[RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU
@ 2023-09-11  2:49 Chen Yu
  2023-09-11  2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu
  2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
  0 siblings, 2 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-11  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inhit the task migration while keeping the work conservation of
scheduler could benefit many workloads. Inspired by Mathieu's
proposal to limit the task migration ratio[1], this patch takes
the task average sleep duration into consideration. If the
task is a short sleeping one, then tag its previous CPU as cache
hot for a short while. During this tag period, other wakees are
not allowed to pick this idle CPU until a timeout. Later if the
task is woken up again, it can find its previous CPU still be
idle, and chooses it in select_idle_sibling().

The benchmark from netperf has shown some improvement, which is
described in patch 2/2.

This series is based on the tip/sched/core on top of
Commit 3f4feb58037a ("sched: Misc cleanups").

It would be appreciated for any feedback/comments.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1

Chen Yu (2):
  sched/fair: Record the average sleep time of a task
  sched/fair: skip the cache hot CPU in select_idle_cpu()

 include/linux/sched.h   |  3 +++
 kernel/sched/fair.c     | 47 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 4 files changed, 49 insertions(+), 3 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task
  2023-09-11  2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu
@ 2023-09-11  2:49 ` Chen Yu
  2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
  1 sibling, 0 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-11  2:49 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

There is a requirement to have the information on how
long a task sleeps. This information can be used for
better task placement that, the short sleeping task
can be woken up on its previous idle easier. The
optimization will be introduced in the next patch.

Define the sleep time of the task as:
The time delta between this task was dequeued and enqueued.

Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 17 +++++++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 177b3f3676ef..c11879468027 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -561,6 +561,9 @@ struct sched_entity {
 	u64				vruntime;
 	s64				vlag;
 	u64				slice;
+	u64				prev_sleep_time;
+	/* average sleep duration of a task */
+	u64				sleep_avg;
 
 	u64				nr_migrations;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0b7445cd5af9..e20f50726ab8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6457,6 +6457,20 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
+	u64 last_sleep = p->se.prev_sleep_time;
+
+	/*
+	 * Calculate the average sleep time of the wakee.
+	 * Use wakee's previous CPU's clock to get the
+	 * delta. The previous CPU should be online, otherwise
+	 * the sleep time could be larger than expected(the clock
+	 * of the offline previous CPU is paused). Check if there
+	 * is a sched_clock() reset on previous CPU due to suspend/resume.
+	 */
+	if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
+	    sched_clock_cpu(task_cpu(p)) > last_sleep)
+		update_avg(&p->se.sleep_avg,
+			   sched_clock_cpu(task_cpu(p)) - last_sleep);
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6551,6 +6565,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int task_sleep = flags & DEQUEUE_SLEEP;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
+	u64 now;
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6612,6 +6627,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+	now = sched_clock_cpu(cpu_of(rq));
+	p->se.prev_sleep_time = task_sleep ? now : 0;
 }
 
 #ifdef CONFIG_SMP
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu
  2023-09-11  2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu
@ 2023-09-11  2:50 ` Chen Yu
  2023-09-11  7:26   ` Aaron Lu
                     ` (3 more replies)
  1 sibling, 4 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-11  2:50 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inspired by Mathieu's idea[1], consider the sleep time of the task.
If that task is a short sleeping one, keep p's previous CPU idle
for a short while. Later when p is woken up, it can choose its
previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
other wakee is not allowed to choose this CPU in select_idle_idle().
The reservation period is set to the task's average sleep time. That
is to say, if p is a short sleeping task, there is no need to migrate
p to another idle CPU.

This does not break the work conservation of the scheduler,
because wakee will still try its best to find an idle CPU.
The difference is that, different idle CPUs might have different
priorities. On the other hand, in theory this extra check could
increase the failure ratio of select_idle_cpu(), but per the
initial test result, no regression is detected.

Baseline: tip/sched/core, on top of:
Commit 3f4feb58037a ("sched: Misc cleanups")

Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
cpufreq governor is performance, turbo boost is disabled, C-states deeper
than C1 are disabled, Numa balancing is disabled.

netperf
=======
case                    load            baseline(std%)  compare%( std%)
UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)

Noticeable improvements of netperf is observed in 168 and 224 threads
cases.

hackbench
=========
case                    load            baseline(std%)  compare%( std%)
process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)

schbench(old)
========
case                    load            baseline(std%)  compare%( std%)
normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)

No much difference is observed in hackbench/schbench, due to the
run-to-run variance.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e20f50726ab8..fe3b760c9654 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 	now = sched_clock_cpu(cpu_of(rq));
 	p->se.prev_sleep_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+	/*
+	 * If this rq will become idle, and dequeued task is
+	 * a short sleeping one, check if we can reserve
+	 * this idle CPU for that task for a short while.
+	 * During this reservation period, other wakees will
+	 * skip this 'idle' CPU in select_idle_cpu(), and this
+	 * short sleeping task can pick its previous CPU in
+	 * select_idle_sibling(), which brings better cache
+	 * locality.
+	 */
+	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
+		rq->cache_hot_timeout = now + p->se.sleep_avg;
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
-	    sched_cpu_cookie_match(cpu_rq(cpu), p))
+	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
+		if (sched_feat(SIS_CACHE) &&
+		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
+			return -1;
+
 		return cpu;
+	}
 
 	return -1;
 }
@@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	int cpu;
 
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
-		if (!available_idle_cpu(cpu)) {
+		bool cache_hot = sched_feat(SIS_CACHE) ?
+			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
+
+		if (!available_idle_cpu(cpu) || cache_hot) {
 			idle = false;
 			if (*idle_cpu == -1) {
-				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
+				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
+				    !cache_hot) {
 					*idle_cpu = cpu;
 					break;
 				}
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index f770168230ae..04ed9fcf67f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CACHE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62013c49c451..7a2c12c3b6d0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,6 +1078,7 @@ struct rq {
 #endif
 	u64			idle_stamp;
 	u64			avg_idle;
+	u64			cache_hot_timeout;
 
 	unsigned long		wake_stamp;
 	u64			wake_avg_idle;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
@ 2023-09-11  7:26   ` Aaron Lu
  2023-09-11  8:40     ` Chen Yu
  2023-09-11  8:29   ` K Prateek Nayak
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Aaron Lu @ 2023-09-11  7:26 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel

On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inspired by Mathieu's idea[1], consider the sleep time of the task.
> If that task is a short sleeping one, keep p's previous CPU idle
> for a short while. Later when p is woken up, it can choose its
> previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> other wakee is not allowed to choose this CPU in select_idle_idle().
> The reservation period is set to the task's average sleep time. That
> is to say, if p is a short sleeping task, there is no need to migrate
> p to another idle CPU.
> 
> This does not break the work conservation of the scheduler,
> because wakee will still try its best to find an idle CPU.
> The difference is that, different idle CPUs might have different
> priorities. On the other hand, in theory this extra check could
> increase the failure ratio of select_idle_cpu(), but per the
> initial test result, no regression is detected.
> 
> Baseline: tip/sched/core, on top of:
> Commit 3f4feb58037a ("sched: Misc cleanups")
> 
> Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> cpufreq governor is performance, turbo boost is disabled, C-states deeper
> than C1 are disabled, Numa balancing is disabled.
> 
> netperf
> =======
> case                    load            baseline(std%)  compare%( std%)
> UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> 
> Noticeable improvements of netperf is observed in 168 and 224 threads
> cases.
> 
> hackbench
> =========
> case                    load            baseline(std%)  compare%( std%)
> process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> 
> schbench(old)
> ========
> case                    load            baseline(std%)  compare%( std%)
> normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> 
> No much difference is observed in hackbench/schbench, due to the
> run-to-run variance.
> 
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/features.h |  1 +
>  kernel/sched/sched.h    |  1 +
>  3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  	now = sched_clock_cpu(cpu_of(rq));
>  	p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this rq will become idle, and dequeued task is
> +	 * a short sleeping one, check if we can reserve
> +	 * this idle CPU for that task for a short while.
> +	 * During this reservation period, other wakees will
> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> +	 * short sleeping task can pick its previous CPU in
> +	 * select_idle_sibling(), which brings better cache
> +	 * locality.
> +	 */
> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> +		rq->cache_hot_timeout = now + p->se.sleep_avg;

Should this be written as:
		rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg);
?

A even earlier task may have a bigger sleep_avg and overwriting
rq->cache_hot_timeout here with an earlier time may cause that task
migrate. Not sure how much impact this can have, just someting hit my
brain while reading this code.

> +#endif
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>  {
>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> +		if (sched_feat(SIS_CACHE) &&
> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> +			return -1;
> +

Maybe introduce a new function that also considers rq->cache_hot_timeout,
like available_idle_cpu_migrate() so that above and below logic can be
simplified a bit?

I was thinking to simply add that rq->cache_hot_timeout check to
available_idle_cpu() but then a long sleeping task could be forced to
migrate if its prev_cpu happens to just deschedule a task that sets rq's
cache_hot_timeout. I guess that's why you chose to only change the idle
semantic in select_idle_cpu() but not in select_idle_sibling()?

Thanks,
Aaron

>  		return cpu;
> +	}
>  
>  	return -1;
>  }
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>  	int cpu;
>  
>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
> -		if (!available_idle_cpu(cpu)) {
> +		bool cache_hot = sched_feat(SIS_CACHE) ?
> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> +		if (!available_idle_cpu(cpu) || cache_hot) {
>  			idle = false;
>  			if (*idle_cpu == -1) {
> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +				    !cache_hot) {
>  					*idle_cpu = cpu;
>  					break;
>  				}
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index f770168230ae..04ed9fcf67f8 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>   */
>  SCHED_FEAT(SIS_PROP, false)
>  SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_CACHE, true)
>  
>  /*
>   * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 62013c49c451..7a2c12c3b6d0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,6 +1078,7 @@ struct rq {
>  #endif
>  	u64			idle_stamp;
>  	u64			avg_idle;
> +	u64			cache_hot_timeout;
>  
>  	unsigned long		wake_stamp;
>  	u64			wake_avg_idle;
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
  2023-09-11  7:26   ` Aaron Lu
@ 2023-09-11  8:29   ` K Prateek Nayak
  2023-09-11 10:19     ` Chen Yu
  2023-09-12  6:32     ` Mike Galbraith
  2023-09-11 15:26   ` Mathieu Desnoyers
  2023-09-14  5:30   ` K Prateek Nayak
  3 siblings, 2 replies; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-11  8:29 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hello Chenyu,

On 9/11/2023 8:20 AM, Chen Yu wrote:
>  [..snip..]
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/features.h |  1 +
>  kernel/sched/sched.h    |  1 +
>  3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	hrtick_update(rq);
>  	now = sched_clock_cpu(cpu_of(rq));
>  	p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this rq will become idle, and dequeued task is
> +	 * a short sleeping one, check if we can reserve
> +	 * this idle CPU for that task for a short while.
> +	 * During this reservation period, other wakees will
> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> +	 * short sleeping task can pick its previous CPU in
> +	 * select_idle_sibling(), which brings better cache
> +	 * locality.
> +	 */
> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> +#endif
>  }
>  
>  #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>  {
>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> +		if (sched_feat(SIS_CACHE) &&
> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> +			return -1;

Just wondering,

Similar to how select_idle_core() caches the "idle_cpu" if it ends up
finding one in its search for an idle core, would returning a "cache-hot
idle CPU" be better than returning previous CPU / current CPU if all
idle CPUs found during the search in select_idle_cpu() are marked
cache-hot?

Speaking of cache-hot idle CPU, is netperf actually more happy with
piling on current CPU? I ask this because the logic seems to be
reserving the previous CPU for a task that dislikes migration but I
do not see anything in the wake_affine_idle() path that would make the
short sleeper proactively choose the previous CPU when the wakeup is
marked with the WF_SYNC flag. Let me know if I'm missing something?

To confirm this can you look at the trend in migration count with and
without the series? Also the ratio of cache-hot idle CPUs to number
of CPUs searched can help estimate overheads of additional search - I
presume SIS_UTIL is efficient at curbing the additional search in
a busy system.

In the meantime, I'll queue this series for testing on my end too.

> +
>  		return cpu;
> +	}
>  
>  	return -1;
>  }
> [..snip..]


--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  7:26   ` Aaron Lu
@ 2023-09-11  8:40     ` Chen Yu
  2023-09-13  6:22       ` Gautham R. Shenoy
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-11  8:40 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel

Hi Aaron,

thanks for the review,

On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> On Mon, Sep 11, 2023 at 10:50:00AM +0800, Chen Yu wrote:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inspired by Mathieu's idea[1], consider the sleep time of the task.
> > If that task is a short sleeping one, keep p's previous CPU idle
> > for a short while. Later when p is woken up, it can choose its
> > previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> > other wakee is not allowed to choose this CPU in select_idle_idle().
> > The reservation period is set to the task's average sleep time. That
> > is to say, if p is a short sleeping task, there is no need to migrate
> > p to another idle CPU.
> > 
> > This does not break the work conservation of the scheduler,
> > because wakee will still try its best to find an idle CPU.
> > The difference is that, different idle CPUs might have different
> > priorities. On the other hand, in theory this extra check could
> > increase the failure ratio of select_idle_cpu(), but per the
> > initial test result, no regression is detected.
> > 
> > Baseline: tip/sched/core, on top of:
> > Commit 3f4feb58037a ("sched: Misc cleanups")
> > 
> > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> > cpufreq governor is performance, turbo boost is disabled, C-states deeper
> > than C1 are disabled, Numa balancing is disabled.
> > 
> > netperf
> > =======
> > case                    load            baseline(std%)  compare%( std%)
> > UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> > UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> > UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> > UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> > 
> > Noticeable improvements of netperf is observed in 168 and 224 threads
> > cases.
> > 
> > hackbench
> > =========
> > case                    load            baseline(std%)  compare%( std%)
> > process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> > process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> > process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> > process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> > process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> > process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> > threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> > threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> > threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> > threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> > threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> > threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> > 
> > schbench(old)
> > ========
> > case                    load            baseline(std%)  compare%( std%)
> > normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> > normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> > normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> > normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> > 
> > No much difference is observed in hackbench/schbench, due to the
> > run-to-run variance.
> > 
> > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >  kernel/sched/features.h |  1 +
> >  kernel/sched/sched.h    |  1 +
> >  3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	hrtick_update(rq);
> >  	now = sched_clock_cpu(cpu_of(rq));
> >  	p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this rq will become idle, and dequeued task is
> > +	 * a short sleeping one, check if we can reserve
> > +	 * this idle CPU for that task for a short while.
> > +	 * During this reservation period, other wakees will
> > +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> > +	 * short sleeping task can pick its previous CPU in
> > +	 * select_idle_sibling(), which brings better cache
> > +	 * locality.
> > +	 */
> > +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> 
> Should this be written as:
> 		rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sleep_avg);
> ?
> 
> A even earlier task may have a bigger sleep_avg and overwriting
> rq->cache_hot_timeout here with an earlier time may cause that task
> migrate. Not sure how much impact this can have, just someting hit my
> brain while reading this code.
>

My previous thought was that, SIS_CACHE only honors the average sleep time
of the latest dequeued task. Say, task p1 goes to sleep, then task p2 is woken
up and godes to sleeps, we only honor the sleep time of p2. Because we don't
know how long p2 has been executed, and how much p2 has polluted the L1/L2 cache,
and does it still make sense to let p1 be woken up on its previous
CPU(is this CPU still cache hot for p1?)
 
> > +#endif
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >  {
> >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > +		if (sched_feat(SIS_CACHE) &&
> > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > +			return -1;
> > +
> 
> Maybe introduce a new function that also considers rq->cache_hot_timeout,
> like available_idle_cpu_migrate() so that above and below logic can be
> simplified a bit?
> 

Yes, that would be simpler, I'll do in next version.

> I was thinking to simply add that rq->cache_hot_timeout check to
> available_idle_cpu() but then a long sleeping task could be forced to
> migrate if its prev_cpu happens to just deschedule a task that sets rq's
> cache_hot_timeout. I guess that's why you chose to only change the idle
> semantic in select_idle_cpu() but not in select_idle_sibling()?
>

Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
thus no one could use this CPU within the cache hot period, including the cache-hot task
itself.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  8:29   ` K Prateek Nayak
@ 2023-09-11 10:19     ` Chen Yu
  2023-09-12  3:05       ` K Prateek Nayak
  2023-09-12  9:39       ` Mike Galbraith
  2023-09-12  6:32     ` Mike Galbraith
  1 sibling, 2 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-11 10:19 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hi Prateek,

thanks for your review,

On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/11/2023 8:20 AM, Chen Yu wrote:
> >  [..snip..]
> >  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >  kernel/sched/features.h |  1 +
> >  kernel/sched/sched.h    |  1 +
> >  3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	hrtick_update(rq);
> >  	now = sched_clock_cpu(cpu_of(rq));
> >  	p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this rq will become idle, and dequeued task is
> > +	 * a short sleeping one, check if we can reserve
> > +	 * this idle CPU for that task for a short while.
> > +	 * During this reservation period, other wakees will
> > +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> > +	 * short sleeping task can pick its previous CPU in
> > +	 * select_idle_sibling(), which brings better cache
> > +	 * locality.
> > +	 */
> > +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> > +#endif
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >  {
> >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > +		if (sched_feat(SIS_CACHE) &&
> > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > +			return -1;
> 
> Just wondering,
> 
> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
> finding one in its search for an idle core, would returning a "cache-hot
> idle CPU" be better than returning previous CPU / current CPU if all
> idle CPUs found during the search in select_idle_cpu() are marked
> cache-hot?
> 

This is a good point, we can optimize this further. Currently I only
send a simple version to desmonstrate how we can leverage the task's
sleep time.

> Speaking of cache-hot idle CPU, is netperf actually more happy with
> piling on current CPU?

Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
put the waker and wakee together.

> I ask this because the logic seems to be
> reserving the previous CPU for a task that dislikes migration but I
> do not see anything in the wake_affine_idle() path that would make the
> short sleeper proactively choose the previous CPU when the wakeup is
> marked with the WF_SYNC flag. Let me know if I'm missing something?
> 

If I understand correctly, WF_SYNC is to let the wakee to woken up
on the waker's CPU, rather than the wakee's previous CPU, because
the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
wakee's previous CPU. We can only restrict that other wakee does not
occupy the previous CPU, but do not enhance the possibility that
wake_affine_idle() chooses the previous CPU.

Say, there are two tasks t1, t2. t1's previous CPU is p1.
We don't enhance that when t1 is woken up, wake_affine_idle() will
choose p1 or not, but we makes sure t2 will not choose p1.

> To confirm this can you look at the trend in migration count with and
> without the series? Also the ratio of cache-hot idle CPUs to number
> of CPUs searched can help estimate overheads of additional search - I
> presume SIS_UTIL is efficient at curbing the additional search in
> a busy system.

OK, I'll collect these statistics.

> 
> In the meantime, I'll queue this series for testing on my end too.

Thanks again for your time.


thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
  2023-09-11  7:26   ` Aaron Lu
  2023-09-11  8:29   ` K Prateek Nayak
@ 2023-09-11 15:26   ` Mathieu Desnoyers
  2023-09-11 15:43     ` Mathieu Desnoyers
  2023-09-20 12:34     ` Chen Yu
  2023-09-14  5:30   ` K Prateek Nayak
  3 siblings, 2 replies; 34+ messages in thread
From: Mathieu Desnoyers @ 2023-09-11 15:26 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 9/10/23 22:50, Chen Yu wrote:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inspired by Mathieu's idea[1], consider the sleep time of the task.
> If that task is a short sleeping one, keep p's previous CPU idle
> for a short while. Later when p is woken up, it can choose its
> previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> other wakee is not allowed to choose this CPU in select_idle_idle().
> The reservation period is set to the task's average sleep time. That
> is to say, if p is a short sleeping task, there is no need to migrate
> p to another idle CPU.
> 
> This does not break the work conservation of the scheduler,
> because wakee will still try its best to find an idle CPU.
> The difference is that, different idle CPUs might have different
> priorities. On the other hand, in theory this extra check could
> increase the failure ratio of select_idle_cpu(), but per the
> initial test result, no regression is detected.
> 
> Baseline: tip/sched/core, on top of:
> Commit 3f4feb58037a ("sched: Misc cleanups")
> 
> Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> cpufreq governor is performance, turbo boost is disabled, C-states deeper
> than C1 are disabled, Numa balancing is disabled.
> 
> netperf
> =======
> case                    load            baseline(std%)  compare%( std%)
> UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> 
> Noticeable improvements of netperf is observed in 168 and 224 threads
> cases.
> 
> hackbench
> =========
> case                    load            baseline(std%)  compare%( std%)
> process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> 
> schbench(old)
> ========
> case                    load            baseline(std%)  compare%( std%)
> normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> 
> No much difference is observed in hackbench/schbench, due to the
> run-to-run variance.
> 
> Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> Suggested-by: Tim Chen <tim.c.chen@intel.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>   kernel/sched/features.h |  1 +
>   kernel/sched/sched.h    |  1 +
>   3 files changed, 29 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	hrtick_update(rq);
>   	now = sched_clock_cpu(cpu_of(rq));
>   	p->se.prev_sleep_time = task_sleep ? now : 0;
> +#ifdef CONFIG_SMP
> +	/*
> +	 * If this rq will become idle, and dequeued task is
> +	 * a short sleeping one, check if we can reserve
> +	 * this idle CPU for that task for a short while.
> +	 * During this reservation period, other wakees will
> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> +	 * short sleeping task can pick its previous CPU in
> +	 * select_idle_sibling(), which brings better cache
> +	 * locality.
> +	 */
> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> +		rq->cache_hot_timeout = now + p->se.sleep_avg;

This is really cool!

There is one scenario that worries me with this approach: workloads
that sleep for a long time and then have short blocked periods.
Those bursts will likely bring the average to values too high
to stay below sysctl_sched_migration_cost.

I wonder if changing the code above for the following would help ?

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
	rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);

For tasks that have a large sleep_avg, it would activate this rq
"appear as not idle for rq selection" scheme for a window of
sysctl_sched_migration_cost. If the sleep ends up being a long one,
preventing other tasks from being migrated to this rq for a tiny
window should not matter performance-wise. I would expect that it
could help workloads that come in bursts.

Thanks,

Mathieu


> +#endif
>   }
>   
>   #ifdef CONFIG_SMP
> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>   static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>   {
>   	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> +		if (sched_feat(SIS_CACHE) &&
> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> +			return -1;
> +
>   		return cpu;
> +	}
>   
>   	return -1;
>   }
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>   	int cpu;
>   
>   	for_each_cpu(cpu, cpu_smt_mask(core)) {
> -		if (!available_idle_cpu(cpu)) {
> +		bool cache_hot = sched_feat(SIS_CACHE) ?
> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> +		if (!available_idle_cpu(cpu) || cache_hot) {
>   			idle = false;
>   			if (*idle_cpu == -1) {
> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +				    !cache_hot) {
>   					*idle_cpu = cpu;
>   					break;
>   				}
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index f770168230ae..04ed9fcf67f8 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
>    */
>   SCHED_FEAT(SIS_PROP, false)
>   SCHED_FEAT(SIS_UTIL, true)
> +SCHED_FEAT(SIS_CACHE, true)
>   
>   /*
>    * Issue a WARN when we do multiple update_rq_clock() calls
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 62013c49c451..7a2c12c3b6d0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1078,6 +1078,7 @@ struct rq {
>   #endif
>   	u64			idle_stamp;
>   	u64			avg_idle;
> +	u64			cache_hot_timeout;
>   
>   	unsigned long		wake_stamp;
>   	u64			wake_avg_idle;

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11 15:26   ` Mathieu Desnoyers
@ 2023-09-11 15:43     ` Mathieu Desnoyers
  2023-09-12 11:53       ` Chen Yu
  2023-09-20 12:34     ` Chen Yu
  1 sibling, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2023-09-11 15:43 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 9/11/23 11:26, Mathieu Desnoyers wrote:
> On 9/10/23 22:50, Chen Yu wrote:
[...]
>> ---
>>   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>   kernel/sched/features.h |  1 +
>>   kernel/sched/sched.h    |  1 +
>>   3 files changed, 29 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e20f50726ab8..fe3b760c9654 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, 
>> struct task_struct *p, int flags)
>>       hrtick_update(rq);
>>       now = sched_clock_cpu(cpu_of(rq));
>>       p->se.prev_sleep_time = task_sleep ? now : 0;
>> +#ifdef CONFIG_SMP
>> +    /*
>> +     * If this rq will become idle, and dequeued task is
>> +     * a short sleeping one, check if we can reserve
>> +     * this idle CPU for that task for a short while.
>> +     * During this reservation period, other wakees will
>> +     * skip this 'idle' CPU in select_idle_cpu(), and this
>> +     * short sleeping task can pick its previous CPU in
>> +     * select_idle_sibling(), which brings better cache
>> +     * locality.
>> +     */
>> +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>> +        p->se.sleep_avg && p->se.sleep_avg < 
>> sysctl_sched_migration_cost)
>> +        rq->cache_hot_timeout = now + p->se.sleep_avg;
> 
> This is really cool!
> 
> There is one scenario that worries me with this approach: workloads
> that sleep for a long time and then have short blocked periods.
> Those bursts will likely bring the average to values too high
> to stay below sysctl_sched_migration_cost.
> 
> I wonder if changing the code above for the following would help ?
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && 
> p->se.sleep_avg)
>      rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, 
> p->se.sleep_avg);
> 
> For tasks that have a large sleep_avg, it would activate this rq
> "appear as not idle for rq selection" scheme for a window of
> sysctl_sched_migration_cost. If the sleep ends up being a long one,
> preventing other tasks from being migrated to this rq for a tiny
> window should not matter performance-wise. I would expect that it
> could help workloads that come in bursts.

There is perhaps a better way to handle bursts:

When calculating the sleep_avg, we actually only really care about
the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
to select which of the sleeps should be taken into account in the avg.

We could rename the field "sleep_avg" to "burst_sleep_avg", and have:

u64 now = sched_clock_cpu(task_cpu(p));

if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
     now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
	update_avg(&p->se.burst_sleep_avg, now - last_sleep);

Then we can have this code is dequeue_task_fair:

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;

Thoughts ?

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11 10:19     ` Chen Yu
@ 2023-09-12  3:05       ` K Prateek Nayak
  2023-09-12 12:32         ` Chen Yu
  2023-09-12  9:39       ` Mike Galbraith
  1 sibling, 1 reply; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-12  3:05 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hello Chenyu,

On 9/11/2023 3:49 PM, Chen Yu wrote:
> Hi Prateek,
> 
> thanks for your review,
> 
> On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>>  [..snip..]
>>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>>  kernel/sched/features.h |  1 +
>>>  kernel/sched/sched.h    |  1 +
>>>  3 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>  	hrtick_update(rq);
>>>  	now = sched_clock_cpu(cpu_of(rq));
>>>  	p->se.prev_sleep_time = task_sleep ? now : 0;
>>> +#ifdef CONFIG_SMP
>>> +	/*
>>> +	 * If this rq will become idle, and dequeued task is
>>> +	 * a short sleeping one, check if we can reserve
>>> +	 * this idle CPU for that task for a short while.
>>> +	 * During this reservation period, other wakees will
>>> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
>>> +	 * short sleeping task can pick its previous CPU in
>>> +	 * select_idle_sibling(), which brings better cache
>>> +	 * locality.
>>> +	 */
>>> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
>>> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
>>> +#endif
>>>  }
>>>  
>>>  #ifdef CONFIG_SMP
>>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>>  {
>>>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
>>> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
>>> +		if (sched_feat(SIS_CACHE) &&
>>> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
>>> +			return -1;
>>
>> Just wondering,
>>
>> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
>> finding one in its search for an idle core, would returning a "cache-hot
>> idle CPU" be better than returning previous CPU / current CPU if all
>> idle CPUs found during the search in select_idle_cpu() are marked
>> cache-hot?
>>
> 
> This is a good point, we can optimize this further. Currently I only
> send a simple version to desmonstrate how we can leverage the task's
> sleep time.
> 
>> Speaking of cache-hot idle CPU, is netperf actually more happy with
>> piling on current CPU?
> 
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.
> 
>> I ask this because the logic seems to be
>> reserving the previous CPU for a task that dislikes migration but I
>> do not see anything in the wake_affine_idle() path that would make the
>> short sleeper proactively choose the previous CPU when the wakeup is
>> marked with the WF_SYNC flag. Let me know if I'm missing something?
>>
> 
> If I understand correctly, WF_SYNC is to let the wakee to woken up
> on the waker's CPU, rather than the wakee's previous CPU, because
> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> wakee's previous CPU. We can only restrict that other wakee does not
> occupy the previous CPU, but do not enhance the possibility that
> wake_affine_idle() chooses the previous CPU.

Correct me if I'm wrong here,

Say a short sleeper, is always woken up using WF_SYNC flag. When the
task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
and restrict any wakeup happening until the "cache_hot_timeout" is
crossed. Let us assume a perfect world where the task wakes up before
the "cache_hot_timeout" expires. Logically this CPU was reserved all
this while for the short sleeper but since the wakeup bears WF_SYNC
flag, the whole reservation is ignored and waker's LLC is explored.

Should the timeout be cleared if the wakeup decides to not target the
previous CPU? (The default "sysctl_sched_migration_cost" is probably
small enough to curb any side effect that could possibly show here but
if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
a larger value, the wakeup path might be affected where lot of idle
targets are overlooked since the CPUs are marked cache-hot forr longer
duration)

Let me know what you think.

> 
> Say, there are two tasks t1, t2. t1's previous CPU is p1.
> We don't enhance that when t1 is woken up, wake_affine_idle() will
> choose p1 or not, but we makes sure t2 will not choose p1.
> 
>> To confirm this can you look at the trend in migration count with and
>> without the series? Also the ratio of cache-hot idle CPUs to number
>> of CPUs searched can help estimate overheads of additional search - I
>> presume SIS_UTIL is efficient at curbing the additional search in
>> a busy system.
> 
> OK, I'll collect these statistics.

Thank you :)

> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  8:29   ` K Prateek Nayak
  2023-09-11 10:19     ` Chen Yu
@ 2023-09-12  6:32     ` Mike Galbraith
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Galbraith @ 2023-09-12  6:32 UTC (permalink / raw)
  To: K Prateek Nayak, Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

On Mon, 2023-09-11 at 13:59 +0530, K Prateek Nayak wrote:
>
> Speaking of cache-hot idle CPU, is netperf actually more happy with
> piling on current CPU?

Some tests would be happier, others not at all, some numbers below.

I doubt much in the real world can perform better stacked, to be a win,
stacked task overlap induced service latency and utilization loss has
to be less than cache population cost of an idle CPU, something that
modern CPUs have become darn good at, making for a high bar.

> I ask this because the logic seems to be
> reserving the previous CPU for a task that dislikes migration but I
> do not see anything in the wake_affine_idle() path that would make the
> short sleeper proactively choose the previous CPU when the wakeup is
> marked with the WF_SYNC flag. Let me know if I'm missing something?

If select_idle_sibling() didn't intervene, the wake affine logic would
indeed routinely step all over working sets, and at one time briefly
did so due to a silly bug. (see kernel/sched/fair.c.today:7292)

The sync hint stems from the bad old days of SMP when cross-cpu latency
was horrid, and has lost much of its value, but its bias toward the
waker CPU still helps reduce man-in-the-middle latency in a busy box,
which can do even more damage than that done by stacking of not really
synchronous tasks that can be seen below.

The TCP/UDP_RR tests are very close to synchronous, and the numbers
reflect that, stacking is unbeatable for them [1], but for the other
tests, hopefully doing something a bit more realistic than tiny ball
ping-pong, stacking is a demonstrable loser.

Not super carefully run script output:

homer:/root # netperf.sh
TCP_SENDFILE-1  unbound    Avg:  87889  Sum:    87889
TCP_SENDFILE-1  stacked    Avg:  62885  Sum:    62885
TCP_SENDFILE-1  cross-smt  Avg:  58887  Sum:    58887
TCP_SENDFILE-1  cross-core Avg:  90673  Sum:    90673

TCP_STREAM-1    unbound    Avg:  71858  Sum:    71858
TCP_STREAM-1    stacked    Avg:  58883  Sum:    58883
TCP_STREAM-1    cross-smt  Avg:  49345  Sum:    49345
TCP_STREAM-1    cross-core Avg:  72346  Sum:    72346

TCP_MAERTS-1    unbound    Avg:  73890  Sum:    73890
TCP_MAERTS-1    stacked    Avg:  60682  Sum:    60682
TCP_MAERTS-1    cross-smt  Avg:  49868  Sum:    49868
TCP_MAERTS-1    cross-core Avg:  73343  Sum:    73343

UDP_STREAM-1    unbound    Avg:  99442  Sum:    99442
UDP_STREAM-1    stacked    Avg:  85319  Sum:    85319
UDP_STREAM-1    cross-smt  Avg:  63239  Sum:    63239
UDP_STREAM-1    cross-core Avg:  99102  Sum:    99102

TCP_RR-1        unbound    Avg: 200833  Sum:   200833
TCP_RR-1        stacked    Avg: 243733  Sum:   243733
TCP_RR-1        cross-smt  Avg: 138507  Sum:   138507
TCP_RR-1        cross-core Avg: 210404  Sum:   210404

UDP_RR-1        unbound    Avg: 252575  Sum:   252575
UDP_RR-1        stacked    Avg: 273081  Sum:   273081
UDP_RR-1        cross-smt  Avg: 168448  Sum:   168448
UDP_RR-1        cross-core Avg: 264124  Sum:   264124

1. nearly unbeatable - shared L2 CPUS can by a wee bit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11 10:19     ` Chen Yu
  2023-09-12  3:05       ` K Prateek Nayak
@ 2023-09-12  9:39       ` Mike Galbraith
  2023-09-12 14:51         ` Chen Yu
  1 sibling, 1 reply; 34+ messages in thread
From: Mike Galbraith @ 2023-09-12  9:39 UTC (permalink / raw)
  To: Chen Yu, K Prateek Nayak
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote:
>
> > Speaking of cache-hot idle CPU, is netperf actually more happy with
> > piling on current CPU?
>
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.

Hm, seems there's at least one shared L2 case where that's untrue by
more than a tiny margin, which surprised me rather a lot.

For grins, I tested netperf on my dinky rpi4b, and while its RR numbers
seem kinda odd, they're also seemingly repeatable (ergo showing them).
I measured a very modest cross-core win on a shared L2 Intel CPU some
years ago (when Q6600 was shiny/new) but nothing close to these deltas.

Makes me wonder what (a tad beefier) Bulldog RR numbers look like.

root@rpi4:~# ONLY=TCP_RR netperf.sh
TCP_RR-1        unbound    Avg:  29611  Sum:    29611
TCP_RR-1        stacked    Avg:  22540  Sum:    22540
TCP_RR-1        cross-core Avg:  30181  Sum:    30181

root@rpi4:~# netperf.sh
TCP_SENDFILE-1  unbound    Avg:  15572  Sum:    15572
TCP_SENDFILE-1  stacked    Avg:  11533  Sum:    11533
TCP_SENDFILE-1  cross-core Avg:  15751  Sum:    15751

TCP_STREAM-1    unbound    Avg:   6331  Sum:     6331
TCP_STREAM-1    stacked    Avg:   6031  Sum:     6031
TCP_STREAM-1    cross-core Avg:   6211  Sum:     6211

TCP_MAERTS-1    unbound    Avg:   6306  Sum:     6306
TCP_MAERTS-1    stacked    Avg:   6094  Sum:     6094
TCP_MAERTS-1    cross-core Avg:   9393  Sum:     9393

UDP_STREAM-1    unbound    Avg:  22277  Sum:    22277
UDP_STREAM-1    stacked    Avg:  18844  Sum:    18844
UDP_STREAM-1    cross-core Avg:  24749  Sum:    24749

TCP_RR-1        unbound    Avg:  29674  Sum:    29674
TCP_RR-1        stacked    Avg:  22267  Sum:    22267
TCP_RR-1        cross-core Avg:  30237  Sum:    30237

UDP_RR-1        unbound    Avg:  36189  Sum:    36189
UDP_RR-1        stacked    Avg:  27129  Sum:    27129
UDP_RR-1        cross-core Avg:  37033  Sum:    37033

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11 15:43     ` Mathieu Desnoyers
@ 2023-09-12 11:53       ` Chen Yu
  2023-09-12 14:06         ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-12 11:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

Hi Mathieu,

thanks for the review,

On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
> On 9/11/23 11:26, Mathieu Desnoyers wrote:
> > On 9/10/23 22:50, Chen Yu wrote:
> [...]
> > > ---
> > >   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> > >   kernel/sched/features.h |  1 +
> > >   kernel/sched/sched.h    |  1 +
> > >   3 files changed, 29 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index e20f50726ab8..fe3b760c9654 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
> > > struct task_struct *p, int flags)
> > >       hrtick_update(rq);
> > >       now = sched_clock_cpu(cpu_of(rq));
> > >       p->se.prev_sleep_time = task_sleep ? now : 0;
> > > +#ifdef CONFIG_SMP
> > > +    /*
> > > +     * If this rq will become idle, and dequeued task is
> > > +     * a short sleeping one, check if we can reserve
> > > +     * this idle CPU for that task for a short while.
> > > +     * During this reservation period, other wakees will
> > > +     * skip this 'idle' CPU in select_idle_cpu(), and this
> > > +     * short sleeping task can pick its previous CPU in
> > > +     * select_idle_sibling(), which brings better cache
> > > +     * locality.
> > > +     */
> > > +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > > +        p->se.sleep_avg && p->se.sleep_avg <
> > > sysctl_sched_migration_cost)
> > > +        rq->cache_hot_timeout = now + p->se.sleep_avg;
> > 
> > This is really cool!
> > 
> > There is one scenario that worries me with this approach: workloads
> > that sleep for a long time and then have short blocked periods.
> > Those bursts will likely bring the average to values too high
> > to stay below sysctl_sched_migration_cost.
> > 
> > I wonder if changing the code above for the following would help ?
> > 
> > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > p->se.sleep_avg)
> >      rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
> > p->se.sleep_avg);
> > 
> > For tasks that have a large sleep_avg, it would activate this rq
> > "appear as not idle for rq selection" scheme for a window of
> > sysctl_sched_migration_cost. If the sleep ends up being a long one,
> > preventing other tasks from being migrated to this rq for a tiny
> > window should not matter performance-wise. I would expect that it
> > could help workloads that come in bursts.
> 
> There is perhaps a better way to handle bursts:
> 
> When calculating the sleep_avg, we actually only really care about
> the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
> to select which of the sleeps should be taken into account in the avg.
> 
> We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
> 
> u64 now = sched_clock_cpu(task_cpu(p));
> 
> if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
>     now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
> 	update_avg(&p->se.burst_sleep_avg, now - last_sleep);
> 
> Then we can have this code is dequeue_task_fair:
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
> 
> Thoughts ?
> 

This looks reasonable, it would be fine grained to only monitor the short sleep duration
of that task. Let me try this approach to see if there is any difference.

thanks,
Chenyu

> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12  3:05       ` K Prateek Nayak
@ 2023-09-12 12:32         ` Chen Yu
  2023-09-12 14:26           ` K Prateek Nayak
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-12 12:32 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hi Prateek,

On 2023-09-12 at 08:35:12 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/11/2023 3:49 PM, Chen Yu wrote:
> > Hi Prateek,
> > 
> > thanks for your review,
> > 
> > On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/11/2023 8:20 AM, Chen Yu wrote:
> >>>  [..snip..]
> >>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >>>  kernel/sched/features.h |  1 +
> >>>  kernel/sched/sched.h    |  1 +
> >>>  3 files changed, 29 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> index e20f50726ab8..fe3b760c9654 100644
> >>> --- a/kernel/sched/fair.c
> >>> +++ b/kernel/sched/fair.c
> >>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>>  	hrtick_update(rq);
> >>>  	now = sched_clock_cpu(cpu_of(rq));
> >>>  	p->se.prev_sleep_time = task_sleep ? now : 0;
> >>> +#ifdef CONFIG_SMP
> >>> +	/*
> >>> +	 * If this rq will become idle, and dequeued task is
> >>> +	 * a short sleeping one, check if we can reserve
> >>> +	 * this idle CPU for that task for a short while.
> >>> +	 * During this reservation period, other wakees will
> >>> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> >>> +	 * short sleeping task can pick its previous CPU in
> >>> +	 * select_idle_sibling(), which brings better cache
> >>> +	 * locality.
> >>> +	 */
> >>> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> >>> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> >>> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> >>> +#endif
> >>>  }
> >>>  
> >>>  #ifdef CONFIG_SMP
> >>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> >>>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> >>>  {
> >>>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> >>> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> >>> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> >>> +		if (sched_feat(SIS_CACHE) &&
> >>> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> >>> +			return -1;
> >>
> >> Just wondering,
> >>
> >> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
> >> finding one in its search for an idle core, would returning a "cache-hot
> >> idle CPU" be better than returning previous CPU / current CPU if all
> >> idle CPUs found during the search in select_idle_cpu() are marked
> >> cache-hot?
> >>
> > 
> > This is a good point, we can optimize this further. Currently I only
> > send a simple version to desmonstrate how we can leverage the task's
> > sleep time.
> > 
> >> Speaking of cache-hot idle CPU, is netperf actually more happy with
> >> piling on current CPU?
> > 
> > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> > put the waker and wakee together.
> > 
> >> I ask this because the logic seems to be
> >> reserving the previous CPU for a task that dislikes migration but I
> >> do not see anything in the wake_affine_idle() path that would make the
> >> short sleeper proactively choose the previous CPU when the wakeup is
> >> marked with the WF_SYNC flag. Let me know if I'm missing something?
> >>
> > 
> > If I understand correctly, WF_SYNC is to let the wakee to woken up
> > on the waker's CPU, rather than the wakee's previous CPU, because
> > the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> > wakee's previous CPU. We can only restrict that other wakee does not
> > occupy the previous CPU, but do not enhance the possibility that
> > wake_affine_idle() chooses the previous CPU.
> 
> Correct me if I'm wrong here,
> 
> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
> and restrict any wakeup happening until the "cache_hot_timeout" is
> crossed. Let us assume a perfect world where the task wakes up before
> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> this while for the short sleeper but since the wakeup bears WF_SYNC
> flag, the whole reservation is ignored and waker's LLC is explored.
> 

Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
forces the short sleeping wakee to be woken up on waker's CPU rather the
wakee's previous CPU, but wakee's previous has been marked as cache hot
for nothing?

> Should the timeout be cleared if the wakeup decides to not target the
> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> small enough to curb any side effect that could possibly show here but
> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> a larger value, the wakeup path might be affected where lot of idle
> targets are overlooked since the CPUs are marked cache-hot forr longer
> duration)
> 
> Let me know what you think.
> 

This makes sense. In theory the above logic can be added in
select_idle_sibling(), if target CPU is chosen rather than
the previous CPU, the previous CPU's cache hot flag should be
cleared.

But this might bring overhead. Because we need to grab the rq
lock and write to other CPU's rq, which could be costly. It
seems to be a trade-off of current implementation. On the other
hand, if the user sets the sysctl_sched_migration_cost to a quite
large value:
1. Without SIS_CACHE, there is no task migration.
2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped
   in select_idle_cpu(), the wakee will be woken up locally.
It seems to be of the same effect, so there is no much impact
to wakeup behavior I suppose.

thanks,
Chenyu


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12 11:53       ` Chen Yu
@ 2023-09-12 14:06         ` Mathieu Desnoyers
  2023-09-12 14:14           ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2023-09-12 14:06 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 9/12/23 07:53, Chen Yu wrote:
> Hi Mathieu,
> 
> thanks for the review,
> 
> On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
>> On 9/11/23 11:26, Mathieu Desnoyers wrote:
>>> On 9/10/23 22:50, Chen Yu wrote:
>> [...]
>>>> ---
>>>>    kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>>>    kernel/sched/features.h |  1 +
>>>>    kernel/sched/sched.h    |  1 +
>>>>    3 files changed, 29 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index e20f50726ab8..fe3b760c9654 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
>>>> struct task_struct *p, int flags)
>>>>        hrtick_update(rq);
>>>>        now = sched_clock_cpu(cpu_of(rq));
>>>>        p->se.prev_sleep_time = task_sleep ? now : 0;
>>>> +#ifdef CONFIG_SMP
>>>> +    /*
>>>> +     * If this rq will become idle, and dequeued task is
>>>> +     * a short sleeping one, check if we can reserve
>>>> +     * this idle CPU for that task for a short while.
>>>> +     * During this reservation period, other wakees will
>>>> +     * skip this 'idle' CPU in select_idle_cpu(), and this
>>>> +     * short sleeping task can pick its previous CPU in
>>>> +     * select_idle_sibling(), which brings better cache
>>>> +     * locality.
>>>> +     */
>>>> +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>>> +        p->se.sleep_avg && p->se.sleep_avg <
>>>> sysctl_sched_migration_cost)
>>>> +        rq->cache_hot_timeout = now + p->se.sleep_avg;
>>>
>>> This is really cool!
>>>
>>> There is one scenario that worries me with this approach: workloads
>>> that sleep for a long time and then have short blocked periods.
>>> Those bursts will likely bring the average to values too high
>>> to stay below sysctl_sched_migration_cost.
>>>
>>> I wonder if changing the code above for the following would help ?
>>>
>>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> p->se.sleep_avg)
>>>       rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
>>> p->se.sleep_avg);
>>>
>>> For tasks that have a large sleep_avg, it would activate this rq
>>> "appear as not idle for rq selection" scheme for a window of
>>> sysctl_sched_migration_cost. If the sleep ends up being a long one,
>>> preventing other tasks from being migrated to this rq for a tiny
>>> window should not matter performance-wise. I would expect that it
>>> could help workloads that come in bursts.
>>
>> There is perhaps a better way to handle bursts:
>>
>> When calculating the sleep_avg, we actually only really care about
>> the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
>> to select which of the sleeps should be taken into account in the avg.
>>
>> We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
>>
>> u64 now = sched_clock_cpu(task_cpu(p));
>>
>> if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
>>      now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
>> 	update_avg(&p->se.burst_sleep_avg, now - last_sleep);
>>
>> Then we can have this code is dequeue_task_fair:
>>
>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
>> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>>
>> Thoughts ?
>>
> 
> This looks reasonable, it would be fine grained to only monitor the short sleep duration
> of that task. Let me try this approach to see if there is any difference.
> 

One more tweak: given that more than one task can update the cache_hot_timeout forward
one after another, and given that some tasks have larger burst_sleep_avg values than
others, I suspect we want to keep the forward movement monotonic with something like:

if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
     rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;

Thanks,

Mathieu


> thanks,
> Chenyu
> 
>> Thanks,
>>
>> Mathieu
>>
>> -- 
>> Mathieu Desnoyers
>> EfficiOS Inc.
>> https://www.efficios.com
>>

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12 14:06         ` Mathieu Desnoyers
@ 2023-09-12 14:14           ` Chen Yu
  2023-09-12 15:18             ` Mathieu Desnoyers
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-12 14:14 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote:
> On 9/12/23 07:53, Chen Yu wrote:
> > Hi Mathieu,
> > 
> > thanks for the review,
> > 
> > On 2023-09-11 at 11:43:27 -0400, Mathieu Desnoyers wrote:
> > > On 9/11/23 11:26, Mathieu Desnoyers wrote:
> > > > On 9/10/23 22:50, Chen Yu wrote:
> > > [...]
> > > > > ---
> > > > >    kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> > > > >    kernel/sched/features.h |  1 +
> > > > >    kernel/sched/sched.h    |  1 +
> > > > >    3 files changed, 29 insertions(+), 3 deletions(-)
> > > > > 
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index e20f50726ab8..fe3b760c9654 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq,
> > > > > struct task_struct *p, int flags)
> > > > >        hrtick_update(rq);
> > > > >        now = sched_clock_cpu(cpu_of(rq));
> > > > >        p->se.prev_sleep_time = task_sleep ? now : 0;
> > > > > +#ifdef CONFIG_SMP
> > > > > +    /*
> > > > > +     * If this rq will become idle, and dequeued task is
> > > > > +     * a short sleeping one, check if we can reserve
> > > > > +     * this idle CPU for that task for a short while.
> > > > > +     * During this reservation period, other wakees will
> > > > > +     * skip this 'idle' CPU in select_idle_cpu(), and this
> > > > > +     * short sleeping task can pick its previous CPU in
> > > > > +     * select_idle_sibling(), which brings better cache
> > > > > +     * locality.
> > > > > +     */
> > > > > +    if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > > > > +        p->se.sleep_avg && p->se.sleep_avg <
> > > > > sysctl_sched_migration_cost)
> > > > > +        rq->cache_hot_timeout = now + p->se.sleep_avg;
> > > > 
> > > > This is really cool!
> > > > 
> > > > There is one scenario that worries me with this approach: workloads
> > > > that sleep for a long time and then have short blocked periods.
> > > > Those bursts will likely bring the average to values too high
> > > > to stay below sysctl_sched_migration_cost.
> > > > 
> > > > I wonder if changing the code above for the following would help ?
> > > > 
> > > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > > > p->se.sleep_avg)
> > > >       rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost,
> > > > p->se.sleep_avg);
> > > > 
> > > > For tasks that have a large sleep_avg, it would activate this rq
> > > > "appear as not idle for rq selection" scheme for a window of
> > > > sysctl_sched_migration_cost. If the sleep ends up being a long one,
> > > > preventing other tasks from being migrated to this rq for a tiny
> > > > window should not matter performance-wise. I would expect that it
> > > > could help workloads that come in bursts.
> > > 
> > > There is perhaps a better way to handle bursts:
> > > 
> > > When calculating the sleep_avg, we actually only really care about
> > > the sleep time for short bursts, so we could use the sysctl_sched_migration_cost
> > > to select which of the sleeps should be taken into account in the avg.
> > > 
> > > We could rename the field "sleep_avg" to "burst_sleep_avg", and have:
> > > 
> > > u64 now = sched_clock_cpu(task_cpu(p));
> > > 
> > > if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
> > >      now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
> > > 	update_avg(&p->se.burst_sleep_avg, now - last_sleep);
> > > 
> > > Then we can have this code is dequeue_task_fair:
> > > 
> > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.busrt_sleep_avg)
> > > 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
> > > 
> > > Thoughts ?
> > > 
> > 
> > This looks reasonable, it would be fine grained to only monitor the short sleep duration
> > of that task. Let me try this approach to see if there is any difference.
> > 
> 
> One more tweak: given that more than one task can update the cache_hot_timeout forward
> one after another, and given that some tasks have larger burst_sleep_avg values than
> others, I suspect we want to keep the forward movement monotonic with something like:
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
>     rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>

Yeah, Aaron has mentioned this too:
https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/
May I know the benefit of keeping forward movement monotonic?
I thought that, should we only honor the latest dequeued task's burst_sleep_avg?
Because we don't know whether the old deuqued task's cache has been scribbled by the latest
dequeued task or not, does it still make sense to wake up the old dequeued task on its
previous CPU?

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12 12:32         ` Chen Yu
@ 2023-09-12 14:26           ` K Prateek Nayak
  2023-09-13  2:57             ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-12 14:26 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hello Chenyu,

On 9/12/2023 6:02 PM, Chen Yu wrote:
> [..snip..]
>
>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>> wakee's previous CPU. We can only restrict that other wakee does not
>>> occupy the previous CPU, but do not enhance the possibility that
>>> wake_affine_idle() chooses the previous CPU.
>>
>> Correct me if I'm wrong here,
>>
>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
>> and restrict any wakeup happening until the "cache_hot_timeout" is
>> crossed. Let us assume a perfect world where the task wakes up before
>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>> this while for the short sleeper but since the wakeup bears WF_SYNC
>> flag, the whole reservation is ignored and waker's LLC is explored.
>>
> 
> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> forces the short sleeping wakee to be woken up on waker's CPU rather the
> wakee's previous CPU, but wakee's previous has been marked as cache hot
> for nothing?

Precisely :)

> 
>> Should the timeout be cleared if the wakeup decides to not target the
>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>> small enough to curb any side effect that could possibly show here but
>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>> a larger value, the wakeup path might be affected where lot of idle
>> targets are overlooked since the CPUs are marked cache-hot forr longer
>> duration)
>>
>> Let me know what you think.
>>
> 
> This makes sense. In theory the above logic can be added in
> select_idle_sibling(), if target CPU is chosen rather than
> the previous CPU, the previous CPU's cache hot flag should be
> cleared.
> 
> But this might bring overhead. Because we need to grab the rq
> lock and write to other CPU's rq, which could be costly. It
> seems to be a trade-off of current implementation.

I agree, it will not be pretty. Maybe the other way is to have a
history of the type of wakeup the task experiences (similar to
wakee_flips but for sync and non-syn wakeups) and only reserve
the CPU if the task wakes up more via non-sync wakeups? Thinking
out loud here.

> On the other
> hand, if the user sets the sysctl_sched_migration_cost to a quite
> large value:
> 1. Without SIS_CACHE, there is no task migration.

But that is in the load balancing path. I think the wakeup path will
still migrate the task. But I believe there might be very few cases
where all CPUs are marked cache-hot and the SIS_UTIL will not bail
out straight away as a result of high utilization. Probably a rare
scenario.

> 2. With SIS_CACHE enabled, all idle CPUs are cache hot and be skipped
>    in select_idle_cpu(), the wakee will be woken up locally.
> It seems to be of the same effect, so there is no much impact
> to wakeup behavior I suppose.
> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12  9:39       ` Mike Galbraith
@ 2023-09-12 14:51         ` Chen Yu
  0 siblings, 0 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-12 14:51 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: K Prateek Nayak, Tim Chen, Aaron Lu, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hi Mike,

thanks for taking a look,

On 2023-09-12 at 11:39:55 +0200, Mike Galbraith wrote:
> On Mon, 2023-09-11 at 18:19 +0800, Chen Yu wrote:
> >
> > > Speaking of cache-hot idle CPU, is netperf actually more happy with
> > > piling on current CPU?
> >
> > Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> > put the waker and wakee together.
> 
> Hm, seems there's at least one shared L2 case where that's untrue by
> more than a tiny margin, which surprised me rather a lot.
> 

Yes, the task stacking is in theory against the work conservation of the
scheduler, and it depends on how much the resource(l1/l2 cache, dsb) locallity
is, and it is workload and hardware specific.

> For grins, I tested netperf on my dinky rpi4b, and while its RR numbers
> seem kinda odd, they're also seemingly repeatable (ergo showing them).
> I measured a very modest cross-core win on a shared L2 Intel CPU some
> years ago (when Q6600 was shiny/new) but nothing close to these deltas.
> 

This is interesting, I have a Jacobsville which also has shared L2, I'll
run some tests to check what the difference between task stacking vs spreading task
on that platform. But I guess that is another topic because current patch
avoids stacking tasks.

thanks,
Chenyu

> Makes me wonder what (a tad beefier) Bulldog RR numbers look like.
> 
> root@rpi4:~# ONLY=TCP_RR netperf.sh
> TCP_RR-1        unbound    Avg:  29611  Sum:    29611
> TCP_RR-1        stacked    Avg:  22540  Sum:    22540
> TCP_RR-1        cross-core Avg:  30181  Sum:    30181
> 
> root@rpi4:~# netperf.sh
> TCP_SENDFILE-1  unbound    Avg:  15572  Sum:    15572
> TCP_SENDFILE-1  stacked    Avg:  11533  Sum:    11533
> TCP_SENDFILE-1  cross-core Avg:  15751  Sum:    15751
> 
> TCP_STREAM-1    unbound    Avg:   6331  Sum:     6331
> TCP_STREAM-1    stacked    Avg:   6031  Sum:     6031
> TCP_STREAM-1    cross-core Avg:   6211  Sum:     6211
> 
> TCP_MAERTS-1    unbound    Avg:   6306  Sum:     6306
> TCP_MAERTS-1    stacked    Avg:   6094  Sum:     6094
> TCP_MAERTS-1    cross-core Avg:   9393  Sum:     9393
> 
> UDP_STREAM-1    unbound    Avg:  22277  Sum:    22277
> UDP_STREAM-1    stacked    Avg:  18844  Sum:    18844
> UDP_STREAM-1    cross-core Avg:  24749  Sum:    24749
> 
> TCP_RR-1        unbound    Avg:  29674  Sum:    29674
> TCP_RR-1        stacked    Avg:  22267  Sum:    22267
> TCP_RR-1        cross-core Avg:  30237  Sum:    30237
> 
> UDP_RR-1        unbound    Avg:  36189  Sum:    36189
> UDP_RR-1        stacked    Avg:  27129  Sum:    27129
> UDP_RR-1        cross-core Avg:  37033  Sum:    37033

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12 14:14           ` Chen Yu
@ 2023-09-12 15:18             ` Mathieu Desnoyers
  2023-09-13  3:02               ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: Mathieu Desnoyers @ 2023-09-12 15:18 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 9/12/23 10:14, Chen Yu wrote:
> On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote:
[...]
>>
>> One more tweak: given that more than one task can update the cache_hot_timeout forward
>> one after another, and given that some tasks have larger burst_sleep_avg values than
>> others, I suspect we want to keep the forward movement monotonic with something like:
>>
>> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
>>      rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
>> 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
>>
> 
> Yeah, Aaron has mentioned this too:
> https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/
> May I know the benefit of keeping forward movement monotonic?
> I thought that, should we only honor the latest dequeued task's burst_sleep_avg?
> Because we don't know whether the old deuqued task's cache has been scribbled by the latest
> dequeued task or not, does it still make sense to wake up the old dequeued task on its
> previous CPU?

Here is my reasoning:

If a second task is scheduled after the first dequeued task (a
task with large burst_sleep_avg) is dequeued, that second task (with
small burst_sleep_avg) would need to entirely scribble the other task's
cache lines within the time given by sysctl_sched_migration_cost, which
I suspect is typically not very large. So I doubt that the second task
can entirely kick out the first task cache lines within that time frame,
and therefore that second task should not move the cache_hot_timeout
value backwards.

But perhaps I'm missing something ?

Thanks,

Mathieu


-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12 14:26           ` K Prateek Nayak
@ 2023-09-13  2:57             ` Chen Yu
  2023-09-14  4:13               ` K Prateek Nayak
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-13  2:57 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/12/2023 6:02 PM, Chen Yu wrote:
> > [..snip..]
> >
> >>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>> wakee's previous CPU. We can only restrict that other wakee does not
> >>> occupy the previous CPU, but do not enhance the possibility that
> >>> wake_affine_idle() chooses the previous CPU.
> >>
> >> Correct me if I'm wrong here,
> >>
> >> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
> >> and restrict any wakeup happening until the "cache_hot_timeout" is
> >> crossed. Let us assume a perfect world where the task wakes up before
> >> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >> this while for the short sleeper but since the wakeup bears WF_SYNC
> >> flag, the whole reservation is ignored and waker's LLC is explored.
> >>
> > 
> > Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> > forces the short sleeping wakee to be woken up on waker's CPU rather the
> > wakee's previous CPU, but wakee's previous has been marked as cache hot
> > for nothing?
> 
> Precisely :)
> 
> > 
> >> Should the timeout be cleared if the wakeup decides to not target the
> >> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >> small enough to curb any side effect that could possibly show here but
> >> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >> a larger value, the wakeup path might be affected where lot of idle
> >> targets are overlooked since the CPUs are marked cache-hot forr longer
> >> duration)
> >>
> >> Let me know what you think.
> >>
> > 
> > This makes sense. In theory the above logic can be added in
> > select_idle_sibling(), if target CPU is chosen rather than
> > the previous CPU, the previous CPU's cache hot flag should be
> > cleared.
> > 
> > But this might bring overhead. Because we need to grab the rq
> > lock and write to other CPU's rq, which could be costly. It
> > seems to be a trade-off of current implementation.
> 
> I agree, it will not be pretty. Maybe the other way is to have a
> history of the type of wakeup the task experiences (similar to
> wakee_flips but for sync and non-syn wakeups) and only reserve
> the CPU if the task wakes up more via non-sync wakeups? Thinking
> out loud here.
>

This looks good to consider the task's attribute, or maybe something
like this:

new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
if (new_cpu != prev_cpu)
	p->burst_sleep_avg >>= 1;
So the duration of reservation could be shrinked.

> > On the other
> > hand, if the user sets the sysctl_sched_migration_cost to a quite
> > large value:
> > 1. Without SIS_CACHE, there is no task migration.
> 
> But that is in the load balancing path. I think the wakeup path will
> still migrate the task.

OK, I see.

> But I believe there might be very few cases
> where all CPUs are marked cache-hot and the SIS_UTIL will not bail
> out straight away as a result of high utilization. Probably a rare
> scenario.
>

Agree.

thanks,
Chenyu 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-12 15:18             ` Mathieu Desnoyers
@ 2023-09-13  3:02               ` Chen Yu
  0 siblings, 0 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-13  3:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

On 2023-09-12 at 11:18:33 -0400, Mathieu Desnoyers wrote:
> On 9/12/23 10:14, Chen Yu wrote:
> > On 2023-09-12 at 10:06:27 -0400, Mathieu Desnoyers wrote:
> [...]
> > > 
> > > One more tweak: given that more than one task can update the cache_hot_timeout forward
> > > one after another, and given that some tasks have larger burst_sleep_avg values than
> > > others, I suspect we want to keep the forward movement monotonic with something like:
> > > 
> > > if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.burst_sleep_avg &&
> > >      rq->cache_hot_timeout < now + p->se.burst_sleep_avg)
> > > 	rq->cache_hot_timeout = now + p->se.burst_sleep_avg;
> > > 
> > 
> > Yeah, Aaron has mentioned this too:
> > https://lore.kernel.org/lkml/ZP7SYu+gxlc%2FYjHu@chenyu5-mobl2/
> > May I know the benefit of keeping forward movement monotonic?
> > I thought that, should we only honor the latest dequeued task's burst_sleep_avg?
> > Because we don't know whether the old deuqued task's cache has been scribbled by the latest
> > dequeued task or not, does it still make sense to wake up the old dequeued task on its
> > previous CPU?
> 
> Here is my reasoning:
> 
> If a second task is scheduled after the first dequeued task (a
> task with large burst_sleep_avg) is dequeued, that second task (with
> small burst_sleep_avg) would need to entirely scribble the other task's
> cache lines within the time given by sysctl_sched_migration_cost, which
> I suspect is typically not very large. So I doubt that the second task
> can entirely kick out the first task cache lines within that time frame,
> and therefore that second task should not move the cache_hot_timeout
> value backwards.
> 
> But perhaps I'm missing something ?
>

You are right, the reservation time itself is not very long, unless someone
enlarges sysctl_sched_migration_cost in user space. Keeping the reservation
time moving forward could help the first dequeued task to be put on its previous
CPU easier(if the second dequeued task has not been woken up yet). I'll modify it
according to your suggestion and to see the results.

thanks,
Chenyu
 
> Thanks,
> 
> Mathieu
> 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> https://www.efficios.com
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  8:40     ` Chen Yu
@ 2023-09-13  6:22       ` Gautham R. Shenoy
  2023-09-13  7:25         ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: Gautham R. Shenoy @ 2023-09-13  6:22 UTC (permalink / raw)
  To: Chen Yu
  Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak,
	linux-kernel

On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> Hi Aaron,
> 
> thanks for the review,
> 
> On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:

[..snip..]

> > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > >  {
> > >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > +		if (sched_feat(SIS_CACHE) &&
> > > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > +			return -1;
> > > +
> > 
> > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > like available_idle_cpu_migrate() so that above and below logic can be
> > simplified a bit?
> > 
> 
> Yes, that would be simpler, I'll do in next version.
> 
> > I was thinking to simply add that rq->cache_hot_timeout check to
> > available_idle_cpu() but then a long sleeping task could be forced to
> > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > cache_hot_timeout. I guess that's why you chose to only change the idle
> > semantic in select_idle_cpu() but not in select_idle_sibling()?
> >
> 
> Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> thus no one could use this CPU within the cache hot period, including the cache-hot task
> itself.

This happens even with this patch right? It is possible for a task p1
whose avg sleep time is "t" to go to sleep which causes its CPU to go
idle. When it wakes up after a time t' < t, the logic above skips the
idle CPU because it is still cache-hot, despite the fact that it is
cache hot for p1!

Have you considered recording p1's identity in the
rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
return the previous CPU if it is cache hot and the wakee is
rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
scan.

> 
> thanks,
> Chenyu

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-13  6:22       ` Gautham R. Shenoy
@ 2023-09-13  7:25         ` Chen Yu
  2023-09-14  7:06           ` Gautham R. Shenoy
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-13  7:25 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak,
	linux-kernel

Hi Gautham,

thanks for the review,

On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > Hi Aaron,
> > 
> > thanks for the review,
> > 
> > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> 
> [..snip..]
> 
> > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > >  {
> > > >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > +		if (sched_feat(SIS_CACHE) &&
> > > > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > +			return -1;
> > > > +
> > > 
> > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > like available_idle_cpu_migrate() so that above and below logic can be
> > > simplified a bit?
> > > 
> > 
> > Yes, that would be simpler, I'll do in next version.
> > 
> > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > available_idle_cpu() but then a long sleeping task could be forced to
> > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > >
> > 
> > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > itself.
> 
> This happens even with this patch right? It is possible for a task p1
> whose avg sleep time is "t" to go to sleep which causes its CPU to go
> idle. When it wakes up after a time t' < t, the logic above skips the
> idle CPU because it is still cache-hot, despite the fact that it is
> cache hot for p1!
> 
Not sure if I understand correctly, in select_idle_sibling() p1's previous
CPU will be checked first, and that check does not involve cache-hot. So if
p1's previous CPU is idle, it will be picked, no?

        if (prev != target && cpus_share_cache(prev, target) &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
            asym_fits_cpu(task_util, util_min, util_max, prev))
                return prev;

Or do you mean, in select_idle_cpu(), we will re-check p1's previous
CPU but it is skipped due to cache-hot?

> Have you considered recording p1's identity in the
> rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> return the previous CPU if it is cache hot and the wakee is
> rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> scan.
> 

Yes this seems to be donable, and one problem would be, if there are
more than 2 dequeued tasks prefer the same (previous) CPU, which task
should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
want to deal with multiple dequeued tasks. If we record all of them,
it might be costly.

[1] https://lore.kernel.org/lkml/2a47ae82-b8cd-95db-9f48-82b3df0730f3@efficios.com/

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-13  2:57             ` Chen Yu
@ 2023-09-14  4:13               ` K Prateek Nayak
  2023-09-14 11:01                 ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-14  4:13 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hello Chenyu,

On 9/13/2023 8:27 AM, Chen Yu wrote:
> On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/12/2023 6:02 PM, Chen Yu wrote:
>>> [..snip..]
>>>
>>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
>>>>> on the waker's CPU, rather than the wakee's previous CPU, because
>>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
>>>>> wakee's previous CPU. We can only restrict that other wakee does not
>>>>> occupy the previous CPU, but do not enhance the possibility that
>>>>> wake_affine_idle() chooses the previous CPU.
>>>>
>>>> Correct me if I'm wrong here,
>>>>
>>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
>>>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
>>>> and restrict any wakeup happening until the "cache_hot_timeout" is
>>>> crossed. Let us assume a perfect world where the task wakes up before
>>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
>>>> this while for the short sleeper but since the wakeup bears WF_SYNC
>>>> flag, the whole reservation is ignored and waker's LLC is explored.
>>>>
>>>
>>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
>>> forces the short sleeping wakee to be woken up on waker's CPU rather the
>>> wakee's previous CPU, but wakee's previous has been marked as cache hot
>>> for nothing?
>>
>> Precisely :)
>>
>>>
>>>> Should the timeout be cleared if the wakeup decides to not target the
>>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
>>>> small enough to curb any side effect that could possibly show here but
>>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
>>>> a larger value, the wakeup path might be affected where lot of idle
>>>> targets are overlooked since the CPUs are marked cache-hot forr longer
>>>> duration)
>>>>
>>>> Let me know what you think.
>>>>
>>>
>>> This makes sense. In theory the above logic can be added in
>>> select_idle_sibling(), if target CPU is chosen rather than
>>> the previous CPU, the previous CPU's cache hot flag should be
>>> cleared.
>>>
>>> But this might bring overhead. Because we need to grab the rq
>>> lock and write to other CPU's rq, which could be costly. It
>>> seems to be a trade-off of current implementation.
>>
>> I agree, it will not be pretty. Maybe the other way is to have a
>> history of the type of wakeup the task experiences (similar to
>> wakee_flips but for sync and non-syn wakeups) and only reserve
>> the CPU if the task wakes up more via non-sync wakeups? Thinking
>> out loud here.
>>
> 
> This looks good to consider the task's attribute, or maybe something
> like this:
> 
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> 	p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.

That seems like a good approach.

Meanwhile, here is result for the current series without any
modifications:

tl;dr

- There seems to be a noticeable increase in hackbench runtime with a
  single group but big gains beyond that. The regression could possibly
  be because of added searching but let me do some digging to confirm
  that. 

- Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
  and DeathStarBench (High Utilization)

- Other benchmarks are more of less perf neutral with the changes.

More information below:

o System information

  - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
  - NPS1 mode (each socket is a NUMA node)
  - Boost Enabled
  - C2 disabled (MWAIT based C1 is still enabled)


o Kernel information

base		:   tip:sched/core at commit b41bbb33cf75 ("Merge branch
		    'sched/eevdf' into sched/core")
		  + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
		    wakeup-preemption")

SIS_CACHE	:   base
		  + this series as is


o Benchmark results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 1-groups     1.00 [ -0.00]( 1.89)     1.10 [-10.28]( 2.03)
 2-groups     1.00 [ -0.00]( 2.04)     0.98 [  1.57]( 2.04)
 4-groups     1.00 [ -0.00]( 2.38)     0.95 [  4.70]( 0.88)
 8-groups     1.00 [ -0.00]( 1.52)     0.93 [  7.18]( 0.76)
16-groups     1.00 [ -0.00]( 3.44)     0.90 [  9.76]( 1.04)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
    1     1.00 [  0.00]( 0.18)     0.98 [ -1.61]( 0.27)
    2     1.00 [  0.00]( 0.63)     0.98 [ -1.58]( 0.09)
    4     1.00 [  0.00]( 0.86)     0.99 [ -0.52]( 0.42)
    8     1.00 [  0.00]( 0.22)     0.98 [ -1.77]( 0.65)
   16     1.00 [  0.00]( 1.99)     1.00 [ -0.10]( 1.55)
   32     1.00 [  0.00]( 4.29)     0.98 [ -1.73]( 1.55)
   64     1.00 [  0.00]( 1.71)     0.97 [ -2.77]( 3.74)
  128     1.00 [  0.00]( 0.65)     1.00 [ -0.14]( 0.88)
  256     1.00 [  0.00]( 0.19)     0.97 [ -2.65]( 0.49)
  512     1.00 [  0.00]( 0.20)     0.99 [ -1.10]( 0.33)
 1024     1.00 [  0.00]( 0.29)     0.99 [ -0.70]( 0.16)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 4.32)     0.90 [ -9.82](10.72)
Scale     1.00 [  0.00]( 5.21)     1.01 [  0.59]( 1.83)
  Add     1.00 [  0.00]( 6.25)     0.99 [ -0.91]( 4.49)
Triad     1.00 [  0.00](10.74)     1.02 [  2.28]( 6.07)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 0.70)     0.98 [ -1.79]( 2.26)
Scale     1.00 [  0.00]( 6.55)     1.03 [  2.80]( 0.74)
  Add     1.00 [  0.00]( 6.53)     1.02 [  2.05]( 1.82)
Triad     1.00 [  0.00]( 6.66)     1.04 [  3.54]( 1.04)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.46)     0.99 [ -0.55]( 0.49)
 2-clients     1.00 [  0.00]( 0.38)     0.99 [ -1.23]( 1.19)
 4-clients     1.00 [  0.00]( 0.72)     0.98 [ -1.91]( 1.21)
 8-clients     1.00 [  0.00]( 0.98)     0.98 [ -1.61]( 1.08)
16-clients     1.00 [  0.00]( 0.70)     0.98 [ -1.80]( 1.04)
32-clients     1.00 [  0.00]( 0.74)     0.98 [ -1.55]( 1.20)
64-clients     1.00 [  0.00]( 2.24)     1.00 [ -0.04]( 2.77)
128-clients    1.00 [  0.00]( 1.72)     1.03 [  3.22]( 1.99)
256-clients    1.00 [  0.00]( 4.44)     0.99 [ -1.33]( 4.71)
512-clients    1.00 [  0.00](52.42)     0.98 [ -1.61](52.72)


==================================================================
Test          : schbench (old)
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
  1     1.00 [ -0.00]( 2.28)     0.96 [  4.00](15.68)
  2     1.00 [ -0.00]( 6.42)     1.00 [ -0.00](10.96)
  4     1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.61)
  8     1.00 [ -0.00](13.83)     1.08 [ -7.89]( 2.86)
 16     1.00 [ -0.00]( 4.37)     1.00 [ -0.00]( 2.13)
 32     1.00 [ -0.00]( 8.69)     0.95 [  4.94]( 2.73)
 64     1.00 [ -0.00]( 2.30)     1.05 [ -5.13]( 1.26)
128     1.00 [ -0.00](12.12)     1.03 [ -3.41]( 5.08)
256     1.00 [ -0.00](26.04)     0.91 [  8.88]( 2.59)
512     1.00 [ -0.00]( 5.62)     0.97 [  3.32]( 0.37)


==================================================================
Test          : Unixbench
Units         : Various, Throughput
Interpretation: Higher is better
Statistic     : AMean, Hmean (Specified)
==================================================================
Metric		variant                      base		     SIS_CACHE
Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41485503.82 (   0.57%)
Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6233919689.40 (  -0.10%)
Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2841236.43 *   4.29%*
Amean     unixbench-syscall-512            7790656.20 (   0.00%)     7631558.00 *   2.04%*
Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2598208.16 *   2.47%*
Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   368566373.76 *   1.99%*
Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4551.67 (   1.01%)
Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69264.30 (  -0.17%)
Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3822.67 (  -0.05%)
Hmean     unixbench-execl-512                12288.64 (   0.00%)       11728.12 (  -4.56%)


==================================================================
Test          : ycsb-mongodb
Units         : Throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
base            : 309589.33 (var: 1.41%) 
SIS_CACHE       : 304931.33 (var: 1.29%) [diff: -1.50%]


==================================================================
Test          : DeathStarBench
Units         : Normalized Throughput, relative to base
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Pinning         base     SIS_CACHE
1 CCD           100%      99.18% [%diff: -0.82%]
2 CCD           100%      97.46% [%diff: -2.54%]
4 CCD           100%      97.22% [%diff: -2.78%]
8 CCD           100%      99.01% [%diff: -0.99%]

--

Regression observed could either be because of the larger search time to
find a non cache-hot idle CPU, or perhaps just the larger search time in
general adding to utilization and curbing the SIS_UTIL limits further.
I'll go gather some stats to back my suspicion (particularly for
hackbench).

> 
> [..snip..]
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
                     ` (2 preceding siblings ...)
  2023-09-11 15:26   ` Mathieu Desnoyers
@ 2023-09-14  5:30   ` K Prateek Nayak
  2023-09-14 10:43     ` Chen Yu
  3 siblings, 1 reply; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-14  5:30 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Vincent Guittot, Juri Lelli, Ingo Molnar

Hello Chenyu,

One question ...

On 9/11/2023 8:20 AM, Chen Yu wrote:
> [..snip..]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e20f50726ab8..fe3b760c9654 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> [..more snip..]
> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>  	int cpu;
>  
>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
> -		if (!available_idle_cpu(cpu)) {
> +		bool cache_hot = sched_feat(SIS_CACHE) ?
> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> +
> +		if (!available_idle_cpu(cpu) || cache_hot) {
>  			idle = false;
>  			if (*idle_cpu == -1) {
> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> +				    !cache_hot) {

Here, the CPU is running a SCHED_IDLE task ...

>  					*idle_cpu = cpu;
>  					break;
>  				}

... but just below this, there are following lines to cache the idle_cpu:

		}
		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
			*idle_cpu = cpu;

Would it make sense to also add the same "cache_hot" check here when we
come across an idle CPU during the search for an idle core? Something
like:

-		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
+		if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
			*idle_cpu = cpu;

Implications with the above change:

If the entire core is idle, "select_idle_core()" will return the core
and the search will bail out in "select_idle_cpu()". Otherwise, the
cache-hot idle CPUs encountered during the search for idle core will be
ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
not cache-hot.

Thoughts?

> [..snip..]

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-13  7:25         ` Chen Yu
@ 2023-09-14  7:06           ` Gautham R. Shenoy
  2023-09-14 12:09             ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: Gautham R. Shenoy @ 2023-09-14  7:06 UTC (permalink / raw)
  To: Chen Yu
  Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak,
	linux-kernel

Hello Chen Yu,

On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote:
> Hi Gautham,
> 
> thanks for the review,
> 
> On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > > Hi Aaron,
> > > 
> > > thanks for the review,
> > > 
> > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> > 
> > [..snip..]
> > 
> > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > >  {
> > > > >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > > +		if (sched_feat(SIS_CACHE) &&
> > > > > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > > +			return -1;
> > > > > +
> > > > 
> > > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > > like available_idle_cpu_migrate() so that above and below logic can be
> > > > simplified a bit?
> > > > 
> > > 
> > > Yes, that would be simpler, I'll do in next version.
> > > 
> > > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > > available_idle_cpu() but then a long sleeping task could be forced to
> > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > > >
> > > 
> > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > > itself.
> > 
> > This happens even with this patch right? It is possible for a task p1
> > whose avg sleep time is "t" to go to sleep which causes its CPU to go
> > idle. When it wakes up after a time t' < t, the logic above skips the
> > idle CPU because it is still cache-hot, despite the fact that it is
> > cache hot for p1!
> > 
> Not sure if I understand correctly, in select_idle_sibling() p1's previous
> CPU will be checked first, and that check does not involve cache-hot. So if
> p1's previous CPU is idle, it will be picked, no?
> 
>         if (prev != target && cpus_share_cache(prev, target) &&
>             (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
>             asym_fits_cpu(task_util, util_min, util_max, prev))
>                 return prev;


You are right, but the "if" condition is the one prior to "if (prev !=
target ...)" which causes p1 can pick the previous CPU here. However,
note that it is not just p1 which can pick the previous CPU as
intended by your patch.

Consider the following case:

* p1' ran on CPUX, and went to sleep. We now update the
  cache_hot_timeout for CPUX based on the sleep-avg of p1'

* When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or
  due to shared_runq search), the CPU could pull p1 from another CPU
  Y.

* p1 now runs on CPUX, and goes to sleep.

* We update the cache_hot_timeout for CPUX based on the sleep-avg of p1.

* p1' wakesup and picks prev as the target after doing wake_affine()
  check in select_task_rq_fair().

* In select_idle_sibling() it encounters the following, which checks
  out and thus returns from target.

	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
	    asym_fits_cpu(task_util, util_min, util_max, target))
		return target;

* p1 now wakes up, picks the previous cpu as the target. However,
  available_idle_cpu() is no longer true.

So despite "reserving" the CPU for p1, which is likely to have its
data still hot in the case, we would have scheduled p1', thus
defeating the whole purpose of reservation.

To be honest, this isn't so bad, because we have been able to avoid a
migration in this case.

> 
> Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> CPU but it is skipped due to cache-hot?

I had originally thought about this, but then as you pointed out we
have an opportunity to pick the previous cpu in the early checks
inside select_idle_sibling().

> 
> > Have you considered recording p1's identity in the
> > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > return the previous CPU if it is cache hot and the wakee is
> > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > scan.
> > 
> 
> Yes this seems to be donable, and one problem would be, if there are
> more than 2 dequeued tasks prefer the same (previous) CPU, which task
> should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> want to deal with multiple dequeued tasks. If we record all of them,
> it might be costly.

If there are multiple dequeued tasks, then it doesn't make sense to
record the identity of the tasks. However, we need the bail out to be
much earlier, in select_task_rq_fair(), perhaps even before the
want_affine() checks.

After all, if the previous CPU is idle, and its cache_hot_timeout
hasn't expired, and if the wakee's sleep duration is less than the
cache_hot_timeout, why don't we just pick it here and be done with it?

> 
> [1] https://lore.kernel.org/lkml/2a47ae82-b8cd-95db-9f48-82b3df0730f3@efficios.com/
> 
> thanks,
> Chenyu

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-14  5:30   ` K Prateek Nayak
@ 2023-09-14 10:43     ` Chen Yu
  2023-09-15  3:37       ` K Prateek Nayak
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-14 10:43 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Vincent Guittot, Juri Lelli, Ingo Molnar

Hi Prateek,

On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> One question ...
> 
> On 9/11/2023 8:20 AM, Chen Yu wrote:
> > [..snip..]
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > [..more snip..]
> > @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
> >  	int cpu;
> >  
> >  	for_each_cpu(cpu, cpu_smt_mask(core)) {
> > -		if (!available_idle_cpu(cpu)) {
> > +		bool cache_hot = sched_feat(SIS_CACHE) ?
> > +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
> > +
> > +		if (!available_idle_cpu(cpu) || cache_hot) {
> >  			idle = false;
> >  			if (*idle_cpu == -1) {
> > -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
> > +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
> > +				    !cache_hot) {
> 
> Here, the CPU is running a SCHED_IDLE task ...
> 
> >  					*idle_cpu = cpu;
> >  					break;
> >  				}
> 
> ... but just below this, there are following lines to cache the idle_cpu:
> 
> 		}
> 		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> 			*idle_cpu = cpu;
> 
> Would it make sense to also add the same "cache_hot" check here when we
> come across an idle CPU during the search for an idle core? Something
> like:
> 
> -		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))

When we reached above code, the following condition should be true:
 (available_idle_cpu(cpu) && !cache_hot)
because the previous 'if' statement is false. So I guess we already
has !cache_hot ?

> +		if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
> 			*idle_cpu = cpu;
> 
> Implications with the above change:
> 
> If the entire core is idle, "select_idle_core()" will return the core
> and the search will bail out in "select_idle_cpu()". Otherwise, the
> cache-hot idle CPUs encountered during the search for idle core will be
> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
> not cache-hot.
> 
> Thoughts?
> 

Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle
in your case.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-14  4:13               ` K Prateek Nayak
@ 2023-09-14 11:01                 ` Chen Yu
  2023-09-15  3:21                   ` K Prateek Nayak
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-14 11:01 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hi Prateek,

thanks for the test,

On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/13/2023 8:27 AM, Chen Yu wrote:
> > On 2023-09-12 at 19:56:37 +0530, K Prateek Nayak wrote:
> >> Hello Chenyu,
> >>
> >> On 9/12/2023 6:02 PM, Chen Yu wrote:
> >>> [..snip..]
> >>>
> >>>>> If I understand correctly, WF_SYNC is to let the wakee to woken up
> >>>>> on the waker's CPU, rather than the wakee's previous CPU, because
> >>>>> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> >>>>> wakee's previous CPU. We can only restrict that other wakee does not
> >>>>> occupy the previous CPU, but do not enhance the possibility that
> >>>>> wake_affine_idle() chooses the previous CPU.
> >>>>
> >>>> Correct me if I'm wrong here,
> >>>>
> >>>> Say a short sleeper, is always woken up using WF_SYNC flag. When the
> >>>> task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
> >>>> and restrict any wakeup happening until the "cache_hot_timeout" is
> >>>> crossed. Let us assume a perfect world where the task wakes up before
> >>>> the "cache_hot_timeout" expires. Logically this CPU was reserved all
> >>>> this while for the short sleeper but since the wakeup bears WF_SYNC
> >>>> flag, the whole reservation is ignored and waker's LLC is explored.
> >>>>
> >>>
> >>> Ah, I see your point. Do you mean, because the waker has a WF_SYNC, wake_affine_idle()
> >>> forces the short sleeping wakee to be woken up on waker's CPU rather the
> >>> wakee's previous CPU, but wakee's previous has been marked as cache hot
> >>> for nothing?
> >>
> >> Precisely :)
> >>
> >>>
> >>>> Should the timeout be cleared if the wakeup decides to not target the
> >>>> previous CPU? (The default "sysctl_sched_migration_cost" is probably
> >>>> small enough to curb any side effect that could possibly show here but
> >>>> if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
> >>>> a larger value, the wakeup path might be affected where lot of idle
> >>>> targets are overlooked since the CPUs are marked cache-hot forr longer
> >>>> duration)
> >>>>
> >>>> Let me know what you think.
> >>>>
> >>>
> >>> This makes sense. In theory the above logic can be added in
> >>> select_idle_sibling(), if target CPU is chosen rather than
> >>> the previous CPU, the previous CPU's cache hot flag should be
> >>> cleared.
> >>>
> >>> But this might bring overhead. Because we need to grab the rq
> >>> lock and write to other CPU's rq, which could be costly. It
> >>> seems to be a trade-off of current implementation.
> >>
> >> I agree, it will not be pretty. Maybe the other way is to have a
> >> history of the type of wakeup the task experiences (similar to
> >> wakee_flips but for sync and non-syn wakeups) and only reserve
> >> the CPU if the task wakes up more via non-sync wakeups? Thinking
> >> out loud here.
> >>
> > 
> > This looks good to consider the task's attribute, or maybe something
> > like this:
> > 
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > 	p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
> 
> That seems like a good approach.
> 
> Meanwhile, here is result for the current series without any
> modifications:
> 
> tl;dr
> 
> - There seems to be a noticeable increase in hackbench runtime with a
>   single group but big gains beyond that. The regression could possibly
>   be because of added searching but let me do some digging to confirm
>   that. 

Ah OK. May I have the command to run 1 group hackbench?

> 
> - Small regressions (~2%) noticed in ycsb-mongodb (medium utilization)
>   and DeathStarBench (High Utilization)
> 
> - Other benchmarks are more of less perf neutral with the changes.
> 
> More information below:
> 
> o System information
> 
>   - Dual socket 3rd Generation EPYC System (2 x 64C/128T)
>   - NPS1 mode (each socket is a NUMA node)
>   - Boost Enabled
>   - C2 disabled (MWAIT based C1 is still enabled)
> 
> 
> o Kernel information
> 
> base		:   tip:sched/core at commit b41bbb33cf75 ("Merge branch
> 		    'sched/eevdf' into sched/core")
> 		  + cheery-pick commit 63304558ba5d ("sched/eevdf: Curb
> 		    wakeup-preemption")
> 
> SIS_CACHE	:   base
> 		  + this series as is
> 
> 
> o Benchmark results
> 
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  1-groups     1.00 [ -0.00]( 1.89)     1.10 [-10.28]( 2.03)
>  2-groups     1.00 [ -0.00]( 2.04)     0.98 [  1.57]( 2.04)
>  4-groups     1.00 [ -0.00]( 2.38)     0.95 [  4.70]( 0.88)
>  8-groups     1.00 [ -0.00]( 1.52)     0.93 [  7.18]( 0.76)
> 16-groups     1.00 [ -0.00]( 3.44)     0.90 [  9.76]( 1.04)
> 
> 
> ==================================================================
> Test          : tbench
> Units         : Normalized throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>     1     1.00 [  0.00]( 0.18)     0.98 [ -1.61]( 0.27)
>     2     1.00 [  0.00]( 0.63)     0.98 [ -1.58]( 0.09)
>     4     1.00 [  0.00]( 0.86)     0.99 [ -0.52]( 0.42)
>     8     1.00 [  0.00]( 0.22)     0.98 [ -1.77]( 0.65)
>    16     1.00 [  0.00]( 1.99)     1.00 [ -0.10]( 1.55)
>    32     1.00 [  0.00]( 4.29)     0.98 [ -1.73]( 1.55)
>    64     1.00 [  0.00]( 1.71)     0.97 [ -2.77]( 3.74)
>   128     1.00 [  0.00]( 0.65)     1.00 [ -0.14]( 0.88)
>   256     1.00 [  0.00]( 0.19)     0.97 [ -2.65]( 0.49)
>   512     1.00 [  0.00]( 0.20)     0.99 [ -1.10]( 0.33)
>  1024     1.00 [  0.00]( 0.29)     0.99 [ -0.70]( 0.16)
> 
> 
> ==================================================================
> Test          : stream-10
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  Copy     1.00 [  0.00]( 4.32)     0.90 [ -9.82](10.72)
> Scale     1.00 [  0.00]( 5.21)     1.01 [  0.59]( 1.83)
>   Add     1.00 [  0.00]( 6.25)     0.99 [ -0.91]( 4.49)
> Triad     1.00 [  0.00](10.74)     1.02 [  2.28]( 6.07)
> 
> 
> ==================================================================
> Test          : stream-100
> Units         : Normalized Bandwidth, MB/s
> Interpretation: Higher is better
> Statistic     : HMean
> ==================================================================
> Test:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  Copy     1.00 [  0.00]( 0.70)     0.98 [ -1.79]( 2.26)
> Scale     1.00 [  0.00]( 6.55)     1.03 [  2.80]( 0.74)
>   Add     1.00 [  0.00]( 6.53)     1.02 [  2.05]( 1.82)
> Triad     1.00 [  0.00]( 6.66)     1.04 [  3.54]( 1.04)
> 
> 
> ==================================================================
> Test          : netperf
> Units         : Normalized Througput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Clients:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>  1-clients     1.00 [  0.00]( 0.46)     0.99 [ -0.55]( 0.49)
>  2-clients     1.00 [  0.00]( 0.38)     0.99 [ -1.23]( 1.19)
>  4-clients     1.00 [  0.00]( 0.72)     0.98 [ -1.91]( 1.21)
>  8-clients     1.00 [  0.00]( 0.98)     0.98 [ -1.61]( 1.08)
> 16-clients     1.00 [  0.00]( 0.70)     0.98 [ -1.80]( 1.04)
> 32-clients     1.00 [  0.00]( 0.74)     0.98 [ -1.55]( 1.20)
> 64-clients     1.00 [  0.00]( 2.24)     1.00 [ -0.04]( 2.77)
> 128-clients    1.00 [  0.00]( 1.72)     1.03 [  3.22]( 1.99)
> 256-clients    1.00 [  0.00]( 4.44)     0.99 [ -1.33]( 4.71)
> 512-clients    1.00 [  0.00](52.42)     0.98 [ -1.61](52.72)
> 
> 
> ==================================================================
> Test          : schbench (old)
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:          base[pct imp](CV)     SIS_CACHE[pct imp](CV)
>   1     1.00 [ -0.00]( 2.28)     0.96 [  4.00](15.68)
>   2     1.00 [ -0.00]( 6.42)     1.00 [ -0.00](10.96)
>   4     1.00 [ -0.00]( 3.77)     0.97 [  3.33]( 7.61)
>   8     1.00 [ -0.00](13.83)     1.08 [ -7.89]( 2.86)
>  16     1.00 [ -0.00]( 4.37)     1.00 [ -0.00]( 2.13)
>  32     1.00 [ -0.00]( 8.69)     0.95 [  4.94]( 2.73)
>  64     1.00 [ -0.00]( 2.30)     1.05 [ -5.13]( 1.26)
> 128     1.00 [ -0.00](12.12)     1.03 [ -3.41]( 5.08)
> 256     1.00 [ -0.00](26.04)     0.91 [  8.88]( 2.59)
> 512     1.00 [ -0.00]( 5.62)     0.97 [  3.32]( 0.37)
> 
> 
> ==================================================================
> Test          : Unixbench
> Units         : Various, Throughput
> Interpretation: Higher is better
> Statistic     : AMean, Hmean (Specified)
> ==================================================================
> Metric		variant                      base		     SIS_CACHE
> Hmean     unixbench-dhry2reg-1            41248390.97 (   0.00%)    41485503.82 (   0.57%)
> Hmean     unixbench-dhry2reg-512        6239969914.15 (   0.00%)  6233919689.40 (  -0.10%)
> Amean     unixbench-syscall-1              2968518.27 (   0.00%)     2841236.43 *   4.29%*
> Amean     unixbench-syscall-512            7790656.20 (   0.00%)     7631558.00 *   2.04%*
> Hmean     unixbench-pipe-1                 2535689.01 (   0.00%)     2598208.16 *   2.47%*
> Hmean     unixbench-pipe-512             361385055.25 (   0.00%)   368566373.76 *   1.99%*
> Hmean     unixbench-spawn-1                   4506.26 (   0.00%)        4551.67 (   1.01%)
> Hmean     unixbench-spawn-512                69380.09 (   0.00%)       69264.30 (  -0.17%)
> Hmean     unixbench-execl-1                   3824.57 (   0.00%)        3822.67 (  -0.05%)
> Hmean     unixbench-execl-512                12288.64 (   0.00%)       11728.12 (  -4.56%)
> 
> 
> ==================================================================
> Test          : ycsb-mongodb
> Units         : Throughput
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> base            : 309589.33 (var: 1.41%) 
> SIS_CACHE       : 304931.33 (var: 1.29%) [diff: -1.50%]
> 
> 
> ==================================================================
> Test          : DeathStarBench
> Units         : Normalized Throughput, relative to base
> Interpretation: Higher is better
> Statistic     : AMean
> ==================================================================
> Pinning         base     SIS_CACHE
> 1 CCD           100%      99.18% [%diff: -0.82%]
> 2 CCD           100%      97.46% [%diff: -2.54%]
> 4 CCD           100%      97.22% [%diff: -2.78%]
> 8 CCD           100%      99.01% [%diff: -0.99%]
> 
> --
> 
> Regression observed could either be because of the larger search time to
> find a non cache-hot idle CPU, or perhaps just the larger search time in
> general adding to utilization and curbing the SIS_UTIL limits further.

Yeah that is possible. And you also mentioned that we should consider the
cache-hot idle CPU if we can not find any cache-cold idle CPUs, that
might be a better choice than forcely putting the wakee on the current
CPU which brings task stacking.

> I'll go gather some stats to back my suspicion (particularly for
> hackbench).
>

Thanks!
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-14  7:06           ` Gautham R. Shenoy
@ 2023-09-14 12:09             ` Chen Yu
  2023-09-15 15:18               ` Gautham R. Shenoy
  0 siblings, 1 reply; 34+ messages in thread
From: Chen Yu @ 2023-09-14 12:09 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak,
	linux-kernel

Hi Gautham,

On 2023-09-14 at 12:36:33 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
> 
> On Wed, Sep 13, 2023 at 03:25:14PM +0800, Chen Yu wrote:
> > Hi Gautham,
> > 
> > thanks for the review,
> > 
> > On 2023-09-13 at 11:52:14 +0530, Gautham R. Shenoy wrote:
> > > On Mon, Sep 11, 2023 at 04:40:02PM +0800, Chen Yu wrote:
> > > > Hi Aaron,
> > > > 
> > > > thanks for the review,
> > > > 
> > > > On 2023-09-11 at 15:26:29 +0800, Aaron Lu wrote:
> > > 
> > > [..snip..]
> > > 
> > > > > > @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
> > > > > >  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
> > > > > >  {
> > > > > >  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> > > > > > -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
> > > > > > +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
> > > > > > +		if (sched_feat(SIS_CACHE) &&
> > > > > > +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
> > > > > > +			return -1;
> > > > > > +
> > > > > 
> > > > > Maybe introduce a new function that also considers rq->cache_hot_timeout,
> > > > > like available_idle_cpu_migrate() so that above and below logic can be
> > > > > simplified a bit?
> > > > > 
> > > > 
> > > > Yes, that would be simpler, I'll do in next version.
> > > > 
> > > > > I was thinking to simply add that rq->cache_hot_timeout check to
> > > > > available_idle_cpu() but then a long sleeping task could be forced to
> > > > > migrate if its prev_cpu happens to just deschedule a task that sets rq's
> > > > > cache_hot_timeout. I guess that's why you chose to only change the idle
> > > > > semantic in select_idle_cpu() but not in select_idle_sibling()?
> > > > >
> > > > 
> > > > Yes, sort of. And the reason I did not put this cache hot check in available_idle_cpu()
> > > > or idle_cpu() was mainly because these APIs are generic and could be invoked by select_idle_sibling().
> > > > If the task fall asleep and woken up quickly, its previous idle CPU will also be skipped,
> > > > thus no one could use this CPU within the cache hot period, including the cache-hot task
> > > > itself.
> > > 
> > > This happens even with this patch right? It is possible for a task p1
> > > whose avg sleep time is "t" to go to sleep which causes its CPU to go
> > > idle. When it wakes up after a time t' < t, the logic above skips the
> > > idle CPU because it is still cache-hot, despite the fact that it is
> > > cache hot for p1!
> > > 
> > Not sure if I understand correctly, in select_idle_sibling() p1's previous
> > CPU will be checked first, and that check does not involve cache-hot. So if
> > p1's previous CPU is idle, it will be picked, no?
> > 
> >         if (prev != target && cpus_share_cache(prev, target) &&
> >             (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> >             asym_fits_cpu(task_util, util_min, util_max, prev))
> >                 return prev;
> 
> 
> You are right, but the "if" condition is the one prior to "if (prev !=
> target ...)" which causes p1 can pick the previous CPU here. However,
> note that it is not just p1 which can pick the previous CPU as
> intended by your patch.
> 
> Consider the following case:
> 
> * p1' ran on CPUX, and went to sleep. We now update the
>   cache_hot_timeout for CPUX based on the sleep-avg of p1'
> 
> * When the CPU goes idle, during CPU_NEWLY_IDLE load balancing, (or
>   due to shared_runq search), the CPU could pull p1 from another CPU
>   Y.

Agree, newidle balance could scribble the cache-hot CPU.

> 
> * p1 now runs on CPUX, and goes to sleep.
> 
> * We update the cache_hot_timeout for CPUX based on the sleep-avg of p1.
> 
> * p1' wakesup and picks prev as the target after doing wake_affine()
>   check in select_task_rq_fair().
> 
> * In select_idle_sibling() it encounters the following, which checks
>   out and thus returns from target.
> 
> 	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> 	    asym_fits_cpu(task_util, util_min, util_max, target))
> 		return target;
> 
> * p1 now wakes up, picks the previous cpu as the target. However,
>   available_idle_cpu() is no longer true.
> 
> So despite "reserving" the CPU for p1, which is likely to have its
> data still hot in the case, we would have scheduled p1', thus
> defeating the whole purpose of reservation.
> 

I see. So you mean, although we reserve the CPU for the wakee,
the wakee might not choose its previous CPU, which is against our
goal.

The reason to prevent the wakee choosing its previous CPU could be:
1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
2. the wakee's CPU has already been taken by someone else, via newidle_balance().

For 1, I think Prateek has expressed the concern. One mitigation method could be
that, we give penalty to that wakee, if it decides not to choose its previous CPU:

"
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
if (new_cpu != prev_cpu)
	p->burst_sleep_avg >>= 1;
So the duration of reservation could be shrinked.
"

For 2, maybe inhit the newidle balance, something in my mind:


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	u64 t0, t1, curr_cost = 0;
 	struct sched_domain *sd;
 	int pulled_task = 0;
+	bool cache_hot = false;
 
 	update_misfit_status(NULL, this_rq);
 
@@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rcu_read_lock();
 	sd = rcu_dereference_check_sched_domain(this_rq->sd);
 
+	if (sched_feat(SIS_CACHE)) {
+		s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
+
+		/*
+		 * If a short time later, a short sleeping task will be woken up
+		 * on this idle CPU, do not launch the newidle balance.
+		 */
+		if (delta > 0 && delta < this_rq->max_idle_balance_cost)
+			cache_hot = true;
+	}
+
 	if (!READ_ONCE(this_rq->rd->overload) ||
-	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
+	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {
 
 		if (sd)
 			update_next_balance(sd, &next_balance);


> To be honest, this isn't so bad, because we have been able to avoid a
> migration in this case.
> 
> > 
> > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > CPU but it is skipped due to cache-hot?
> 
> I had originally thought about this, but then as you pointed out we
> have an opportunity to pick the previous cpu in the early checks
> inside select_idle_sibling().
> 
> > 
> > > Have you considered recording p1's identity in the
> > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > return the previous CPU if it is cache hot and the wakee is
> > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > scan.
> > > 
> > 
> > Yes this seems to be donable, and one problem would be, if there are
> > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > want to deal with multiple dequeued tasks. If we record all of them,
> > it might be costly.
> 
> If there are multiple dequeued tasks, then it doesn't make sense to
> record the identity of the tasks. However, we need the bail out to be
> much earlier, in select_task_rq_fair(), perhaps even before the
> want_affine() checks.
> 
> After all, if the previous CPU is idle, and its cache_hot_timeout
> hasn't expired, and if the wakee's sleep duration is less than the
> cache_hot_timeout, why don't we just pick it here and be done with it?
> 

Yes we can return the previous CPU earlier, one concern is that, should
we honor WF_SYNC flag or not,  because in wake_affine_idle(), WF_SYNC
seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
if the current CPU has 1 running task, and the previous CPU is idle,
wake_affine_idle() still prefers the current CPU.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-14 11:01                 ` Chen Yu
@ 2023-09-15  3:21                   ` K Prateek Nayak
  0 siblings, 0 replies; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-15  3:21 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli

Hello Chenyu,

Thank you for taking a look at the report.

On 9/14/2023 4:31 PM, Chen Yu wrote:
> Hi Prateek,
> 
> thanks for the test,
> 
> On 2023-09-14 at 09:43:52 +0530, K Prateek Nayak wrote:
>> [..snip..]
>>
>> Meanwhile, here is result for the current series without any
>> modifications:
>>
>> tl;dr
>>
>> - There seems to be a noticeable increase in hackbench runtime with a
>>   single group but big gains beyond that. The regression could possibly
>>   be because of added searching but let me do some digging to confirm
>>   that. 
> 
> Ah OK. May I have the command to run 1 group hackbench?

This is actually perf bench sched messaging. The cmdline from the runner
is:

	$ perf bench sched messaging -t -p -l 100000 -g <# of groups>

> [..snip..]
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-14 10:43     ` Chen Yu
@ 2023-09-15  3:37       ` K Prateek Nayak
  0 siblings, 0 replies; 34+ messages in thread
From: K Prateek Nayak @ 2023-09-15  3:37 UTC (permalink / raw)
  To: Chen Yu
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Peter Zijlstra,
	Mathieu Desnoyers, Vincent Guittot, Juri Lelli, Ingo Molnar

Hello Chenyu,

On 9/14/2023 4:13 PM, Chen Yu wrote:
> Hi Prateek,
> 
> On 2023-09-14 at 11:00:02 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> One question ...
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>> [..snip..]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> [..more snip..]
>>> @@ -7052,10 +7072,14 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>>>  	int cpu;
>>>  
>>>  	for_each_cpu(cpu, cpu_smt_mask(core)) {
>>> -		if (!available_idle_cpu(cpu)) {
>>> +		bool cache_hot = sched_feat(SIS_CACHE) ?
>>> +			sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout : false;
>>> +
>>> +		if (!available_idle_cpu(cpu) || cache_hot) {
>>>  			idle = false;
>>>  			if (*idle_cpu == -1) {
>>> -				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr)) {
>>> +				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, p->cpus_ptr) &&
>>> +				    !cache_hot) {
>>
>> Here, the CPU is running a SCHED_IDLE task ...
>>
>>>  					*idle_cpu = cpu;
>>>  					break;
>>>  				}
>>
>> ... but just below this, there are following lines to cache the idle_cpu:
>>
>> 		}
>> 		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
>> 			*idle_cpu = cpu;
>>
>> Would it make sense to also add the same "cache_hot" check here when we
>> come across an idle CPU during the search for an idle core? Something
>> like:
>>
>> -		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, p->cpus_ptr))
> 
> When we reached above code, the following condition should be true:
>  (available_idle_cpu(cpu) && !cache_hot)
> because the previous 'if' statement is false. So I guess we already
> has !cache_hot ?

Ah! You are right. I missed the break at end of the if block. Thank you
for pointing it out to me :)

> 
>> +		if (*idle_cpu == -1 && !cache_hot && cpumask_test_cpu(cpu, p->cpus_ptr))
>> 			*idle_cpu = cpu;
>>
>> Implications with the above change:
>>
>> If the entire core is idle, "select_idle_core()" will return the core
>> and the search will bail out in "select_idle_cpu()". Otherwise, the
>> cache-hot idle CPUs encountered during the search for idle core will be
>> ignored now and if "idle_cpu" is not -1, it contains an idle CPU that is
>> not cache-hot.
>>
>> Thoughts?
>>
> 
> Yes, agree, we want to skip the cache-hot idle CPU if that entire core is not idle
> in your case.
> 
> thanks,
> Chenyu
 
--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-14 12:09             ` Chen Yu
@ 2023-09-15 15:18               ` Gautham R. Shenoy
  2023-09-19  9:01                 ` Chen Yu
  0 siblings, 1 reply; 34+ messages in thread
From: Gautham R. Shenoy @ 2023-09-15 15:18 UTC (permalink / raw)
  To: Chen Yu
  Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak,
	linux-kernel

Hello Chen Yu,

On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote:
[..snip..]

> > 
> > So despite "reserving" the CPU for p1, which is likely to have its
> > data still hot in the case, we would have scheduled p1', thus
> > defeating the whole purpose of reservation.
> > 
> 
> I see. So you mean, although we reserve the CPU for the wakee,
> the wakee might not choose its previous CPU, which is against our
> goal.


Yes, but only because some other task could have run on the previous
CPU. That other task could be something that was woken up on that CPU
due to:

1) wake-affine choosing that CPU 
2) newidle-balance pulling the other task on that CPU
3) !wake-affine && that CPU was also the other task's previous CPU

It could also be due to this wakee task being woken up on the waker
CPU due to wake-affine.

> 
> The reason to prevent the wakee choosing its previous CPU could be:
> 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
> 2. the wakee's CPU has already been taken by someone else, via newidle_balance().
>


> For 1, I think Prateek has expressed the concern. One mitigation method could be
> that, we give penalty to that wakee, if it decides not to choose its previous CPU:

We would be penalizing the task for something that the scheduler
decides :-)

As you point out below, in the presence of the WF_SYNC flag,
wake_affine_idle() prefer the waker CPU over the previous CPU when
they are on different LLCs and when the waker is the only task.

This strategy makes sense for two reasons:

1) The wakee may be consuming the data produced by the waker.
2) Since the wakeup will happen on the local CPU, there is no risk of
   task-stacking, exactly what your SIS_CURRENT patchset was
   attempting.

But this strategy would also result in increased task-migration. Which
both Mattieu and you have found is not so beneficial for workloads
such as hackbench. Is it only because task's data is still hot in the
previous CPU's cache ? Or is there more to it ?


It would be good to confirm if this is why lower migration is better
for these kinds of workloads.

> 
> "
> new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> if (new_cpu != prev_cpu)
> 	p->burst_sleep_avg >>= 1;
> So the duration of reservation could be shrinked.
> "
> 
> For 2, maybe inhit the newidle balance, something in my mind:
> 
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	u64 t0, t1, curr_cost = 0;
>  	struct sched_domain *sd;
>  	int pulled_task = 0;
> +	bool cache_hot = false;
>  
>  	update_misfit_status(NULL, this_rq);
>  
> @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
>  	rcu_read_lock();
>  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
>  
> +	if (sched_feat(SIS_CACHE)) {
> +		s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
> +
> +		/*
> +		 * If a short time later, a short sleeping task will be woken up
> +		 * on this idle CPU, do not launch the newidle balance.
> +		 */
> +		if (delta > 0 && delta < this_rq->max_idle_balance_cost)
> +			cache_hot = true;
> +	}
> +
>  	if (!READ_ONCE(this_rq->rd->overload) ||
> -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> +	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {

>  
>  		if (sd)
>  			update_next_balance(sd, &next_balance);

If the benefit that the workload obtains is really due to the data
being hot near its previous CPU, then this seems a sensible thing to
do.

It would be good to confirm this. Let me get some IBS data for
hackbench which is the workload which likes a sticky wakeup.

--
Thanks and Regards
gautham.



> 
> 
> > To be honest, this isn't so bad, because we have been able to avoid a
> > migration in this case.
> > 
> > > 
> > > Or do you mean, in select_idle_cpu(), we will re-check p1's previous
> > > CPU but it is skipped due to cache-hot?
> > 
> > I had originally thought about this, but then as you pointed out we
> > have an opportunity to pick the previous cpu in the early checks
> > inside select_idle_sibling().
> > 
> > > 
> > > > Have you considered recording p1's identity in the
> > > > rq->cache_hot_sleeper so that in select_task_rq_fair(), we can simply
> > > > return the previous CPU if it is cache hot and the wakee is
> > > > rq->cache_hot_sleeper, thus avoiding the whole select_idle_sibling
> > > > scan.
> > > > 
> > > 
> > > Yes this seems to be donable, and one problem would be, if there are
> > > more than 2 dequeued tasks prefer the same (previous) CPU, which task
> > > should be the rq->cache_hot_sleeper. And per Mathieu's feedback[1], we
> > > want to deal with multiple dequeued tasks. If we record all of them,
> > > it might be costly.
> > 
> > If there are multiple dequeued tasks, then it doesn't make sense to
> > record the identity of the tasks. However, we need the bail out to be
> > much earlier, in select_task_rq_fair(), perhaps even before the
> > want_affine() checks.
> > 
> > After all, if the previous CPU is idle, and its cache_hot_timeout
> > hasn't expired, and if the wakee's sleep duration is less than the
> > cache_hot_timeout, why don't we just pick it here and be done with it?
> > 
> 
> Yes we can return the previous CPU earlier, one concern is that, should
> we honor WF_SYNC flag or not,  because in wake_affine_idle(), WF_SYNC
> seems to have a higher priority than available_idle_cpu(prev_cpu). Say,
> if the current CPU has 1 running task, and the previous CPU is idle,
> wake_affine_idle() still prefers the current CPU.
> 
> thanks,
> Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-15 15:18               ` Gautham R. Shenoy
@ 2023-09-19  9:01                 ` Chen Yu
  0 siblings, 0 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-19  9:01 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Aaron Lu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli, Tim Chen, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman,
	Daniel Bristot de Oliveira, Valentin Schneider, K Prateek Nayak,
	linux-kernel

Hi Gautham,

Sorry for late reply,

On 2023-09-15 at 20:48:14 +0530, Gautham R. Shenoy wrote:
> Hello Chen Yu,
> 
> On Thu, Sep 14, 2023 at 08:09:26PM +0800, Chen Yu wrote:
> [..snip..]
> 
> > > 
> > > So despite "reserving" the CPU for p1, which is likely to have its
> > > data still hot in the case, we would have scheduled p1', thus
> > > defeating the whole purpose of reservation.
> > > 
> > 
> > I see. So you mean, although we reserve the CPU for the wakee,
> > the wakee might not choose its previous CPU, which is against our
> > goal.
> 
> 
> Yes, but only because some other task could have run on the previous
> CPU. That other task could be something that was woken up on that CPU
> due to:
> 
> 1) wake-affine choosing that CPU 
> 2) newidle-balance pulling the other task on that CPU
> 3) !wake-affine && that CPU was also the other task's previous CPU
> 
> It could also be due to this wakee task being woken up on the waker
> CPU due to wake-affine.
> 
> > 
> > The reason to prevent the wakee choosing its previous CPU could be:
> > 1. wake_affine() choose the waker's CPU rather the wakee's previous CPU, or
> > 2. the wakee's CPU has already been taken by someone else, via newidle_balance().
> >
> 
> 
> > For 1, I think Prateek has expressed the concern. One mitigation method could be
> > that, we give penalty to that wakee, if it decides not to choose its previous CPU:
> 
> We would be penalizing the task for something that the scheduler
> decides :-)
> 
> As you point out below, in the presence of the WF_SYNC flag,
> wake_affine_idle() prefer the waker CPU over the previous CPU when
> they are on different LLCs and when the waker is the only task.
> 
> This strategy makes sense for two reasons:
> 
> 1) The wakee may be consuming the data produced by the waker.
> 2) Since the wakeup will happen on the local CPU, there is no risk of
>    task-stacking, exactly what your SIS_CURRENT patchset was
>    attempting.
> 
> But this strategy would also result in increased task-migration. Which
> both Mattieu and you have found is not so beneficial for workloads
> such as hackbench. Is it only because task's data is still hot in the
> previous CPU's cache ? Or is there more to it ?
> 
> 
> It would be good to confirm if this is why lower migration is better
> for these kinds of workloads.
>

Right. According to the previous hackbench test for shared runqueue, higher
migration brings higher DSB miss rate [1]. I'll collect some statistics with
this patch applied to confirm.

https://lore.kernel.org/lkml/ZO7e5YaS71cXVxQN@chenyu5-mobl2/
> > 
> > "
> > new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> > if (new_cpu != prev_cpu)
> > 	p->burst_sleep_avg >>= 1;
> > So the duration of reservation could be shrinked.
> > "
> > 
> > For 2, maybe inhit the newidle balance, something in my mind:
> > 
> > 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12022,6 +12022,7 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >  	u64 t0, t1, curr_cost = 0;
> >  	struct sched_domain *sd;
> >  	int pulled_task = 0;
> > +	bool cache_hot = false;
> >  
> >  	update_misfit_status(NULL, this_rq);
> >  
> > @@ -12055,8 +12056,19 @@ static int newidle_balance(struct rq *this_rq, struct rq_flags *rf)
> >  	rcu_read_lock();
> >  	sd = rcu_dereference_check_sched_domain(this_rq->sd);
> >  
> > +	if (sched_feat(SIS_CACHE)) {
> > +		s64 delta = this_rq->cache_hot_timeout - sched_clock_cpu(this_cpu);
> > +
> > +		/*
> > +		 * If a short time later, a short sleeping task will be woken up
> > +		 * on this idle CPU, do not launch the newidle balance.
> > +		 */
> > +		if (delta > 0 && delta < this_rq->max_idle_balance_cost)
> > +			cache_hot = true;
> > +	}
> > +
> >  	if (!READ_ONCE(this_rq->rd->overload) ||
> > -	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost)) {
> > +	    (sd && this_rq->avg_idle < sd->max_newidle_lb_cost) || cache_hot) {
> 
> >  
> >  		if (sd)
> >  			update_next_balance(sd, &next_balance);
> 
> If the benefit that the workload obtains is really due to the data
> being hot near its previous CPU, then this seems a sensible thing to
> do.
> 
> It would be good to confirm this. Let me get some IBS data for
> hackbench which is the workload which likes a sticky wakeup.
> 

I'll collect the statistics too. Thanks for your time.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-11 15:26   ` Mathieu Desnoyers
  2023-09-11 15:43     ` Mathieu Desnoyers
@ 2023-09-20 12:34     ` Chen Yu
  1 sibling, 0 replies; 34+ messages in thread
From: Chen Yu @ 2023-09-20 12:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel

Hi Mathieu,

On 2023-09-11 at 11:26:50 -0400, Mathieu Desnoyers wrote:
> On 9/10/23 22:50, Chen Yu wrote:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inspired by Mathieu's idea[1], consider the sleep time of the task.
> > If that task is a short sleeping one, keep p's previous CPU idle
> > for a short while. Later when p is woken up, it can choose its
> > previous CPU in select_idle_sibling(). When p's previous CPU is reserved,
> > other wakee is not allowed to choose this CPU in select_idle_idle().
> > The reservation period is set to the task's average sleep time. That
> > is to say, if p is a short sleeping task, there is no need to migrate
> > p to another idle CPU.
> > 
> > This does not break the work conservation of the scheduler,
> > because wakee will still try its best to find an idle CPU.
> > The difference is that, different idle CPUs might have different
> > priorities. On the other hand, in theory this extra check could
> > increase the failure ratio of select_idle_cpu(), but per the
> > initial test result, no regression is detected.
> > 
> > Baseline: tip/sched/core, on top of:
> > Commit 3f4feb58037a ("sched: Misc cleanups")
> > 
> > Benchmark results on Intel Sapphire Rapids, 112 CPUs/socket, 2 sockets.
> > cpufreq governor is performance, turbo boost is disabled, C-states deeper
> > than C1 are disabled, Numa balancing is disabled.
> > 
> > netperf
> > =======
> > case                    load            baseline(std%)  compare%( std%)
> > UDP_RR                  56-threads       1.00 (  1.34)   +1.05 (  1.04)
> > UDP_RR                  112-threads      1.00 (  7.94)   -0.68 ( 14.42)
> > UDP_RR                  168-threads      1.00 ( 33.17)  +49.63 (  5.96)
> > UDP_RR                  224-threads      1.00 ( 13.52)  +122.53 ( 18.50)
> > 
> > Noticeable improvements of netperf is observed in 168 and 224 threads
> > cases.
> > 
> > hackbench
> > =========
> > case                    load            baseline(std%)  compare%( std%)
> > process-pipe            1-groups         1.00 (  5.61)   -4.69 (  1.48)
> > process-pipe            2-groups         1.00 (  8.74)   -0.24 (  3.10)
> > process-pipe            4-groups         1.00 (  3.52)   +1.61 (  4.41)
> > process-sockets         1-groups         1.00 (  4.73)   +2.32 (  0.95)
> > process-sockets         2-groups         1.00 (  1.27)   -3.29 (  0.97)
> > process-sockets         4-groups         1.00 (  0.09)   +0.24 (  0.09)
> > threads-pipe            1-groups         1.00 ( 10.44)   -5.88 (  1.49)
> > threads-pipe            2-groups         1.00 ( 19.15)   +5.31 ( 12.90)
> > threads-pipe            4-groups         1.00 (  1.74)   -5.01 (  6.44)
> > threads-sockets         1-groups         1.00 (  1.58)   -1.79 (  0.43)
> > threads-sockets         2-groups         1.00 (  1.19)   -8.43 (  6.91)
> > threads-sockets         4-groups         1.00 (  0.10)   -0.09 (  0.07)
> > 
> > schbench(old)
> > ========
> > case                    load            baseline(std%)  compare%( std%)
> > normal                  1-mthreads       1.00 (  0.63)   +1.28 (  0.37)
> > normal                  2-mthreads       1.00 (  8.33)   +1.58 (  2.83)
> > normal                  4-mthreads       1.00 (  2.48)   -2.98 (  3.28)
> > normal                  8-mthreads       1.00 (  3.97)   +5.01 (  1.28)
> > 
> > No much difference is observed in hackbench/schbench, due to the
> > run-to-run variance.
> > 
> > Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
> > Suggested-by: Tim Chen <tim.c.chen@intel.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >   kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >   kernel/sched/features.h |  1 +
> >   kernel/sched/sched.h    |  1 +
> >   3 files changed, 29 insertions(+), 3 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e20f50726ab8..fe3b760c9654 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   	hrtick_update(rq);
> >   	now = sched_clock_cpu(cpu_of(rq));
> >   	p->se.prev_sleep_time = task_sleep ? now : 0;
> > +#ifdef CONFIG_SMP
> > +	/*
> > +	 * If this rq will become idle, and dequeued task is
> > +	 * a short sleeping one, check if we can reserve
> > +	 * this idle CPU for that task for a short while.
> > +	 * During this reservation period, other wakees will
> > +	 * skip this 'idle' CPU in select_idle_cpu(), and this
> > +	 * short sleeping task can pick its previous CPU in
> > +	 * select_idle_sibling(), which brings better cache
> > +	 * locality.
> > +	 */
> > +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
> > +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
> > +		rq->cache_hot_timeout = now + p->se.sleep_avg;
> 
> This is really cool!
> 
> There is one scenario that worries me with this approach: workloads
> that sleep for a long time and then have short blocked periods.
> Those bursts will likely bring the average to values too high
> to stay below sysctl_sched_migration_cost.
> 
> I wonder if changing the code above for the following would help ?
> 
> if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running && p->se.sleep_avg)
> 	rq->cache_hot_timeout = now + min(sysctl_sched_migration_cost, p->se.sleep_avg);
> 
> For tasks that have a large sleep_avg, it would activate this rq
> "appear as not idle for rq selection" scheme for a window of
> sysctl_sched_migration_cost. If the sleep ends up being a long one,
> preventing other tasks from being migrated to this rq for a tiny
> window should not matter performance-wise. I would expect that it
> could help workloads that come in bursts.
>

After a revist, I thought maybe above approach is better. Because
this method considers all the historic data rather than only the short
sleeping duration:

if ((flags & ENQUEUE_WAKEUP) && last_sleep && cpu_online(task_cpu(p)) &&
     now > last_sleep && now - last_sleep < sysctl_sched_migration_cost)
	update_avg(&p->se.burst_sleep_avg, now - last_sleep);

Say, if a task sleeps for 1 ms, and in all the subsequent context switch
it will sleep much longer than 1 ms, then the burst_avg_sleep will remain
1 ms which does not reflect the behavior of this task.

update_avg() is actually (Exponentially Weighted Moving-Average) of

new_avg = 0.875 * old_avg + 0.125 * lastest_sleep_duration

which approximately reflects the latest 1/0.125 = 8 sample data. I guess
the 8 is small enought to capture the burst sleep duration.

Unless we want to capture the scenario that, during the latest 8 sleep
events, there are 1 small sleeps, and 7 large sleep behavior and
we still want to honor that 1 small sleep and reserve the idle CPU.
Then either we can use the method you proposed to use the min logic, or
use the logic you proposed in another reply, but enhance it by clearing
that burst avergae sleep duration if there is a continious 8 long sleep
event, which looks a little overkilled. Anyway, I can get some data based
on this.

thanks,
Chenyu



^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-09-20 12:35 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-11  2:49 [RFC PATCH 0/2] Makes it easier for the wakee to choose previous CPU Chen Yu
2023-09-11  2:49 ` [RFC PATCH 1/2] sched/fair: Record the average sleep time of a task Chen Yu
2023-09-11  2:50 ` [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
2023-09-11  7:26   ` Aaron Lu
2023-09-11  8:40     ` Chen Yu
2023-09-13  6:22       ` Gautham R. Shenoy
2023-09-13  7:25         ` Chen Yu
2023-09-14  7:06           ` Gautham R. Shenoy
2023-09-14 12:09             ` Chen Yu
2023-09-15 15:18               ` Gautham R. Shenoy
2023-09-19  9:01                 ` Chen Yu
2023-09-11  8:29   ` K Prateek Nayak
2023-09-11 10:19     ` Chen Yu
2023-09-12  3:05       ` K Prateek Nayak
2023-09-12 12:32         ` Chen Yu
2023-09-12 14:26           ` K Prateek Nayak
2023-09-13  2:57             ` Chen Yu
2023-09-14  4:13               ` K Prateek Nayak
2023-09-14 11:01                 ` Chen Yu
2023-09-15  3:21                   ` K Prateek Nayak
2023-09-12  9:39       ` Mike Galbraith
2023-09-12 14:51         ` Chen Yu
2023-09-12  6:32     ` Mike Galbraith
2023-09-11 15:26   ` Mathieu Desnoyers
2023-09-11 15:43     ` Mathieu Desnoyers
2023-09-12 11:53       ` Chen Yu
2023-09-12 14:06         ` Mathieu Desnoyers
2023-09-12 14:14           ` Chen Yu
2023-09-12 15:18             ` Mathieu Desnoyers
2023-09-13  3:02               ` Chen Yu
2023-09-20 12:34     ` Chen Yu
2023-09-14  5:30   ` K Prateek Nayak
2023-09-14 10:43     ` Chen Yu
2023-09-15  3:37       ` K Prateek Nayak

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).