[PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
@ 2023-09-26  5:10 Chen Yu
  2023-09-26  5:11 ` [PATCH 1/2] sched/fair: Record the short sleeping time of a task Chen Yu
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Chen Yu @ 2023-09-26  5:10 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu,
	Chen Yu

RFC -> v1:
- drop RFC
- Only record the short sleeping time for each task, to better honor the
  burst sleeping tasks. (Mathieu Desnoyers)
- Keep the forward movement monotonic for runqueue's cache-hot timeout value.
  (Mathieu Desnoyers, Aaron Lu)
- Introduce a new helper function cache_hot_cpu() that considers
  rq->cache_hot_timeout. (Aaron Lu)
- Add analysis of why inhibiting task migration could bring better throughput
  for some benchmarks. (Gautham R. Shenoy)
- Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
  select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
  (K Prateek Nayak)

Thanks for your comments and review!

----------------------------------------------------------------------

When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Inhibit the task migration while keeping the work conservation of
scheduler could benefit many workloads. Inspired by Mathieu's
proposal to limit the task migration ratio[1], this patch considers
the task average sleep duration. If the task is a short sleeping one,
then tag its previous CPU as cache hot for a short while. During this
reservation period, other wakees are not allowed to pick this idle CPU
until a timeout. Later if the task is woken up again, it can find its
previous CPU still idle, and choose it in select_idle_sibling().

This test is based on tip/sched/core, on top of
Commit afc1996859a2
("sched/fair: Ratelimit update to tg->load_avg")

patch afc1996859a2 has significantly reduced the cost of task migration,
the SIS_CACHE further reduces that cost. SIS_CACHE shows noticeable
throughput improvement of netperf/tbench around 100% load.

[patch 1/2] records the task's average short sleeping time in
its per sched_entity structure.
[patch 2/2] introduces the SIS_CACHE to skip the cache-hot
idle CPU during wakeup.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1

Chen Yu (2):
  sched/fair: Record the short sleeping time of a task
  sched/fair: skip the cache hot CPU in select_idle_cpu()

 include/linux/sched.h   |  3 ++
 kernel/sched/fair.c     | 86 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 4 files changed, 87 insertions(+), 4 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] sched/fair: Record the short sleeping time of a task
  2023-09-26  5:10 [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Chen Yu
@ 2023-09-26  5:11 ` Chen Yu
  2023-09-26 10:29   ` Mathieu Desnoyers
  2023-09-27  7:53   ` Aaron Lu
  2023-09-26  5:11 ` [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Chen Yu @ 2023-09-26  5:11 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu,
	Chen Yu

During task wakeup, the wakee firstly checks if its previous
running CPU is idle. If yes, choose that CPU as its first
choice. However, in most cases, the wakee's previous CPU
could be chosen by someone else, which breaks the cache
locality.

Proposes a mechanism to reserve the task's previous
CPU for a short while. In this reservation period, other
tasks are not allowed to pick that CPU until a timeout.
The reservation period is defined as the average short
sleep time of the task. To be more specific, it is the
time delta between this task being dequeued and enqueued.
Only the sleep time shorter than sysctl_sched_migration_cost
will be recorded. If the sleep time is longer than
sysctl_sched_migration_cost, give the reservation period
a penalty by shrinking it to half. In this way, the 'burst'
sleeping time of the task is honored, meanwhile, if that
task becomes a long-sleeper, the reservation time of that
task is shrunk to reduce the impact on task wakeup.

Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 include/linux/sched.h |  3 +++
 kernel/sched/fair.c   | 21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc37ae787e33..4a0ac0276384 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -561,6 +561,9 @@ struct sched_entity {
 	u64				vruntime;
 	s64				vlag;
 	u64				slice;
+	u64				prev_dequeue_time;
+	/* the reservation period of this task during wakeup */
+	u64				sis_rsv_avg;
 
 	u64				nr_migrations;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0877878bcdb..297b9470829c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	struct sched_entity *se = &p->se;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	int task_new = !(flags & ENQUEUE_WAKEUP);
+	u64 last_dequeue = p->se.prev_dequeue_time;
+	u64 now = sched_clock_cpu(task_cpu(p));
+
+	/*
+	 * If the task is a short-sleepting task, there is no need
+	 * to migrate it to other CPUs. Estimate the average short sleeping
+	 * time of the wakee. This sleep time is used as a hint to reserve
+	 * the dequeued task's previous CPU for a short while. During this
+	 * reservation period, select_idle_cpu() prevents other wakees from
+	 * choosing this CPU. This could bring a better cache locality.
+	 */
+	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
+	    now > last_dequeue) {
+		if (now - last_dequeue < sysctl_sched_migration_cost)
+			update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
+		else
+			p->se.sis_rsv_avg >>= 1;
+	}
 
 	/*
 	 * The code below (indirectly) updates schedutil which looks at
@@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int task_sleep = flags & DEQUEUE_SLEEP;
 	int idle_h_nr_running = task_has_idle_policy(p);
 	bool was_sched_idle = sched_idle_rq(rq);
+	u64 now;
 
 	util_est_dequeue(&rq->cfs, p);
 
@@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 dequeue_throttle:
 	util_est_update(&rq->cfs, p, task_sleep);
 	hrtick_update(rq);
+	now = sched_clock_cpu(cpu_of(rq));
+	p->se.prev_dequeue_time = task_sleep ? now : 0;
 }
 
 #ifdef CONFIG_SMP
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-26  5:10 [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Chen Yu
  2023-09-26  5:11 ` [PATCH 1/2] sched/fair: Record the short sleeping time of a task Chen Yu
@ 2023-09-26  5:11 ` Chen Yu
  2023-09-27 16:11   ` Mathieu Desnoyers
  2023-09-27  8:00 ` [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Ingo Molnar
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2023-09-26  5:11 UTC (permalink / raw)
  To: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu,
	Chen Yu

Problem statement:
When task p is woken up, the scheduler leverages select_idle_sibling()
to find an idle CPU for it. p's previous CPU is usually a preference
because it can improve cache locality. However in many cases, the
previous CPU has already been taken by other wakees, thus p has to
find another idle CPU.

Proposal:
Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
the sleep time of the task for better task placement. Based on the
task's short sleeping history, keep p's previous CPU idle for a short
while. Later when p is woken up, it can choose its previous CPU in
select_idle_sibling(). When p's previous CPU is reserved, another wakee
is not allowed to choose this CPU in select_idle_idle(). The reservation
period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.

This does not break the work conservation of the scheduler, because
wakee will still try its best to find an idle CPU. The difference is that
different idle CPUs might have different priorities.

Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
which brings task stacking. Mitigate this by returning the first cache-hot
idle CPU if no cache-cold CPU is found.

Benchmarks:
Tested on Intel Xeon Sapphire Rapids, 56C/112T x 2 sockets. CPU frequency
governor is set to performance, CPU C-state deeper than C1 are disabled.
There is noticeable improvement of netperf/tbench under 100% load.
0.7% improvement is observed on an OLTP (on-line transaction processing)
benchmark.

hackbench throughput
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  4.34)	 +8.43 ( 10.67)
process-pipe    	2-groups	 1.00 (  7.02)	 +0.48 (  6.66)
process-pipe    	4-groups	 1.00 (  1.69)	 +3.06 (  5.26)
process-sockets 	1-groups	 1.00 (  3.35)	 +0.51 (  0.99)
process-sockets 	2-groups	 1.00 (  3.56)	 +1.56 (  0.85)
process-sockets 	4-groups	 1.00 (  0.10)	 -0.10 (  0.07)
threads-pipe    	1-groups	 1.00 (  9.86)	 +8.21 (  0.46)
threads-pipe    	2-groups	 1.00 ( 12.26)	 +1.95 (  7.06)
threads-pipe    	4-groups	 1.00 (  8.33)	 -5.81 (  7.16)
threads-sockets 	1-groups	 1.00 (  1.77)	 +0.41 (  2.29)
threads-sockets 	2-groups	 1.00 (  5.26)	 -7.86 (  9.41)
threads-sockets 	4-groups	 1.00 (  0.04)	 +0.02 (  0.08)

netperf throughput
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	56-threads	 1.00 (  1.57)	 +0.08 (  1.23)
TCP_RR          	112-threads	 1.00 (  1.01)	 -0.41 (  0.28)
TCP_RR          	168-threads	 1.00 (  0.95)	 -1.00 (  0.70)
TCP_RR          	224-threads	 1.00 (  0.70)	+119.95 ( 34.26)
TCP_RR          	280-threads	 1.00 ( 16.77)	 -0.37 ( 17.40)
TCP_RR          	336-threads	 1.00 ( 20.37)	 -0.33 ( 22.35)
TCP_RR          	392-threads	 1.00 ( 28.30)	 -0.48 ( 31.66)
TCP_RR          	448-threads	 1.00 ( 39.48)	 -0.13 ( 38.72)
UDP_RR          	56-threads	 1.00 (  1.47)	 -0.09 (  1.57)
UDP_RR          	112-threads	 1.00 (  7.88)	 -0.05 (  9.98)
UDP_RR          	168-threads	 1.00 ( 10.48)	 -1.27 ( 16.89)
UDP_RR          	224-threads	 1.00 ( 15.08)	 -2.13 ( 23.49)
UDP_RR          	280-threads	 1.00 ( 21.16)	 +0.17 ( 23.05)
UDP_RR          	336-threads	 1.00 ( 24.41)	 +0.46 ( 22.11)
UDP_RR          	392-threads	 1.00 ( 27.58)	 +0.01 ( 29.47)
UDP_RR          	448-threads	 1.00 ( 37.33)	 +0.08 ( 56.11)

tbench throughput
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	56-threads	 1.00 (  1.87)	 -0.45 (  2.10)
loopback        	112-threads	 1.00 (  0.21)	 -0.86 (  0.12)
loopback        	168-threads	 1.00 (  0.08)	 -2.47 (  0.21)
loopback        	224-threads	 1.00 (  2.57)	+13.39 (  0.79)
loopback        	280-threads	 1.00 (  0.20)	 -0.07 (  0.39)
loopback        	336-threads	 1.00 (  0.14)	 +0.14 (  0.22)
loopback        	392-threads	 1.00 (  0.32)	 +0.25 (  0.23)
loopback        	448-threads	 1.00 (  0.87)	 +1.45 (  0.17)

schbench 99.0th latency
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.36)	 +1.02 (  0.63)
normal          	2-mthreads	 1.00 (  2.44)	 -3.74 (  3.92)
normal          	4-mthreads	 1.00 (  3.14)	 +4.17 (  3.87)

Analysis:
Gautham is interested in the reason why waking up the task on its previous
CPU brings benefits. The following digs into that.

Take netperf 224 case as an example, the reason why netperf improves a
lot is due to less task migration brings better cache locality. Described
as below.

bpftrace script to track the migration number within 10 seconds:
kretfunc:select_task_rq_fair
{
        $p = (struct task_struct *)args->p;
        if ($p->comm == "netperf") {
                if ($p->thread_info.cpu != retval) {
                        @wakeup_migrate_netperf = count();
                } else {
                        @wakeup_prev_netperf = count();
                }
        }
}

interval:s:10
{
        time("\n%H:%M:%S Wakeup statistics: \n");
        print(@wakeup_migrate_netperf);
        clear(@wakeup_migrate_netperf);
        print(@wakeup_prev_netperf);
        clear(@wakeup_prev_netperf);
}

NO_SIS_CACHE:
@wakeup_migrate_netperf: 48002669
@wakeup_prev_netperf:    14853103

SIS_CACHE:
@wakeup_migrate_netperf: 571
@wakeup_prev_netperf: 136620030

The ratio the task migration has been reduced a lot with SIS_CACHE.

perf topdown to track the PMU events, by running the following commands:
perf stat -M TopdownL1 -- sleep 10
perf stat -M tma_backend_bound_group -- sleep 10
perf stat -M tma_core_bound_group -- sleep 10
perf stat -M tma_ports_utilization_group -- sleep 10
perf stat -M tma_memory_bound_group -- sleep 10

NO_SIS_CACHE:
19.1 %  tma_backend_bound     <--------
        7.1 %  tma_core_bound
               0.2 %  tma_divider
               35.0 %  tma_ports_utilization
                       13.9 %  tma_ports_utilized_0
                       15.0 %  tma_ports_utilized_1
                       16.5 %  tma_ports_utilized_3m
        8.3 %  tma_memory_bound
               4.8 %  tma_dram_bound
               2.1 %  tma_l2_bound
14.4 %  tma_bad_speculation
43.0 %  tma_frontend_bound
23.4 %  tma_retiring

SIS_CACHE:
5.4 %   tma_backend_bound     <--------
        7.5 %  tma_core_bound
               0.2 %  tma_divider
               39.8 %  tma_ports_utilization
                       13.9 %  tma_ports_utilized_0
                       15.3 %  tma_ports_utilized_1
                       16.5 %  tma_ports_utilized_3m
        6.7 %  tma_memory_bound
               4.5 %  tma_dram_bound
               2.2 %  tma_l2_bound
16.6 %  tma_bad_speculation
44.7 %  tma_frontend_bound
33.3 %  tma_retiring

The tma_backend_bound ratio has been decreased a lot. Inside the backend
bound metrics, the tma_ports_utilized_x is the fraction of cycles CPU
executed x uops per cycle on all execution ports(on this platform it is
the Logical Processor cycles). It is the RESOURCE_STALLS cycles on this
Logical Processor. It reflects that waking up on its previous logical
CPU would reduce the cycle on uops execution. Meanwhile, the l2_bound
is also reduced due to higher L2 cache locality.

The reason why netperf has improvement on TCP_RR over UDP_RR might be
that, the TCP_RR has a shorter sleep time than UDP_RR(per the bpftrace
result), which could be easier to be detected and taken care of by
SIS_CACHE. TCP has to maintain a connection between the sender and
receiver, so the CPU needs to do more CPU work before falling asleep.
UDP is much faster to finish the package sending and enters sleep more
frequently.

Limitations:
As pointed out by Gautham, some other tasks could be woken up on that
reservation CPU due to:

  1) wake-affine choosing that CPU
  2) newidle-balance pulling the other task on that CPU
  3) !wake-affine && that CPU was also the other task's previous CPU

Currently, we do not deal with these cases and only propose a simple enough
method, later there could be some incremental optimizations.

Link: https://lore.kernel.org/lkml/20230905171105.1005672-2-mathieu.desnoyers@efficios.com/ #1
Suggested-by: Tim Chen <tim.c.chen@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 65 ++++++++++++++++++++++++++++++++++++++---
 kernel/sched/features.h |  1 +
 kernel/sched/sched.h    |  1 +
 3 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 297b9470829c..12a8b4594dff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6632,6 +6632,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	hrtick_update(rq);
 	now = sched_clock_cpu(cpu_of(rq));
 	p->se.prev_dequeue_time = task_sleep ? now : 0;
+#ifdef CONFIG_SMP
+	/*
+	 * If this rq will become idle, and if the dequeued
+	 * task is a short sleeping one, check if we can reserve
+	 * this idle CPU for that task for a short while.
+	 * During this reservation period, other wakees will
+	 * skip this 'idle' CPU in select_idle_cpu(), and this
+	 * short sleeping task can pick its previous CPU in
+	 * select_idle_sibling(), which brings better cache
+	 * locality.
+	 */
+	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
+	    p->se.sis_rsv_avg)
+		rq->cache_hot_timeout = max(rq->cache_hot_timeout, now + p->se.sis_rsv_avg);
+#endif
 }
 
 #ifdef CONFIG_SMP
@@ -6982,6 +6997,25 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
 	return new_cpu;
 }
 
+/*
+ * If a short sleeping task has once run on this idle CPU
+ * not long ago, return 1 to indicate that the CPU is still
+ * cache-hot for that task, and no one except for that task
+ * should pick this cache-hot CPU during wakeup.
+ *
+ * The CPU is expected to be idle when invoking this function.
+ */
+static int cache_hot_cpu(int cpu)
+{
+	if (!sched_feat(SIS_CACHE))
+		return 0;
+
+	if (sched_clock_cpu(cpu) >= cpu_rq(cpu)->cache_hot_timeout)
+		return 0;
+
+	return 1;
+}
+
 static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
 	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
@@ -7125,7 +7159,7 @@ static inline int select_idle_smt(struct task_struct *p, int target)
 static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
 {
 	struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
-	int i, cpu, idle_cpu = -1, nr = INT_MAX;
+	int i, cpu, idle_cpu = -1, nr = INT_MAX, first_hot_cpu = -1;
 	struct sched_domain_shared *sd_share;
 	struct rq *this_rq = this_rq();
 	int this = smp_processor_id();
@@ -7180,18 +7214,41 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
-			if ((unsigned int)i < nr_cpumask_bits)
+			/*
+			 * Only the cache-cold idle CPU is returned. If no
+			 * cache-cold CPU is found, choose the first idle cpu
+			 * stored in idle_cpu.
+			 */
+			if ((unsigned int)i < nr_cpumask_bits && !cache_hot_cpu(i))
 				return i;
 
 		} else {
 			if (!--nr)
 				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
-			if ((unsigned int)idle_cpu < nr_cpumask_bits)
-				break;
+			if ((unsigned int)idle_cpu < nr_cpumask_bits) {
+				if (!cache_hot_cpu(idle_cpu)) {
+					first_hot_cpu = -1;
+					break;
+				}
+
+				/*
+				 * Record the first cache-hot idle CPU
+				 * as the last resort. This is to deal
+				 * with the case that, every CPU is
+				 * cache-hot, and we want to choose an
+				 * idle CPU over a non-idle target CPU.
+				 */
+				if (first_hot_cpu == -1)
+					first_hot_cpu = idle_cpu;
+			}
 		}
 	}
 
+	/* all idle CPUs are cache-hot, choose the first cache-hot one */
+	if (first_hot_cpu != -1)
+		idle_cpu = first_hot_cpu;
+
 	if (has_idle_core)
 		set_idle_cores(target, false);
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index f770168230ae..04ed9fcf67f8 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -51,6 +51,7 @@ SCHED_FEAT(TTWU_QUEUE, true)
  */
 SCHED_FEAT(SIS_PROP, false)
 SCHED_FEAT(SIS_UTIL, true)
+SCHED_FEAT(SIS_CACHE, true)
 
 /*
  * Issue a WARN when we do multiple update_rq_clock() calls
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 887468c48ff6..7aa728289395 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1079,6 +1079,7 @@ struct rq {
 #endif
 	u64			idle_stamp;
 	u64			avg_idle;
+	u64			cache_hot_timeout;
 
 	unsigned long		wake_stamp;
 	u64			wake_avg_idle;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
  2023-09-26  5:11 ` [PATCH 1/2] sched/fair: Record the short sleeping time of a task Chen Yu
@ 2023-09-26 10:29   ` Mathieu Desnoyers
  2023-09-27  6:17     ` Chen Yu
  2023-09-27  7:53   ` Aaron Lu
  1 sibling, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2023-09-26 10:29 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

On 9/26/23 06:11, Chen Yu wrote:
> During task wakeup, the wakee firstly checks if its previous
> running CPU is idle. If yes, choose that CPU as its first
> choice. However, in most cases, the wakee's previous CPU
> could be chosen by someone else, which breaks the cache
> locality.
> 
> Proposes a mechanism to reserve the task's previous
> CPU for a short while. In this reservation period, other
> tasks are not allowed to pick that CPU until a timeout.
> The reservation period is defined as the average short
> sleep time of the task. To be more specific, it is the
> time delta between this task being dequeued and enqueued.
> Only the sleep time shorter than sysctl_sched_migration_cost
> will be recorded. If the sleep time is longer than
> sysctl_sched_migration_cost, give the reservation period
> a penalty by shrinking it to half. In this way, the 'burst'
> sleeping time of the task is honored, meanwhile, if that
> task becomes a long-sleeper, the reservation time of that
> task is shrunk to reduce the impact on task wakeup.
> 
> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>   include/linux/sched.h |  3 +++
>   kernel/sched/fair.c   | 21 +++++++++++++++++++++
>   2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc37ae787e33..4a0ac0276384 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -561,6 +561,9 @@ struct sched_entity {
>   	u64				vruntime;
>   	s64				vlag;
>   	u64				slice;
> +	u64				prev_dequeue_time;
> +	/* the reservation period of this task during wakeup */
> +	u64				sis_rsv_avg;
>   
>   	u64				nr_migrations;
>   
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0877878bcdb..297b9470829c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	struct sched_entity *se = &p->se;
>   	int idle_h_nr_running = task_has_idle_policy(p);
>   	int task_new = !(flags & ENQUEUE_WAKEUP);
> +	u64 last_dequeue = p->se.prev_dequeue_time;
> +	u64 now = sched_clock_cpu(task_cpu(p));
> +
> +	/*
> +	 * If the task is a short-sleepting task, there is no need
> +	 * to migrate it to other CPUs. Estimate the average short sleeping
> +	 * time of the wakee. This sleep time is used as a hint to reserve
> +	 * the dequeued task's previous CPU for a short while. During this
> +	 * reservation period, select_idle_cpu() prevents other wakees from
> +	 * choosing this CPU. This could bring a better cache locality.

"This could bring a better cache locality." could be rephrased as
"This improves cache locality for short-sleeping tasks."

Please add my:

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> +	 */
> +	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
> +	    now > last_dequeue) {
> +		if (now - last_dequeue < sysctl_sched_migration_cost)
> +			update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
> +		else
> +			p->se.sis_rsv_avg >>= 1;
> +	}
>   
>   	/*
>   	 * The code below (indirectly) updates schedutil which looks at
> @@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   	int task_sleep = flags & DEQUEUE_SLEEP;
>   	int idle_h_nr_running = task_has_idle_policy(p);
>   	bool was_sched_idle = sched_idle_rq(rq);
> +	u64 now;
>   
>   	util_est_dequeue(&rq->cfs, p);
>   
> @@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>   dequeue_throttle:
>   	util_est_update(&rq->cfs, p, task_sleep);
>   	hrtick_update(rq);
> +	now = sched_clock_cpu(cpu_of(rq));
> +	p->se.prev_dequeue_time = task_sleep ? now : 0;
>   }
>   
>   #ifdef CONFIG_SMP

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
  2023-09-26 10:29   ` Mathieu Desnoyers
@ 2023-09-27  6:17     ` Chen Yu
  0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2023-09-27  6:17 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

Hi Mathieu,

On 2023-09-26 at 06:29:40 -0400, Mathieu Desnoyers wrote:
> On 9/26/23 06:11, Chen Yu wrote:
> > During task wakeup, the wakee firstly checks if its previous
> > running CPU is idle. If yes, choose that CPU as its first
> > choice. However, in most cases, the wakee's previous CPU
> > could be chosen by someone else, which breaks the cache
> > locality.
> > 
> > Proposes a mechanism to reserve the task's previous
> > CPU for a short while. In this reservation period, other
> > tasks are not allowed to pick that CPU until a timeout.
> > The reservation period is defined as the average short
> > sleep time of the task. To be more specific, it is the
> > time delta between this task being dequeued and enqueued.
> > Only the sleep time shorter than sysctl_sched_migration_cost
> > will be recorded. If the sleep time is longer than
> > sysctl_sched_migration_cost, give the reservation period
> > a penalty by shrinking it to half. In this way, the 'burst'
> > sleeping time of the task is honored, meanwhile, if that
> > task becomes a long-sleeper, the reservation time of that
> > task is shrunk to reduce the impact on task wakeup.
> > 
> > Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >   include/linux/sched.h |  3 +++
> >   kernel/sched/fair.c   | 21 +++++++++++++++++++++
> >   2 files changed, 24 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dc37ae787e33..4a0ac0276384 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -561,6 +561,9 @@ struct sched_entity {
> >   	u64				vruntime;
> >   	s64				vlag;
> >   	u64				slice;
> > +	u64				prev_dequeue_time;
> > +	/* the reservation period of this task during wakeup */
> > +	u64				sis_rsv_avg;
> >   	u64				nr_migrations;
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d0877878bcdb..297b9470829c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >   	struct sched_entity *se = &p->se;
> >   	int idle_h_nr_running = task_has_idle_policy(p);
> >   	int task_new = !(flags & ENQUEUE_WAKEUP);
> > +	u64 last_dequeue = p->se.prev_dequeue_time;
> > +	u64 now = sched_clock_cpu(task_cpu(p));
> > +
> > +	/*
> > +	 * If the task is a short-sleepting task, there is no need
> > +	 * to migrate it to other CPUs. Estimate the average short sleeping
> > +	 * time of the wakee. This sleep time is used as a hint to reserve
> > +	 * the dequeued task's previous CPU for a short while. During this
> > +	 * reservation period, select_idle_cpu() prevents other wakees from
> > +	 * choosing this CPU. This could bring a better cache locality.
> 
> "This could bring a better cache locality." could be rephrased as
> "This improves cache locality for short-sleeping tasks."
>

OK, will do in the next version.
 
> Please add my:
> 
> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> 

Thanks very much!

best,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
  2023-09-26  5:11 ` [PATCH 1/2] sched/fair: Record the short sleeping time of a task Chen Yu
  2023-09-26 10:29   ` Mathieu Desnoyers
@ 2023-09-27  7:53   ` Aaron Lu
  2023-09-28  7:57     ` Chen Yu
  1 sibling, 1 reply; 18+ messages in thread
From: Aaron Lu @ 2023-09-27  7:53 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Chen Yu

On Tue, Sep 26, 2023 at 01:11:02PM +0800, Chen Yu wrote:
> During task wakeup, the wakee firstly checks if its previous
> running CPU is idle. If yes, choose that CPU as its first
> choice. However, in most cases, the wakee's previous CPU
> could be chosen by someone else, which breaks the cache
> locality.
> 
> Proposes a mechanism to reserve the task's previous
> CPU for a short while. In this reservation period, other
> tasks are not allowed to pick that CPU until a timeout.
> The reservation period is defined as the average short
> sleep time of the task. To be more specific, it is the
> time delta between this task being dequeued and enqueued.
> Only the sleep time shorter than sysctl_sched_migration_cost
> will be recorded. If the sleep time is longer than
> sysctl_sched_migration_cost, give the reservation period
> a penalty by shrinking it to half. In this way, the 'burst'
> sleeping time of the task is honored, meanwhile, if that
> task becomes a long-sleeper, the reservation time of that
> task is shrunk to reduce the impact on task wakeup.
> 
> Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  include/linux/sched.h |  3 +++
>  kernel/sched/fair.c   | 21 +++++++++++++++++++++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index dc37ae787e33..4a0ac0276384 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -561,6 +561,9 @@ struct sched_entity {
>  	u64				vruntime;
>  	s64				vlag;
>  	u64				slice;
> +	u64				prev_dequeue_time;
> +	/* the reservation period of this task during wakeup */
> +	u64				sis_rsv_avg;

Nit: these info are only relavant for task, not group so might be better
to move them to task_struct to save a little memory?

>  
>  	u64				nr_migrations;
>  
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d0877878bcdb..297b9470829c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	struct sched_entity *se = &p->se;
>  	int idle_h_nr_running = task_has_idle_policy(p);
>  	int task_new = !(flags & ENQUEUE_WAKEUP);
> +	u64 last_dequeue = p->se.prev_dequeue_time;
> +	u64 now = sched_clock_cpu(task_cpu(p));

I think cpu_of(rq) is more clear than task_cpu(p). Using task_cpu(p)
seems to suggest task_cpu(p) can be different from cpu_of(rq).

> +
> +	/*
> +	 * If the task is a short-sleepting task, there is no need
> +	 * to migrate it to other CPUs. Estimate the average short sleeping
> +	 * time of the wakee. This sleep time is used as a hint to reserve
> +	 * the dequeued task's previous CPU for a short while. During this
> +	 * reservation period, select_idle_cpu() prevents other wakees from
> +	 * choosing this CPU. This could bring a better cache locality.
> +	 */
> +	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&

Hmm...the cpu_online() check looks weird. If the cpu is offlined, no task
will be enqueued there, right?

Thanks,
Aaron

> +	    now > last_dequeue) {
> +		if (now - last_dequeue < sysctl_sched_migration_cost)
> +			update_avg(&p->se.sis_rsv_avg, now - last_dequeue);
> +		else
> +			p->se.sis_rsv_avg >>= 1;
> +	}
>  
>  	/*
>  	 * The code below (indirectly) updates schedutil which looks at
> @@ -6550,6 +6568,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	int task_sleep = flags & DEQUEUE_SLEEP;
>  	int idle_h_nr_running = task_has_idle_policy(p);
>  	bool was_sched_idle = sched_idle_rq(rq);
> +	u64 now;
>  
>  	util_est_dequeue(&rq->cfs, p);
>  
> @@ -6611,6 +6630,8 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  dequeue_throttle:
>  	util_est_update(&rq->cfs, p, task_sleep);
>  	hrtick_update(rq);
> +	now = sched_clock_cpu(cpu_of(rq));
> +	p->se.prev_dequeue_time = task_sleep ? now : 0;
>  }
>  
>  #ifdef CONFIG_SMP
> -- 
> 2.25.1
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-09-26  5:10 [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Chen Yu
  2023-09-26  5:11 ` [PATCH 1/2] sched/fair: Record the short sleeping time of a task Chen Yu
  2023-09-26  5:11 ` [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
@ 2023-09-27  8:00 ` Ingo Molnar
  2023-09-27 21:34   ` Tim Chen
  2023-09-28  8:23   ` Chen Yu
  2023-10-05  6:22 ` K Prateek Nayak
  2023-10-17  9:49 ` Madadi Vineeth Reddy
  4 siblings, 2 replies; 18+ messages in thread
From: Ingo Molnar @ 2023-09-27  8:00 UTC (permalink / raw)
  To: Chen Yu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Chen Yu


* Chen Yu <yu.c.chen@intel.com> wrote:

> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Inhibit the task migration while keeping the work conservation of
> scheduler could benefit many workloads. Inspired by Mathieu's
> proposal to limit the task migration ratio[1], this patch considers
> the task average sleep duration. If the task is a short sleeping one,
> then tag its previous CPU as cache hot for a short while. During this
> reservation period, other wakees are not allowed to pick this idle CPU
> until a timeout. Later if the task is woken up again, it can find its
> previous CPU still idle, and choose it in select_idle_sibling().

Yeah, so I'm not convinced about this at this stage.

By allowing a task to basically hog a CPU after it has gone idle already,
however briefly, we reduce resource utilization efficiency for the sake
of singular benchmark workloads.

In a mixed environment the cost of leaving CPUs idle longer than necessary
will show up - and none of these benchmarks show that kind of side effect
and indirect overhead.

This feature would be a lot more convincing if it tried to measure overhead
in the pathological case, not the case it's been written for.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-26  5:11 ` [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
@ 2023-09-27 16:11   ` Mathieu Desnoyers
  2023-09-28  7:52     ` Chen Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Mathieu Desnoyers @ 2023-09-27 16:11 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

On 9/26/23 06:11, Chen Yu wrote:
> Problem statement:
> When task p is woken up, the scheduler leverages select_idle_sibling()
> to find an idle CPU for it. p's previous CPU is usually a preference
> because it can improve cache locality. However in many cases, the
> previous CPU has already been taken by other wakees, thus p has to
> find another idle CPU.
> 
> Proposal:
> Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
> the sleep time of the task for better task placement. Based on the
> task's short sleeping history, keep p's previous CPU idle for a short
> while. Later when p is woken up, it can choose its previous CPU in
> select_idle_sibling(). When p's previous CPU is reserved, another wakee
> is not allowed to choose this CPU in select_idle_idle(). The reservation
> period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.
> 
> This does not break the work conservation of the scheduler, because
> wakee will still try its best to find an idle CPU. The difference is that
> different idle CPUs might have different priorities.
> 
> Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
> cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
> which brings task stacking. Mitigate this by returning the first cache-hot
> idle CPU if no cache-cold CPU is found.

I've tried your patches on my reference hackbench workload:

./hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

Unfortunately they don't appear to help for that specific load.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-09-27  8:00 ` [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Ingo Molnar
@ 2023-09-27 21:34   ` Tim Chen
  2023-09-28  8:23   ` Chen Yu
  1 sibling, 0 replies; 18+ messages in thread
From: Tim Chen @ 2023-09-27 21:34 UTC (permalink / raw)
  To: Ingo Molnar, Chen Yu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Chen Yu

On Wed, 2023-09-27 at 10:00 +0200, Ingo Molnar wrote:
> * Chen Yu <yu.c.chen@intel.com> wrote:
> 
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
> 
> Yeah, so I'm not convinced about this at this stage.
> 
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
> 
> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
> 
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
> 

Ingo,

Mathieu's patches on detecting overly high task migrations and then
rate limiting migration is a way to detect that tasks are getting 
crazy doing CPU musical chairs and in a pathological state.

Will the migration rate be a reasonable indicator that we need to
do something to reduce pathological migrations like SIS_CACHE proposal so the
tasks don't get jerked all over?
Or you have some other better indicators in mind?

We did some experiments on the OLTP workload on a 112 core 2 socket
SPR machine.  The OLTP workload have a mixture of threads
handling database updates on disks and handling transaction
queries over network.

For Mathieu's original task migration rate limit patches,
we saw 1.2% improvement and for Chen Yu's SIS_CACHE proposal, we 
saw 0.7% improvement.  System is running at
~94% busy so is under high utilization.  The variation of this workload
is less than 0.2%. There are improvements for such mix workload
though it is not as much as the microbenchmarks.  These
data are perliminary and we are still doing more experiments.

For the OLTP experiments, each socket with 64 cores are divided
with sub-numa clusters of 4 nodes of 16 cores each so the scheduling
overhead in idle CPU search is much less if SNC is off.  

Thanks.

Tim

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu()
  2023-09-27 16:11   ` Mathieu Desnoyers
@ 2023-09-28  7:52     ` Chen Yu
  0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2023-09-28  7:52 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

Hi Mathieu,

On 2023-09-27 at 12:11:33 -0400, Mathieu Desnoyers wrote:
> On 9/26/23 06:11, Chen Yu wrote:
> > Problem statement:
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Proposal:
> > Inspired by Mathieu's idea[1], introduce the SIS_CACHE. It considers
> > the sleep time of the task for better task placement. Based on the
> > task's short sleeping history, keep p's previous CPU idle for a short
> > while. Later when p is woken up, it can choose its previous CPU in
> > select_idle_sibling(). When p's previous CPU is reserved, another wakee
> > is not allowed to choose this CPU in select_idle_idle(). The reservation
> > period is set to the task's average short sleep time, AKA, se->sis_rsv_avg.
> > 
> > This does not break the work conservation of the scheduler, because
> > wakee will still try its best to find an idle CPU. The difference is that
> > different idle CPUs might have different priorities.
> > 
> > Prateek pointed out that, with SIS_CACHE enabled, if all idle CPUs are
> > cache-hot, select_idle_cpu() might have to choose a non-idle target CPU,
> > which brings task stacking. Mitigate this by returning the first cache-hot
> > idle CPU if no cache-cold CPU is found.
> 
> I've tried your patches on my reference hackbench workload:
> 
> ./hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100
> 
> Unfortunately they don't appear to help for that specific load.
>

I just ran the same test on a 224 CPU system and it seems that
there is no much difference with/without SIS_CACHE. To figure out
the reason, I used the bpftrace to track how often hackbench is woken
up on its previous CPU:

kretfunc:select_task_rq_fair
{
        $p = (struct task_struct *)args->p;
        if ($p->comm == "sender") {
                if ($p->thread_info.cpu != retval) {
                        @wakeup_migrate_sender = count();
                } else {
                        @wakeup_prev_sender = count();
                }
        }

        if ($p->comm == "receiver") {
                if ($p->thread_info.cpu != retval) {
                        @wakeup_migrate_receiver = count();
                } else {
                        @wakeup_prev_receiver = count();
                }
        }
}

and print the data every 10 seconds:

NO_SIS_CACHE:
23:50:24 Wakeup statistics: 
@wakeup_migrate_sender: 9043961
@wakeup_prev_sender: 20073128
@wakeup_migrate_receiver: 12071462
@wakeup_prev_receiver: 19587895

sender: migration/previous = 45.06%
receiver: migration/previous = 61.612%


SIS_CACHE:
23:49:21 Wakeup statistics: 
@wakeup_migrate_sender: 6716902
@wakeup_prev_sender: 22727513
@wakeup_migrate_receiver: 11547623
@wakeup_prev_receiver: 24615810
 
sender: migration/previous = 29.55%
receiver: migration/previous = 46.91%

Both the sender and receiver in hackbench has raised the chance
to be woken up on its previous CPU, but not as much as netperf.
Why there is no much score difference? I checked the bottleneck
via perf topdown.

perf stat -M TopdownL1 -- sleep 10
perf stat -M tma_frontend_bound_group -- sleep 10
perf stat -M tma_fetch_latency_group -- sleep 10

NO_SIS_CACHE:
15.2 %  tma_backend_bound
14.9 %  tma_bad_speculation
43.9 %  tma_frontend_bound
        30.3 %  tma_fetch_latency
                9.7 %  tma_ms_switches
        14.0 %  tma_fetch_bandwidth
26.1 %  tma_retiring


SIS_CACHE:
14.5 %  tma_backend_bound
15.3 %  tma_bad_speculation
44.5 %  tma_frontend_bound
        31.5 %  tma_fetch_latency
                10.6 %  tma_ms_switches
        13.0 %  tma_fetch_bandwidth
25.8 %  tma_retiring

There is no much ratio changed with/without SIS_CACHE enabled.
This is because SIS_CACHE might bring benefit if tasks have a large
cache foot print(backend like netperf), but it seems that hackbench
pipe mode is frontend bound and its bottleneck is the complexy of the
instruction being executes(tma_ms_switch: MS is to decode the complex
instructions, the increase of MS switch counter usually means that the
workload is running some complex instruction), that is to say, the
pipe_read/write's code path could be the bottleneck. Your original
rate limitation on task migration might be more aggressive to reduce
the ratio of tma_backend_bound, and that might bring the score benefit,
like netperf case. Let me apply your original patch to confirm
if this is the case.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] sched/fair: Record the short sleeping time of a task
  2023-09-27  7:53   ` Aaron Lu
@ 2023-09-28  7:57     ` Chen Yu
  0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2023-09-28  7:57 UTC (permalink / raw)
  To: Aaron Lu
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Chen Yu

Hi Aaron,

On 2023-09-27 at 15:53:33 +0800, Aaron Lu wrote:
> On Tue, Sep 26, 2023 at 01:11:02PM +0800, Chen Yu wrote:
> > During task wakeup, the wakee firstly checks if its previous
> > running CPU is idle. If yes, choose that CPU as its first
> > choice. However, in most cases, the wakee's previous CPU
> > could be chosen by someone else, which breaks the cache
> > locality.
> > 
> > Proposes a mechanism to reserve the task's previous
> > CPU for a short while. In this reservation period, other
> > tasks are not allowed to pick that CPU until a timeout.
> > The reservation period is defined as the average short
> > sleep time of the task. To be more specific, it is the
> > time delta between this task being dequeued and enqueued.
> > Only the sleep time shorter than sysctl_sched_migration_cost
> > will be recorded. If the sleep time is longer than
> > sysctl_sched_migration_cost, give the reservation period
> > a penalty by shrinking it to half. In this way, the 'burst'
> > sleeping time of the task is honored, meanwhile, if that
> > task becomes a long-sleeper, the reservation time of that
> > task is shrunk to reduce the impact on task wakeup.
> > 
> > Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> > ---
> >  include/linux/sched.h |  3 +++
> >  kernel/sched/fair.c   | 21 +++++++++++++++++++++
> >  2 files changed, 24 insertions(+)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index dc37ae787e33..4a0ac0276384 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -561,6 +561,9 @@ struct sched_entity {
> >  	u64				vruntime;
> >  	s64				vlag;
> >  	u64				slice;
> > +	u64				prev_dequeue_time;
> > +	/* the reservation period of this task during wakeup */
> > +	u64				sis_rsv_avg;
> 
> Nit: these info are only relavant for task, not group so might be better
> to move them to task_struct to save a little memory?
>

Yes, I'll try to do this.
 
> >  
> >  	u64				nr_migrations;
> >  
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d0877878bcdb..297b9470829c 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6456,6 +6456,24 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  	struct sched_entity *se = &p->se;
> >  	int idle_h_nr_running = task_has_idle_policy(p);
> >  	int task_new = !(flags & ENQUEUE_WAKEUP);
> > +	u64 last_dequeue = p->se.prev_dequeue_time;
> > +	u64 now = sched_clock_cpu(task_cpu(p));
> 
> I think cpu_of(rq) is more clear than task_cpu(p). Using task_cpu(p)
> seems to suggest task_cpu(p) can be different from cpu_of(rq).
>

You are right. My original thought is to use task's previous CPU rather than
the current rq. But at this stage the task's CPU has already been updated
to the same as rq. I'll think more about how to deal with this properly.
 
> > +
> > +	/*
> > +	 * If the task is a short-sleepting task, there is no need
> > +	 * to migrate it to other CPUs. Estimate the average short sleeping
> > +	 * time of the wakee. This sleep time is used as a hint to reserve
> > +	 * the dequeued task's previous CPU for a short while. During this
> > +	 * reservation period, select_idle_cpu() prevents other wakees from
> > +	 * choosing this CPU. This could bring a better cache locality.
> > +	 */
> > +	if ((flags & ENQUEUE_WAKEUP) && last_dequeue && cpu_online(task_cpu(p)) &&
> 
> Hmm...the cpu_online() check looks weird. If the cpu is offlined, no task
> will be enqueued there, right?
>

Right. If rq and the task's CPU are the same, there is no need to check cpu_online.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-09-27  8:00 ` [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Ingo Molnar
  2023-09-27 21:34   ` Tim Chen
@ 2023-09-28  8:23   ` Chen Yu
  1 sibling, 0 replies; 18+ messages in thread
From: Chen Yu @ 2023-09-28  8:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Chen Yu

Hi Ingo,

On 2023-09-27 at 10:00:11 +0200, Ingo Molnar wrote:
> 
> * Chen Yu <yu.c.chen@intel.com> wrote:
> 
> > When task p is woken up, the scheduler leverages select_idle_sibling()
> > to find an idle CPU for it. p's previous CPU is usually a preference
> > because it can improve cache locality. However in many cases, the
> > previous CPU has already been taken by other wakees, thus p has to
> > find another idle CPU.
> > 
> > Inhibit the task migration while keeping the work conservation of
> > scheduler could benefit many workloads. Inspired by Mathieu's
> > proposal to limit the task migration ratio[1], this patch considers
> > the task average sleep duration. If the task is a short sleeping one,
> > then tag its previous CPU as cache hot for a short while. During this
> > reservation period, other wakees are not allowed to pick this idle CPU
> > until a timeout. Later if the task is woken up again, it can find its
> > previous CPU still idle, and choose it in select_idle_sibling().
> 
> Yeah, so I'm not convinced about this at this stage.
> 
> By allowing a task to basically hog a CPU after it has gone idle already,
> however briefly, we reduce resource utilization efficiency for the sake
> of singular benchmark workloads.
>

Currently in the code we do not really reserve the idle CPU or force it
to be idle. We just give other wakee a search sequence suggestion to find
the idle CPU. If all idle CPUs are in reserved state, the first reserved idle
CPU will be picked up rather than left it in idle. This can fully utilize the
idle CPU resource. The main impact is the wakeup latency if I understand
correctly. Let me run the latest schbench and monitor these latency statistics
in detail.
 
> In a mixed environment the cost of leaving CPUs idle longer than necessary
> will show up - and none of these benchmarks show that kind of side effect
> and indirect overhead.
> 
> This feature would be a lot more convincing if it tried to measure overhead
> in the pathological case, not the case it's been written for.
>

Thanks for the suggestion, Ingo. Yes, we should launch more tests to evaluate this
proposal. As Tim mentioned, we have previously tested it using OLTP benchmark
as described in PATCH [2/2]. I'm thinking of running more benchmarks to get
a wider understanding of how this change would impact them, both positive and
negative part.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-09-26  5:10 [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Chen Yu
                   ` (2 preceding siblings ...)
  2023-09-27  8:00 ` [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Ingo Molnar
@ 2023-10-05  6:22 ` K Prateek Nayak
  2023-10-07  3:23   ` Chen Yu
  2023-10-17  9:49 ` Madadi Vineeth Reddy
  4 siblings, 1 reply; 18+ messages in thread
From: K Prateek Nayak @ 2023-10-05  6:22 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	Gautham R . Shenoy, linux-kernel, Chen Yu

Hello Chenyu,

On 9/26/2023 10:40 AM, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for your comments and review!

Sorry for the delay! I'll leave the test results from a 3rd Generation
EPYC system below.

tl;dr

- Small regression in tbench and netperf possible due to more searching
  for an idle CPU.

- Small regression in schbench (old) at 256 workers albeit with large
  run to run variance.

- Other benchmarks are more or less same.

I'll leave the full result below

o System details

- 3rd Generation EPYC System
- 2 sockets each with 64C/128T
- NPS1 (Each socket is a NUMA node)
- Boost enabled, C2 Disabled (POLL and MWAIT based C1 remained enabled)


o Kernel Details

- tip:	tip:sched/core at commit 5fe7765997b1 (sched/deadline: Make
	dl_rq->pushable_dl_tasks update drive dl_rq->overloaded)

- SIS_CACHE: tip + this series


o Benchmark results

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)     SIS_CACHE[pct imp](CV)
 1-groups     1.00 [ -0.00]( 2.36)     1.01 [ -1.47]( 3.02)
 2-groups     1.00 [ -0.00]( 2.35)     0.99 [  0.92]( 1.01)
 4-groups     1.00 [ -0.00]( 1.79)     0.98 [  2.34]( 0.63)
 8-groups     1.00 [ -0.00]( 0.84)     0.98 [  1.73]( 1.02)
16-groups     1.00 [ -0.00]( 2.39)     0.97 [  2.76]( 2.33)


==================================================================
Test          : tbench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:    tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
    1     1.00 [  0.00]( 0.86)     0.97 [ -2.68]( 0.74)
    2     1.00 [  0.00]( 0.99)     0.98 [ -2.18]( 0.17)
    4     1.00 [  0.00]( 0.49)     0.98 [ -2.47]( 1.15)
    8     1.00 [  0.00]( 0.96)     0.96 [ -3.81]( 0.24)
   16     1.00 [  0.00]( 1.38)     0.96 [ -4.33]( 1.31)
   32     1.00 [  0.00]( 1.64)     0.95 [ -4.70]( 1.59)
   64     1.00 [  0.00]( 0.92)     0.97 [ -2.97]( 0.49)
  128     1.00 [  0.00]( 0.57)     0.99 [ -1.15]( 0.57)
  256     1.00 [  0.00]( 0.38)     1.00 [  0.03]( 0.79)
  512     1.00 [  0.00]( 0.04)     1.00 [  0.43]( 0.34)
 1024     1.00 [  0.00]( 0.20)     1.00 [  0.41]( 0.13)


==================================================================
Test          : stream-10
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 2.52)     0.93 [ -6.90]( 6.75)
Scale     1.00 [  0.00]( 6.38)     0.99 [ -1.18]( 7.45)
  Add     1.00 [  0.00]( 6.54)     0.97 [ -2.55]( 7.34)
Triad     1.00 [  0.00]( 5.18)     0.95 [ -4.64]( 6.81)


==================================================================
Test          : stream-100
Units         : Normalized Bandwidth, MB/s
Interpretation: Higher is better
Statistic     : HMean
==================================================================
Test:       tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
 Copy     1.00 [  0.00]( 0.74)     1.00 [ -0.20]( 1.69)
Scale     1.00 [  0.00]( 6.25)     1.03 [  3.46]( 0.55)
  Add     1.00 [  0.00]( 6.53)     1.05 [  4.58]( 0.43)
Triad     1.00 [  0.00]( 5.14)     0.98 [ -1.78]( 6.24)


==================================================================
Test          : netperf
Units         : Normalized Througput
Interpretation: Higher is better
Statistic     : AMean
==================================================================
Clients:         tip[pct imp](CV)      SIS_CACHE[pct imp](CV)
 1-clients     1.00 [  0.00]( 0.27)     0.98 [ -1.50]( 0.14)
 2-clients     1.00 [  0.00]( 1.32)     0.98 [ -2.35]( 0.54)
 4-clients     1.00 [  0.00]( 0.40)     0.98 [ -2.35]( 0.56)
 8-clients     1.00 [  0.00]( 0.97)     0.97 [ -2.72]( 0.50)
16-clients     1.00 [  0.00]( 0.54)     0.96 [ -3.92]( 0.86)
32-clients     1.00 [  0.00]( 1.38)     0.97 [ -3.10]( 0.44)
64-clients     1.00 [  0.00]( 1.78)     0.97 [ -3.44]( 1.70)
128-clients    1.00 [  0.00]( 1.09)     0.94 [ -5.75]( 2.67)
256-clients    1.00 [  0.00]( 4.45)     0.97 [ -2.61]( 4.93)
512-clients    1.00 [  0.00](54.70)     0.98 [ -1.64](55.09)


==================================================================
Test          : schbench
Units         : Normalized 99th percentile latency in us
Interpretation: Lower is better
Statistic     : Median
==================================================================
#workers:  tip[pct imp](CV)     SIS_CACHE[pct imp](CV)
  1     1.00 [ -0.00]( 3.95)     0.97 [  2.56](10.42)
  2     1.00 [ -0.00]( 5.89)     0.83 [ 16.67](22.56)
  4     1.00 [ -0.00](14.28)     1.00 [ -0.00](14.75)
  8     1.00 [ -0.00]( 4.90)     0.84 [ 15.69]( 6.01)
 16     1.00 [ -0.00]( 4.15)     1.00 [ -0.00]( 4.41)
 32     1.00 [ -0.00]( 5.10)     1.01 [ -1.10]( 3.44)
 64     1.00 [ -0.00]( 2.69)     1.04 [ -3.72]( 2.57)
128     1.00 [ -0.00]( 2.63)     0.94 [  6.29]( 2.55)
256     1.00 [ -0.00](26.75)     1.51 [-50.57](11.40)
512     1.00 [ -0.00]( 2.93)     0.96 [  3.52]( 3.56)

==================================================================
Test          : ycsb-cassandra
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip     SIS_CACHE(pct imp)
Throughput      1.00    1.00 (%diff: 0.27%)


==================================================================
Test          : ycsb-mondodb
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Metric          tip      SIS_CACHE(pct imp)
Throughput      1.00    1.00 (%diff: -0.45%)


==================================================================
Test          : DeathStarBench
Units         : Normalized throughput
Interpretation: Higher is better
Statistic     : Mean
==================================================================
Pinning      scaling     tip                SIS_CACHE(pct imp)
 1CCD           1        1.00              1.00 (%diff: -0.47%)
 2CCD           2        1.00              0.98 (%diff: -2.34%)
 4CCD           4        1.00              1.00 (%diff: -0.29%)
 8CCD           8        1.00              1.01 (%diff: 0.54%)

> 
> ----------------------------------------------------------------------
> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-10-05  6:22 ` K Prateek Nayak
@ 2023-10-07  3:23   ` Chen Yu
  0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2023-10-07  3:23 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, Gautham R . Shenoy, linux-kernel, Chen Yu

Hi Prateek,

On 2023-10-05 at 11:52:13 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
> 
> On 9/26/2023 10:40 AM, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> >   burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >   (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> >   rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> >   for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >   (K Prateek Nayak)
> > 
> > Thanks for your comments and review!
> 
> Sorry for the delay! I'll leave the test results from a 3rd Generation
> EPYC system below.
> 
> tl;dr
> 
> - Small regression in tbench and netperf possible due to more searching
>   for an idle CPU.
> 
> - Small regression in schbench (old) at 256 workers albeit with large
>   run to run variance.
> 
> - Other benchmarks are more or less same.
> 
> Test          : schbench
> Units         : Normalized 99th percentile latency in us
> Interpretation: Lower is better
> Statistic     : Median
> ==================================================================
> #workers:  tip[pct imp](CV)     SIS_CACHE[pct imp](CV)
>   1     1.00 [ -0.00]( 3.95)     0.97 [  2.56](10.42)
>   2     1.00 [ -0.00]( 5.89)     0.83 [ 16.67](22.56)
>   4     1.00 [ -0.00](14.28)     1.00 [ -0.00](14.75)
>   8     1.00 [ -0.00]( 4.90)     0.84 [ 15.69]( 6.01)
>  16     1.00 [ -0.00]( 4.15)     1.00 [ -0.00]( 4.41)
>  32     1.00 [ -0.00]( 5.10)     1.01 [ -1.10]( 3.44)
>  64     1.00 [ -0.00]( 2.69)     1.04 [ -3.72]( 2.57)
> 128     1.00 [ -0.00]( 2.63)     0.94 [  6.29]( 2.55)
> 256     1.00 [ -0.00](26.75)     1.51 [-50.57](11.40)

Thanks for the testing. So the latency regression from schbench is
quite obvious, and as you mentioned, it is possible due to longer
scan time during select_idle_cpu(). I'll run the same test with split
LLC to see if I can reproduce the issue or not.
I'm also working with Mathieu on another direction to choose previous CPU
over current CPU when the system is overloaded, and that should be
more moderate and I'll post the test result later.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-09-26  5:10 [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Chen Yu
                   ` (3 preceding siblings ...)
  2023-10-05  6:22 ` K Prateek Nayak
@ 2023-10-17  9:49 ` Madadi Vineeth Reddy
  2023-10-17 11:09   ` Chen Yu
  4 siblings, 1 reply; 18+ messages in thread
From: Madadi Vineeth Reddy @ 2023-10-17  9:49 UTC (permalink / raw)
  To: Chen Yu, Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar,
	Vincent Guittot, Juri Lelli
  Cc: Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu

Hi Chen Yu,

On 26/09/23 10:40, Chen Yu wrote:
> RFC -> v1:
> - drop RFC
> - Only record the short sleeping time for each task, to better honor the
>   burst sleeping tasks. (Mathieu Desnoyers)
> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>   (Mathieu Desnoyers, Aaron Lu)
> - Introduce a new helper function cache_hot_cpu() that considers
>   rq->cache_hot_timeout. (Aaron Lu)
> - Add analysis of why inhibiting task migration could bring better throughput
>   for some benchmarks. (Gautham R. Shenoy)
> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>   (K Prateek Nayak)
> 
> Thanks for your comments and review!
> 
> ----------------------------------------------------------------------

Regarding making the scan for finding an idle cpu longer vs cache benefits, 
I ran some benchmarks.

Tested the patch on power system with 12 cores. Total of 96 CPU's.
System has two NUMA nodes.

Below are some of the benchmark results

schbench 99.0th latency (lower is better)
========
case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)


schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
for SIS_CACHE in case of 4-mthreads. I think we can ignore the last case due to huge run to run variations.

producer_consumer avg time/access (lower is better)
========
loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)

The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
mainly when loads per consumer iteration is lower.

hackbench normalized time in seconds (lower is better)
========
case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)

hackbench results are similar in both kernels except the case where there is an improvement of
29% in case of threads-pipe case with 1 groups.

Daytrader throughput (higher is better)
========

As per Ingo suggestion, ran a real life workload daytrader

baseline:
=================================================================================== 
 Instance      1
     Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
  ================       ===============   ===============   ===============
       10124.5 			    2 		    0 		   3970

SIS_CACHE:
===================================================================================
 Instance      1
     Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
  ================       ===============   ===============   ===============
       10319.5                       2               0              5771

In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.

Thanks and Regards 
Madadi Vineeth Reddy


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-10-17  9:49 ` Madadi Vineeth Reddy
@ 2023-10-17 11:09   ` Chen Yu
  2023-10-18 19:32     ` Madadi Vineeth Reddy
  0 siblings, 1 reply; 18+ messages in thread
From: Chen Yu @ 2023-10-17 11:09 UTC (permalink / raw)
  To: cover.1695704179.git.yu.c.chen
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	Madadi Vineeth Reddy, linux-kernel, Chen Yu

Hi Madadi,

On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> 
> On 26/09/23 10:40, Chen Yu wrote:
> > RFC -> v1:
> > - drop RFC
> > - Only record the short sleeping time for each task, to better honor the
> >   burst sleeping tasks. (Mathieu Desnoyers)
> > - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >   (Mathieu Desnoyers, Aaron Lu)
> > - Introduce a new helper function cache_hot_cpu() that considers
> >   rq->cache_hot_timeout. (Aaron Lu)
> > - Add analysis of why inhibiting task migration could bring better throughput
> >   for some benchmarks. (Gautham R. Shenoy)
> > - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >   (K Prateek Nayak)
> > 
> > Thanks for your comments and review!
> > 
> > ----------------------------------------------------------------------
> 
> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
> I ran some benchmarks.
> 

Thanks very much for your interest and your time on the patch.

> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> System has two NUMA nodes.
>
> Below are some of the benchmark results
> 
> schbench 99.0th latency (lower is better)
> ========
> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
> 
> 
> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
> for SIS_CACHE in case of 4-mthreads.

The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.

> I think we can ignore the last case due to huge run to run variations.

Although the run-to-run variation is large, it seems that the decrease is within that range.
Prateek has also reported that when the system is overloaded there could be some regression
from schbench:
https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
latency in detail.
 
> producer_consumer avg time/access (lower is better)
> ========
> loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> 5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
> 20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
> 50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
> 
> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
> mainly when loads per consumer iteration is lower.
> 
> hackbench normalized time in seconds (lower is better)
> ========
> case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
> process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
> process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
> process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
> process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
> threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
> threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
> threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
> threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)
> 
> hackbench results are similar in both kernels except the case where there is an improvement of
> 29% in case of threads-pipe case with 1 groups.
> 
> Daytrader throughput (higher is better)
> ========
> 
> As per Ingo suggestion, ran a real life workload daytrader
> 
> baseline:
> =================================================================================== 
>  Instance      1
>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>   ================       ===============   ===============   ===============
>        10124.5 			    2 		    0 		   3970
> 
> SIS_CACHE:
> ===================================================================================
>  Instance      1
>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>   ================       ===============   ===============   ===============
>        10319.5                       2               0              5771
> 
> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>

Thanks for bringing this good news, a real life workload benefits from this change.
I'll tune this patch a little bit to address the regression from schbench. Also to mention
that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
platform benefit from this change.
https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/

thanks,
Chenyu
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-10-17 11:09   ` Chen Yu
@ 2023-10-18 19:32     ` Madadi Vineeth Reddy
  2023-10-19 10:57       ` Chen Yu
  0 siblings, 1 reply; 18+ messages in thread
From: Madadi Vineeth Reddy @ 2023-10-18 19:32 UTC (permalink / raw)
  To: Chen Yu, cover.1695704179.git.yu.c.chen
  Cc: Peter Zijlstra, Mathieu Desnoyers, Ingo Molnar, Vincent Guittot,
	Juri Lelli, Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Daniel Bristot de Oliveira,
	Valentin Schneider, K Prateek Nayak, Gautham R . Shenoy,
	linux-kernel, Chen Yu, Madadi Vineeth Reddy

Hi Chen Yu,
On 17/10/23 16:39, Chen Yu wrote:
> Hi Madadi,
> 
> On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
>> Hi Chen Yu,
>>
>> On 26/09/23 10:40, Chen Yu wrote:
>>> RFC -> v1:
>>> - drop RFC
>>> - Only record the short sleeping time for each task, to better honor the
>>>   burst sleeping tasks. (Mathieu Desnoyers)
>>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
>>>   (Mathieu Desnoyers, Aaron Lu)
>>> - Introduce a new helper function cache_hot_cpu() that considers
>>>   rq->cache_hot_timeout. (Aaron Lu)
>>> - Add analysis of why inhibiting task migration could bring better throughput
>>>   for some benchmarks. (Gautham R. Shenoy)
>>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
>>>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
>>>   (K Prateek Nayak)
>>>
>>> Thanks for your comments and review!
>>>
>>> ----------------------------------------------------------------------
>>
>> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
>> I ran some benchmarks.
>>
> 
> Thanks very much for your interest and your time on the patch.
> 
>> Tested the patch on power system with 12 cores. Total of 96 CPU's.
>> System has two NUMA nodes.
>>
>> Below are some of the benchmark results
>>
>> schbench 99.0th latency (lower is better)
>> ========
>> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
>> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
>> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
>> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
>> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
>>
>>
>> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
>> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
>> for SIS_CACHE in case of 4-mthreads.
> 
> The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> 
>> I think we can ignore the last case due to huge run to run variations.
> 
> Although the run-to-run variation is large, it seems that the decrease is within that range.
> Prateek has also reported that when the system is overloaded there could be some regression
> from schbench:
> https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> latency in detail.
>  

raw data by schbench(old) with 6-mthreads
======================

Baseline (5 runs)
========
Latency percentiles (usec)                                                                                                                                                                                                                                  
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 981 
        99.5000th: 4424
        99.9000th: 9200
        min=0, max=29497

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 29
        90.0000th: 35
        95.0000th: 38
        *99.0000th: 495 
        99.5000th: 3924
        99.9000th: 9872
        min=0, max=29997

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 30
        90.0000th: 36
        95.0000th: 39
        *99.0000th: 1326
        99.5000th: 4744
        99.9000th: 10000
        min=0, max=23394

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 55
        99.5000th: 3292
        99.9000th: 9104
        min=0, max=25196

Latency percentiles (usec)
        50.0000th: 23
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 711 
        99.5000th: 4600
        99.9000th: 9424
        min=0, max=19997

SIS_CACHE (5 runs)
=========
Latency percentiles (usec)                                                                                                                                                                                                                                                                                     
        50.0000th: 23
        75.0000th: 30
        90.0000th: 35
        95.0000th: 38
        *99.0000th: 1894
        99.5000th: 5464
        99.9000th: 10000
        min=0, max=19157

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 2396
        99.5000th: 6664
        99.9000th: 10000
        min=0, max=24029

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 2132
        99.5000th: 6296
        99.9000th: 10000
        min=0, max=25313

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 37
        *99.0000th: 1090
        99.5000th: 6232
        99.9000th: 9744
        min=0, max=27264

Latency percentiles (usec)
        50.0000th: 22
        75.0000th: 29
        90.0000th: 34
        95.0000th: 38
        *99.0000th: 1786
        99.5000th: 5240
        99.9000th: 9968
        min=0, max=24754

The above data as indicated has large run to run variation and in general, the latency is
high in case of SIS_CACHE for the 99th %ile.


schbench(new) with 6-mthreads
=============

Baseline
========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
	  50.0th: 8          (43672 samples)
	  90.0th: 13         (83908 samples)
	* 99.0th: 20         (18323 samples)
	  99.9th: 775        (1785 samples)
	  min=1, max=8400
Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
	  50.0th: 13648      (59873 samples)
	  90.0th: 14000      (82767 samples)
	* 99.0th: 14320      (16342 samples)
	  99.9th: 18720      (1670 samples)
	  min=5130, max=38334
RPS percentiles (requests) runtime 30 (s) (31 total samples)
	  20.0th: 6968       (8 samples)
	* 50.0th: 6984       (23 samples)
	  90.0th: 6984       (0 samples)
	  min=6835, max=6991
average rps: 6984.77


SIS_CACHE
=========
Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
	  50.0th: 9          (49267 samples)
	  90.0th: 14         (86522 samples)
	* 99.0th: 21         (14091 samples)
	  99.9th: 1146       (1722 samples)
	  min=1, max=10427
Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
	  50.0th: 13616      (62838 samples)
	  90.0th: 14000      (85301 samples)
	* 99.0th: 14352      (16149 samples)
	  99.9th: 21408      (1660 samples)
	  min=5070, max=41866
RPS percentiles (requests) runtime 30 (s) (31 total samples)
	  20.0th: 6968       (7 samples)
	* 50.0th: 6984       (21 samples)
	  90.0th: 6984       (0 samples)
	  min=6672, max=6996
average rps: 6981.07

In new schbench, I didn't observe run to run variation and also there was no regression
in case of SIS_CACHE for the 99th %ile.


>> producer_consumer avg time/access (lower is better)
>> ========
>> loads per consumer iteration   baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
>> 5                  		1.00 [ 0.00]( 0.00)            0.87 [ +13.0]( 1.92)
>> 20                   		1.00 [ 0.00]( 0.00)            0.92 [ +8.00]( 0.00)
>> 50                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
>> 100                    		1.00 [ 0.00]( 0.00)            1.00 [  0.00]( 0.00)
>>
>> The main goal of the patch of improving cache locality is reflected as SIS_CACHE only improves in this workload, 
>> mainly when loads per consumer iteration is lower.
>>
>> hackbench normalized time in seconds (lower is better)
>> ========
>> case            load        baseline[pct imp](std%)         SIS_CACHE[pct imp]( std%)
>> process-pipe    1-groups     1.00 [ 0.00]( 1.50)            1.02 [ -2.00]( 3.36)
>> process-pipe    2-groups     1.00 [ 0.00]( 4.76)            0.99 [ +1.00]( 5.68)
>> process-sockets 1-groups     1.00 [ 0.00]( 2.56)            1.00 [  0.00]( 0.86)
>> process-sockets 2-groups     1.00 [ 0.00]( 0.50)            0.99 [ +1.00]( 0.96)
>> threads-pipe    1-groups     1.00 [ 0.00]( 3.87)            0.71 [ +29.0]( 3.56)
>> threads-pipe    2-groups     1.00 [ 0.00]( 1.60)            0.97 [ +3.00]( 3.44)
>> threads-sockets 1-groups     1.00 [ 0.00]( 7.65)            0.99 [ +1.00]( 1.05)
>> threads-sockets 2-groups     1.00 [ 0.00]( 3.12)            1.03 [ -3.00]( 1.70)
>>
>> hackbench results are similar in both kernels except the case where there is an improvement of
>> 29% in case of threads-pipe case with 1 groups.
>>
>> Daytrader throughput (higher is better)
>> ========
>>
>> As per Ingo suggestion, ran a real life workload daytrader
>>
>> baseline:
>> =================================================================================== 
>>  Instance      1
>>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>>   ================       ===============   ===============   ===============
>>        10124.5 			    2 		    0 		   3970
>>
>> SIS_CACHE:
>> ===================================================================================
>>  Instance      1
>>      Throughputs         Ave. Resp. Time   Min. Resp. Time   Max. Resp. Time
>>   ================       ===============   ===============   ===============
>>        10319.5                       2               0              5771
>>
>> In the above run, daytrader perfomance was 2% better in case of SIS_CACHE.
>>
> 
> Thanks for bringing this good news, a real life workload benefits from this change.
> I'll tune this patch a little bit to address the regression from schbench. Also to mention
> that, I'm working with Mathieu on his proposal to make the wakee choosing its previous
> CPU easier(similar to SIS_CACHE, but a little simpler), and we'll check how to make more
> platform benefit from this change.
> https://lore.kernel.org/lkml/20231012203626.1298944-1-mathieu.desnoyers@efficios.com/

Oh..ok. Thanks for the pointer!

> 
> thanks,
> Chenyu
>  

Thanks and Regards
Madadi Vineeth Reddy


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup
  2023-10-18 19:32     ` Madadi Vineeth Reddy
@ 2023-10-19 10:57       ` Chen Yu
  0 siblings, 0 replies; 18+ messages in thread
From: Chen Yu @ 2023-10-19 10:57 UTC (permalink / raw)
  To: Madadi Vineeth Reddy
  Cc: cover.1695704179.git.yu.c.chen, Peter Zijlstra,
	Mathieu Desnoyers, Ingo Molnar, Vincent Guittot, Juri Lelli,
	Tim Chen, Aaron Lu, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Daniel Bristot de Oliveira, Valentin Schneider,
	K Prateek Nayak, Gautham R . Shenoy, linux-kernel, Chen Yu,
	Madadi Vineeth Reddy

On 2023-10-19 at 01:02:16 +0530, Madadi Vineeth Reddy wrote:
> Hi Chen Yu,
> On 17/10/23 16:39, Chen Yu wrote:
> > Hi Madadi,
> > 
> > On 2023-10-17 at 15:19:24 +0530, Madadi Vineeth Reddy wrote:
> >> Hi Chen Yu,
> >>
> >> On 26/09/23 10:40, Chen Yu wrote:
> >>> RFC -> v1:
> >>> - drop RFC
> >>> - Only record the short sleeping time for each task, to better honor the
> >>>   burst sleeping tasks. (Mathieu Desnoyers)
> >>> - Keep the forward movement monotonic for runqueue's cache-hot timeout value.
> >>>   (Mathieu Desnoyers, Aaron Lu)
> >>> - Introduce a new helper function cache_hot_cpu() that considers
> >>>   rq->cache_hot_timeout. (Aaron Lu)
> >>> - Add analysis of why inhibiting task migration could bring better throughput
> >>>   for some benchmarks. (Gautham R. Shenoy)
> >>> - Choose the first cache-hot CPU, if all idle CPUs are cache-hot in
> >>>   select_idle_cpu(). To avoid possible task stacking on the waker's CPU.
> >>>   (K Prateek Nayak)
> >>>
> >>> Thanks for your comments and review!
> >>>
> >>> ----------------------------------------------------------------------
> >>
> >> Regarding making the scan for finding an idle cpu longer vs cache benefits, 
> >> I ran some benchmarks.
> >>
> > 
> > Thanks very much for your interest and your time on the patch.
> > 
> >> Tested the patch on power system with 12 cores. Total of 96 CPU's.
> >> System has two NUMA nodes.
> >>
> >> Below are some of the benchmark results
> >>
> >> schbench 99.0th latency (lower is better)
> >> ========
> >> case            load        	baseline[pct imp](std%)       SIS_CACHE[pct imp]( std%)
> >> normal          1-mthreads      1.00 [ 0.00]( 3.66)            1.00 [  0.00]( 1.71)
> >> normal          2-mthreads      1.00 [ 0.00]( 4.55)            1.02 [ -2.00]( 3.00)
> >> normal          4-mthreads      1.00 [ 0.00]( 4.77)            0.96 [ +4.00]( 4.27)
> >> normal          6-mthreads      1.00 [ 0.00]( 60.37)           2.66 [ -166.00]( 23.67)
> >>
> >>
> >> schbench results are showing that there is not much impact in wakeup latencies due to more iterations 
> >> in search for an idle cpu in the select_idle_cpu code path and interestingly numbers are slightly better 
> >> for SIS_CACHE in case of 4-mthreads.
> > 
> > The 4% improvement is within std%, so I suppose we did not see much difference in 4 mthreads case.
> > 
> >> I think we can ignore the last case due to huge run to run variations.
> > 
> > Although the run-to-run variation is large, it seems that the decrease is within that range.
> > Prateek has also reported that when the system is overloaded there could be some regression
> > from schbench:
> > https://lore.kernel.org/lkml/27651e14-f441-c1e2-9b5b-b958d6aadc79@amd.com/
> > Could you also post the raw data printed by schbench? And maybe using the latest schbench could get the
> > latency in detail.
> >  
> 
> raw data by schbench(old) with 6-mthreads
> ======================
> 
> Baseline (5 runs)
> ========
> Latency percentiles (usec)                                                                                                                                                                                                                                  
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 981 
>         99.5000th: 4424
>         99.9000th: 9200
>         min=0, max=29497
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 35
>         95.0000th: 38
>         *99.0000th: 495 
>         99.5000th: 3924
>         99.9000th: 9872
>         min=0, max=29997
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 30
>         90.0000th: 36
>         95.0000th: 39
>         *99.0000th: 1326
>         99.5000th: 4744
>         99.9000th: 10000
>         min=0, max=23394
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 55
>         99.5000th: 3292
>         99.9000th: 9104
>         min=0, max=25196
> 
> Latency percentiles (usec)
>         50.0000th: 23
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 711 
>         99.5000th: 4600
>         99.9000th: 9424
>         min=0, max=19997
> 
> SIS_CACHE (5 runs)
> =========
> Latency percentiles (usec)                                                                                                                                                                                                                                                                                     
>         50.0000th: 23
>         75.0000th: 30
>         90.0000th: 35
>         95.0000th: 38
>         *99.0000th: 1894
>         99.5000th: 5464
>         99.9000th: 10000
>         min=0, max=19157
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 2396
>         99.5000th: 6664
>         99.9000th: 10000
>         min=0, max=24029
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 2132
>         99.5000th: 6296
>         99.9000th: 10000
>         min=0, max=25313
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 37
>         *99.0000th: 1090
>         99.5000th: 6232
>         99.9000th: 9744
>         min=0, max=27264
> 
> Latency percentiles (usec)
>         50.0000th: 22
>         75.0000th: 29
>         90.0000th: 34
>         95.0000th: 38
>         *99.0000th: 1786
>         99.5000th: 5240
>         99.9000th: 9968
>         min=0, max=24754
> 
> The above data as indicated has large run to run variation and in general, the latency is
> high in case of SIS_CACHE for the 99th %ile.
> 
> 
> schbench(new) with 6-mthreads
> =============
> 
> Baseline
> ========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209403 total samples)
> 	  50.0th: 8          (43672 samples)
> 	  90.0th: 13         (83908 samples)
> 	* 99.0th: 20         (18323 samples)
> 	  99.9th: 775        (1785 samples)
> 	  min=1, max=8400
> Request Latencies percentiles (usec) runtime 30 (s) (209543 total samples)
> 	  50.0th: 13648      (59873 samples)
> 	  90.0th: 14000      (82767 samples)
> 	* 99.0th: 14320      (16342 samples)
> 	  99.9th: 18720      (1670 samples)
> 	  min=5130, max=38334
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 	  20.0th: 6968       (8 samples)
> 	* 50.0th: 6984       (23 samples)
> 	  90.0th: 6984       (0 samples)
> 	  min=6835, max=6991
> average rps: 6984.77
> 
> 
> SIS_CACHE
> =========
> Wakeup Latencies percentiles (usec) runtime 30 (s) (209295 total samples)
> 	  50.0th: 9          (49267 samples)
> 	  90.0th: 14         (86522 samples)
> 	* 99.0th: 21         (14091 samples)
> 	  99.9th: 1146       (1722 samples)
> 	  min=1, max=10427
> Request Latencies percentiles (usec) runtime 30 (s) (209432 total samples)
> 	  50.0th: 13616      (62838 samples)
> 	  90.0th: 14000      (85301 samples)
> 	* 99.0th: 14352      (16149 samples)
> 	  99.9th: 21408      (1660 samples)
> 	  min=5070, max=41866
> RPS percentiles (requests) runtime 30 (s) (31 total samples)
> 	  20.0th: 6968       (7 samples)
> 	* 50.0th: 6984       (21 samples)
> 	  90.0th: 6984       (0 samples)
> 	  min=6672, max=6996
> average rps: 6981.07
> 
> In new schbench, I didn't observe run to run variation and also there was no regression
> in case of SIS_CACHE for the 99th %ile.
>

Thanks for the test Madadi, in my opinion we can stick with the new schbench
in the future. I'll have a double check on my test machine.

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-10-19 10:57 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-26  5:10 [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Chen Yu
2023-09-26  5:11 ` [PATCH 1/2] sched/fair: Record the short sleeping time of a task Chen Yu
2023-09-26 10:29   ` Mathieu Desnoyers
2023-09-27  6:17     ` Chen Yu
2023-09-27  7:53   ` Aaron Lu
2023-09-28  7:57     ` Chen Yu
2023-09-26  5:11 ` [PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle_cpu() Chen Yu
2023-09-27 16:11   ` Mathieu Desnoyers
2023-09-28  7:52     ` Chen Yu
2023-09-27  8:00 ` [PATCH 0/2] Introduce SIS_CACHE to choose previous CPU during task wakeup Ingo Molnar
2023-09-27 21:34   ` Tim Chen
2023-09-28  8:23   ` Chen Yu
2023-10-05  6:22 ` K Prateek Nayak
2023-10-07  3:23   ` Chen Yu
2023-10-17  9:49 ` Madadi Vineeth Reddy
2023-10-17 11:09   ` Chen Yu
2023-10-18 19:32     ` Madadi Vineeth Reddy
2023-10-19 10:57       ` Chen Yu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.