linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
@ 2022-09-15  7:34 Yicong Yang
  2022-09-15  7:34 ` [RESEND PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Yicong Yang @ 2022-09-15  7:34 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

From: Yicong Yang <yangyicong@hisilicon.com>

This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")

Change for RESEND:
- Collect tag from Chen Yu and rebase on the latest tip/sched/core. Thanks.
Link: https://lore.kernel.org/lkml/20220822073610.27205-1-yangyicong@huawei.com/

Change since v6:
- rebase on 6.0-rc1
Link: https://lore.kernel.org/lkml/20220726074758.46686-1-yangyicong@huawei.com/

Change since v5:
- Improve patch 2 according to Peter's suggestion:
  - use sched_cluster_active to indicate whether cluster is active
  - consider SMT case and use wrap iteration when scanning cluster
- Add Vincent's tag
Thanks.
Link: https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/

Change since v4:
- rename cpus_share_resources to cpus_share_lowest_cache to be more informative, per Tim
- return -1 when nr==0 in scan_cluster(), per Abel
Thanks!
Link: https://lore.kernel.org/lkml/20220609120622.47724-1-yangyicong@hisilicon.com/

Change since v3:
- fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/

Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/

Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/

Barry Song (2):
  sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
  sched/fair: Scan cluster before scanning LLC in wake-up path

 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/fair.c            | 30 +++++++++++++++++++++++++++---
 kernel/sched/sched.h           |  3 +++
 kernel/sched/topology.c        | 25 +++++++++++++++++++++++++
 6 files changed, 81 insertions(+), 4 deletions(-)

-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RESEND PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
  2022-09-15  7:34 [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-09-15  7:34 ` Yicong Yang
  2022-09-15  7:34 ` [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-09-26 14:52 ` [RESEND PATCH v7 0/2] " Yicong Yang
  2 siblings, 0 replies; 8+ messages in thread
From: Yicong Yang @ 2022-09-15  7:34 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

From: Barry Song <song.bao.hua@hisilicon.com>

Add per-cpu cluster domain info and cpus_share_lowest_cache() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 +++++++++++++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..42ed454e8b18 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
  */
 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
+/*
+ * Domain members share CPU cluster (LLC tags or L2 cache)
+ *
+ * NEEDS_GROUPS: Clusters are shared between groups.
+ */
+SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS)
+
 /*
  * Domain members share CPU package resources (i.e. caches)
  *
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..c0d21667ddf3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void)
 #ifdef CONFIG_SCHED_CLUSTER
 static inline int cpu_cluster_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES;
+	return SD_CLUSTER | SD_SHARE_PKG_RESOURCES;
 }
 #endif
 
@@ -179,6 +179,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_lowest_cache(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
@@ -232,6 +233,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline bool cpus_share_lowest_cache(int this_cpu, int that_cpu)
+{
+	return true;
+}
+
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b85d1b5fe0c..e6966f66631d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3801,6 +3801,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
+/*
+ * Whether CPUs are share lowest cache, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters.
+ */
+bool cpus_share_lowest_cache(int this_cpu, int that_cpu)
+{
+	if (this_cpu == that_cpu)
+		return true;
+
+	return per_cpu(sd_lowest_cache_id, this_cpu) == per_cpu(sd_lowest_cache_id, that_cpu);
+}
+
 static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 {
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 91b2c7ec53bd..c148f6c4f0fc 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1794,7 +1794,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_lowest_cache_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..8ab27c0d6d1f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_lowest_cache_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	sd = lowest_flag_domain(cpu, SD_CLUSTER);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+	rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd);
+
+	/*
+	 * This assignment should be placed after the sd_llc_id as
+	 * we want this id equals to cluster id on cluster machines
+	 * but equals to LLC id on non-Cluster machines.
+	 */
+	per_cpu(sd_lowest_cache_id, cpu) = id;
+
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
@@ -1532,6 +1546,7 @@ static struct cpumask		***sched_domains_numa_masks;
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY	|	\
+	 SD_CLUSTER		|	\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA		|	\
 	 SD_ASYM_PACKING)
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-09-15  7:34 [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-09-15  7:34 ` [RESEND PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
@ 2022-09-15  7:34 ` Yicong Yang
  2023-05-22  6:29   ` Chen Yu
  2022-09-26 14:52 ` [RESEND PATCH v7 0/2] " Yicong Yang
  2 siblings, 1 reply; 8+ messages in thread
From: Yicong Yang @ 2022-09-15  7:34 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

From: Barry Song <song.bao.hua@hisilicon.com>

For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency.

Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench within one numa or cross
two numa.

On numa 0:
                             6.0-rc1                patched
Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*

On numa 0-1:
                             6.0-rc1                patched
Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
[https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 10 ++++++++++
 3 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4e5b171b1171..e6505b0764c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6444,6 +6444,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	if (static_branch_unlikely(&sched_cluster_active)) {
+		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
+
+		if (sdc) {
+			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
+				if (!cpumask_test_cpu(cpu, cpus))
+					continue;
+
+				if (has_idle_core) {
+					i = select_idle_core(p, cpu, cpus, &idle_cpu);
+					if ((unsigned int)i < nr_cpumask_bits)
+						return i;
+				} else {
+					if (--nr <= 0)
+						return -1;
+					idle_cpu = __select_idle_cpu(cpu, p);
+					if ((unsigned int)idle_cpu < nr_cpumask_bits)
+						return idle_cpu;
+				}
+			}
+			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
+		}
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6451,7 +6475,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 				return i;
 
 		} else {
-			if (!--nr)
+			if (--nr <= 0)
 				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
@@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (prev != target && cpus_share_lowest_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_capacity(task_util, prev))
 		return prev;
@@ -6576,7 +6600,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
-	    cpus_share_cache(recent_used_cpu, target) &&
+	    cpus_share_lowest_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_capacity(task_util, recent_used_cpu)) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c148f6c4f0fc..f99208146d7a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1801,6 +1801,7 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
 extern struct static_key_false sched_asym_cpucapacity;
+extern struct static_key_false sched_cluster_active;
 
 static __always_inline bool sched_asym_cpucap_active(void)
 {
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8ab27c0d6d1f..04ead3227201 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -670,7 +670,9 @@ DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
+
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
+DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -2268,6 +2270,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 	bool has_asym = false;
+	bool has_cluster = false;
 
 	if (WARN_ON(cpumask_empty(cpu_map)))
 		goto error;
@@ -2289,6 +2292,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
 			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
+			has_cluster |= sd->flags & SD_CLUSTER;
 
 			if (tl == sched_domain_topology)
 				*per_cpu_ptr(d.sd, i) = sd;
@@ -2399,6 +2403,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	if (has_asym)
 		static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
 
+	if (has_cluster)
+		static_branch_inc_cpuslocked(&sched_cluster_active);
+
 	if (rq && sched_debug_verbose) {
 		pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
 			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
@@ -2498,6 +2505,9 @@ static void detach_destroy_domains(const struct cpumask *cpu_map)
 	if (rcu_access_pointer(per_cpu(sd_asym_cpucapacity, cpu)))
 		static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
 
+	if (rcu_access_pointer(per_cpu(sd_cluster, cpu)))
+		static_branch_dec_cpuslocked(&sched_cluster_active);
+
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map)
 		cpu_attach_domain(NULL, &def_root_domain, i);
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-09-15  7:34 [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-09-15  7:34 ` [RESEND PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
  2022-09-15  7:34 ` [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-09-26 14:52 ` Yicong Yang
  2 siblings, 0 replies; 8+ messages in thread
From: Yicong Yang @ 2022-09-26 14:52 UTC (permalink / raw)
  To: peterz
  Cc: gautham.shenoy, vincent.guittot, tim.c.chen, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, yangyicong,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, juri.lelli,
	mingo, guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel, Yicong Yang,
	linux-arm-kernel, linux-kernel

Hi Peter,

Does it still have a chance for 6.1?

It should be stable for now since I got no new comment for this whole review cycle and
it can still be applied cleanly on tip/sched/core.

However appreciation for any comment if there're!

Thanks.

On 9/15/2022 3:34 PM, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
> 
> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.
> 
> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
> 
> Change for RESEND:
> - Collect tag from Chen Yu and rebase on the latest tip/sched/core. Thanks.
> Link: https://lore.kernel.org/lkml/20220822073610.27205-1-yangyicong@huawei.com/
> 
> Change since v6:
> - rebase on 6.0-rc1
> Link: https://lore.kernel.org/lkml/20220726074758.46686-1-yangyicong@huawei.com/
> 
> Change since v5:
> - Improve patch 2 according to Peter's suggestion:
>   - use sched_cluster_active to indicate whether cluster is active
>   - consider SMT case and use wrap iteration when scanning cluster
> - Add Vincent's tag
> Thanks.
> Link: https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/
> 
> Change since v4:
> - rename cpus_share_resources to cpus_share_lowest_cache to be more informative, per Tim
> - return -1 when nr==0 in scan_cluster(), per Abel
> Thanks!
> Link: https://lore.kernel.org/lkml/20220609120622.47724-1-yangyicong@hisilicon.com/
> 
> Change since v3:
> - fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
> Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/
> 
> Change since v2:
> - leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
> - remove the ping-pong suppression
> - address the comment from Tim, thanks.
> Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/
> 
> Change since v1:
> - regain the performance data based on v5.17-rc1
> - rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
> Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/
> 
> Barry Song (2):
>   sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
>   sched/fair: Scan cluster before scanning LLC in wake-up path
> 
>  include/linux/sched/sd_flags.h |  7 +++++++
>  include/linux/sched/topology.h |  8 +++++++-
>  kernel/sched/core.c            | 12 ++++++++++++
>  kernel/sched/fair.c            | 30 +++++++++++++++++++++++++++---
>  kernel/sched/sched.h           |  3 +++
>  kernel/sched/topology.c        | 25 +++++++++++++++++++++++++
>  6 files changed, 81 insertions(+), 4 deletions(-)
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-09-15  7:34 ` [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2023-05-22  6:29   ` Chen Yu
  2023-05-22 12:42     ` Yicong Yang
  0 siblings, 1 reply; 8+ messages in thread
From: Chen Yu @ 2023-05-22  6:29 UTC (permalink / raw)
  To: Yicong Yang
  Cc: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, yangyicong,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel, tim.c.chen

Hi Yicong,
On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> have lower latency when synchronizing and accessing shared resources like
> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> target CPU before scanning the whole LLC to gain lower latency.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one numa or cross
> two numa.
> 
> On numa 0:
>                              6.0-rc1                patched
> Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
> Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
> Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
> Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
> Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
> Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
> Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
> Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*
> 
> On numa 0-1:
>                              6.0-rc1                patched
> Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
> Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
> Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
> Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
> Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
> Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
> Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
> Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
> Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
> in the code has not been tested but it supposed to work.
>
May I know if this is the latest version to support cluster based wakeup?

I did a double check on Jacobsville(24 CPUs, 1 socket) with this patch applied.
Overall there are obvious improvements for netperf/tbench in throughput:

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	6-threads	 1.00 (  0.59)	 +6.63 (  0.71)
TCP_RR          	12-threads	 1.00 (  0.25)	 +5.90 (  0.16)
TCP_RR          	18-threads	 1.00 (  0.39)	 +9.49 (  0.49)
TCP_RR          	24-threads	 1.00 (  0.95)	 +2.61 (  0.94)
TCP_RR          	30-threads	 1.00 (  5.01)	 +2.37 (  3.82)
TCP_RR          	36-threads	 1.00 (  3.73)	 +2.02 (  2.97)
TCP_RR          	42-threads	 1.00 (  3.88)	 +1.99 (  3.96)
TCP_RR          	48-threads	 1.00 (  1.39)	 +1.74 (  1.50)
UDP_RR          	6-threads	 1.00 (  1.31)	 +5.04 (  1.70)
UDP_RR          	12-threads	 1.00 (  0.30)	 +8.18 (  0.20)
UDP_RR          	18-threads	 1.00 (  0.37)	+10.94 (  0.59)
UDP_RR          	24-threads	 1.00 (  0.84)	 +1.12 (  0.99)
UDP_RR          	30-threads	 1.00 (  4.70)	 +1.61 (  6.54)
UDP_RR          	36-threads	 1.00 ( 10.53)	 +1.71 (  2.67)
UDP_RR          	42-threads	 1.00 (  2.52)	 +0.63 (  3.60)
UDP_RR          	48-threads	 1.00 (  1.61)	 +0.12 (  1.27)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	6-threads	 1.00 (  0.60)	 +2.94 (  0.23)
loopback        	12-threads	 1.00 (  0.11)	 +4.27 (  0.23)
loopback        	18-threads	 1.00 (  0.12)	+13.45 (  0.14)
loopback        	24-threads	 1.00 (  0.13)	 +0.69 (  0.24)
loopback        	30-threads	 1.00 (  0.34)	 +0.42 (  0.15)
loopback        	36-threads	 1.00 (  0.29)	 +0.58 (  0.07)
loopback        	42-threads	 1.00 (  0.06)	 +0.38 (  0.45)
loopback        	48-threads	 1.00 (  0.04)	 +0.15 (  0.68)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  4.56)	 +3.23 (  0.00)
normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	4-mthreads	 1.00 ( 11.00)	 -8.82 ( 16.66)
normal          	8-mthreads	 1.00 (  7.10)	 -4.49 (  3.26)

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  0.62)	 +4.71 (  0.96)
process-pipe    	2-groups	 1.00 (  0.84)	 +3.56 (  2.35)
process-pipe    	4-groups	 1.00 (  1.56)	 +6.74 (  0.74)
process-pipe    	8-groups	 1.00 ( 14.27)	 +0.85 (  8.34)
process-sockets 	1-groups	 1.00 (  0.36)	 -8.05 (  1.54)
process-sockets 	2-groups	 1.00 (  3.19)	 +1.77 (  2.39)
process-sockets 	4-groups	 1.00 (  1.86)	-29.10 (  2.63)
process-sockets 	8-groups	 1.00 (  1.77)	 -2.94 (  1.55)
threads-pipe    	1-groups	 1.00 (  0.74)	 +6.62 (  0.94)
threads-pipe    	2-groups	 1.00 (  1.28)	 +7.50 (  0.93)
threads-pipe    	4-groups	 1.00 (  0.80)	 +8.72 (  4.54)
threads-pipe    	8-groups	 1.00 (  8.77)	 +6.49 (  7.49)
threads-sockets 	1-groups	 1.00 (  0.43)	 -4.35 (  0.27)
threads-sockets 	2-groups	 1.00 (  0.35)	 -5.60 (  1.86)
threads-sockets 	4-groups	 1.00 (  0.61)	-26.87 (  2.35)
threads-sockets 	8-groups	 1.00 (  0.81)	 -6.60 (  0.62)

And there is regression from hackbench in socket mode, especially in
4 groups case.

In 4 groups case, the fd descriptors of each hackbench group is 3, so there
are 3 x 4 x 2 = 24 tasks in the system, which is the same number
as the online CPUs.

I added schedstats trace and found that it was due to target CPU(becauase the
idle CPU scan in select_idle_sibling() failed) is chosen more offen than
the previous CPU with this patch applied. And with this patch applied, when
there are 4 groups of hackbench, some CPUs are around 80% utilization, while
without the patch applied, every CPU is nearly 100% utilized. This suggested
that, task migration is unnecessary in this case, just to put the wakee on its
previous CPU is optimal and could mitigate race condition. I did an experiment
to keep the cpus_share_cache() as it is when checking prev cpu and recent_used_cpu,
the regression was gone(comment below).
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/sched.h    |  1 +
>  kernel/sched/topology.c | 10 ++++++++++
>  3 files changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4e5b171b1171..e6505b0764c0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6444,6 +6444,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		}
>  	}
>  
> +	if (static_branch_unlikely(&sched_cluster_active)) {
> +		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
> +
> +		if (sdc) {
> +			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
> +				if (!cpumask_test_cpu(cpu, cpus))
> +					continue;
> +
> +				if (has_idle_core) {
> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
> +					if ((unsigned int)i < nr_cpumask_bits)
> +						return i;
> +				} else {
> +					if (--nr <= 0)
> +						return -1;
> +					idle_cpu = __select_idle_cpu(cpu, p);
> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +						return idle_cpu;
> +				}
> +			}
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
> +		}
> +	}
> +
>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>  		if (has_idle_core) {
>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
> @@ -6451,7 +6475,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  				return i;
>  
>  		} else {
> -			if (!--nr)
> +			if (--nr <= 0)
This change seems to not be needed because if the cluster scan has run out of nr budget,
it will return -1 there, and there's no need to check nr here. But yes, with this
change the code is more readable.
>  				return -1;
>  			idle_cpu = __select_idle_cpu(cpu, p);
>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	/*
>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>  	 */
> -	if (prev != target && cpus_share_cache(prev, target) &&
> +	if (prev != target && cpus_share_lowest_cache(prev, target) &&
This change impacts hackbench in socket mode a bit. It seems that for hackbench even
putting the wakee on its previous CPU in the same LLC is better than putting it on
current cluster. But it seems to be hackbench specific.

thanks,
Chenyu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2023-05-22  6:29   ` Chen Yu
@ 2023-05-22 12:42     ` Yicong Yang
  2023-05-23 13:44       ` Chen Yu
  0 siblings, 1 reply; 8+ messages in thread
From: Yicong Yang @ 2023-05-22 12:42 UTC (permalink / raw)
  To: Chen Yu
  Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel,
	dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel, tim.c.chen

Hi Chen,

On 2023/5/22 14:29, Chen Yu wrote:
> Hi Yicong,
> On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote:
>> From: Barry Song <song.bao.hua@hisilicon.com>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>> have lower latency when synchronizing and accessing shared resources like
>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>> target CPU before scanning the whole LLC to gain lower latency.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench within one numa or cross
>> two numa.
>>
>> On numa 0:
>>                              6.0-rc1                patched
>> Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
>> Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
>> Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
>> Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
>> Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
>> Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
>> Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
>> Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*
>>
>> On numa 0-1:
>>                              6.0-rc1                patched
>> Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
>> Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
>> Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
>> Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
>> Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
>> Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
>> Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
>> Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
>> Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
>> in the code has not been tested but it supposed to work.
>>
> May I know if this is the latest version to support cluster based wakeup?
> 

Yes it's the latest version.

> I did a double check on Jacobsville(24 CPUs, 1 socket) with this patch applied.
> Overall there are obvious improvements for netperf/tbench in throughput:
> 

Thanks for the further information. The result of netperf/tbench looks good as we
image, the cluster wakeup expects to gain more benefit when the system is under
loaded or well-loaded. May I know how many CPUs sharing cluster on Jacobsvilla?

> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	6-threads	 1.00 (  0.59)	 +6.63 (  0.71)
> TCP_RR          	12-threads	 1.00 (  0.25)	 +5.90 (  0.16)
> TCP_RR          	18-threads	 1.00 (  0.39)	 +9.49 (  0.49)
> TCP_RR          	24-threads	 1.00 (  0.95)	 +2.61 (  0.94)
> TCP_RR          	30-threads	 1.00 (  5.01)	 +2.37 (  3.82)
> TCP_RR          	36-threads	 1.00 (  3.73)	 +2.02 (  2.97)
> TCP_RR          	42-threads	 1.00 (  3.88)	 +1.99 (  3.96)
> TCP_RR          	48-threads	 1.00 (  1.39)	 +1.74 (  1.50)
> UDP_RR          	6-threads	 1.00 (  1.31)	 +5.04 (  1.70)
> UDP_RR          	12-threads	 1.00 (  0.30)	 +8.18 (  0.20)
> UDP_RR          	18-threads	 1.00 (  0.37)	+10.94 (  0.59)
> UDP_RR          	24-threads	 1.00 (  0.84)	 +1.12 (  0.99)
> UDP_RR          	30-threads	 1.00 (  4.70)	 +1.61 (  6.54)
> UDP_RR          	36-threads	 1.00 ( 10.53)	 +1.71 (  2.67)
> UDP_RR          	42-threads	 1.00 (  2.52)	 +0.63 (  3.60)
> UDP_RR          	48-threads	 1.00 (  1.61)	 +0.12 (  1.27)
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	6-threads	 1.00 (  0.60)	 +2.94 (  0.23)
> loopback        	12-threads	 1.00 (  0.11)	 +4.27 (  0.23)
> loopback        	18-threads	 1.00 (  0.12)	+13.45 (  0.14)
> loopback        	24-threads	 1.00 (  0.13)	 +0.69 (  0.24)
> loopback        	30-threads	 1.00 (  0.34)	 +0.42 (  0.15)
> loopback        	36-threads	 1.00 (  0.29)	 +0.58 (  0.07)
> loopback        	42-threads	 1.00 (  0.06)	 +0.38 (  0.45)
> loopback        	48-threads	 1.00 (  0.04)	 +0.15 (  0.68)
> 
> schbench
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1-mthreads	 1.00 (  4.56)	 +3.23 (  0.00)
> normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
> normal          	4-mthreads	 1.00 ( 11.00)	 -8.82 ( 16.66)
> normal          	8-mthreads	 1.00 (  7.10)	 -4.49 (  3.26)
> 
> hackbench
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1-groups	 1.00 (  0.62)	 +4.71 (  0.96)
> process-pipe    	2-groups	 1.00 (  0.84)	 +3.56 (  2.35)
> process-pipe    	4-groups	 1.00 (  1.56)	 +6.74 (  0.74)
> process-pipe    	8-groups	 1.00 ( 14.27)	 +0.85 (  8.34)
> process-sockets 	1-groups	 1.00 (  0.36)	 -8.05 (  1.54)
> process-sockets 	2-groups	 1.00 (  3.19)	 +1.77 (  2.39)
> process-sockets 	4-groups	 1.00 (  1.86)	-29.10 (  2.63)
> process-sockets 	8-groups	 1.00 (  1.77)	 -2.94 (  1.55)
> threads-pipe    	1-groups	 1.00 (  0.74)	 +6.62 (  0.94)
> threads-pipe    	2-groups	 1.00 (  1.28)	 +7.50 (  0.93)
> threads-pipe    	4-groups	 1.00 (  0.80)	 +8.72 (  4.54)
> threads-pipe    	8-groups	 1.00 (  8.77)	 +6.49 (  7.49)
> threads-sockets 	1-groups	 1.00 (  0.43)	 -4.35 (  0.27)
> threads-sockets 	2-groups	 1.00 (  0.35)	 -5.60 (  1.86)
> threads-sockets 	4-groups	 1.00 (  0.61)	-26.87 (  2.35)
> threads-sockets 	8-groups	 1.00 (  0.81)	 -6.60 (  0.62)
> 
> And there is regression from hackbench in socket mode, especially in
> 4 groups case.
> 
> In 4 groups case, the fd descriptors of each hackbench group is 3, so there
> are 3 x 4 x 2 = 24 tasks in the system, which is the same number
> as the online CPUs.
> 
> I added schedstats trace and found that it was due to target CPU(becauase the
> idle CPU scan in select_idle_sibling() failed) is chosen more offen than
> the previous CPU with this patch applied. And with this patch applied, when
> there are 4 groups of hackbench, some CPUs are around 80% utilization, while
> without the patch applied, every CPU is nearly 100% utilized. This suggested
> that, task migration is unnecessary in this case, just to put the wakee on its
> previous CPU is optimal and could mitigate race condition. I did an experiment
> to keep the cpus_share_cache() as it is when checking prev cpu and recent_used_cpu,
> the regression was gone(comment below).

Thanks for the information, see the reply below...

>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
>> ---
>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>  kernel/sched/sched.h    |  1 +
>>  kernel/sched/topology.c | 10 ++++++++++
>>  3 files changed, 38 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4e5b171b1171..e6505b0764c0 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6444,6 +6444,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  		}
>>  	}
>>  
>> +	if (static_branch_unlikely(&sched_cluster_active)) {
>> +		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
>> +
>> +		if (sdc) {
>> +			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
>> +				if (!cpumask_test_cpu(cpu, cpus))
>> +					continue;
>> +
>> +				if (has_idle_core) {
>> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> +					if ((unsigned int)i < nr_cpumask_bits)
>> +						return i;
>> +				} else {
>> +					if (--nr <= 0)
>> +						return -1;
>> +					idle_cpu = __select_idle_cpu(cpu, p);
>> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> +						return idle_cpu;
>> +				}
>> +			}
>> +			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
>> +		}
>> +	}
>> +
>>  	for_each_cpu_wrap(cpu, cpus, target + 1) {
>>  		if (has_idle_core) {
>>  			i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> @@ -6451,7 +6475,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  				return i;
>>  
>>  		} else {
>> -			if (!--nr)
>> +			if (--nr <= 0)
> This change seems to not be needed because if the cluster scan has run out of nr budget,
> it will return -1 there, and there's no need to check nr here. But yes, with this
> change the code is more readable.
>>  				return -1;
>>  			idle_cpu = __select_idle_cpu(cpu, p);
>>  			if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>  	/*
>>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>>  	 */
>> -	if (prev != target && cpus_share_cache(prev, target) &&
>> +	if (prev != target && cpus_share_lowest_cache(prev, target) &&
> This change impacts hackbench in socket mode a bit. It seems that for hackbench even
> putting the wakee on its previous CPU in the same LLC is better than putting it on
> current cluster. But it seems to be hackbench specific.
> 

...without this do you still see the same improvement at under-loaded case (threads less-equal the CPU
numbers) for tbench/netperf? The idea here is to always try to wakeup in the same cluster of the
target to benefit from the cluster cache but the early test for the prev and recent used cpu may break
that. Keep it as is, at low load the prev cpu or recent used cpu get more chance to be idle so we take
less chance to benefit from the cluster and gain less performance improvement.

In the hackbench case as you noticed, the utilization can reach 100% ideally so the SIS_UTIL
will regulate the scanning number to 4 or around. If the prev/recent used CPU is not in the same
cluster with target, we're about to scanning the cluster and when found no idle CPU and has
run out of the scanning number, we'll fallback to wakeup on the target. That maybe the reason
why observed more wakeups on target rather than previous CPU.

In this case I wondering choosing prev cpu or recent used cpu after scanning the cluster can help
the situation here, like the snippet below (kinds of messy though).

Thanks,
Yicong

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecdc7970566e..5a25cb680350 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6939,7 +6939,8 @@ static inline int select_idle_smt(struct task_struct *p, int target)
  * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
  * average idle time for this rq (as found in rq->avg_idle).
  */
-static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target,
+                          int aff_prev_cpu, int aff_recent_cpu)
 {
        struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_rq_mask);
        int i, cpu, idle_cpu = -1, nr = INT_MAX;
@@ -7008,12 +7009,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
                                                return i;
                                } else {
                                        if (--nr <= 0)
-                                               return -1;
+                                               break;
                                        idle_cpu = __select_idle_cpu(cpu, p);
                                        if ((unsigned int)idle_cpu < nr_cpumask_bits)
                                                return idle_cpu;
                                }
                        }
+
+                       /*
+                        * No idle CPU in the target's cluster found, check task's
+                        * prev_cpu and recent_used_cpu first for better affinity.
+                        */
+                       if ((unsigned int)aff_prev_cpu < nr_cpumask_bits)
+                               return aff_prev_cpu;
+                       else if ((unsigned int)aff_recent_cpu < nr_cpumask_bits)
+                               return aff_recent_cpu;
+
                        cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
                }
        }
@@ -7128,6 +7139,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
        struct sched_domain *sd;
        unsigned long task_util, util_min, util_max;
        int i, recent_used_cpu;
+       int aff_prev_cpu = -1, aff_recent_cpu = -1;

        /*
         * On asymmetric system, update task utilization because we will check
@@ -7152,10 +7164,14 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
        /*
         * If the previous CPU is cache affine and idle, don't be stupid:
         */
-       if (prev != target && cpus_share_lowest_cache(prev, target) &&
+       if (prev != target &&
            (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
-           asym_fits_cpu(task_util, util_min, util_max, prev))
-               return prev;
+           asym_fits_cpu(task_util, util_min, util_max, prev)) {
+               if (cpus_share_lowest_cache(prev, target))
+                       return prev;
+               else if (cpus_share_cache(prev, target))
+                       aff_prev_cpu = prev;
+       }

        /*
         * Allow a per-cpu kthread to stack with the wakee if the
@@ -7178,11 +7194,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
        p->recent_used_cpu = prev;
        if (recent_used_cpu != prev &&
            recent_used_cpu != target &&
-           cpus_share_lowest_cache(recent_used_cpu, target) &&
            (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
            cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
            asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
-               return recent_used_cpu;
+               if (cpus_share_lowest_cache(recent_used_cpu, target))
+                       return recent_used_cpu;
+               else if (cpus_share_cache(recent_used_cpu, target))
+                       aff_recent_cpu = recent_used_cpu;
        }

        /*
@@ -7219,7 +7237,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
                }
        }

-       i = select_idle_cpu(p, sd, has_idle_core, target);
+       i = select_idle_cpu(p, sd, has_idle_core, target, aff_prev_cpu, aff_recent_cpu);
        if ((unsigned)i < nr_cpumask_bits)
                return i;


> thanks,
> Chenyu
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2023-05-22 12:42     ` Yicong Yang
@ 2023-05-23 13:44       ` Chen Yu
  2023-05-24  8:05         ` Yicong Yang
  0 siblings, 1 reply; 8+ messages in thread
From: Chen Yu @ 2023-05-23 13:44 UTC (permalink / raw)
  To: Yicong Yang
  Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel,
	dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel, tim.c.chen

On 2023-05-22 at 20:42:19 +0800, Yicong Yang wrote:
> Hi Chen,
> 
> On 2023/5/22 14:29, Chen Yu wrote:
> > Hi Yicong,
> > On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote:
> >> From: Barry Song <song.bao.hua@hisilicon.com>
> >>
[snip...]
> 
> Thanks for the further information. The result of netperf/tbench looks good as we
> image, the cluster wakeup expects to gain more benefit when the system is under
> loaded or well-loaded. May I know how many CPUs sharing cluster on Jacobsvilla?
>
There are 4 CPUs per cluster on Jacobsville.
[snip...]
> >> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
> >>  	/*
> >>  	 * If the previous CPU is cache affine and idle, don't be stupid:
> >>  	 */
> >> -	if (prev != target && cpus_share_cache(prev, target) &&
> >> +	if (prev != target && cpus_share_lowest_cache(prev, target) &&
> > This change impacts hackbench in socket mode a bit. It seems that for hackbench even
> > putting the wakee on its previous CPU in the same LLC is better than putting it on
> > current cluster. But it seems to be hackbench specific.
> > 
> 
> ...without this do you still see the same improvement at under-loaded case (threads less-equal the CPU
> numbers) for tbench/netperf? 
> The idea here is to always try to wakeup in the same cluster of the
> target to benefit from the cluster cache but the early test for the prev and recent used cpu may break
> that. Keep it as is, at low load the prev cpu or recent used cpu get more chance to be idle so we take
> less chance to benefit from the cluster and gain less performance improvement.
> 
Right. Without above change I saw lower improvement at lightly load case for netperf/tbench.
> In the hackbench case as you noticed, the utilization can reach 100% ideally so the SIS_UTIL
> will regulate the scanning number to 4 or around. If the prev/recent used CPU is not in the same
> cluster with target, we're about to scanning the cluster and when found no idle CPU and has
> run out of the scanning number, we'll fallback to wakeup on the target. That maybe the reason
> why observed more wakeups on target rather than previous CPU.
> 
Looks reasonable. When the budget of scanning number is low, we can not find an idle target
on local cluster and terminates scanning for an idle prev on remote cluster, although that
prev could be a better choice than target cpu.
> In this case I wondering choosing prev cpu or recent used cpu after scanning the cluster can help
> the situation here, like the snippet below (kinds of messy though).
> 
This change makes sense to me. I only modified it a little bit to only give prev a second
chance. With your patch applied, the improvement of netperf/tbench remains while the
hackbench big regress was gone.

hackbench
=========
case            	load    	baseline(std%)	compare%( std%)
process-pipe    	1-groups	 1.00 (  2.35)	 -0.65 (  1.81)
process-pipe    	2-groups	 1.00 (  0.42)	 -2.16 (  1.12)
process-pipe    	4-groups	 1.00 (  1.84)	 +0.72 (  1.34)
process-pipe    	8-groups	 1.00 (  2.81)	 +1.12 (  3.88)
process-sockets 	1-groups	 1.00 (  1.88)	 -0.99 (  4.84)
process-sockets 	2-groups	 1.00 (  5.49)	 -4.50 (  4.09)
process-sockets 	4-groups	 1.00 (  3.54)	 +2.28 (  3.13)
process-sockets 	8-groups	 1.00 (  0.79)	 -0.13 (  1.28)
threads-pipe    	1-groups	 1.00 (  1.73)	 -2.39 (  0.40)
threads-pipe    	2-groups	 1.00 (  0.73)	 +2.88 (  1.94)
threads-pipe    	4-groups	 1.00 (  0.64)	 +1.12 (  1.82)
threads-pipe    	8-groups	 1.00 (  1.55)	 -1.59 (  1.20)
threads-sockets 	1-groups	 1.00 (  3.76)	 +3.21 (  3.56)
threads-sockets 	2-groups	 1.00 (  1.20)	 -5.56 (  2.64)
threads-sockets 	4-groups	 1.00 (  2.65)	 +1.48 (  4.91)
threads-sockets 	8-groups	 1.00 (  0.08)	 +0.18 (  0.15)

netperf
=======
case            	load    	baseline(std%)	compare%( std%)
TCP_RR          	6-threads	 1.00 (  0.91)	 +2.87 (  0.83)
TCP_RR          	12-threads	 1.00 (  0.22)	 +3.48 (  0.31)
TCP_RR          	18-threads	 1.00 (  0.41)	 +7.81 (  0.48)
TCP_RR          	24-threads	 1.00 (  1.02)	 -0.32 (  1.25)
TCP_RR          	30-threads	 1.00 (  4.67)	 -0.04 (  5.14)
TCP_RR          	36-threads	 1.00 (  4.53)	 -0.13 (  4.37)
TCP_RR          	42-threads	 1.00 (  3.92)	 -0.15 (  3.07)
TCP_RR          	48-threads	 1.00 (  2.07)	 -0.17 (  1.52)
UDP_RR          	6-threads	 1.00 (  0.98)	 +4.50 (  2.38)
UDP_RR          	12-threads	 1.00 (  0.26)	 +3.64 (  0.25)
UDP_RR          	18-threads	 1.00 (  0.27)	 +9.93 (  0.55)
UDP_RR          	24-threads	 1.00 (  1.22)	 +0.13 (  1.33)
UDP_RR          	30-threads	 1.00 (  3.86)	 -0.03 (  5.05)
UDP_RR          	36-threads	 1.00 (  2.81)	 +0.10 (  3.37)
UDP_RR          	42-threads	 1.00 (  3.51)	 -0.26 (  2.94)
UDP_RR          	48-threads	 1.00 ( 12.54)	 +0.74 (  9.44)

tbench
======
case            	load    	baseline(std%)	compare%( std%)
loopback        	6-threads	 1.00 (  0.04)	 +2.94 (  0.26)
loopback        	12-threads	 1.00 (  0.30)	 +4.58 (  0.12)
loopback        	18-threads	 1.00 (  0.37)	+12.38 (  0.10)
loopback        	24-threads	 1.00 (  0.56)	 -0.27 (  0.50)
loopback        	30-threads	 1.00 (  0.17)	 -0.18 (  0.06)
loopback        	36-threads	 1.00 (  0.25)	 -0.73 (  0.44)
loopback        	42-threads	 1.00 (  0.10)	 -0.22 (  0.18)
loopback        	48-threads	 1.00 (  0.29)	 -0.48 (  0.19)

schbench
========
case            	load    	baseline(std%)	compare%( std%)
normal          	1-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
normal          	4-mthreads	 1.00 (  6.80)	 +2.78 (  8.08)
normal          	8-mthreads	 1.00 (  3.65)	 -0.23 (  4.30)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0989116b0796..07495b44c68f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7127,7 +7127,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	bool has_idle_core = false;
 	struct sched_domain *sd;
 	unsigned long task_util, util_min, util_max;
-	int i, recent_used_cpu;
+	int i, recent_used_cpu, prev_aff = -1;
 
 	/*
 	 * On asymmetric system, update task utilization because we will check
@@ -7152,10 +7152,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_lowest_cache(prev, target) &&
+	if (prev != target && cpus_share_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
-	    asym_fits_cpu(task_util, util_min, util_max, prev))
-		return prev;
+	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
+		if (cpus_share_lowest_cache(prev, target))
+			return prev;
+		prev_aff = prev;
+	}
 
 	/*
 	 * Allow a per-cpu kthread to stack with the wakee if the
@@ -7223,6 +7226,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if ((unsigned)i < nr_cpumask_bits)
 		return i;
 
+	/*
+	 * Give prev another chance, in case prev has not been
+	 * scanned in select_idle_cpu() due to nr constrain.
+	 */
+	if (prev_aff != -1)
+		return prev_aff;
+
 	return target;
 }
 

thanks,
Chenyu

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2023-05-23 13:44       ` Chen Yu
@ 2023-05-24  8:05         ` Yicong Yang
  0 siblings, 0 replies; 8+ messages in thread
From: Yicong Yang @ 2023-05-24  8:05 UTC (permalink / raw)
  To: Chen Yu
  Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel,
	dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel, tim.c.chen

On 2023/5/23 21:44, Chen Yu wrote:
> On 2023-05-22 at 20:42:19 +0800, Yicong Yang wrote:
>> Hi Chen,
>>
>> On 2023/5/22 14:29, Chen Yu wrote:
>>> Hi Yicong,
>>> On 2022-09-15 at 15:34:23 +0800, Yicong Yang wrote:
>>>> From: Barry Song <song.bao.hua@hisilicon.com>
>>>>
> [snip...]
>>
>> Thanks for the further information. The result of netperf/tbench looks good as we
>> image, the cluster wakeup expects to gain more benefit when the system is under
>> loaded or well-loaded. May I know how many CPUs sharing cluster on Jacobsvilla?
>>
> There are 4 CPUs per cluster on Jacobsville.
> [snip...]
>>>> @@ -6550,7 +6574,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>>>>  	/*
>>>>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>>>>  	 */
>>>> -	if (prev != target && cpus_share_cache(prev, target) &&
>>>> +	if (prev != target && cpus_share_lowest_cache(prev, target) &&
>>> This change impacts hackbench in socket mode a bit. It seems that for hackbench even
>>> putting the wakee on its previous CPU in the same LLC is better than putting it on
>>> current cluster. But it seems to be hackbench specific.
>>>
>>
>> ...without this do you still see the same improvement at under-loaded case (threads less-equal the CPU
>> numbers) for tbench/netperf? 
>> The idea here is to always try to wakeup in the same cluster of the
>> target to benefit from the cluster cache but the early test for the prev and recent used cpu may break
>> that. Keep it as is, at low load the prev cpu or recent used cpu get more chance to be idle so we take
>> less chance to benefit from the cluster and gain less performance improvement.
>>
> Right. Without above change I saw lower improvement at lightly load case for netperf/tbench.
>> In the hackbench case as you noticed, the utilization can reach 100% ideally so the SIS_UTIL
>> will regulate the scanning number to 4 or around. If the prev/recent used CPU is not in the same
>> cluster with target, we're about to scanning the cluster and when found no idle CPU and has
>> run out of the scanning number, we'll fallback to wakeup on the target. That maybe the reason
>> why observed more wakeups on target rather than previous CPU.
>>
> Looks reasonable. When the budget of scanning number is low, we can not find an idle target
> on local cluster and terminates scanning for an idle prev on remote cluster, although that
> prev could be a better choice than target cpu.
>> In this case I wondering choosing prev cpu or recent used cpu after scanning the cluster can help
>> the situation here, like the snippet below (kinds of messy though).
>>
> This change makes sense to me. I only modified it a little bit to only give prev a second
> chance. With your patch applied, the improvement of netperf/tbench remains while the
> hackbench big regress was gone.
> 

Thanks for the test, it looks justified to have this. Will include this change in next version.

> hackbench
> =========
> case            	load    	baseline(std%)	compare%( std%)
> process-pipe    	1-groups	 1.00 (  2.35)	 -0.65 (  1.81)
> process-pipe    	2-groups	 1.00 (  0.42)	 -2.16 (  1.12)
> process-pipe    	4-groups	 1.00 (  1.84)	 +0.72 (  1.34)
> process-pipe    	8-groups	 1.00 (  2.81)	 +1.12 (  3.88)
> process-sockets 	1-groups	 1.00 (  1.88)	 -0.99 (  4.84)
> process-sockets 	2-groups	 1.00 (  5.49)	 -4.50 (  4.09)
> process-sockets 	4-groups	 1.00 (  3.54)	 +2.28 (  3.13)
> process-sockets 	8-groups	 1.00 (  0.79)	 -0.13 (  1.28)
> threads-pipe    	1-groups	 1.00 (  1.73)	 -2.39 (  0.40)
> threads-pipe    	2-groups	 1.00 (  0.73)	 +2.88 (  1.94)
> threads-pipe    	4-groups	 1.00 (  0.64)	 +1.12 (  1.82)
> threads-pipe    	8-groups	 1.00 (  1.55)	 -1.59 (  1.20)
> threads-sockets 	1-groups	 1.00 (  3.76)	 +3.21 (  3.56)
> threads-sockets 	2-groups	 1.00 (  1.20)	 -5.56 (  2.64)
> threads-sockets 	4-groups	 1.00 (  2.65)	 +1.48 (  4.91)
> threads-sockets 	8-groups	 1.00 (  0.08)	 +0.18 (  0.15)
> 
> netperf
> =======
> case            	load    	baseline(std%)	compare%( std%)
> TCP_RR          	6-threads	 1.00 (  0.91)	 +2.87 (  0.83)
> TCP_RR          	12-threads	 1.00 (  0.22)	 +3.48 (  0.31)
> TCP_RR          	18-threads	 1.00 (  0.41)	 +7.81 (  0.48)
> TCP_RR          	24-threads	 1.00 (  1.02)	 -0.32 (  1.25)
> TCP_RR          	30-threads	 1.00 (  4.67)	 -0.04 (  5.14)
> TCP_RR          	36-threads	 1.00 (  4.53)	 -0.13 (  4.37)
> TCP_RR          	42-threads	 1.00 (  3.92)	 -0.15 (  3.07)
> TCP_RR          	48-threads	 1.00 (  2.07)	 -0.17 (  1.52)
> UDP_RR          	6-threads	 1.00 (  0.98)	 +4.50 (  2.38)
> UDP_RR          	12-threads	 1.00 (  0.26)	 +3.64 (  0.25)
> UDP_RR          	18-threads	 1.00 (  0.27)	 +9.93 (  0.55)
> UDP_RR          	24-threads	 1.00 (  1.22)	 +0.13 (  1.33)
> UDP_RR          	30-threads	 1.00 (  3.86)	 -0.03 (  5.05)
> UDP_RR          	36-threads	 1.00 (  2.81)	 +0.10 (  3.37)
> UDP_RR          	42-threads	 1.00 (  3.51)	 -0.26 (  2.94)
> UDP_RR          	48-threads	 1.00 ( 12.54)	 +0.74 (  9.44)
> 
> tbench
> ======
> case            	load    	baseline(std%)	compare%( std%)
> loopback        	6-threads	 1.00 (  0.04)	 +2.94 (  0.26)
> loopback        	12-threads	 1.00 (  0.30)	 +4.58 (  0.12)
> loopback        	18-threads	 1.00 (  0.37)	+12.38 (  0.10)
> loopback        	24-threads	 1.00 (  0.56)	 -0.27 (  0.50)
> loopback        	30-threads	 1.00 (  0.17)	 -0.18 (  0.06)
> loopback        	36-threads	 1.00 (  0.25)	 -0.73 (  0.44)
> loopback        	42-threads	 1.00 (  0.10)	 -0.22 (  0.18)
> loopback        	48-threads	 1.00 (  0.29)	 -0.48 (  0.19)
> 
> schbench
> ========
> case            	load    	baseline(std%)	compare%( std%)
> normal          	1-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
> normal          	2-mthreads	 1.00 (  0.00)	 +0.00 (  0.00)
> normal          	4-mthreads	 1.00 (  6.80)	 +2.78 (  8.08)
> normal          	8-mthreads	 1.00 (  3.65)	 -0.23 (  4.30)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0989116b0796..07495b44c68f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7127,7 +7127,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	bool has_idle_core = false;
>  	struct sched_domain *sd;
>  	unsigned long task_util, util_min, util_max;
> -	int i, recent_used_cpu;
> +	int i, recent_used_cpu, prev_aff = -1;
>  
>  	/*
>  	 * On asymmetric system, update task utilization because we will check
> @@ -7152,10 +7152,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	/*
>  	 * If the previous CPU is cache affine and idle, don't be stupid:
>  	 */
> -	if (prev != target && cpus_share_lowest_cache(prev, target) &&
> +	if (prev != target && cpus_share_cache(prev, target) &&
>  	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> -	    asym_fits_cpu(task_util, util_min, util_max, prev))
> -		return prev;
> +	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
> +		if (cpus_share_lowest_cache(prev, target))
> +			return prev;
> +		prev_aff = prev;
> +	}
>  
>  	/*
>  	 * Allow a per-cpu kthread to stack with the wakee if the
> @@ -7223,6 +7226,13 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>  	if ((unsigned)i < nr_cpumask_bits)
>  		return i;
>  
> +	/*
> +	 * Give prev another chance, in case prev has not been
> +	 * scanned in select_idle_cpu() due to nr constrain.
> +	 */
> +	if (prev_aff != -1)
> +		return prev_aff;
> +

It looks neater. We should also give recent_used_cpu a chance based on the current implementation
if it does no harm.

Thanks,
Yicong

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-24  8:06 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-15  7:34 [RESEND PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-09-15  7:34 ` [RESEND PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
2022-09-15  7:34 ` [RESEND PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2023-05-22  6:29   ` Chen Yu
2023-05-22 12:42     ` Yicong Yang
2023-05-23 13:44       ` Chen Yu
2023-05-24  8:05         ` Yicong Yang
2022-09-26 14:52 ` [RESEND PATCH v7 0/2] " Yicong Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).