[PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
@ 2022-08-22  7:36 Yicong Yang
  2022-08-22  7:36 ` [PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Yicong Yang @ 2022-08-22  7:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

From: Yicong Yang <yangyicong@hisilicon.com>

This is the follow-up work to support cluster scheduler. Previously
we have added cluster level in the scheduler for both ARM64[1] and
X86[2] to support load balance between clusters to bring more memory
bandwidth and decrease cache contention. This patchset, on the other
hand, takes care of wake-up path by giving CPUs within the same cluster
a try before scanning the whole LLC to benefit those tasks communicating
with each other.

[1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
[2] 66558b730f25 ("sched: Add cluster scheduler level for x86")

Change since v6:
- rebase on 6.0-rc1
Link: https://lore.kernel.org/lkml/20220726074758.46686-1-yangyicong@huawei.com/

Change since v5:
- Improve patch 2 according to Peter's suggestion:
  - use sched_cluster_active to indicate whether cluster is active
  - consider SMT case and use wrap iteration when scanning cluster
- Add Vincent's tag
Thanks.
Link: https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/

Change since v4:
- rename cpus_share_resources to cpus_share_lowest_cache to be more informative, per Tim
- return -1 when nr==0 in scan_cluster(), per Abel
Thanks!
Link: https://lore.kernel.org/lkml/20220609120622.47724-1-yangyicong@hisilicon.com/

Change since v3:
- fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/

Change since v2:
- leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
- remove the ping-pong suppression
- address the comment from Tim, thanks.
Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/

Change since v1:
- regain the performance data based on v5.17-rc1
- rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/


Barry Song (2):
  sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
  sched/fair: Scan cluster before scanning LLC in wake-up path

 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/fair.c            | 30 +++++++++++++++++++++++++++---
 kernel/sched/sched.h           |  4 ++++
 kernel/sched/topology.c        | 25 +++++++++++++++++++++++++
 6 files changed, 82 insertions(+), 4 deletions(-)

-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
  2022-08-22  7:36 [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-08-22  7:36 ` Yicong Yang
  2022-08-22  7:36 ` [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Yicong Yang @ 2022-08-22  7:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

From: Barry Song <song.bao.hua@hisilicon.com>

Add per-cpu cluster domain info and cpus_share_lowest_cache() API.
This is the preparation for the optimization of select_idle_cpu()
on platforms with cluster scheduler level.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched/sd_flags.h |  7 +++++++
 include/linux/sched/topology.h |  8 +++++++-
 kernel/sched/core.c            | 12 ++++++++++++
 kernel/sched/sched.h           |  2 ++
 kernel/sched/topology.c        | 15 +++++++++++++++
 5 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
index 57bde66d95f7..42ed454e8b18 100644
--- a/include/linux/sched/sd_flags.h
+++ b/include/linux/sched/sd_flags.h
@@ -109,6 +109,13 @@ SD_FLAG(SD_ASYM_CPUCAPACITY_FULL, SDF_SHARED_PARENT | SDF_NEEDS_GROUPS)
  */
 SD_FLAG(SD_SHARE_CPUCAPACITY, SDF_SHARED_CHILD | SDF_NEEDS_GROUPS)
 
+/*
+ * Domain members share CPU cluster (LLC tags or L2 cache)
+ *
+ * NEEDS_GROUPS: Clusters are shared between groups.
+ */
+SD_FLAG(SD_CLUSTER, SDF_NEEDS_GROUPS)
+
 /*
  * Domain members share CPU package resources (i.e. caches)
  *
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 816df6cc444e..c0d21667ddf3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -45,7 +45,7 @@ static inline int cpu_smt_flags(void)
 #ifdef CONFIG_SCHED_CLUSTER
 static inline int cpu_cluster_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES;
+	return SD_CLUSTER | SD_SHARE_PKG_RESOURCES;
 }
 #endif
 
@@ -179,6 +179,7 @@ cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
 
 bool cpus_share_cache(int this_cpu, int that_cpu);
+bool cpus_share_lowest_cache(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
@@ -232,6 +233,11 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+static inline bool cpus_share_lowest_cache(int this_cpu, int that_cpu)
+{
+	return true;
+}
+
 #endif	/* !CONFIG_SMP */
 
 #if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ee28253c9ac0..2d647598d26c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3802,6 +3802,18 @@ bool cpus_share_cache(int this_cpu, int that_cpu)
 	return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
 
+/*
+ * Whether CPUs are share lowest cache, which means LLC on non-cluster
+ * machines and LLC tag or L2 on machines with clusters.
+ */
+bool cpus_share_lowest_cache(int this_cpu, int that_cpu)
+{
+	if (this_cpu == that_cpu)
+		return true;
+
+	return per_cpu(sd_lowest_cache_id, this_cpu) == per_cpu(sd_lowest_cache_id, that_cpu);
+}
+
 static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
 {
 	/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..e9f0935605e2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1809,7 +1809,9 @@ static inline struct sched_domain *lowest_flag_domain(int cpu, int flag)
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DECLARE_PER_CPU(int, sd_llc_size);
 DECLARE_PER_CPU(int, sd_llc_id);
+DECLARE_PER_CPU(int, sd_lowest_cache_id);
 DECLARE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
+DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..8ab27c0d6d1f 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -664,6 +664,8 @@ static void destroy_sched_domains(struct sched_domain *sd)
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
 DEFINE_PER_CPU(int, sd_llc_size);
 DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_lowest_cache_id);
+DEFINE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
@@ -689,6 +691,18 @@ static void update_top_cache_domain(int cpu)
 	per_cpu(sd_llc_id, cpu) = id;
 	rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
 
+	sd = lowest_flag_domain(cpu, SD_CLUSTER);
+	if (sd)
+		id = cpumask_first(sched_domain_span(sd));
+	rcu_assign_pointer(per_cpu(sd_cluster, cpu), sd);
+
+	/*
+	 * This assignment should be placed after the sd_llc_id as
+	 * we want this id equals to cluster id on cluster machines
+	 * but equals to LLC id on non-Cluster machines.
+	 */
+	per_cpu(sd_lowest_cache_id, cpu) = id;
+
 	sd = lowest_flag_domain(cpu, SD_NUMA);
 	rcu_assign_pointer(per_cpu(sd_numa, cpu), sd);
 
@@ -1532,6 +1546,7 @@ static struct cpumask		***sched_domains_numa_masks;
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY	|	\
+	 SD_CLUSTER		|	\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA		|	\
 	 SD_ASYM_PACKING)
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-22  7:36 [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-08-22  7:36 ` [PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
@ 2022-08-22  7:36 ` Yicong Yang
  2022-08-23  3:45   ` Chen Yu
  2022-09-05 12:37 ` [PATCH v7 0/2] " Yicong Yang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Yicong Yang @ 2022-08-22  7:36 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

From: Barry Song <song.bao.hua@hisilicon.com>

For platforms having clusters like Kunpeng920, CPUs within the same cluster
have lower latency when synchronizing and accessing shared resources like
cache. Thus, this patch tries to find an idle cpu within the cluster of the
target CPU before scanning the whole LLC to gain lower latency.

Testing has been done on Kunpeng920 by pinning tasks to one numa and two
numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.

With this patch, We noticed enhancement on tbench within one numa or cross
two numa.

On numa 0:
                             6.0-rc1                patched
Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*

On numa 0-1:
                             6.0-rc1                patched
Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*

Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
in the code has not been tested but it supposed to work.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
[https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
Tested-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
---
 kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
 kernel/sched/sched.h    |  2 ++
 kernel/sched/topology.c | 10 ++++++++++
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 914096c5b1ae..6fa77610d0f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6437,6 +6437,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 		}
 	}
 
+	if (static_branch_unlikely(&sched_cluster_active)) {
+		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
+
+		if (sdc) {
+			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
+				if (!cpumask_test_cpu(cpu, cpus))
+					continue;
+
+				if (has_idle_core) {
+					i = select_idle_core(p, cpu, cpus, &idle_cpu);
+					if ((unsigned int)i < nr_cpumask_bits)
+						return i;
+				} else {
+					if (--nr <= 0)
+						return -1;
+					idle_cpu = __select_idle_cpu(cpu, p);
+					if ((unsigned int)idle_cpu < nr_cpumask_bits)
+						return idle_cpu;
+				}
+			}
+			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
+		}
+	}
+
 	for_each_cpu_wrap(cpu, cpus, target + 1) {
 		if (has_idle_core) {
 			i = select_idle_core(p, cpu, cpus, &idle_cpu);
@@ -6444,7 +6468,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
 				return i;
 
 		} else {
-			if (!--nr)
+			if (--nr <= 0)
 				return -1;
 			idle_cpu = __select_idle_cpu(cpu, p);
 			if ((unsigned int)idle_cpu < nr_cpumask_bits)
@@ -6543,7 +6567,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	/*
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
-	if (prev != target && cpus_share_cache(prev, target) &&
+	if (prev != target && cpus_share_lowest_cache(prev, target) &&
 	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
 	    asym_fits_capacity(task_util, prev))
 		return prev;
@@ -6569,7 +6593,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	p->recent_used_cpu = prev;
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
-	    cpus_share_cache(recent_used_cpu, target) &&
+	    cpus_share_lowest_cache(recent_used_cpu, target) &&
 	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
 	    cpumask_test_cpu(p->recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_capacity(task_util, recent_used_cpu)) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e9f0935605e2..60e8a91e29d1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1815,7 +1815,9 @@ DECLARE_PER_CPU(struct sched_domain __rcu *, sd_cluster);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DECLARE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
+
 extern struct static_key_false sched_asym_cpucapacity;
+extern struct static_key_false sched_cluster_active;
 
 struct sched_group_capacity {
 	atomic_t		ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8ab27c0d6d1f..04ead3227201 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -670,7 +670,9 @@ DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_packing);
 DEFINE_PER_CPU(struct sched_domain __rcu *, sd_asym_cpucapacity);
+
 DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
+DEFINE_STATIC_KEY_FALSE(sched_cluster_active);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -2268,6 +2270,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 	bool has_asym = false;
+	bool has_cluster = false;
 
 	if (WARN_ON(cpumask_empty(cpu_map)))
 		goto error;
@@ -2289,6 +2292,7 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 			sd = build_sched_domain(tl, cpu_map, attr, sd, i);
 
 			has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
+			has_cluster |= sd->flags & SD_CLUSTER;
 
 			if (tl == sched_domain_topology)
 				*per_cpu_ptr(d.sd, i) = sd;
@@ -2399,6 +2403,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	if (has_asym)
 		static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
 
+	if (has_cluster)
+		static_branch_inc_cpuslocked(&sched_cluster_active);
+
 	if (rq && sched_debug_verbose) {
 		pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
 			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
@@ -2498,6 +2505,9 @@ static void detach_destroy_domains(const struct cpumask *cpu_map)
 	if (rcu_access_pointer(per_cpu(sd_asym_cpucapacity, cpu)))
 		static_branch_dec_cpuslocked(&sched_asym_cpucapacity);
 
+	if (rcu_access_pointer(per_cpu(sd_cluster, cpu)))
+		static_branch_dec_cpuslocked(&sched_cluster_active);
+
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map)
 		cpu_attach_domain(NULL, &def_root_domain, i);
-- 
2.24.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-22  7:36 ` [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-08-23  3:45   ` Chen Yu
  2022-08-23  7:48     ` Yicong Yang
  0 siblings, 1 reply; 10+ messages in thread
From: Chen Yu @ 2022-08-23  3:45 UTC (permalink / raw)
  To: Yicong Yang
  Cc: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel, dietmar.eggemann,
	rostedt, bsegall, bristot, prime.zeng, yangyicong,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel

On 2022-08-22 at 15:36:10 +0800, Yicong Yang wrote:
> From: Barry Song <song.bao.hua@hisilicon.com>
> 
> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> have lower latency when synchronizing and accessing shared resources like
> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> target CPU before scanning the whole LLC to gain lower latency.
> 
> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> 
> With this patch, We noticed enhancement on tbench within one numa or cross
> two numa.
> 
> On numa 0:
>                              6.0-rc1                patched
> Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
> Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
> Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
> Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
> Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
> Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
> Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
> Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*
> 
> On numa 0-1:
>                              6.0-rc1                patched
> Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
> Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
> Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
> Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
> Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
> Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
> Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
> Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
> Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*
> 
> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
> in the code has not been tested but it supposed to work.
> 
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>  kernel/sched/sched.h    |  2 ++
>  kernel/sched/topology.c | 10 ++++++++++
>  3 files changed, 39 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 914096c5b1ae..6fa77610d0f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6437,6 +6437,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>  		}
>  	}
>  
> +	if (static_branch_unlikely(&sched_cluster_active)) {
> +		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
> +
> +		if (sdc) {
> +			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
Looks good to me. One minor question, why don't we use
cpumask_and(cpus, sched_domain_span(sdc), cpus);
> +				if (!cpumask_test_cpu(cpu, cpus))
> +					continue;
so above check can be removed in each loop? Besides may I know what version this patch
is based on? since I failed to apply the patch on v6.0-rc2. Other than that:

Reviewed-by: Chen Yu <yu.c.chen@intel.com>

thanks,
Chenyu
> +
> +				if (has_idle_core) {
> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
> +					if ((unsigned int)i < nr_cpumask_bits)
> +						return i;
> +				} else {
> +					if (--nr <= 0)
> +						return -1;
> +					idle_cpu = __select_idle_cpu(cpu, p);
> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
> +						return idle_cpu;
> +				}
> +			}
> +			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
> +		}
> +	}

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-23  3:45   ` Chen Yu
@ 2022-08-23  7:48     ` Yicong Yang
  2022-08-23  8:09       ` Chen Yu
  0 siblings, 1 reply; 10+ messages in thread
From: Yicong Yang @ 2022-08-23  7:48 UTC (permalink / raw)
  To: Chen Yu
  Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel,
	dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel

On 2022/8/23 11:45, Chen Yu wrote:
> On 2022-08-22 at 15:36:10 +0800, Yicong Yang wrote:
>> From: Barry Song <song.bao.hua@hisilicon.com>
>>
>> For platforms having clusters like Kunpeng920, CPUs within the same cluster
>> have lower latency when synchronizing and accessing shared resources like
>> cache. Thus, this patch tries to find an idle cpu within the cluster of the
>> target CPU before scanning the whole LLC to gain lower latency.
>>
>> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
>> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
>>
>> With this patch, We noticed enhancement on tbench within one numa or cross
>> two numa.
>>
>> On numa 0:
>>                              6.0-rc1                patched
>> Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
>> Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
>> Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
>> Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
>> Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
>> Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
>> Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
>> Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*
>>
>> On numa 0-1:
>>                              6.0-rc1                patched
>> Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
>> Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
>> Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
>> Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
>> Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
>> Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
>> Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
>> Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
>> Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*
>>
>> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
>> in the code has not been tested but it supposed to work.
>>
>> Suggested-by: Peter Zijlstra <peterz@infradead.org>
>> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
>> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
>> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
>> ---
>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>  kernel/sched/sched.h    |  2 ++
>>  kernel/sched/topology.c | 10 ++++++++++
>>  3 files changed, 39 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 914096c5b1ae..6fa77610d0f5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6437,6 +6437,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>  		}
>>  	}
>>  
>> +	if (static_branch_unlikely(&sched_cluster_active)) {
>> +		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
>> +
>> +		if (sdc) {
>> +			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
> Looks good to me. One minor question, why don't we use
> cpumask_and(cpus, sched_domain_span(sdc), cpus);
>> +				if (!cpumask_test_cpu(cpu, cpus))
>> +					continue;
> so above check can be removed in each loop?

Since we'll need to recalculate the mask of rest CPUs to test in the LLC after scanning the cluster CPUs.

> Besides may I know what version this patch
> is based on? since I failed to apply the patch on v6.0-rc2. Other than that:
> 

It's on 6.0-rc1 when sent but can be cleanly rebased on rc2:

yangyicong@ubuntu:~/mainline_linux/linux_sub_workspace$ git log --oneline -3
0079c27ba265 (HEAD -> topost-cls-v7, topost-cls-v6) sched/fair: Scan cluster before scanning LLC in wake-up path
1ecb9e322bd7 sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
1c23f9e627a7 (tag: v6.0-rc2, origin/master, origin/HEAD, master) Linux 6.0-rc2

So I'm not sure where's the problem...

> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> 

Thanks!

> thanks,
> Chenyu
>> +
>> +				if (has_idle_core) {
>> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
>> +					if ((unsigned int)i < nr_cpumask_bits)
>> +						return i;
>> +				} else {
>> +					if (--nr <= 0)
>> +						return -1;
>> +					idle_cpu = __select_idle_cpu(cpu, p);
>> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
>> +						return idle_cpu;
>> +				}
>> +			}
>> +			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
>> +		}
>> +	}
> .
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-23  7:48     ` Yicong Yang
@ 2022-08-23  8:09       ` Chen Yu
  0 siblings, 0 replies; 10+ messages in thread
From: Chen Yu @ 2022-08-23  8:09 UTC (permalink / raw)
  To: Yicong Yang
  Cc: yangyicong, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel,
	dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	jonathan.cameron, ego, srikar, linuxarm, 21cnbao, guodong.xu,
	hesham.almatary, john.garry, shenyang39, kprateek.nayak,
	wuyun.abel

On 2022-08-23 at 15:48:00 +0800, Yicong Yang wrote:
> On 2022/8/23 11:45, Chen Yu wrote:
> > On 2022-08-22 at 15:36:10 +0800, Yicong Yang wrote:
> >> From: Barry Song <song.bao.hua@hisilicon.com>
> >>
> >> For platforms having clusters like Kunpeng920, CPUs within the same cluster
> >> have lower latency when synchronizing and accessing shared resources like
> >> cache. Thus, this patch tries to find an idle cpu within the cluster of the
> >> target CPU before scanning the whole LLC to gain lower latency.
> >>
> >> Testing has been done on Kunpeng920 by pinning tasks to one numa and two
> >> numa. On Kunpeng920, Each numa has 8 clusters and each cluster has 4 CPUs.
> >>
> >> With this patch, We noticed enhancement on tbench within one numa or cross
> >> two numa.
> >>
> >> On numa 0:
> >>                              6.0-rc1                patched
> >> Hmean     1        351.20 (   0.00%)      396.45 *  12.88%*
> >> Hmean     2        700.43 (   0.00%)      793.76 *  13.32%*
> >> Hmean     4       1404.42 (   0.00%)     1583.62 *  12.76%*
> >> Hmean     8       2833.31 (   0.00%)     3147.85 *  11.10%*
> >> Hmean     16      5501.90 (   0.00%)     6089.89 *  10.69%*
> >> Hmean     32     10428.59 (   0.00%)    10619.63 *   1.83%*
> >> Hmean     64      8223.39 (   0.00%)     8306.93 *   1.02%*
> >> Hmean     128     7042.88 (   0.00%)     7068.03 *   0.36%*
> >>
> >> On numa 0-1:
> >>                              6.0-rc1                patched
> >> Hmean     1        363.06 (   0.00%)      397.13 *   9.38%*
> >> Hmean     2        721.68 (   0.00%)      789.84 *   9.44%*
> >> Hmean     4       1435.15 (   0.00%)     1566.01 *   9.12%*
> >> Hmean     8       2776.17 (   0.00%)     3007.05 *   8.32%*
> >> Hmean     16      5471.71 (   0.00%)     6103.91 *  11.55%*
> >> Hmean     32     10164.98 (   0.00%)    11531.81 *  13.45%*
> >> Hmean     64     17143.28 (   0.00%)    20078.68 *  17.12%*
> >> Hmean     128    14552.70 (   0.00%)    15156.41 *   4.15%*
> >> Hmean     256    12827.37 (   0.00%)    13326.86 *   3.89%*
> >>
> >> Note neither Kunpeng920 nor x86 Jacobsville supports SMT, so the SMT branch
> >> in the code has not been tested but it supposed to work.
> >>
> >> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> >> [https://lore.kernel.org/lkml/Ytfjs+m1kUs0ScSn@worktop.programming.kicks-ass.net]
> >> Tested-by: Yicong Yang <yangyicong@hisilicon.com>
> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> >> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> >> Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
> >> ---
> >>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
> >>  kernel/sched/sched.h    |  2 ++
> >>  kernel/sched/topology.c | 10 ++++++++++
> >>  3 files changed, 39 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 914096c5b1ae..6fa77610d0f5 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6437,6 +6437,30 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>  		}
> >>  	}
> >>  
> >> +	if (static_branch_unlikely(&sched_cluster_active)) {
> >> +		struct sched_domain *sdc = rcu_dereference(per_cpu(sd_cluster, target));
> >> +
> >> +		if (sdc) {
> >> +			for_each_cpu_wrap(cpu, sched_domain_span(sdc), target + 1) {
> > Looks good to me. One minor question, why don't we use
> > cpumask_and(cpus, sched_domain_span(sdc), cpus);
> >> +				if (!cpumask_test_cpu(cpu, cpus))
> >> +					continue;
> > so above check can be removed in each loop?
> 
> Since we'll need to recalculate the mask of rest CPUs to test in the LLC after scanning the cluster CPUs.
>
I was thinking of introducing a temporary variable
cpumask_and(cpus_cluster, sched_domain_span(sdc), cpus);
and iterate this cpus_cluster in the loop. But since the
cpus is reused, it is ok to be as it is.
> > Besides may I know what version this patch
> > is based on? since I failed to apply the patch on v6.0-rc2. Other than that:
> > 
> 
> It's on 6.0-rc1 when sent but can be cleanly rebased on rc2:
> 
> yangyicong@ubuntu:~/mainline_linux/linux_sub_workspace$ git log --oneline -3
> 0079c27ba265 (HEAD -> topost-cls-v7, topost-cls-v6) sched/fair: Scan cluster before scanning LLC in wake-up path
> 1ecb9e322bd7 sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
I did not apply 1/2, and that was why it failed I think. Thanks for explaination.

Thanks,
Chenyu
> 1c23f9e627a7 (tag: v6.0-rc2, origin/master, origin/HEAD, master) Linux 6.0-rc2
> 
> So I'm not sure where's the problem...
> 
> > Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> > 
> 
> Thanks!
> 
> > thanks,
> > Chenyu
> >> +
> >> +				if (has_idle_core) {
> >> +					i = select_idle_core(p, cpu, cpus, &idle_cpu);
> >> +					if ((unsigned int)i < nr_cpumask_bits)
> >> +						return i;
> >> +				} else {
> >> +					if (--nr <= 0)
> >> +						return -1;
> >> +					idle_cpu = __select_idle_cpu(cpu, p);
> >> +					if ((unsigned int)idle_cpu < nr_cpumask_bits)
> >> +						return idle_cpu;
> >> +				}
> >> +			}
> >> +			cpumask_andnot(cpus, cpus, sched_domain_span(sdc));
> >> +		}
> >> +	}
> > .
> > 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-22  7:36 [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
  2022-08-22  7:36 ` [PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
  2022-08-22  7:36 ` [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
@ 2022-09-05 12:37 ` Yicong Yang
  2022-09-06  5:28 ` K Prateek Nayak
  2022-09-07 11:52 ` Barry Song
  4 siblings, 0 replies; 10+ messages in thread
From: Yicong Yang @ 2022-09-05 12:37 UTC (permalink / raw)
  To: peterz, mingo, juri.lelli, vincent.guittot, tim.c.chen,
	gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: yangyicong, dietmar.eggemann, rostedt, bsegall, bristot,
	prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39,
	kprateek.nayak, yu.c.chen, wuyun.abel

a friendly ping...

Thanks.

On 2022/8/22 15:36, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
> 
> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.
> 
> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
> 
> Change since v6:
> - rebase on 6.0-rc1
> Link: https://lore.kernel.org/lkml/20220726074758.46686-1-yangyicong@huawei.com/
> 
> Change since v5:
> - Improve patch 2 according to Peter's suggestion:
>   - use sched_cluster_active to indicate whether cluster is active
>   - consider SMT case and use wrap iteration when scanning cluster
> - Add Vincent's tag
> Thanks.
> Link: https://lore.kernel.org/lkml/20220720081150.22167-1-yangyicong@hisilicon.com/
> 
> Change since v4:
> - rename cpus_share_resources to cpus_share_lowest_cache to be more informative, per Tim
> - return -1 when nr==0 in scan_cluster(), per Abel
> Thanks!
> Link: https://lore.kernel.org/lkml/20220609120622.47724-1-yangyicong@hisilicon.com/
> 
> Change since v3:
> - fix compile error when !CONFIG_SCHED_CLUSTER, reported by lkp test.
> Link: https://lore.kernel.org/lkml/20220608095758.60504-1-yangyicong@hisilicon.com/
> 
> Change since v2:
> - leverage SIS_PROP to suspend redundant scanning when LLC is overloaded
> - remove the ping-pong suppression
> - address the comment from Tim, thanks.
> Link: https://lore.kernel.org/lkml/20220126080947.4529-1-yangyicong@hisilicon.com/
> 
> Change since v1:
> - regain the performance data based on v5.17-rc1
> - rename cpus_share_cluster to cpus_share_resources per Vincent and Gautham, thanks!
> Link: https://lore.kernel.org/lkml/20211215041149.73171-1-yangyicong@hisilicon.com/
> 
> 
> Barry Song (2):
>   sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
>   sched/fair: Scan cluster before scanning LLC in wake-up path
> 
>  include/linux/sched/sd_flags.h |  7 +++++++
>  include/linux/sched/topology.h |  8 +++++++-
>  kernel/sched/core.c            | 12 ++++++++++++
>  kernel/sched/fair.c            | 30 +++++++++++++++++++++++++++---
>  kernel/sched/sched.h           |  4 ++++
>  kernel/sched/topology.c        | 25 +++++++++++++++++++++++++
>  6 files changed, 82 insertions(+), 4 deletions(-)
> 

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-22  7:36 [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
                   ` (2 preceding siblings ...)
  2022-09-05 12:37 ` [PATCH v7 0/2] " Yicong Yang
@ 2022-09-06  5:28 ` K Prateek Nayak
  2022-09-06  8:46   ` Yicong Yang
  2022-09-07 11:52 ` Barry Song
  4 siblings, 1 reply; 10+ messages in thread
From: K Prateek Nayak @ 2022-09-06  5:28 UTC (permalink / raw)
  To: Yicong Yang, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: dietmar.eggemann, rostedt, bsegall, bristot, prime.zeng,
	yangyicong, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39, yu.c.chen,
	wuyun.abel

Hello Yicong,

We've tested the series on a dual socket Zen3 system (2 x 64C/128T).

tl;dr

- The results look good and the changes do not affect the Zen3 machine
  which doesn't contain any sched domain with SD_CLUSTER flag set.

- With the latest BIOS, I don't see any regression due to the addition
  of the new per CPU variables.
  We had observed a regression in tbench previously when testing the
  v4 of the series on the system with a slightly outdated BIOS
  (https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com/)
  but that doesn't seem to be the case with the latest BIOS :)

Detailed results from the standard benchmarks are reported below.

On 8/22/2022 1:06 PM, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
> 
> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.
> 
> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
> 

Discussed below are the results from running standard benchmarks on
a dual socket Zen3 (2 x 64C/128T) machine configured in different
NPS modes.

NPS Modes are used to logically divide single socket into
multiple NUMA region.
Following is the NUMA configuration for each NPS mode on the system:

NPS1: Each socket is a NUMA node.
    Total 2 NUMA nodes in the dual socket machine.

    Node 0: 0-63,   128-191
    Node 1: 64-127, 192-255

NPS2: Each socket is further logically divided into 2 NUMA regions.
    Total 4 NUMA nodes exist over 2 socket.
   
    Node 0: 0-31,   128-159
    Node 1: 32-63,  160-191
    Node 2: 64-95,  192-223
    Node 3: 96-127, 223-255

NPS4: Each socket is logically divided into 4 NUMA regions.
    Total 8 NUMA nodes exist over 2 socket.
   
    Node 0: 0-15,    128-143
    Node 1: 16-31,   144-159
    Node 2: 32-47,   160-175
    Node 3: 48-63,   176-191
    Node 4: 64-79,   192-207
    Node 5: 80-95,   208-223
    Node 6: 96-111,  223-231
    Node 7: 112-127, 232-255

Benchmark Results:

Kernel versions:
- tip:      5.19.0 tip sched/core
- cluster:  5.19.0 tip sched/core + both the patches of the series

When we started testing, the tip was at:
commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"

~~~~~~~~~~~~~
~ hackbench ~
~~~~~~~~~~~~~

NPS1

Test:		      tip                    cluster
 1-groups:	   4.31 (0.00 pct)	   4.31 (0.00 pct)
 2-groups:	   4.93 (0.00 pct)	   4.86 (1.41 pct)
 4-groups:	   5.38 (0.00 pct)	   5.36 (0.37 pct)
 8-groups:	   5.59 (0.00 pct)	   5.54 (0.89 pct)
16-groups:	   7.18 (0.00 pct)	   7.47 (-4.03 pct)

NPS2

Test:		      tip                     cluster
 1-groups:	   4.25 (0.00 pct)	   4.40 (-3.52 pct)
 2-groups:	   4.83 (0.00 pct)	   4.73 (2.07 pct)
 4-groups:	   5.25 (0.00 pct)	   5.18 (1.33 pct)
 8-groups:	   5.56 (0.00 pct)	   5.45 (1.97 pct)
16-groups:	   6.72 (0.00 pct)	   6.63 (1.33 pct)

NPS4

Test:		      tip                     cluster
 1-groups:	   4.24 (0.00 pct)	   4.23 (0.23 pct)
 2-groups:	   4.88 (0.00 pct)	   4.78 (2.04 pct)
 4-groups:	   5.30 (0.00 pct)	   5.25 (0.94 pct)
 8-groups:	   5.66 (0.00 pct)	   5.61 (0.88 pct)
16-groups:	   6.79 (0.00 pct)	   7.05 (-3.82 pct)

~~~~~~~~~~~~
~ schbench ~
~~~~~~~~~~~~

NPS1

#workers:     tip                       cluster
  1:	  37.00 (0.00 pct)	     22.00 (40.54 pct)
  2:	  39.00 (0.00 pct)	     23.00 (41.02 pct)
  4:	  41.00 (0.00 pct)	     30.00 (26.82 pct)
  8:	  53.00 (0.00 pct)	     47.00 (11.32 pct)
 16:	  73.00 (0.00 pct)	     73.00 (0.00 pct)
 32:	 116.00 (0.00 pct)	    117.00 (-0.86 pct)
 64:	 217.00 (0.00 pct)	    221.00 (-1.84 pct)
128:	 477.00 (0.00 pct)	    444.00 (6.91 pct)
256:	1062.00 (0.00 pct)	   1050.00 (1.12 pct)
512:   47552.00 (0.00 pct)	  48576.00 (-2.15 pct)

NPS2

#workers:     tip                       cluster
  1:	  20.00 (0.00 pct)	     20.00 (0.00 pct)
  2:	  22.00 (0.00 pct)	     23.00 (-4.54 pct)
  4:	  30.00 (0.00 pct)	     31.00 (-3.33 pct)
  8:	  46.00 (0.00 pct)	     49.00 (-6.52 pct)
 16:	  70.00 (0.00 pct)	     72.00 (-2.85 pct)
 32:	 120.00 (0.00 pct)	    118.00 (1.66 pct)
 64:	 215.00 (0.00 pct)	    216.00 (-0.46 pct)
128:	 482.00 (0.00 pct)	    449.00 (6.84 pct)
256:	1042.00 (0.00 pct)	    995.00 (4.51 pct)
512:   47552.00 (0.00 pct)	  47296.00 (0.53 pct)

NPS4

#workers:     tip                       cluster
  1:	  18.00 (0.00 pct)	     20.00 (-11.11 pct)
  2:	  23.00 (0.00 pct)	     22.00 (4.34 pct)
  4:	  27.00 (0.00 pct)	     30.00 (-11.11 pct)
  8:	  57.00 (0.00 pct)	     60.00 (-5.26 pct)
 16:	  76.00 (0.00 pct)	     84.00 (-10.52 pct)
 32:	 120.00 (0.00 pct)	    115.00 (4.16 pct)
 64:	 219.00 (0.00 pct)	    212.00 (3.19 pct)
128:	 459.00 (0.00 pct)	    442.00 (3.70 pct)
256:	1078.00 (0.00 pct)	    983.00 (8.81 pct)
512:   47040.00 (0.00 pct)	  48192.00 (-2.44 pct)

Note: schbench displays lot of run to run variance for
low worker count. This behavior is due to the timing of
new-idle balance which is not consistent across runs.

~~~~~~~~~~
~ tbench ~
~~~~~~~~~~

NPS1

Clients:      tip            	      cluster
    1	   573.26 (0.00 pct)	   572.61 (-0.11 pct)
    2	  1131.19 (0.00 pct)	  1122.41 (-0.77 pct)
    4	  2100.07 (0.00 pct)	  2081.74 (-0.87 pct)
    8	  3809.88 (0.00 pct)	  3732.14 (-2.04 pct)
   16	  6560.72 (0.00 pct)	  6289.22 (-4.13 pct)
   32	 12203.23 (0.00 pct)	 11811.74 (-3.20 pct)
   64	 22389.81 (0.00 pct)	 21587.79 (-3.58 pct)
  128	 32449.37 (0.00 pct)	 32967.15 (1.59 pct)
  256	 58962.40 (0.00 pct)	 56604.63 (-3.99 pct)
  512	 59608.71 (0.00 pct)	 56529.95 (-5.16 pct) * (Machine Overloaded)
  512	 57925.05 (0.00 pct)	 56697.38 (-2.11 pct) [Verification Run]
 1024	 58037.02 (0.00 pct)	 55751.53 (-3.93 pct)

NPS2

Clients:      tip                     cluster
    1	   574.20 (0.00 pct)	   572.49 (-0.29 pct)
    2	  1131.56 (0.00 pct)	  1149.53 (1.58 pct)
    4	  2132.26 (0.00 pct)	  2084.18 (-2.25 pct)
    8	  3812.20 (0.00 pct)	  3683.04 (-3.38 pct)
   16	  6457.61 (0.00 pct)	  6340.70 (-1.81 pct)
   32	 12263.82 (0.00 pct)	 11714.15 (-4.48 pct)
   64	 22224.11 (0.00 pct)	 21226.34 (-4.48 pct)
  128	 33040.38 (0.00 pct)	 32478.99 (-1.69 pct)
  256	 56547.25 (0.00 pct)	 52915.71 (-6.42 pct) * (Machine Overloaded)
  256    55631.80 (0.00 pct)     52905.99 (-4.89 pct) [Verification Run]
  512	 56220.67 (0.00 pct)	 54735.69 (-2.64 pct)
 1024	 56048.88 (0.00 pct)	 54426.63 (-2.89 pct)

NPS4

Clients:     tip                      cluster
    1	   575.50 (0.00 pct)	   570.65 (-0.84 pct)
    2	  1138.70 (0.00 pct)	  1137.75 (-0.08 pct)
    4	  2070.66 (0.00 pct)	  2103.18 (1.57 pct)
    8	  3811.70 (0.00 pct)	  3573.52 (-6.24 pct) *
    8	  3769.53 (0.00 pct)      3653.05 (-3.09 pct) [Verification Run]
   16	  6312.80 (0.00 pct)	  6212.41 (-1.59 pct)
   32	 11418.14 (0.00 pct)	 11721.01 (2.65 pct)
   64	 19671.16 (0.00 pct)	 20053.77 (1.94 pct)
  128	 30258.53 (0.00 pct)	 32585.15 (7.68 pct)
  256	 55838.10 (0.00 pct)	 51318.64 (-8.09 pct) * (Machine Overloaded)
  256	 54291.03 (0.00 pct)     54379.80 (0.16 pct)  [Verification Run]
  512	 55586.44 (0.00 pct)	 51538.93 (-7.28 pct) * (Machine Overloaded)
  512	 54190.04 (0.00 pct)     54096.16 (-0.17 pct) [Verification Run]
 1024	 56370.35 (0.00 pct)	 50768.68 (-9.93 pct) * (Machine Overloaded)
 1024    56498.36 (0.00 pct)     54661.85 (-3.25 pct) [Verification Run]

~~~~~~~~~~
~ stream ~
~~~~~~~~~~

NPS1

- 10 Runs:

Test:	      tip                  cluster
 Copy:	 332237.51 (0.00 pct)	 338085.24 (1.76 pct)
Scale:	 215236.94 (0.00 pct)	 214179.72 (-0.49 pct)
  Add:	 250753.67 (0.00 pct)	 251181.86 (0.17 pct)
Triad:	 259467.60 (0.00 pct)	 262541.92 (1.18 pct)

- 100 Runs:

Test:	      tip                  cluster
 Copy:	 329320.65 (0.00 pct)	 336947.39 (2.31 pct)
Scale:	 218102.78 (0.00 pct)	 219617.85 (0.69 pct)
  Add:	 251283.30 (0.00 pct)	 251918.03 (0.25 pct)
Triad:	 258044.33 (0.00 pct)	 261512.99 (1.34 pct)

NPS2

- 10 Runs:

Test:	      tip                  cluster
 Copy:	 336926.24 (0.00 pct)	 324310.01 (-3.74 pct)
Scale:	 220120.41 (0.00 pct)	 212795.43 (-3.32 pct)
  Add:	 252428.34 (0.00 pct)	 254355.80 (0.76 pct)
Triad:	 274268.23 (0.00 pct)	 261777.03 (-4.55 pct)

- 100 Runs:

Test:	      tip                  cluster
 Copy:   338126.49 (0.00 pct)    338947.03 (0.24 pct)
Scale:   230229.59 (0.00 pct)    229991.65 (-0.10 pct)
  Add:   253964.25 (0.00 pct)    264374.57 (4.09 pct)
Triad:   272176.19 (0.00 pct)    274587.35 (0.88 pct)

NPS4

- 10 Runs:

Test:	      tip                  cluster
 Copy:   367144.56 (0.00 pct)    375452.26 (2.26 pct)
Scale:   246928.04 (0.00 pct)    243651.53 (-1.32 pct)
  Add:   272096.30 (0.00 pct)    272845.33 (0.27 pct)
Triad:   286644.55 (0.00 pct)    290925.20 (1.49 pct)

- 100 Runs:

Test:	      tip                  cluster
 Copy:	 351980.15 (0.00 pct)	 375854.72 (6.78 pct)
Scale:	 254918.41 (0.00 pct)	 255904.90 (0.38 pct)
  Add:	 272722.89 (0.00 pct)	 274075.11 (0.49 pct)
Triad:   283340.94 (0.00 pct)	 287608.77 (1.50 pct)

~~~~~~~~~~~~~~~~~~~~
~ Additional notes ~
~~~~~~~~~~~~~~~~~~~~

- schbench is know to have a noticeable run-to-run variation for lower
  worker counts and any improvements or regression observed can be
  safely ignored. The results are included to make sure there are
  no unnecessarily large regressions as a result of task pileup.

- tbench shows slight run to run variation with larger number of
  clients on both tip and patched kernel. This is expected as the machine
  is overloaded at that point (equivalent of two or more tasks per CPU).
  "Verification Run" shows none of these regressions are persistent.

>
> [..snip..]
> 

Overall, the changes look good and doesn't affect system without a
SD_CLUSTER domain like the Zen3 system used during testing.

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards,
Prateek

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-09-06  5:28 ` K Prateek Nayak
@ 2022-09-06  8:46   ` Yicong Yang
  0 siblings, 0 replies; 10+ messages in thread
From: Yicong Yang @ 2022-09-06  8:46 UTC (permalink / raw)
  To: K Prateek Nayak, peterz, mingo, juri.lelli, vincent.guittot,
	tim.c.chen, gautham.shenoy, linux-kernel, linux-arm-kernel
  Cc: yangyicong, dietmar.eggemann, rostedt, bsegall, bristot,
	prime.zeng, jonathan.cameron, ego, srikar, linuxarm, 21cnbao,
	guodong.xu, hesham.almatary, john.garry, shenyang39, yu.c.chen,
	wuyun.abel

On 2022/9/6 13:28, K Prateek Nayak wrote:
> Hello Yicong,
> 
> We've tested the series on a dual socket Zen3 system (2 x 64C/128T).
> 
> tl;dr
> 
> - The results look good and the changes do not affect the Zen3 machine
>   which doesn't contain any sched domain with SD_CLUSTER flag set.
> 
> - With the latest BIOS, I don't see any regression due to the addition
>   of the new per CPU variables.
>   We had observed a regression in tbench previously when testing the
>   v4 of the series on the system with a slightly outdated BIOS
>   (https://lore.kernel.org/lkml/e000b124-afd4-28e1-fde2-393b0e38ce19@amd.com/)
>   but that doesn't seem to be the case with the latest BIOS :)
> 
> Detailed results from the standard benchmarks are reported below.
> 
> On 8/22/2022 1:06 PM, Yicong Yang wrote:
>> From: Yicong Yang <yangyicong@hisilicon.com>
>>
>> This is the follow-up work to support cluster scheduler. Previously
>> we have added cluster level in the scheduler for both ARM64[1] and
>> X86[2] to support load balance between clusters to bring more memory
>> bandwidth and decrease cache contention. This patchset, on the other
>> hand, takes care of wake-up path by giving CPUs within the same cluster
>> a try before scanning the whole LLC to benefit those tasks communicating
>> with each other.
>>
>> [1] 778c558f49a2 ("sched: Add cluster scheduler level in core and related Kconfig for ARM64")
>> [2] 66558b730f25 ("sched: Add cluster scheduler level for x86")
>>
> 
> Discussed below are the results from running standard benchmarks on
> a dual socket Zen3 (2 x 64C/128T) machine configured in different
> NPS modes.
> 
> NPS Modes are used to logically divide single socket into
> multiple NUMA region.
> Following is the NUMA configuration for each NPS mode on the system:
> 
> NPS1: Each socket is a NUMA node.
>     Total 2 NUMA nodes in the dual socket machine.
> 
>     Node 0: 0-63,   128-191
>     Node 1: 64-127, 192-255
> 
> NPS2: Each socket is further logically divided into 2 NUMA regions.
>     Total 4 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-31,   128-159
>     Node 1: 32-63,  160-191
>     Node 2: 64-95,  192-223
>     Node 3: 96-127, 223-255
> 
> NPS4: Each socket is logically divided into 4 NUMA regions.
>     Total 8 NUMA nodes exist over 2 socket.
>    
>     Node 0: 0-15,    128-143
>     Node 1: 16-31,   144-159
>     Node 2: 32-47,   160-175
>     Node 3: 48-63,   176-191
>     Node 4: 64-79,   192-207
>     Node 5: 80-95,   208-223
>     Node 6: 96-111,  223-231
>     Node 7: 112-127, 232-255
> 
> Benchmark Results:
> 
> Kernel versions:
> - tip:      5.19.0 tip sched/core
> - cluster:  5.19.0 tip sched/core + both the patches of the series
> 
> When we started testing, the tip was at:
> commit: 5531ecffa4b9 "sched: Add update_current_exec_runtime helper"
> 
> ~~~~~~~~~~~~~
> ~ hackbench ~
> ~~~~~~~~~~~~~
> 
> NPS1
> 
> Test:		      tip                    cluster
>  1-groups:	   4.31 (0.00 pct)	   4.31 (0.00 pct)
>  2-groups:	   4.93 (0.00 pct)	   4.86 (1.41 pct)
>  4-groups:	   5.38 (0.00 pct)	   5.36 (0.37 pct)
>  8-groups:	   5.59 (0.00 pct)	   5.54 (0.89 pct)
> 16-groups:	   7.18 (0.00 pct)	   7.47 (-4.03 pct)
> 
> NPS2
> 
> Test:		      tip                     cluster
>  1-groups:	   4.25 (0.00 pct)	   4.40 (-3.52 pct)
>  2-groups:	   4.83 (0.00 pct)	   4.73 (2.07 pct)
>  4-groups:	   5.25 (0.00 pct)	   5.18 (1.33 pct)
>  8-groups:	   5.56 (0.00 pct)	   5.45 (1.97 pct)
> 16-groups:	   6.72 (0.00 pct)	   6.63 (1.33 pct)
> 
> NPS4
> 
> Test:		      tip                     cluster
>  1-groups:	   4.24 (0.00 pct)	   4.23 (0.23 pct)
>  2-groups:	   4.88 (0.00 pct)	   4.78 (2.04 pct)
>  4-groups:	   5.30 (0.00 pct)	   5.25 (0.94 pct)
>  8-groups:	   5.66 (0.00 pct)	   5.61 (0.88 pct)
> 16-groups:	   6.79 (0.00 pct)	   7.05 (-3.82 pct)
> 
> ~~~~~~~~~~~~
> ~ schbench ~
> ~~~~~~~~~~~~
> 
> NPS1
> 
> #workers:     tip                       cluster
>   1:	  37.00 (0.00 pct)	     22.00 (40.54 pct)
>   2:	  39.00 (0.00 pct)	     23.00 (41.02 pct)
>   4:	  41.00 (0.00 pct)	     30.00 (26.82 pct)
>   8:	  53.00 (0.00 pct)	     47.00 (11.32 pct)
>  16:	  73.00 (0.00 pct)	     73.00 (0.00 pct)
>  32:	 116.00 (0.00 pct)	    117.00 (-0.86 pct)
>  64:	 217.00 (0.00 pct)	    221.00 (-1.84 pct)
> 128:	 477.00 (0.00 pct)	    444.00 (6.91 pct)
> 256:	1062.00 (0.00 pct)	   1050.00 (1.12 pct)
> 512:   47552.00 (0.00 pct)	  48576.00 (-2.15 pct)
> 
> NPS2
> 
> #workers:     tip                       cluster
>   1:	  20.00 (0.00 pct)	     20.00 (0.00 pct)
>   2:	  22.00 (0.00 pct)	     23.00 (-4.54 pct)
>   4:	  30.00 (0.00 pct)	     31.00 (-3.33 pct)
>   8:	  46.00 (0.00 pct)	     49.00 (-6.52 pct)
>  16:	  70.00 (0.00 pct)	     72.00 (-2.85 pct)
>  32:	 120.00 (0.00 pct)	    118.00 (1.66 pct)
>  64:	 215.00 (0.00 pct)	    216.00 (-0.46 pct)
> 128:	 482.00 (0.00 pct)	    449.00 (6.84 pct)
> 256:	1042.00 (0.00 pct)	    995.00 (4.51 pct)
> 512:   47552.00 (0.00 pct)	  47296.00 (0.53 pct)
> 
> NPS4
> 
> #workers:     tip                       cluster
>   1:	  18.00 (0.00 pct)	     20.00 (-11.11 pct)
>   2:	  23.00 (0.00 pct)	     22.00 (4.34 pct)
>   4:	  27.00 (0.00 pct)	     30.00 (-11.11 pct)
>   8:	  57.00 (0.00 pct)	     60.00 (-5.26 pct)
>  16:	  76.00 (0.00 pct)	     84.00 (-10.52 pct)
>  32:	 120.00 (0.00 pct)	    115.00 (4.16 pct)
>  64:	 219.00 (0.00 pct)	    212.00 (3.19 pct)
> 128:	 459.00 (0.00 pct)	    442.00 (3.70 pct)
> 256:	1078.00 (0.00 pct)	    983.00 (8.81 pct)
> 512:   47040.00 (0.00 pct)	  48192.00 (-2.44 pct)
> 
> Note: schbench displays lot of run to run variance for
> low worker count. This behavior is due to the timing of
> new-idle balance which is not consistent across runs.
> 
> ~~~~~~~~~~
> ~ tbench ~
> ~~~~~~~~~~
> 
> NPS1
> 
> Clients:      tip            	      cluster
>     1	   573.26 (0.00 pct)	   572.61 (-0.11 pct)
>     2	  1131.19 (0.00 pct)	  1122.41 (-0.77 pct)
>     4	  2100.07 (0.00 pct)	  2081.74 (-0.87 pct)
>     8	  3809.88 (0.00 pct)	  3732.14 (-2.04 pct)
>    16	  6560.72 (0.00 pct)	  6289.22 (-4.13 pct)
>    32	 12203.23 (0.00 pct)	 11811.74 (-3.20 pct)
>    64	 22389.81 (0.00 pct)	 21587.79 (-3.58 pct)
>   128	 32449.37 (0.00 pct)	 32967.15 (1.59 pct)
>   256	 58962.40 (0.00 pct)	 56604.63 (-3.99 pct)
>   512	 59608.71 (0.00 pct)	 56529.95 (-5.16 pct) * (Machine Overloaded)
>   512	 57925.05 (0.00 pct)	 56697.38 (-2.11 pct) [Verification Run]
>  1024	 58037.02 (0.00 pct)	 55751.53 (-3.93 pct)
> 
> NPS2
> 
> Clients:      tip                     cluster
>     1	   574.20 (0.00 pct)	   572.49 (-0.29 pct)
>     2	  1131.56 (0.00 pct)	  1149.53 (1.58 pct)
>     4	  2132.26 (0.00 pct)	  2084.18 (-2.25 pct)
>     8	  3812.20 (0.00 pct)	  3683.04 (-3.38 pct)
>    16	  6457.61 (0.00 pct)	  6340.70 (-1.81 pct)
>    32	 12263.82 (0.00 pct)	 11714.15 (-4.48 pct)
>    64	 22224.11 (0.00 pct)	 21226.34 (-4.48 pct)
>   128	 33040.38 (0.00 pct)	 32478.99 (-1.69 pct)
>   256	 56547.25 (0.00 pct)	 52915.71 (-6.42 pct) * (Machine Overloaded)
>   256    55631.80 (0.00 pct)     52905.99 (-4.89 pct) [Verification Run]
>   512	 56220.67 (0.00 pct)	 54735.69 (-2.64 pct)
>  1024	 56048.88 (0.00 pct)	 54426.63 (-2.89 pct)
> 
> NPS4
> 
> Clients:     tip                      cluster
>     1	   575.50 (0.00 pct)	   570.65 (-0.84 pct)
>     2	  1138.70 (0.00 pct)	  1137.75 (-0.08 pct)
>     4	  2070.66 (0.00 pct)	  2103.18 (1.57 pct)
>     8	  3811.70 (0.00 pct)	  3573.52 (-6.24 pct) *
>     8	  3769.53 (0.00 pct)      3653.05 (-3.09 pct) [Verification Run]
>    16	  6312.80 (0.00 pct)	  6212.41 (-1.59 pct)
>    32	 11418.14 (0.00 pct)	 11721.01 (2.65 pct)
>    64	 19671.16 (0.00 pct)	 20053.77 (1.94 pct)
>   128	 30258.53 (0.00 pct)	 32585.15 (7.68 pct)
>   256	 55838.10 (0.00 pct)	 51318.64 (-8.09 pct) * (Machine Overloaded)
>   256	 54291.03 (0.00 pct)     54379.80 (0.16 pct)  [Verification Run]
>   512	 55586.44 (0.00 pct)	 51538.93 (-7.28 pct) * (Machine Overloaded)
>   512	 54190.04 (0.00 pct)     54096.16 (-0.17 pct) [Verification Run]
>  1024	 56370.35 (0.00 pct)	 50768.68 (-9.93 pct) * (Machine Overloaded)
>  1024    56498.36 (0.00 pct)     54661.85 (-3.25 pct) [Verification Run]
> 
> ~~~~~~~~~~
> ~ stream ~
> ~~~~~~~~~~
> 
> NPS1
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 332237.51 (0.00 pct)	 338085.24 (1.76 pct)
> Scale:	 215236.94 (0.00 pct)	 214179.72 (-0.49 pct)
>   Add:	 250753.67 (0.00 pct)	 251181.86 (0.17 pct)
> Triad:	 259467.60 (0.00 pct)	 262541.92 (1.18 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 329320.65 (0.00 pct)	 336947.39 (2.31 pct)
> Scale:	 218102.78 (0.00 pct)	 219617.85 (0.69 pct)
>   Add:	 251283.30 (0.00 pct)	 251918.03 (0.25 pct)
> Triad:	 258044.33 (0.00 pct)	 261512.99 (1.34 pct)
> 
> NPS2
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 336926.24 (0.00 pct)	 324310.01 (-3.74 pct)
> Scale:	 220120.41 (0.00 pct)	 212795.43 (-3.32 pct)
>   Add:	 252428.34 (0.00 pct)	 254355.80 (0.76 pct)
> Triad:	 274268.23 (0.00 pct)	 261777.03 (-4.55 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:   338126.49 (0.00 pct)    338947.03 (0.24 pct)
> Scale:   230229.59 (0.00 pct)    229991.65 (-0.10 pct)
>   Add:   253964.25 (0.00 pct)    264374.57 (4.09 pct)
> Triad:   272176.19 (0.00 pct)    274587.35 (0.88 pct)
> 
> NPS4
> 
> - 10 Runs:
> 
> Test:	      tip                  cluster
>  Copy:   367144.56 (0.00 pct)    375452.26 (2.26 pct)
> Scale:   246928.04 (0.00 pct)    243651.53 (-1.32 pct)
>   Add:   272096.30 (0.00 pct)    272845.33 (0.27 pct)
> Triad:   286644.55 (0.00 pct)    290925.20 (1.49 pct)
> 
> - 100 Runs:
> 
> Test:	      tip                  cluster
>  Copy:	 351980.15 (0.00 pct)	 375854.72 (6.78 pct)
> Scale:	 254918.41 (0.00 pct)	 255904.90 (0.38 pct)
>   Add:	 272722.89 (0.00 pct)	 274075.11 (0.49 pct)
> Triad:   283340.94 (0.00 pct)	 287608.77 (1.50 pct)
> 
> ~~~~~~~~~~~~~~~~~~~~
> ~ Additional notes ~
> ~~~~~~~~~~~~~~~~~~~~
> 
> - schbench is know to have a noticeable run-to-run variation for lower
>   worker counts and any improvements or regression observed can be
>   safely ignored. The results are included to make sure there are
>   no unnecessarily large regressions as a result of task pileup.
> 
> - tbench shows slight run to run variation with larger number of
>   clients on both tip and patched kernel. This is expected as the machine
>   is overloaded at that point (equivalent of two or more tasks per CPU).
>   "Verification Run" shows none of these regressions are persistent.
> 
>>
>> [..snip..]
>>
> 
> Overall, the changes look good and doesn't affect system without a
> SD_CLUSTER domain like the Zen3 system used during testing.
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 

Thanks a lot for the testing and verification on the Zen3 system.

Regards,
Yicong


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path
  2022-08-22  7:36 [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
                   ` (3 preceding siblings ...)
  2022-09-06  5:28 ` K Prateek Nayak
@ 2022-09-07 11:52 ` Barry Song
  4 siblings, 0 replies; 10+ messages in thread
From: Barry Song @ 2022-09-07 11:52 UTC (permalink / raw)
  To: yangyicong, peterz
  Cc: 21cnbao, bristot, bsegall, dietmar.eggemann, ego, gautham.shenoy,
	guodong.xu, hesham.almatary, john.garry, jonathan.cameron,
	juri.lelli, kprateek.nayak, linux-arm-kernel, linux-kernel,
	linuxarm, mingo, prime.zeng, rostedt, shenyang39, srikar,
	tim.c.chen, vincent.guittot, wuyun.abel, yangyicong, yu.c.chen

> From: Yicong Yang <yangyicong@hisilicon.com>

> This is the follow-up work to support cluster scheduler. Previously
> we have added cluster level in the scheduler for both ARM64[1] and
> X86[2] to support load balance between clusters to bring more memory
> bandwidth and decrease cache contention. This patchset, on the other
> hand, takes care of wake-up path by giving CPUs within the same cluster
> a try before scanning the whole LLC to benefit those tasks communicating
> with each other.

> Barry Song (2):
>   sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API
>   sched/fair: Scan cluster before scanning LLC in wake-up path

Hi Peter,
I believe this one has been ready. It has also gotten widely reviewed and
tested on platforms w/ and w/o clusters.
Can you please pick up this?

Thanks
Barry

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-09-07 11:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22  7:36 [PATCH v7 0/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-08-22  7:36 ` [PATCH v7 1/2] sched: Add per_cpu cluster domain info and cpus_share_lowest_cache API Yicong Yang
2022-08-22  7:36 ` [PATCH v7 2/2] sched/fair: Scan cluster before scanning LLC in wake-up path Yicong Yang
2022-08-23  3:45   ` Chen Yu
2022-08-23  7:48     ` Yicong Yang
2022-08-23  8:09       ` Chen Yu
2022-09-05 12:37 ` [PATCH v7 0/2] " Yicong Yang
2022-09-06  5:28 ` K Prateek Nayak
2022-09-06  8:46   ` Yicong Yang
2022-09-07 11:52 ` Barry Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).