All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V2 0/2] Add complex scheduler level for arm64
@ 2022-04-22 11:51 ` Qing Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Qing Wang @ 2022-04-22 11:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel
  Cc: vincent.guittot, dietmar.eggemann, Wang Qing

From: Wang Qing <wangqing@vivo.com>

The DSU cluster supports blocks that are called complexes
which contain up to two cores of the same type and some shared logic,
which sharing some logic between the cores can make a complex area efficient.

Complex also can be considered as a shared cache group smaller
than cluster.

This patch adds complex level for complexs by parsing cache topology
form DT. It will directly benefit a lot of workload which loves more
resources such as memory bandwidth, caches.

Note this patch only handle the DT case.

V2:
fix commit log and loop more

wangqing (2):
  arch_topology: support for describing cache topology from DT
  arm64: Add complex scheduler level for arm64

 arch/arm64/Kconfig            | 13 ++++++++++
 arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
 drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
 include/linux/arch_topology.h |  3 +++
 4 files changed, 109 insertions(+), 2 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/2] Add complex scheduler level for arm64
@ 2022-04-22 11:51 ` Qing Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Qing Wang @ 2022-04-22 11:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel
  Cc: vincent.guittot, dietmar.eggemann, Wang Qing

From: Wang Qing <wangqing@vivo.com>

The DSU cluster supports blocks that are called complexes
which contain up to two cores of the same type and some shared logic,
which sharing some logic between the cores can make a complex area efficient.

Complex also can be considered as a shared cache group smaller
than cluster.

This patch adds complex level for complexs by parsing cache topology
form DT. It will directly benefit a lot of workload which loves more
resources such as memory bandwidth, caches.

Note this patch only handle the DT case.

V2:
fix commit log and loop more

wangqing (2):
  arch_topology: support for describing cache topology from DT
  arm64: Add complex scheduler level for arm64

 arch/arm64/Kconfig            | 13 ++++++++++
 arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
 drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
 include/linux/arch_topology.h |  3 +++
 4 files changed, 109 insertions(+), 2 deletions(-)

-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 1/2] arch_topology: support for parsing cache topology from DT
  2022-04-22 11:51 ` Qing Wang
@ 2022-04-22 11:51   ` Qing Wang
  -1 siblings, 0 replies; 24+ messages in thread
From: Qing Wang @ 2022-04-22 11:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel
  Cc: vincent.guittot, dietmar.eggemann, Wang Qing

From: Wang Qing <wangqing@vivo.com>

When ACPI is not enabled, we can get cache topolopy from DT like:
*		cpu0: cpu@000 {
*			next-level-cache = <&L2_1>;
*			L2_1: l2-cache {
* 				compatible = "cache";
*				next-level-cache = <&L3_1>;
* 			};
*			L3_1: l3-cache {
* 				compatible = "cache";
* 			};
*		};
*
*		cpu1: cpu@001 {
*			next-level-cache = <&L2_1>;
*		};
*		...
*		};
cache_topology[] hold the pointer describing by "next-level-cache", 
which can describe the cache topology of every level.

MAX_CACHE_LEVEL is strictly corresponding to the cache level from L2.

V2:
make function name more sense

Signed-off-by: Wang Qing <wangqing@vivo.com>
---
 drivers/base/arch_topology.c  | 47 ++++++++++++++++++++++++++++++++++-
 include/linux/arch_topology.h |  3 +++
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..46e84ce2ec0c 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
 		return -1;
 
 	cpu = of_cpu_node_to_id(cpu_node);
-	if (cpu >= 0)
+	if (cpu >= 0) {
 		topology_parse_cpu_capacity(cpu_node, cpu);
+		topology_parse_cpu_caches(cpu_node, cpu);
+	}
 	else
 		pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
 			cpu_node, cpumask_pr_args(cpu_possible_mask));
@@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
 }
 #endif
 
+/*
+ * cpu cache topology table
+ */
+#define MAX_CACHE_LEVEL 7
+static struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
+
+void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
+{
+	struct device_node *node_cache = cpu_node;
+	int level = 0;
+
+	while (level < MAX_CACHE_LEVEL) {
+		node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
+		if (!node_cache)
+			break;
+
+		cache_topology[cpu][level++] = node_cache;
+	}
+}
+
+/*
+ * find the largest subset of the shared cache in the range of cpu_mask
+ */
+void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
+								 struct cpumask *sc_mask)
+{
+	int cache_level, cpu_id;
+
+	for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
+		if (!cache_topology[cpu][cache_level])
+			continue;
+
+		cpumask_clear(sc_mask);
+		for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
+			if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
+				cpumask_set_cpu(cpu_id, sc_mask);
+		}
+
+		if (cpumask_subset(sc_mask, cpu_mask))
+			break;
+	}
+}
+
 /*
  * cpu topology table
  */
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 58cbe18d825c..c6ed727e453c 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
 int parse_acpi_topology(void);
+void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
+void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
+								 struct cpumask *sc_mask);
 #endif
 
 #endif /* _LINUX_ARCH_TOPOLOGY_H_ */
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 1/2] arch_topology: support for parsing cache topology from DT
@ 2022-04-22 11:51   ` Qing Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Qing Wang @ 2022-04-22 11:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel
  Cc: vincent.guittot, dietmar.eggemann, Wang Qing

From: Wang Qing <wangqing@vivo.com>

When ACPI is not enabled, we can get cache topolopy from DT like:
*		cpu0: cpu@000 {
*			next-level-cache = <&L2_1>;
*			L2_1: l2-cache {
* 				compatible = "cache";
*				next-level-cache = <&L3_1>;
* 			};
*			L3_1: l3-cache {
* 				compatible = "cache";
* 			};
*		};
*
*		cpu1: cpu@001 {
*			next-level-cache = <&L2_1>;
*		};
*		...
*		};
cache_topology[] hold the pointer describing by "next-level-cache", 
which can describe the cache topology of every level.

MAX_CACHE_LEVEL is strictly corresponding to the cache level from L2.

V2:
make function name more sense

Signed-off-by: Wang Qing <wangqing@vivo.com>
---
 drivers/base/arch_topology.c  | 47 ++++++++++++++++++++++++++++++++++-
 include/linux/arch_topology.h |  3 +++
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1d6636ebaac5..46e84ce2ec0c 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
 		return -1;
 
 	cpu = of_cpu_node_to_id(cpu_node);
-	if (cpu >= 0)
+	if (cpu >= 0) {
 		topology_parse_cpu_capacity(cpu_node, cpu);
+		topology_parse_cpu_caches(cpu_node, cpu);
+	}
 	else
 		pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
 			cpu_node, cpumask_pr_args(cpu_possible_mask));
@@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
 }
 #endif
 
+/*
+ * cpu cache topology table
+ */
+#define MAX_CACHE_LEVEL 7
+static struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
+
+void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
+{
+	struct device_node *node_cache = cpu_node;
+	int level = 0;
+
+	while (level < MAX_CACHE_LEVEL) {
+		node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
+		if (!node_cache)
+			break;
+
+		cache_topology[cpu][level++] = node_cache;
+	}
+}
+
+/*
+ * find the largest subset of the shared cache in the range of cpu_mask
+ */
+void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
+								 struct cpumask *sc_mask)
+{
+	int cache_level, cpu_id;
+
+	for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
+		if (!cache_topology[cpu][cache_level])
+			continue;
+
+		cpumask_clear(sc_mask);
+		for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
+			if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
+				cpumask_set_cpu(cpu_id, sc_mask);
+		}
+
+		if (cpumask_subset(sc_mask, cpu_mask))
+			break;
+	}
+}
+
 /*
  * cpu topology table
  */
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index 58cbe18d825c..c6ed727e453c 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
 int parse_acpi_topology(void);
+void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
+void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
+								 struct cpumask *sc_mask);
 #endif
 
 #endif /* _LINUX_ARCH_TOPOLOGY_H_ */
-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
  2022-04-22 11:51 ` Qing Wang
@ 2022-04-22 11:51   ` Qing Wang
  -1 siblings, 0 replies; 24+ messages in thread
From: Qing Wang @ 2022-04-22 11:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel
  Cc: vincent.guittot, dietmar.eggemann, Wang Qing

From: Wang Qing <wangqing@vivo.com>

The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
which contain up to two cores of the same type and some shared logic.
Sharing some logic between the cores can make a complex area efficient.

This patch adds complex level for complexs and automatically enables
the load balance among complexs. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.

Testing has been done with Stream benchmark:
8threads stream (2 little cores * 2(complex) + 3 medium cores + 1 big core)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     37579.2 (   0.00%)    39127.3 (   4.12%)
MB/sec scale    38261.1 (   0.00%)    39195.4 (   2.44%)
MB/sec add      39497.0 (   0.00%)    41101.5 (   4.06%)
MB/sec triad    39885.6 (   0.00%)    40772.7 (   2.22%)

And in order to support this features, we defined arm64_topology.

V2:
fix commit log and loop more

Signed-off-by: Wang Qing <wangqing@vivo.com>
---
 arch/arm64/Kconfig      | 13 +++++++++++
 arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index edbe035cb0e3..4063de8c6153 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
 	  by sharing mid-level caches, last-level cache tags or internal
 	  busses.
 
+config SCHED_COMPLEX
+	bool "Complex scheduler support"
+	help
+	  DSU supports blocks that are called complexes which contain up to
+	  two cores of the same type and some shared logic. Sharing some logic
+	  between the cores can make a complex area efficient.
+
+	  Complex also can be considered as a shared cache group smaller
+	  than cluster.
+
+	  Complex scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have complexs of CPUs.
+
 config SCHED_SMT
 	bool "SMT scheduler support"
 	help
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b46041f2b97..526765112146 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/hotplug.h>
 #include <linux/sched/task_stack.h>
+#include <linux/sched/topology.h>
 #include <linux/interrupt.h>
 #include <linux/cache.h>
 #include <linux/profile.h>
@@ -57,6 +58,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
 EXPORT_PER_CPU_SYMBOL(cpu_number);
 
+#ifdef SCHED_COMPLEX
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
+#endif
+
 /*
  * as from 2.5, kernels no longer have an init_tasks structure
  * so we need some other way of telling a new secondary core
@@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
 	}
 }
 
+#ifdef SCHED_COMPLEX
+static int arm64_complex_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+
+const struct cpumask *arm64_complex_mask(int cpu)
+{
+	const struct cpumask *core_mask = cpu_cpu_mask(cpu);
+
+	/* Find the smaller shared cache level than clustergroup and coregroup*/
+#ifdef CONFIG_SCHED_MC
+	core_mask = cpu_coregroup_mask(cpu);
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
+	core_mask = cpu_clustergroup_mask(cpu);
+#endif
+
+	find_subset_of_share_cache(core_mask, cpu, &per_cpu(cpu_complex_map, cpu));
+
+	return &per_cpu(cpu_complex_map, cpu);
+}
+#endif
+
+static struct sched_domain_topology_level arm64_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_COMPLEX
+	{ arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
 	const struct cpu_operations *ops;
@@ -723,9 +769,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	unsigned int this_cpu;
 
 	init_cpu_topology();
-
 	this_cpu = smp_processor_id();
 	store_cpu_topology(this_cpu);
+	set_sched_topology(arm64_topology);
 	numa_store_cpu_info(this_cpu);
 	numa_add_cpu(this_cpu);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
@ 2022-04-22 11:51   ` Qing Wang
  0 siblings, 0 replies; 24+ messages in thread
From: Qing Wang @ 2022-04-22 11:51 UTC (permalink / raw)
  To: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel
  Cc: vincent.guittot, dietmar.eggemann, Wang Qing

From: Wang Qing <wangqing@vivo.com>

The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
which contain up to two cores of the same type and some shared logic.
Sharing some logic between the cores can make a complex area efficient.

This patch adds complex level for complexs and automatically enables
the load balance among complexs. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.

Testing has been done with Stream benchmark:
8threads stream (2 little cores * 2(complex) + 3 medium cores + 1 big core)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     37579.2 (   0.00%)    39127.3 (   4.12%)
MB/sec scale    38261.1 (   0.00%)    39195.4 (   2.44%)
MB/sec add      39497.0 (   0.00%)    41101.5 (   4.06%)
MB/sec triad    39885.6 (   0.00%)    40772.7 (   2.22%)

And in order to support this features, we defined arm64_topology.

V2:
fix commit log and loop more

Signed-off-by: Wang Qing <wangqing@vivo.com>
---
 arch/arm64/Kconfig      | 13 +++++++++++
 arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 60 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index edbe035cb0e3..4063de8c6153 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
 	  by sharing mid-level caches, last-level cache tags or internal
 	  busses.
 
+config SCHED_COMPLEX
+	bool "Complex scheduler support"
+	help
+	  DSU supports blocks that are called complexes which contain up to
+	  two cores of the same type and some shared logic. Sharing some logic
+	  between the cores can make a complex area efficient.
+
+	  Complex also can be considered as a shared cache group smaller
+	  than cluster.
+
+	  Complex scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have complexs of CPUs.
+
 config SCHED_SMT
 	bool "SMT scheduler support"
 	help
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 3b46041f2b97..526765112146 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/sched/mm.h>
 #include <linux/sched/hotplug.h>
 #include <linux/sched/task_stack.h>
+#include <linux/sched/topology.h>
 #include <linux/interrupt.h>
 #include <linux/cache.h>
 #include <linux/profile.h>
@@ -57,6 +58,10 @@
 DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
 EXPORT_PER_CPU_SYMBOL(cpu_number);
 
+#ifdef SCHED_COMPLEX
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
+#endif
+
 /*
  * as from 2.5, kernels no longer have an init_tasks structure
  * so we need some other way of telling a new secondary core
@@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
 	}
 }
 
+#ifdef SCHED_COMPLEX
+static int arm64_complex_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+
+const struct cpumask *arm64_complex_mask(int cpu)
+{
+	const struct cpumask *core_mask = cpu_cpu_mask(cpu);
+
+	/* Find the smaller shared cache level than clustergroup and coregroup*/
+#ifdef CONFIG_SCHED_MC
+	core_mask = cpu_coregroup_mask(cpu);
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
+	core_mask = cpu_clustergroup_mask(cpu);
+#endif
+
+	find_subset_of_share_cache(core_mask, cpu, &per_cpu(cpu_complex_map, cpu));
+
+	return &per_cpu(cpu_complex_map, cpu);
+}
+#endif
+
+static struct sched_domain_topology_level arm64_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+#ifdef CONFIG_SCHED_COMPLEX
+	{ arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
+#endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+#ifdef CONFIG_SCHED_MC
+	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+#endif
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 void __init smp_prepare_cpus(unsigned int max_cpus)
 {
 	const struct cpu_operations *ops;
@@ -723,9 +769,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	unsigned int this_cpu;
 
 	init_cpu_topology();
-
 	this_cpu = smp_processor_id();
 	store_cpu_topology(this_cpu);
+	set_sched_topology(arm64_topology);
 	numa_store_cpu_info(this_cpu);
 	numa_add_cpu(this_cpu);
 
-- 
2.7.4


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 1/2] arch_topology: support for parsing cache topology from DT
  2022-04-22 11:51   ` Qing Wang
@ 2022-04-22 12:30     ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 24+ messages in thread
From: Greg Kroah-Hartman @ 2022-04-22 12:30 UTC (permalink / raw)
  To: Qing Wang
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann

On Fri, Apr 22, 2022 at 04:51:25AM -0700, Qing Wang wrote:
> From: Wang Qing <wangqing@vivo.com>
> 
> When ACPI is not enabled, we can get cache topolopy from DT like:
> *		cpu0: cpu@000 {
> *			next-level-cache = <&L2_1>;
> *			L2_1: l2-cache {
> * 				compatible = "cache";
> *				next-level-cache = <&L3_1>;
> * 			};
> *			L3_1: l3-cache {
> * 				compatible = "cache";
> * 			};
> *		};
> *
> *		cpu1: cpu@001 {
> *			next-level-cache = <&L2_1>;
> *		};
> *		...
> *		};
> cache_topology[] hold the pointer describing by "next-level-cache", 
> which can describe the cache topology of every level.
> 
> MAX_CACHE_LEVEL is strictly corresponding to the cache level from L2.

I have no idea what this changelog means at all.

What are you trying to do?  What problem are you solving?  Why are you
doing any of this?


> 
> V2:
> make function name more sense

As per the documentation this goes below the --- line, right?


> 
> Signed-off-by: Wang Qing <wangqing@vivo.com>
> ---
>  drivers/base/arch_topology.c  | 47 ++++++++++++++++++++++++++++++++++-
>  include/linux/arch_topology.h |  3 +++
>  2 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 1d6636ebaac5..46e84ce2ec0c 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
>  		return -1;
>  
>  	cpu = of_cpu_node_to_id(cpu_node);
> -	if (cpu >= 0)
> +	if (cpu >= 0) {
>  		topology_parse_cpu_capacity(cpu_node, cpu);
> +		topology_parse_cpu_caches(cpu_node, cpu);
> +	}
>  	else
>  		pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
>  			cpu_node, cpumask_pr_args(cpu_possible_mask));
> @@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
>  }
>  #endif
>  
> +/*
> + * cpu cache topology table
> + */
> +#define MAX_CACHE_LEVEL 7
> +static struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];

So for a normal big system of 4k cpus * 7 levels, that's a lot of
memory?  are you sure?

How big of a box did you test this on?

> +
> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
> +{
> +	struct device_node *node_cache = cpu_node;
> +	int level = 0;
> +
> +	while (level < MAX_CACHE_LEVEL) {
> +		node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
> +		if (!node_cache)
> +			break;
> +
> +		cache_topology[cpu][level++] = node_cache;
> +	}

No locking anywhere?  What could go wrong :(

> +}
> +
> +/*
> + * find the largest subset of the shared cache in the range of cpu_mask
> + */
> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
> +								 struct cpumask *sc_mask)

Again, horrid global function name.

And no kernel documentation for how this works?



> +{
> +	int cache_level, cpu_id;
> +
> +	for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
> +		if (!cache_topology[cpu][cache_level])
> +			continue;

No locking???


> +
> +		cpumask_clear(sc_mask);
> +		for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
> +			if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
> +				cpumask_set_cpu(cpu_id, sc_mask);
> +		}
> +
> +		if (cpumask_subset(sc_mask, cpu_mask))
> +			break;
> +	}
> +}
> +
>  /*
>   * cpu topology table
>   */
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index 58cbe18d825c..c6ed727e453c 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
>  void remove_cpu_topology(unsigned int cpuid);
>  void reset_cpu_topology(void);
>  int parse_acpi_topology(void);
> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
> +								 struct cpumask *sc_mask);

I still have no idea what this last function is supposed to do.

And very odd indentation, did you run checkpatch on this?

totally confused,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 1/2] arch_topology: support for parsing cache topology from DT
@ 2022-04-22 12:30     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Kroah-Hartman @ 2022-04-22 12:30 UTC (permalink / raw)
  To: Qing Wang
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann

On Fri, Apr 22, 2022 at 04:51:25AM -0700, Qing Wang wrote:
> From: Wang Qing <wangqing@vivo.com>
> 
> When ACPI is not enabled, we can get cache topolopy from DT like:
> *		cpu0: cpu@000 {
> *			next-level-cache = <&L2_1>;
> *			L2_1: l2-cache {
> * 				compatible = "cache";
> *				next-level-cache = <&L3_1>;
> * 			};
> *			L3_1: l3-cache {
> * 				compatible = "cache";
> * 			};
> *		};
> *
> *		cpu1: cpu@001 {
> *			next-level-cache = <&L2_1>;
> *		};
> *		...
> *		};
> cache_topology[] hold the pointer describing by "next-level-cache", 
> which can describe the cache topology of every level.
> 
> MAX_CACHE_LEVEL is strictly corresponding to the cache level from L2.

I have no idea what this changelog means at all.

What are you trying to do?  What problem are you solving?  Why are you
doing any of this?


> 
> V2:
> make function name more sense

As per the documentation this goes below the --- line, right?


> 
> Signed-off-by: Wang Qing <wangqing@vivo.com>
> ---
>  drivers/base/arch_topology.c  | 47 ++++++++++++++++++++++++++++++++++-
>  include/linux/arch_topology.h |  3 +++
>  2 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 1d6636ebaac5..46e84ce2ec0c 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
>  		return -1;
>  
>  	cpu = of_cpu_node_to_id(cpu_node);
> -	if (cpu >= 0)
> +	if (cpu >= 0) {
>  		topology_parse_cpu_capacity(cpu_node, cpu);
> +		topology_parse_cpu_caches(cpu_node, cpu);
> +	}
>  	else
>  		pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
>  			cpu_node, cpumask_pr_args(cpu_possible_mask));
> @@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
>  }
>  #endif
>  
> +/*
> + * cpu cache topology table
> + */
> +#define MAX_CACHE_LEVEL 7
> +static struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];

So for a normal big system of 4k cpus * 7 levels, that's a lot of
memory?  are you sure?

How big of a box did you test this on?

> +
> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
> +{
> +	struct device_node *node_cache = cpu_node;
> +	int level = 0;
> +
> +	while (level < MAX_CACHE_LEVEL) {
> +		node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
> +		if (!node_cache)
> +			break;
> +
> +		cache_topology[cpu][level++] = node_cache;
> +	}

No locking anywhere?  What could go wrong :(

> +}
> +
> +/*
> + * find the largest subset of the shared cache in the range of cpu_mask
> + */
> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
> +								 struct cpumask *sc_mask)

Again, horrid global function name.

And no kernel documentation for how this works?



> +{
> +	int cache_level, cpu_id;
> +
> +	for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
> +		if (!cache_topology[cpu][cache_level])
> +			continue;

No locking???


> +
> +		cpumask_clear(sc_mask);
> +		for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
> +			if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
> +				cpumask_set_cpu(cpu_id, sc_mask);
> +		}
> +
> +		if (cpumask_subset(sc_mask, cpu_mask))
> +			break;
> +	}
> +}
> +
>  /*
>   * cpu topology table
>   */
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index 58cbe18d825c..c6ed727e453c 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
>  void remove_cpu_topology(unsigned int cpuid);
>  void reset_cpu_topology(void);
>  int parse_acpi_topology(void);
> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
> +								 struct cpumask *sc_mask);

I still have no idea what this last function is supposed to do.

And very odd indentation, did you run checkpatch on this?

totally confused,

greg k-h

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
  2022-04-22 11:51   ` Qing Wang
@ 2022-04-22 12:31     ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 24+ messages in thread
From: Greg Kroah-Hartman @ 2022-04-22 12:31 UTC (permalink / raw)
  To: Qing Wang
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann

On Fri, Apr 22, 2022 at 04:51:26AM -0700, Qing Wang wrote:
> From: Wang Qing <wangqing@vivo.com>
> 
> The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
> which contain up to two cores of the same type and some shared logic.
> Sharing some logic between the cores can make a complex area efficient.
> 
> This patch adds complex level for complexs and automatically enables
> the load balance among complexs. It will directly benefit a lot of
> workload which loves more resources such as memory bandwidth, caches.
> 
> Testing has been done with Stream benchmark:
> 8threads stream (2 little cores * 2(complex) + 3 medium cores + 1 big core)
>                 stream                 stream
>                 w/o patch              w/ patch
> MB/sec copy     37579.2 (   0.00%)    39127.3 (   4.12%)
> MB/sec scale    38261.1 (   0.00%)    39195.4 (   2.44%)
> MB/sec add      39497.0 (   0.00%)    41101.5 (   4.06%)
> MB/sec triad    39885.6 (   0.00%)    40772.7 (   2.22%)
> 
> And in order to support this features, we defined arm64_topology.
> 
> V2:
> fix commit log and loop more
> 
> Signed-off-by: Wang Qing <wangqing@vivo.com>
> ---
>  arch/arm64/Kconfig      | 13 +++++++++++
>  arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index edbe035cb0e3..4063de8c6153 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
>  	  by sharing mid-level caches, last-level cache tags or internal
>  	  busses.
>  
> +config SCHED_COMPLEX
> +	bool "Complex scheduler support"
> +	help
> +	  DSU supports blocks that are called complexes which contain up to
> +	  two cores of the same type and some shared logic. Sharing some logic
> +	  between the cores can make a complex area efficient.
> +
> +	  Complex also can be considered as a shared cache group smaller
> +	  than cluster.
> +
> +	  Complex scheduler support improves the CPU scheduler's decision
> +	  making when dealing with machines that have complexs of CPUs.
> +
>  config SCHED_SMT
>  	bool "SMT scheduler support"
>  	help
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 3b46041f2b97..526765112146 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -14,6 +14,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/sched/hotplug.h>
>  #include <linux/sched/task_stack.h>
> +#include <linux/sched/topology.h>
>  #include <linux/interrupt.h>
>  #include <linux/cache.h>
>  #include <linux/profile.h>
> @@ -57,6 +58,10 @@
>  DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
>  EXPORT_PER_CPU_SYMBOL(cpu_number);
>  
> +#ifdef SCHED_COMPLEX
> +DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
> +#endif

ifdefs should not be in .c files.


> +
>  /*
>   * as from 2.5, kernels no longer have an init_tasks structure
>   * so we need some other way of telling a new secondary core
> @@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
>  	}
>  }
>  
> +#ifdef SCHED_COMPLEX

same here.

> +static int arm64_complex_flags(void)
> +{
> +	return SD_SHARE_PKG_RESOURCES;
> +}
> +
> +const struct cpumask *arm64_complex_mask(int cpu)
> +{
> +	const struct cpumask *core_mask = cpu_cpu_mask(cpu);
> +
> +	/* Find the smaller shared cache level than clustergroup and coregroup*/
> +#ifdef CONFIG_SCHED_MC
> +	core_mask = cpu_coregroup_mask(cpu);
> +#endif
> +#ifdef CONFIG_SCHED_CLUSTER
> +	core_mask = cpu_clustergroup_mask(cpu);
> +#endif

See, same here.  This is a mess and unmaintainable.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
@ 2022-04-22 12:31     ` Greg Kroah-Hartman
  0 siblings, 0 replies; 24+ messages in thread
From: Greg Kroah-Hartman @ 2022-04-22 12:31 UTC (permalink / raw)
  To: Qing Wang
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann

On Fri, Apr 22, 2022 at 04:51:26AM -0700, Qing Wang wrote:
> From: Wang Qing <wangqing@vivo.com>
> 
> The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
> which contain up to two cores of the same type and some shared logic.
> Sharing some logic between the cores can make a complex area efficient.
> 
> This patch adds complex level for complexs and automatically enables
> the load balance among complexs. It will directly benefit a lot of
> workload which loves more resources such as memory bandwidth, caches.
> 
> Testing has been done with Stream benchmark:
> 8threads stream (2 little cores * 2(complex) + 3 medium cores + 1 big core)
>                 stream                 stream
>                 w/o patch              w/ patch
> MB/sec copy     37579.2 (   0.00%)    39127.3 (   4.12%)
> MB/sec scale    38261.1 (   0.00%)    39195.4 (   2.44%)
> MB/sec add      39497.0 (   0.00%)    41101.5 (   4.06%)
> MB/sec triad    39885.6 (   0.00%)    40772.7 (   2.22%)
> 
> And in order to support this features, we defined arm64_topology.
> 
> V2:
> fix commit log and loop more
> 
> Signed-off-by: Wang Qing <wangqing@vivo.com>
> ---
>  arch/arm64/Kconfig      | 13 +++++++++++
>  arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 60 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index edbe035cb0e3..4063de8c6153 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
>  	  by sharing mid-level caches, last-level cache tags or internal
>  	  busses.
>  
> +config SCHED_COMPLEX
> +	bool "Complex scheduler support"
> +	help
> +	  DSU supports blocks that are called complexes which contain up to
> +	  two cores of the same type and some shared logic. Sharing some logic
> +	  between the cores can make a complex area efficient.
> +
> +	  Complex also can be considered as a shared cache group smaller
> +	  than cluster.
> +
> +	  Complex scheduler support improves the CPU scheduler's decision
> +	  making when dealing with machines that have complexs of CPUs.
> +
>  config SCHED_SMT
>  	bool "SMT scheduler support"
>  	help
> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
> index 3b46041f2b97..526765112146 100644
> --- a/arch/arm64/kernel/smp.c
> +++ b/arch/arm64/kernel/smp.c
> @@ -14,6 +14,7 @@
>  #include <linux/sched/mm.h>
>  #include <linux/sched/hotplug.h>
>  #include <linux/sched/task_stack.h>
> +#include <linux/sched/topology.h>
>  #include <linux/interrupt.h>
>  #include <linux/cache.h>
>  #include <linux/profile.h>
> @@ -57,6 +58,10 @@
>  DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
>  EXPORT_PER_CPU_SYMBOL(cpu_number);
>  
> +#ifdef SCHED_COMPLEX
> +DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
> +#endif

ifdefs should not be in .c files.


> +
>  /*
>   * as from 2.5, kernels no longer have an init_tasks structure
>   * so we need some other way of telling a new secondary core
> @@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
>  	}
>  }
>  
> +#ifdef SCHED_COMPLEX

same here.

> +static int arm64_complex_flags(void)
> +{
> +	return SD_SHARE_PKG_RESOURCES;
> +}
> +
> +const struct cpumask *arm64_complex_mask(int cpu)
> +{
> +	const struct cpumask *core_mask = cpu_cpu_mask(cpu);
> +
> +	/* Find the smaller shared cache level than clustergroup and coregroup*/
> +#ifdef CONFIG_SCHED_MC
> +	core_mask = cpu_coregroup_mask(cpu);
> +#endif
> +#ifdef CONFIG_SCHED_CLUSTER
> +	core_mask = cpu_clustergroup_mask(cpu);
> +#endif

See, same here.  This is a mess and unmaintainable.

thanks,

greg k-h

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
  2022-04-22 12:31     ` Greg Kroah-Hartman
@ 2022-04-24  2:26       ` 王擎
  -1 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-24  2:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann


>> From: Wang Qing <wangqing@vivo.com>
>> 
>> The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
>> which contain up to two cores of the same type and some shared logic.
>> Sharing some logic between the cores can make a complex area efficient.
>> 
>> This patch adds complex level for complexs and automatically enables
>> the load balance among complexs. It will directly benefit a lot of
>> workload which loves more resources such as memory bandwidth, caches.
>> 
>> Testing has been done with Stream benchmark:
>> 8threads stream (2 little cores * 2(complex) + 3 medium cores + 1 big core)
>>                 stream                 stream
>>                 w/o patch              w/ patch
>> MB/sec copy     37579.2 (   0.00%)    39127.3 (   4.12%)
>> MB/sec scale    38261.1 (   0.00%)    39195.4 (   2.44%)
>> MB/sec add      39497.0 (   0.00%)    41101.5 (   4.06%)
>> MB/sec triad    39885.6 (   0.00%)    40772.7 (   2.22%)
>> 
>> And in order to support this features, we defined arm64_topology.
>> 
>> V2:
>> fix commit log and loop more
>> 
>> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> ---
>>  arch/arm64/Kconfig      | 13 +++++++++++
>>  arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 60 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index edbe035cb0e3..4063de8c6153 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
>>          by sharing mid-level caches, last-level cache tags or internal
>>          busses.
>>  
>> +config SCHED_COMPLEX
>> +     bool "Complex scheduler support"
>> +     help
>> +       DSU supports blocks that are called complexes which contain up to
>> +       two cores of the same type and some shared logic. Sharing some logic
>> +       between the cores can make a complex area efficient.
>> +
>> +       Complex also can be considered as a shared cache group smaller
>> +       than cluster.
>> +
>> +       Complex scheduler support improves the CPU scheduler's decision
>> +       making when dealing with machines that have complexs of CPUs.
>> +
>>  config SCHED_SMT
>>        bool "SMT scheduler support"
>>        help
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index 3b46041f2b97..526765112146 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/sched/mm.h>
>>  #include <linux/sched/hotplug.h>
>>  #include <linux/sched/task_stack.h>
>> +#include <linux/sched/topology.h>
>>  #include <linux/interrupt.h>
>>  #include <linux/cache.h>
>>  #include <linux/profile.h>
>> @@ -57,6 +58,10 @@
>>  DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
>>  EXPORT_PER_CPU_SYMBOL(cpu_number);
>>  
>> +#ifdef SCHED_COMPLEX
>> +DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
>> +#endif
>
>ifdefs should not be in .c files.

But I see a lot of ifdefs in .c files, change to IsEnabled() instead?
I'm just follow the x86_topology and default_topology does.

Thanks,
Qing

>
>
>> +
>>  /*
>>   * as from 2.5, kernels no longer have an init_tasks structure
>>   * so we need some other way of telling a new secondary core
>> @@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
>>        }
>>  }
>>  
>> +#ifdef SCHED_COMPLEX
>
>same here.
>
>> +static int arm64_complex_flags(void)
>> +{
>> +     return SD_SHARE_PKG_RESOURCES;
>> +}
>> +
>> +const struct cpumask *arm64_complex_mask(int cpu)
>> +{
>> +     const struct cpumask *core_mask = cpu_cpu_mask(cpu);
>> +
>> +     /* Find the smaller shared cache level than clustergroup and coregroup*/
>> +#ifdef CONFIG_SCHED_MC
>> +     core_mask = cpu_coregroup_mask(cpu);
>> +#endif
>> +#ifdef CONFIG_SCHED_CLUSTER
>> +     core_mask = cpu_clustergroup_mask(cpu);
>> +#endif
>
>See, same here.  This is a mess and unmaintainable.
>
>thanks,
>
>greg k-h

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
@ 2022-04-24  2:26       ` 王擎
  0 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-24  2:26 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann


>> From: Wang Qing <wangqing@vivo.com>
>> 
>> The DSU-110 DynamIQ™ cluster supports blocks that are called complexes
>> which contain up to two cores of the same type and some shared logic.
>> Sharing some logic between the cores can make a complex area efficient.
>> 
>> This patch adds complex level for complexs and automatically enables
>> the load balance among complexs. It will directly benefit a lot of
>> workload which loves more resources such as memory bandwidth, caches.
>> 
>> Testing has been done with Stream benchmark:
>> 8threads stream (2 little cores * 2(complex) + 3 medium cores + 1 big core)
>>                 stream                 stream
>>                 w/o patch              w/ patch
>> MB/sec copy     37579.2 (   0.00%)    39127.3 (   4.12%)
>> MB/sec scale    38261.1 (   0.00%)    39195.4 (   2.44%)
>> MB/sec add      39497.0 (   0.00%)    41101.5 (   4.06%)
>> MB/sec triad    39885.6 (   0.00%)    40772.7 (   2.22%)
>> 
>> And in order to support this features, we defined arm64_topology.
>> 
>> V2:
>> fix commit log and loop more
>> 
>> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> ---
>>  arch/arm64/Kconfig      | 13 +++++++++++
>>  arch/arm64/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 60 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index edbe035cb0e3..4063de8c6153 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -1207,6 +1207,19 @@ config SCHED_CLUSTER
>>          by sharing mid-level caches, last-level cache tags or internal
>>          busses.
>>  
>> +config SCHED_COMPLEX
>> +     bool "Complex scheduler support"
>> +     help
>> +       DSU supports blocks that are called complexes which contain up to
>> +       two cores of the same type and some shared logic. Sharing some logic
>> +       between the cores can make a complex area efficient.
>> +
>> +       Complex also can be considered as a shared cache group smaller
>> +       than cluster.
>> +
>> +       Complex scheduler support improves the CPU scheduler's decision
>> +       making when dealing with machines that have complexs of CPUs.
>> +
>>  config SCHED_SMT
>>        bool "SMT scheduler support"
>>        help
>> diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
>> index 3b46041f2b97..526765112146 100644
>> --- a/arch/arm64/kernel/smp.c
>> +++ b/arch/arm64/kernel/smp.c
>> @@ -14,6 +14,7 @@
>>  #include <linux/sched/mm.h>
>>  #include <linux/sched/hotplug.h>
>>  #include <linux/sched/task_stack.h>
>> +#include <linux/sched/topology.h>
>>  #include <linux/interrupt.h>
>>  #include <linux/cache.h>
>>  #include <linux/profile.h>
>> @@ -57,6 +58,10 @@
>>  DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);
>>  EXPORT_PER_CPU_SYMBOL(cpu_number);
>>  
>> +#ifdef SCHED_COMPLEX
>> +DEFINE_PER_CPU_READ_MOSTLY(cpumask_t, cpu_complex_map);
>> +#endif
>
>ifdefs should not be in .c files.

But I see a lot of ifdefs in .c files, change to IsEnabled() instead?
I'm just follow the x86_topology and default_topology does.

Thanks,
Qing

>
>
>> +
>>  /*
>>   * as from 2.5, kernels no longer have an init_tasks structure
>>   * so we need some other way of telling a new secondary core
>> @@ -715,6 +720,47 @@ void __init smp_init_cpus(void)
>>        }
>>  }
>>  
>> +#ifdef SCHED_COMPLEX
>
>same here.
>
>> +static int arm64_complex_flags(void)
>> +{
>> +     return SD_SHARE_PKG_RESOURCES;
>> +}
>> +
>> +const struct cpumask *arm64_complex_mask(int cpu)
>> +{
>> +     const struct cpumask *core_mask = cpu_cpu_mask(cpu);
>> +
>> +     /* Find the smaller shared cache level than clustergroup and coregroup*/
>> +#ifdef CONFIG_SCHED_MC
>> +     core_mask = cpu_coregroup_mask(cpu);
>> +#endif
>> +#ifdef CONFIG_SCHED_CLUSTER
>> +     core_mask = cpu_clustergroup_mask(cpu);
>> +#endif
>
>See, same here.  This is a mess and unmaintainable.
>
>thanks,
>
>greg k-h
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 RESEND 1/2] arch_topology: support for parsing cache topology from DT
  2022-04-22 12:30     ` Greg Kroah-Hartman
@ 2022-04-24  2:53       ` 王擎
  -1 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-24  2:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann


>> From: Wang Qing <wangqing@vivo.com>
>> 
>> When ACPI is not enabled, we can get cache topolopy from DT like:
>> *             cpu0: cpu@000 {
>> *                     next-level-cache = <&L2_1>;
>> *                     L2_1: l2-cache {
>> *                              compatible = "cache";
>> *                             next-level-cache = <&L3_1>;
>> *                      };
>> *                     L3_1: l3-cache {
>> *                              compatible = "cache";
>> *                      };
>> *             };
>> *
>> *             cpu1: cpu@001 {
>> *                     next-level-cache = <&L2_1>;
>> *             };
>> *             ...
>> *             };
>> cache_topology[] hold the pointer describing by "next-level-cache", 
>> which can describe the cache topology of every level.
>> 
>> MAX_CACHE_LEVEL is strictly corresponding to the cache level from L2.
>
>I have no idea what this changelog means at all.
>
>What are you trying to do?  What problem are you solving?  Why are you
>doing any of this?

This patch just supports parsing cache topology from DT in the early
(before sched domain initialization). Then we can use the information
about shared cache level to build the sched domain level.

>
>
>> 
>> V2:
>> make function name more sense
>
>As per the documentation this goes below the --- line, right?
>
>
>> 
>> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> ---
>>  drivers/base/arch_topology.c  | 47 ++++++++++++++++++++++++++++++++++-
>>  include/linux/arch_topology.h |  3 +++
>>  2 files changed, 49 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 1d6636ebaac5..46e84ce2ec0c 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
>>                return -1;
>>  
>>        cpu = of_cpu_node_to_id(cpu_node);
>> -     if (cpu >= 0)
>> +     if (cpu >= 0) {
>>                topology_parse_cpu_capacity(cpu_node, cpu);
>> +             topology_parse_cpu_caches(cpu_node, cpu);
>> +     }
>>        else
>>                pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
>>                        cpu_node, cpumask_pr_args(cpu_possible_mask));
>> @@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
>>  }
>>  #endif
>>  
>> +/*
>> + * cpu cache topology table
>> + */
>> +#define MAX_CACHE_LEVEL 7
>> +static struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
>
>So for a normal big system of 4k cpus * 7 levels, that's a lot of
>memory?  are you sure?

Yes, if CONFIG_NR_CPUS is configured as 4k, it will take up 224KB memory
in 4k cpus system.

>
>How big of a box did you test this on?

I tested on android phone.

>
>> +
>> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
>> +{
>> +     struct device_node *node_cache = cpu_node;
>> +     int level = 0;
>> +
>> +     while (level < MAX_CACHE_LEVEL) {
>> +             node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
>> +             if (!node_cache)
>> +                     break;
>> +
>> +             cache_topology[cpu][level++] = node_cache;
>> +     }
>
>No locking anywhere?  What could go wrong :(
 
The calling process of the function has guaranteed no data racing condition.
cache_topology[] will only be modified by parse_dt_topology().
cache_topology[] behaves like cpu_topology[].

>
>> +}
>> +
>> +/*
>> + * find the largest subset of the shared cache in the range of cpu_mask
>> + */
>> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
>> +                                                              struct cpumask *sc_mask)
>
>Again, horrid global function name.
>
>And no kernel documentation for how this works?

Maybe I should post an FRC patch first.

>
>
>
>> +{
>> +     int cache_level, cpu_id;
>> +
>> +     for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
>> +             if (!cache_topology[cpu][cache_level])
>> +                     continue;
>
>No locking???
>
>
>> +
>> +             cpumask_clear(sc_mask);
>> +             for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
>> +                     if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
>> +                             cpumask_set_cpu(cpu_id, sc_mask);
>> +             }
>> +
>> +             if (cpumask_subset(sc_mask, cpu_mask))
>> +                     break;
>> +     }
>> +}
>> +
>>  /*
>>   * cpu topology table
>>   */
>> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
>> index 58cbe18d825c..c6ed727e453c 100644
>> --- a/include/linux/arch_topology.h
>> +++ b/include/linux/arch_topology.h
>> @@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
>>  void remove_cpu_topology(unsigned int cpuid);
>>  void reset_cpu_topology(void);
>>  int parse_acpi_topology(void);
>> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
>> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
>> +                                                              struct cpumask *sc_mask);
>
>I still have no idea what this last function is supposed to do.

This function is to find the largest shared cache subset from the
specified cpumask, so we can make them a new level sched domain,
which will directly benefit a lot of workload.

>
>And very odd indentation, did you run checkpatch on this?

Yes, I modify it manually after running checkpatch, this is my problem.

>
>totally confused,

Thanks for your correction, 
Qing

>
>greg k-h 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 RESEND 1/2] arch_topology: support for parsing cache topology from DT
@ 2022-04-24  2:53       ` 王擎
  0 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-24  2:53 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Rafael J. Wysocki,
	linux-arm-kernel, linux-kernel, vincent.guittot,
	dietmar.eggemann


>> From: Wang Qing <wangqing@vivo.com>
>> 
>> When ACPI is not enabled, we can get cache topolopy from DT like:
>> *             cpu0: cpu@000 {
>> *                     next-level-cache = <&L2_1>;
>> *                     L2_1: l2-cache {
>> *                              compatible = "cache";
>> *                             next-level-cache = <&L3_1>;
>> *                      };
>> *                     L3_1: l3-cache {
>> *                              compatible = "cache";
>> *                      };
>> *             };
>> *
>> *             cpu1: cpu@001 {
>> *                     next-level-cache = <&L2_1>;
>> *             };
>> *             ...
>> *             };
>> cache_topology[] hold the pointer describing by "next-level-cache", 
>> which can describe the cache topology of every level.
>> 
>> MAX_CACHE_LEVEL is strictly corresponding to the cache level from L2.
>
>I have no idea what this changelog means at all.
>
>What are you trying to do?  What problem are you solving?  Why are you
>doing any of this?

This patch just supports parsing cache topology from DT in the early
(before sched domain initialization). Then we can use the information
about shared cache level to build the sched domain level.

>
>
>> 
>> V2:
>> make function name more sense
>
>As per the documentation this goes below the --- line, right?
>
>
>> 
>> Signed-off-by: Wang Qing <wangqing@vivo.com>
>> ---
>>  drivers/base/arch_topology.c  | 47 ++++++++++++++++++++++++++++++++++-
>>  include/linux/arch_topology.h |  3 +++
>>  2 files changed, 49 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
>> index 1d6636ebaac5..46e84ce2ec0c 100644
>> --- a/drivers/base/arch_topology.c
>> +++ b/drivers/base/arch_topology.c
>> @@ -480,8 +480,10 @@ static int __init get_cpu_for_node(struct device_node *node)
>>                return -1;
>>  
>>        cpu = of_cpu_node_to_id(cpu_node);
>> -     if (cpu >= 0)
>> +     if (cpu >= 0) {
>>                topology_parse_cpu_capacity(cpu_node, cpu);
>> +             topology_parse_cpu_caches(cpu_node, cpu);
>> +     }
>>        else
>>                pr_info("CPU node for %pOF exist but the possible cpu range is :%*pbl\n",
>>                        cpu_node, cpumask_pr_args(cpu_possible_mask));
>> @@ -647,6 +649,49 @@ static int __init parse_dt_topology(void)
>>  }
>>  #endif
>>  
>> +/*
>> + * cpu cache topology table
>> + */
>> +#define MAX_CACHE_LEVEL 7
>> +static struct device_node *cache_topology[NR_CPUS][MAX_CACHE_LEVEL];
>
>So for a normal big system of 4k cpus * 7 levels, that's a lot of
>memory?  are you sure?

Yes, if CONFIG_NR_CPUS is configured as 4k, it will take up 224KB memory
in 4k cpus system.

>
>How big of a box did you test this on?

I tested on android phone.

>
>> +
>> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu)
>> +{
>> +     struct device_node *node_cache = cpu_node;
>> +     int level = 0;
>> +
>> +     while (level < MAX_CACHE_LEVEL) {
>> +             node_cache = of_parse_phandle(node_cache, "next-level-cache", 0);
>> +             if (!node_cache)
>> +                     break;
>> +
>> +             cache_topology[cpu][level++] = node_cache;
>> +     }
>
>No locking anywhere?  What could go wrong :(
 
The calling process of the function has guaranteed no data racing condition.
cache_topology[] will only be modified by parse_dt_topology().
cache_topology[] behaves like cpu_topology[].

>
>> +}
>> +
>> +/*
>> + * find the largest subset of the shared cache in the range of cpu_mask
>> + */
>> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
>> +                                                              struct cpumask *sc_mask)
>
>Again, horrid global function name.
>
>And no kernel documentation for how this works?

Maybe I should post an FRC patch first.

>
>
>
>> +{
>> +     int cache_level, cpu_id;
>> +
>> +     for (cache_level = MAX_CACHE_LEVEL - 1; cache_level >= 0; cache_level--) {
>> +             if (!cache_topology[cpu][cache_level])
>> +                     continue;
>
>No locking???
>
>
>> +
>> +             cpumask_clear(sc_mask);
>> +             for (cpu_id = 0; cpu_id < NR_CPUS; cpu_id++) {
>> +                     if (cache_topology[cpu][cache_level] == cache_topology[cpu_id][cache_level])
>> +                             cpumask_set_cpu(cpu_id, sc_mask);
>> +             }
>> +
>> +             if (cpumask_subset(sc_mask, cpu_mask))
>> +                     break;
>> +     }
>> +}
>> +
>>  /*
>>   * cpu topology table
>>   */
>> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
>> index 58cbe18d825c..c6ed727e453c 100644
>> --- a/include/linux/arch_topology.h
>> +++ b/include/linux/arch_topology.h
>> @@ -93,6 +93,9 @@ void update_siblings_masks(unsigned int cpu);
>>  void remove_cpu_topology(unsigned int cpuid);
>>  void reset_cpu_topology(void);
>>  int parse_acpi_topology(void);
>> +void topology_parse_cpu_caches(struct device_node *cpu_node, int cpu);
>> +void find_subset_of_share_cache(const struct cpumask *cpu_mask, int cpu,
>> +                                                              struct cpumask *sc_mask);
>
>I still have no idea what this last function is supposed to do.

This function is to find the largest shared cache subset from the
specified cpumask, so we can make them a new level sched domain,
which will directly benefit a lot of workload.

>
>And very odd indentation, did you run checkpatch on this?

Yes, I modify it manually after running checkpatch, this is my problem.

>
>totally confused,

Thanks for your correction, 
Qing

>
>greg k-h 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/2] Add complex scheduler level for arm64
  2022-04-22 11:51 ` Qing Wang
@ 2022-04-25 13:19   ` Jonathan Cameron
  -1 siblings, 0 replies; 24+ messages in thread
From: Jonathan Cameron @ 2022-04-25 13:19 UTC (permalink / raw)
  To: Qing Wang
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann

On Fri, 22 Apr 2022 04:51:24 -0700
Qing Wang <wangqing@vivo.com> wrote:

> From: Wang Qing <wangqing@vivo.com>
> 
> The DSU cluster supports blocks that are called complexes
> which contain up to two cores of the same type and some shared logic,
> which sharing some logic between the cores can make a complex area efficient.
>

Given the complex shares things like the SVE units (cortex a510)...

Why not handle this as SMT?

Seems like a blurred boundary between separate cores and SMT threads.
I think we need to express and potentially take advantage of knowledge
about what logic is being shared.

Jonathan

 
> Complex also can be considered as a shared cache group smaller
> than cluster.
> 
> This patch adds complex level for complexs by parsing cache topology
> form DT. It will directly benefit a lot of workload which loves more
> resources such as memory bandwidth, caches.
> 
> Note this patch only handle the DT case.
> 
> V2:
> fix commit log and loop more
> 
> wangqing (2):
>   arch_topology: support for describing cache topology from DT
>   arm64: Add complex scheduler level for arm64
> 
>  arch/arm64/Kconfig            | 13 ++++++++++
>  arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
>  drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
>  include/linux/arch_topology.h |  3 +++
>  4 files changed, 109 insertions(+), 2 deletions(-)
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/2] Add complex scheduler level for arm64
@ 2022-04-25 13:19   ` Jonathan Cameron
  0 siblings, 0 replies; 24+ messages in thread
From: Jonathan Cameron @ 2022-04-25 13:19 UTC (permalink / raw)
  To: Qing Wang
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann

On Fri, 22 Apr 2022 04:51:24 -0700
Qing Wang <wangqing@vivo.com> wrote:

> From: Wang Qing <wangqing@vivo.com>
> 
> The DSU cluster supports blocks that are called complexes
> which contain up to two cores of the same type and some shared logic,
> which sharing some logic between the cores can make a complex area efficient.
>

Given the complex shares things like the SVE units (cortex a510)...

Why not handle this as SMT?

Seems like a blurred boundary between separate cores and SMT threads.
I think we need to express and potentially take advantage of knowledge
about what logic is being shared.

Jonathan

 
> Complex also can be considered as a shared cache group smaller
> than cluster.
> 
> This patch adds complex level for complexs by parsing cache topology
> form DT. It will directly benefit a lot of workload which loves more
> resources such as memory bandwidth, caches.
> 
> Note this patch only handle the DT case.
> 
> V2:
> fix commit log and loop more
> 
> wangqing (2):
>   arch_topology: support for describing cache topology from DT
>   arm64: Add complex scheduler level for arm64
> 
>  arch/arm64/Kconfig            | 13 ++++++++++
>  arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
>  drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
>  include/linux/arch_topology.h |  3 +++
>  4 files changed, 109 insertions(+), 2 deletions(-)
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/2] Add complex scheduler level for arm64
  2022-04-25 13:19   ` Jonathan Cameron
@ 2022-04-26  3:06     ` 王擎
  -1 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-26  3:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann


>> From: Wang Qing <wangqing@vivo.com>
>> 
>> The DSU cluster supports blocks that are called complexes
>> which contain up to two cores of the same type and some shared logic,
>> which sharing some logic between the cores can make a complex area efficient.
>>
>
>Given the complex shares things like the SVE units (cortex a510)...
>
>Why not handle this as SMT?

SMT should share all cache levels. but complexs only share L2(and L3)
cache here.

>
>Seems like a blurred boundary between separate cores and SMT threads.
>I think we need to express and potentially take advantage of knowledge
>about what logic is being shared.

Logic such as a Vector Processing Unit, L2 Translation Lookaside Buffer
(TLB) ... are shared, I believe this will improve efficiency even if
only L2 cache is shared.

Thanks,
Qing

>
>Jonathan
>
>> Complex also can be considered as a shared cache group smaller
>> than cluster.
>> 
>> This patch adds complex level for complexs by parsing cache topology
>> form DT. It will directly benefit a lot of workload which loves more
>> resources such as memory bandwidth, caches.
>> 
>> Note this patch only handle the DT case.
>> 
>> V2:
>> fix commit log and loop more
>> 
>> wangqing (2):
>>   arch_topology: support for describing cache topology from DT
>>   arm64: Add complex scheduler level for arm64
>> 
>>  arch/arm64/Kconfig            | 13 ++++++++++
>>  arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
>>  drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
>>  include/linux/arch_topology.h |  3 +++
>>  4 files changed, 109 insertions(+), 2 deletions(-)
>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/2] Add complex scheduler level for arm64
@ 2022-04-26  3:06     ` 王擎
  0 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-26  3:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Catalin Marinas, Will Deacon, Sudeep Holla, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann


>> From: Wang Qing <wangqing@vivo.com>
>> 
>> The DSU cluster supports blocks that are called complexes
>> which contain up to two cores of the same type and some shared logic,
>> which sharing some logic between the cores can make a complex area efficient.
>>
>
>Given the complex shares things like the SVE units (cortex a510)...
>
>Why not handle this as SMT?

SMT should share all cache levels. but complexs only share L2(and L3)
cache here.

>
>Seems like a blurred boundary between separate cores and SMT threads.
>I think we need to express and potentially take advantage of knowledge
>about what logic is being shared.

Logic such as a Vector Processing Unit, L2 Translation Lookaside Buffer
(TLB) ... are shared, I believe this will improve efficiency even if
only L2 cache is shared.

Thanks,
Qing

>
>Jonathan
>
>> Complex also can be considered as a shared cache group smaller
>> than cluster.
>> 
>> This patch adds complex level for complexs by parsing cache topology
>> form DT. It will directly benefit a lot of workload which loves more
>> resources such as memory bandwidth, caches.
>> 
>> Note this patch only handle the DT case.
>> 
>> V2:
>> fix commit log and loop more
>> 
>> wangqing (2):
>>   arch_topology: support for describing cache topology from DT
>>   arm64: Add complex scheduler level for arm64
>> 
>>  arch/arm64/Kconfig            | 13 ++++++++++
>>  arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
>>  drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
>>  include/linux/arch_topology.h |  3 +++
>>  4 files changed, 109 insertions(+), 2 deletions(-)
>> 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/2] Add complex scheduler level for arm64
  2022-04-26  3:06     ` 王擎
@ 2022-04-26  7:05       ` 王擎
  -1 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-26  7:05 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Catalin Marinas, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann, Jonathan Cameron


Hi Sudeep:

I am thinking if it is possible to add a complex level cpu topology
between cluster and SMT?

We can describe it directly in “cpu-map”, instead of parsing it through
the cache info.

Thanks,
Qing

>>> From: Wang Qing <wangqing@vivo.com>
>>> 
>>> The DSU cluster supports blocks that are called complexes
>>> which contain up to two cores of the same type and some shared logic,
>>> which sharing some logic between the cores can make a complex area efficient.
>>>
>>
>>Given the complex shares things like the SVE units (cortex a510)...
>>
>>Why not handle this as SMT?
>
>SMT should share all cache levels. but complexs only share L2(and L3)
>cache here.
>
>>
>>Seems like a blurred boundary between separate cores and SMT threads.
>>I think we need to express and potentially take advantage of knowledge
>>about what logic is being shared.
>
>Logic such as a Vector Processing Unit, L2 Translation Lookaside Buffer
>(TLB) ... are shared, I believe this will improve efficiency even if
>only L2 cache is shared.
>
>Thanks,
>Qing
>
>>
>>Jonathan
>>
>>> Complex also can be considered as a shared cache group smaller
>>> than cluster.
>>> 
>>> This patch adds complex level for complexs by parsing cache topology
>>> form DT. It will directly benefit a lot of workload which loves more
>>> resources such as memory bandwidth, caches.
>>> 
>>> Note this patch only handle the DT case.
>>> 
>>> V2:
>>> fix commit log and loop more
>>> 
>>> wangqing (2):
>>>   arch_topology: support for describing cache topology from DT
>>>   arm64: Add complex scheduler level for arm64
>>> 
>>>  arch/arm64/Kconfig            | 13 ++++++++++
>>>  arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
>>>  drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
>>>  include/linux/arch_topology.h |  3 +++
>>>  4 files changed, 109 insertions(+), 2 deletions(-)
>>> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH V2 0/2] Add complex scheduler level for arm64
@ 2022-04-26  7:05       ` 王擎
  0 siblings, 0 replies; 24+ messages in thread
From: 王擎 @ 2022-04-26  7:05 UTC (permalink / raw)
  To: Sudeep Holla
  Cc: Catalin Marinas, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann, Jonathan Cameron


Hi Sudeep:

I am thinking if it is possible to add a complex level cpu topology
between cluster and SMT?

We can describe it directly in “cpu-map”, instead of parsing it through
the cache info.

Thanks,
Qing

>>> From: Wang Qing <wangqing@vivo.com>
>>> 
>>> The DSU cluster supports blocks that are called complexes
>>> which contain up to two cores of the same type and some shared logic,
>>> which sharing some logic between the cores can make a complex area efficient.
>>>
>>
>>Given the complex shares things like the SVE units (cortex a510)...
>>
>>Why not handle this as SMT?
>
>SMT should share all cache levels. but complexs only share L2(and L3)
>cache here.
>
>>
>>Seems like a blurred boundary between separate cores and SMT threads.
>>I think we need to express and potentially take advantage of knowledge
>>about what logic is being shared.
>
>Logic such as a Vector Processing Unit, L2 Translation Lookaside Buffer
>(TLB) ... are shared, I believe this will improve efficiency even if
>only L2 cache is shared.
>
>Thanks,
>Qing
>
>>
>>Jonathan
>>
>>> Complex also can be considered as a shared cache group smaller
>>> than cluster.
>>> 
>>> This patch adds complex level for complexs by parsing cache topology
>>> form DT. It will directly benefit a lot of workload which loves more
>>> resources such as memory bandwidth, caches.
>>> 
>>> Note this patch only handle the DT case.
>>> 
>>> V2:
>>> fix commit log and loop more
>>> 
>>> wangqing (2):
>>>   arch_topology: support for describing cache topology from DT
>>>   arm64: Add complex scheduler level for arm64
>>> 
>>>  arch/arm64/Kconfig            | 13 ++++++++++
>>>  arch/arm64/kernel/smp.c       | 48 ++++++++++++++++++++++++++++++++++-
>>>  drivers/base/arch_topology.c  | 47 +++++++++++++++++++++++++++++++++-
>>>  include/linux/arch_topology.h |  3 +++
>>>  4 files changed, 109 insertions(+), 2 deletions(-)
>>> 
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/2] Add complex scheduler level for arm64
  2022-04-26  7:05       ` 王擎
@ 2022-04-26 13:27         ` Sudeep Holla
  -1 siblings, 0 replies; 24+ messages in thread
From: Sudeep Holla @ 2022-04-26 13:27 UTC (permalink / raw)
  To: 王擎
  Cc: Catalin Marinas, Sudeep Holla, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann, Jonathan Cameron

On Tue, Apr 26, 2022 at 07:05:15AM +0000, 王擎 wrote:
> 
> Hi Sudeep:
> 
> I am thinking if it is possible to add a complex level cpu topology
> between cluster and SMT?
> 
> We can describe it directly in “cpu-map”, instead of parsing it through
> the cache info.
> 

I don't know or understand what you mean by that. Do you have a proposal
for DT bindings ? Please post the patch with details on the motivation for
the change to help us understand your proposal. I can't comment much without
seeing or understanding your proposal at this moment.

-- 
Regards,
Sudeep

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 0/2] Add complex scheduler level for arm64
@ 2022-04-26 13:27         ` Sudeep Holla
  0 siblings, 0 replies; 24+ messages in thread
From: Sudeep Holla @ 2022-04-26 13:27 UTC (permalink / raw)
  To: 王擎
  Cc: Catalin Marinas, Sudeep Holla, Will Deacon, Greg Kroah-Hartman,
	Rafael J. Wysocki, linux-arm-kernel, linux-kernel,
	vincent.guittot, dietmar.eggemann, Jonathan Cameron

On Tue, Apr 26, 2022 at 07:05:15AM +0000, 王擎 wrote:
> 
> Hi Sudeep:
> 
> I am thinking if it is possible to add a complex level cpu topology
> between cluster and SMT?
> 
> We can describe it directly in “cpu-map”, instead of parsing it through
> the cache info.
> 

I don't know or understand what you mean by that. Do you have a proposal
for DT bindings ? Please post the patch with details on the motivation for
the change to help us understand your proposal. I can't comment much without
seeing or understanding your proposal at this moment.

-- 
Regards,
Sudeep

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
  2022-04-22 11:51   ` Qing Wang
@ 2022-04-26 18:15     ` kernel test robot
  -1 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2022-04-26 18:15 UTC (permalink / raw)
  To: Qing Wang, Catalin Marinas, Will Deacon, Sudeep Holla,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-arm-kernel,
	linux-kernel
  Cc: llvm, kbuild-all, vincent.guittot, dietmar.eggemann, Wang Qing

Hi Qing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on driver-core/driver-core-testing linus/master arm-perf/for-next/perf v5.18-rc4 next-20220422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Qing-Wang/Add-complex-scheduler-level-for-arm64/20220422-201107
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: arm64-allyesconfig (https://download.01.org/0day-ci/archive/20220427/202204270201.FkymxKhn-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 1cddcfdc3c683b393df1a5c9063252eb60e52818)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/3b18155ccd99fb790e719fa432366dfdb97ab57c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qing-Wang/Add-complex-scheduler-level-for-arm64/20220422-201107
        git checkout 3b18155ccd99fb790e719fa432366dfdb97ab57c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/arm64/kernel/smp.c:752:4: error: use of undeclared identifier 'arm64_complex_mask'
           { arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
             ^
>> arch/arm64/kernel/smp.c:752:24: error: use of undeclared identifier 'arm64_complex_flags'
           { arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
                                 ^
   2 errors generated.


vim +/arm64_complex_mask +752 arch/arm64/kernel/smp.c

   746	
   747	static struct sched_domain_topology_level arm64_topology[] = {
   748	#ifdef CONFIG_SCHED_SMT
   749		{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
   750	#endif
   751	#ifdef CONFIG_SCHED_COMPLEX
 > 752		{ arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
   753	#endif
   754	#ifdef CONFIG_SCHED_CLUSTER
   755		{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
   756	#endif
   757	#ifdef CONFIG_SCHED_MC
   758		{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
   759	#endif
   760		{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
   761		{ NULL, },
   762	};
   763	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH V2 2/2] arm64: Add complex scheduler level for arm64
@ 2022-04-26 18:15     ` kernel test robot
  0 siblings, 0 replies; 24+ messages in thread
From: kernel test robot @ 2022-04-26 18:15 UTC (permalink / raw)
  To: Qing Wang, Catalin Marinas, Will Deacon, Sudeep Holla,
	Greg Kroah-Hartman, Rafael J. Wysocki, linux-arm-kernel,
	linux-kernel
  Cc: llvm, kbuild-all, vincent.guittot, dietmar.eggemann, Wang Qing

Hi Qing,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on driver-core/driver-core-testing linus/master arm-perf/for-next/perf v5.18-rc4 next-20220422]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/intel-lab-lkp/linux/commits/Qing-Wang/Add-complex-scheduler-level-for-arm64/20220422-201107
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: arm64-allyesconfig (https://download.01.org/0day-ci/archive/20220427/202204270201.FkymxKhn-lkp@intel.com/config)
compiler: clang version 15.0.0 (https://github.com/llvm/llvm-project 1cddcfdc3c683b393df1a5c9063252eb60e52818)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install arm64 cross compiling tool for clang build
        # apt-get install binutils-aarch64-linux-gnu
        # https://github.com/intel-lab-lkp/linux/commit/3b18155ccd99fb790e719fa432366dfdb97ab57c
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Qing-Wang/Add-complex-scheduler-level-for-arm64/20220422-201107
        git checkout 3b18155ccd99fb790e719fa432366dfdb97ab57c
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=arm64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

>> arch/arm64/kernel/smp.c:752:4: error: use of undeclared identifier 'arm64_complex_mask'
           { arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
             ^
>> arch/arm64/kernel/smp.c:752:24: error: use of undeclared identifier 'arm64_complex_flags'
           { arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
                                 ^
   2 errors generated.


vim +/arm64_complex_mask +752 arch/arm64/kernel/smp.c

   746	
   747	static struct sched_domain_topology_level arm64_topology[] = {
   748	#ifdef CONFIG_SCHED_SMT
   749		{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
   750	#endif
   751	#ifdef CONFIG_SCHED_COMPLEX
 > 752		{ arm64_complex_mask, arm64_complex_flags, SD_INIT_NAME(CPL) },
   753	#endif
   754	#ifdef CONFIG_SCHED_CLUSTER
   755		{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
   756	#endif
   757	#ifdef CONFIG_SCHED_MC
   758		{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
   759	#endif
   760		{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
   761		{ NULL, },
   762	};
   763	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-04-26 18:17 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-22 11:51 [PATCH V2 0/2] Add complex scheduler level for arm64 Qing Wang
2022-04-22 11:51 ` Qing Wang
2022-04-22 11:51 ` [PATCH V2 1/2] arch_topology: support for parsing cache topology from DT Qing Wang
2022-04-22 11:51   ` Qing Wang
2022-04-22 12:30   ` Greg Kroah-Hartman
2022-04-22 12:30     ` Greg Kroah-Hartman
2022-04-24  2:53     ` [PATCH V2 RESEND " 王擎
2022-04-24  2:53       ` 王擎
2022-04-22 11:51 ` [PATCH V2 2/2] arm64: Add complex scheduler level for arm64 Qing Wang
2022-04-22 11:51   ` Qing Wang
2022-04-22 12:31   ` Greg Kroah-Hartman
2022-04-22 12:31     ` Greg Kroah-Hartman
2022-04-24  2:26     ` 王擎
2022-04-24  2:26       ` 王擎
2022-04-26 18:15   ` kernel test robot
2022-04-26 18:15     ` kernel test robot
2022-04-25 13:19 ` [PATCH V2 0/2] " Jonathan Cameron
2022-04-25 13:19   ` Jonathan Cameron
2022-04-26  3:06   ` 王擎
2022-04-26  3:06     ` 王擎
2022-04-26  7:05     ` 王擎
2022-04-26  7:05       ` 王擎
2022-04-26 13:27       ` Sudeep Holla
2022-04-26 13:27         ` Sudeep Holla

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.