All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Represent cluster topology and enable load balance between clusters
@ 2021-08-20  1:30 ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song

From: Barry Song <song.bao.hua@hisilicon.com>

ARM64 machines like kunpeng920 and x86 machines like Jacobsville have a
level of hardware topology in which some CPU cores, typically 4 cores,
share L3 tags or L2 cache.

That means spreading those tasks between clusters will bring more memory
bandwidth and decrease cache contention. But packing tasks might help
decrease the latency of cache synchronization.

We have three series to bring up cluster level scheduler in kernel.
This is the first series.

1st series(this one): make kernel aware of cluster, expose cluster to sysfs
ABI and add SCHED_CLUSTER which can make load balance among clusters to
benefit lots of workload.
Testing shows this can hugely boost the performance, for example, this
can increase 25.1% of SPECrate mcf on Jacobsville and 13.574% of mcf
on kunpeng920.

2nd series(packing path): modify the wake_affine and let kernel select CPUs
within cluster first before scanning the whole LLC so that we can benefit
from the lower latency of the communication within one single cluster.
this series is much more tricky. so we would like to send it after the 1st
series settles down. Prototype here:
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-June/000219.html

3rd series: a sysctl to permit users to enable or disable cluster scheduler
from Tim Chen. Prototype here:
Add run time sysctl to enable/disable cluster scheduling
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-July/000258.html

This series is rebased on Greg's driver-core-next with the update in topology
sysfs ABI.

-V1:
 differences with RFC v6 
 * rebased on top of the latest update in topology sysfs ABI of Greg's
   driver-core-next
 * removed wake_affine path modifcation, which will be separately b2nd series
 * cluster_id is gotten by detecting valid ID before falling back to use offset
 * lots of benchmark data from both x86 Jacobsville and ARM64 kunpeng920

-RFC v6:
https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao.hua@hisilicon.com/

Barry Song (1):
  scheduler: Add cluster scheduler level in core and related Kconfig for
    ARM64

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
 Documentation/admin-guide/cputopology.rst     | 12 ++--
 arch/arm64/Kconfig                            |  7 ++
 arch/arm64/kernel/topology.c                  |  2 +
 arch/x86/Kconfig                              |  8 +++
 arch/x86/include/asm/smp.h                    |  7 ++
 arch/x86/include/asm/topology.h               |  3 +
 arch/x86/kernel/cpu/cacheinfo.c               |  1 +
 arch/x86/kernel/cpu/common.c                  |  3 +
 arch/x86/kernel/smpboot.c                     | 44 +++++++++++-
 drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
 drivers/base/arch_topology.c                  | 14 ++++
 drivers/base/topology.c                       | 10 +++
 include/linux/acpi.h                          |  5 ++
 include/linux/arch_topology.h                 |  5 ++
 include/linux/sched/topology.h                |  7 ++
 include/linux/topology.h                      | 13 ++++
 kernel/sched/topology.c                       |  5 ++
 18 files changed, 223 insertions(+), 5 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 0/3] Represent cluster topology and enable load balance between clusters
@ 2021-08-20  1:30 ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song

From: Barry Song <song.bao.hua@hisilicon.com>

ARM64 machines like kunpeng920 and x86 machines like Jacobsville have a
level of hardware topology in which some CPU cores, typically 4 cores,
share L3 tags or L2 cache.

That means spreading those tasks between clusters will bring more memory
bandwidth and decrease cache contention. But packing tasks might help
decrease the latency of cache synchronization.

We have three series to bring up cluster level scheduler in kernel.
This is the first series.

1st series(this one): make kernel aware of cluster, expose cluster to sysfs
ABI and add SCHED_CLUSTER which can make load balance among clusters to
benefit lots of workload.
Testing shows this can hugely boost the performance, for example, this
can increase 25.1% of SPECrate mcf on Jacobsville and 13.574% of mcf
on kunpeng920.

2nd series(packing path): modify the wake_affine and let kernel select CPUs
within cluster first before scanning the whole LLC so that we can benefit
from the lower latency of the communication within one single cluster.
this series is much more tricky. so we would like to send it after the 1st
series settles down. Prototype here:
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-June/000219.html

3rd series: a sysctl to permit users to enable or disable cluster scheduler
from Tim Chen. Prototype here:
Add run time sysctl to enable/disable cluster scheduling
https://op-lists.linaro.org/pipermail/linaro-open-discussions/2021-July/000258.html

This series is rebased on Greg's driver-core-next with the update in topology
sysfs ABI.

-V1:
 differences with RFC v6 
 * rebased on top of the latest update in topology sysfs ABI of Greg's
   driver-core-next
 * removed wake_affine path modifcation, which will be separately b2nd series
 * cluster_id is gotten by detecting valid ID before falling back to use offset
 * lots of benchmark data from both x86 Jacobsville and ARM64 kunpeng920

-RFC v6:
https://lore.kernel.org/lkml/20210420001844.9116-1-song.bao.hua@hisilicon.com/

Barry Song (1):
  scheduler: Add cluster scheduler level in core and related Kconfig for
    ARM64

Jonathan Cameron (1):
  topology: Represent clusters of CPUs within a die

Tim Chen (1):
  scheduler: Add cluster scheduler level for x86

 .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
 Documentation/admin-guide/cputopology.rst     | 12 ++--
 arch/arm64/Kconfig                            |  7 ++
 arch/arm64/kernel/topology.c                  |  2 +
 arch/x86/Kconfig                              |  8 +++
 arch/x86/include/asm/smp.h                    |  7 ++
 arch/x86/include/asm/topology.h               |  3 +
 arch/x86/kernel/cpu/cacheinfo.c               |  1 +
 arch/x86/kernel/cpu/common.c                  |  3 +
 arch/x86/kernel/smpboot.c                     | 44 +++++++++++-
 drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
 drivers/base/arch_topology.c                  | 14 ++++
 drivers/base/topology.c                       | 10 +++
 include/linux/acpi.h                          |  5 ++
 include/linux/arch_topology.h                 |  5 ++
 include/linux/sched/topology.h                |  7 ++
 include/linux/topology.h                      | 13 ++++
 kernel/sched/topology.c                       |  5 ++
 18 files changed, 223 insertions(+), 5 deletions(-)

-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/3] topology: Represent clusters of CPUs within a die
  2021-08-20  1:30 ` Barry Song
@ 2021-08-20  1:30   ` Barry Song
  -1 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Jonathan Cameron, Tian Tao, Barry Song

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.

For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus.

+-----------------------------------+                          +---------+
|  +------+    +------+            +---------------------------+         |
|  | CPU0 |    | cpu1 |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+   cluster   |    |    tag    |         |         |
|  | CPU2 |    | CPU3 |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   |    |    L3     |         |         |
|  +------+    +------+             +----+    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |   L3    |
                                                               |   data  |
+-----------------------------------+                          |         |
|  +------+    +------+             |    +-----------+         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             +----+    L3     |         |         |
|                                   |    |    tag    |         |         |
|  +------+    +------+             |    |           |         |         |
|  |      |    |      |            ++    +-----------+         |         |
|  +------+    +------+            |---------------------------+         |
+-----------------------------------|                          |         |
+-----------------------------------|                          |         |
|  +------+    +------+            +---------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+             |    |    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |   +-----------+          |         |
|  +------+    +------+             |   |           |          |         |
|                                   |   |    L3     |          |         |
|  +------+    +------+             +---+    tag    |          |         |
|  |      |    |      |             |   |           |          |         |
|  +------+    +------+             |   +-----------+          |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                         ++         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |  +-----------+           |         |
|  +------+    +------+             |  |           |           |         |
|                                   |  |    L3     |           |         |
|  +------+    +------+             +--+    tag    |           |         |
|  |      |    |      |             |  |           |           |         |
|  +------+    +------+             |  +-----------+           |         |
|                                   |                          +---------+
+-----------------------------------+

That means spreading tasks among clusters will bring more bandwidth
while packing tasks within one cluster will lead to smaller cache
synchronization latency. So both kernel and userspace will have
a chance to leverage this topology to deploy tasks accordingly to
achieve either smaller cache latency within one cluster or an even
distribution of load among clusters for higher throughput.

This patch exposes cluster topology to both kernel and userspace.
Libraried like hwloc will know cluster by cluster_cpus and related
sysfs attributes. PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
    structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
 Documentation/admin-guide/cputopology.rst     | 12 ++--
 arch/arm64/kernel/topology.c                  |  2 +
 drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
 drivers/base/arch_topology.c                  | 14 ++++
 drivers/base/topology.c                       | 10 +++
 include/linux/acpi.h                          |  5 ++
 include/linux/arch_topology.h                 |  5 ++
 include/linux/topology.h                      |  6 ++
 9 files changed, 132 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 516dafea03eb..3965ce504484 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
@@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
                 architecture and platform dependent.
 Values:         integer
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
+Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent.
+Values:         integer
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_id
 Description:    the book ID of cpuX. Typically it is the hardware platform's
                 identifier (rather than the kernel's). The actual value is
@@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
                 The format is like 0-3, 8-11, 14,17.
 Values:         decimal list.
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
+Description:    internal kernel map of CPUs within the same cluster.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
+Description:    human-readable list of CPUs within the same cluster.
+                The format is like 0-3, 8-11, 14,17.
+Values:         decimal list.
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
 Description:    internal kernel map of cpuX's hardware threads within the same
                 book_id. it's only used on s390.
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index 8632a1db36e4..a5491949880d 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
 
 	#define topology_physical_package_id(cpu)
 	#define topology_die_id(cpu)
+	#define topology_cluster_id(cpu)
 	#define topology_core_id(cpu)
 	#define topology_book_id(cpu)
 	#define topology_drawer_id(cpu)
 	#define topology_sibling_cpumask(cpu)
 	#define topology_core_cpumask(cpu)
+	#define topology_cluster_cpumask(cpu)
 	#define topology_die_cpumask(cpu)
 	#define topology_book_cpumask(cpu)
 	#define topology_drawer_cpumask(cpu)
@@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 4dd14a6620c1..9ab78ad826e2 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
 			cpu_topology[cpu].thread_id  = -1;
 			cpu_topology[cpu].core_id    = topology_id;
 		}
+		topology_id = find_acpi_cpu_topology_cluster(cpu);
+		cpu_topology[cpu].cluster_id = topology_id;
 		topology_id = find_acpi_cpu_topology_package(cpu);
 		cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index fe69dc518f31..701f61c01359 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
 					  ACPI_PPTT_PHYSICAL_PACKAGE);
 }
 
+/**
+ * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
+ * @cpu: Kernel logical CPU number
+ *
+ * Determine a topology unique cluster ID for the given CPU/thread.
+ * This ID can then be used to group peers, which will have matching ids.
+ *
+ * The cluster, if present is the level of topology above CPUs. In a
+ * multi-thread CPU, it will be the level above the CPU, not the thread.
+ * It may not exist in single CPU systems. In simple multi-CPU systems,
+ * it may be equal to the package topology level.
+ *
+ * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
+ * or there is no toplogy level above the CPU..
+ * Otherwise returns a value which represents the package for this CPU.
+ */
+
+int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	struct acpi_table_header *table;
+	acpi_status status;
+	struct acpi_pptt_processor *cpu_node, *cluster_node;
+	u32 acpi_cpu_id;
+	int retval;
+	int is_thread;
+
+	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
+	if (ACPI_FAILURE(status)) {
+		acpi_pptt_warn_missing();
+		return -ENOENT;
+	}
+
+	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
+	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
+	if (cpu_node == NULL || !cpu_node->parent) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+
+	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
+	cluster_node = fetch_pptt_node(table, cpu_node->parent);
+	if (cluster_node == NULL) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+	if (is_thread) {
+		if (!cluster_node->parent) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+		cluster_node = fetch_pptt_node(table, cluster_node->parent);
+		if (cluster_node == NULL) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+	}
+	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
+		retval = cluster_node->acpi_processor_id;
+	else
+		retval = ACPI_PTR_DIFF(cluster_node, table);
+
+put_table:
+	acpi_put_table(table);
+
+	return retval;
+}
+
 /**
  * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
  * @cpu: Kernel logical CPU number
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 921312a8d957..5b1589adacaf 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -598,6 +598,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return core_mask;
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].cluster_sibling;
+}
+
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -615,6 +620,11 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;
 
+		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
+			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
+			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
+		}
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
 
@@ -633,6 +643,9 @@ static void clear_cpu_topology(int cpu)
 	cpumask_clear(&cpu_topo->llc_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
 
+	cpumask_clear(&cpu_topo->cluster_sibling);
+	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
+
 	cpumask_clear(&cpu_topo->core_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
 	cpumask_clear(&cpu_topo->thread_sibling);
@@ -648,6 +661,7 @@ void __init reset_cpu_topology(void)
 
 		cpu_topo->thread_id = -1;
 		cpu_topo->core_id = -1;
+		cpu_topo->cluster_id = -1;
 		cpu_topo->package_id = -1;
 		cpu_topo->llc_id = -1;
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 43c0940643f5..8f2b641d0b8c 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
 define_id_show_func(die_id);
 static DEVICE_ATTR_RO(die_id);
 
+define_id_show_func(cluster_id);
+static DEVICE_ATTR_RO(cluster_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
 static BIN_ATTR_RO(core_siblings, 0);
 static BIN_ATTR_RO(core_siblings_list, 0);
 
+define_siblings_read_func(cluster_cpus, cluster_cpumask);
+static BIN_ATTR_RO(cluster_cpus, 0);
+static BIN_ATTR_RO(cluster_cpus_list, 0);
+
 define_siblings_read_func(die_cpus, die_cpumask);
 static BIN_ATTR_RO(die_cpus, 0);
 static BIN_ATTR_RO(die_cpus_list, 0);
@@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
 	&bin_attr_thread_siblings_list,
 	&bin_attr_core_siblings,
 	&bin_attr_core_siblings_list,
+	&bin_attr_cluster_cpus,
+	&bin_attr_cluster_cpus_list,
 	&bin_attr_die_cpus,
 	&bin_attr_die_cpus_list,
 	&bin_attr_package_cpus,
@@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
 static struct attribute *default_attrs[] = {
 	&dev_attr_physical_package_id.attr,
 	&dev_attr_die_id.attr,
+	&dev_attr_cluster_id.attr,
 	&dev_attr_core_id.attr,
 #ifdef CONFIG_SCHED_BOOK
 	&dev_attr_book_id.attr,
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 72e4f7fd268c..6d65427e5f67 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
 #ifdef CONFIG_ACPI_PPTT
 int acpi_pptt_cpu_is_thread(unsigned int cpu);
 int find_acpi_cpu_topology(unsigned int cpu, int level);
+int find_acpi_cpu_topology_cluster(unsigned int cpu);
 int find_acpi_cpu_topology_package(unsigned int cpu);
 int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
 int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
@@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
 {
 	return -EINVAL;
 }
+static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	return -EINVAL;
+}
 static inline int find_acpi_cpu_topology_package(unsigned int cpu)
 {
 	return -EINVAL;
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index f180240dc95f..b97cea83b25e 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
 struct cpu_topology {
 	int thread_id;
 	int core_id;
+	int cluster_id;
 	int package_id;
 	int llc_id;
 	cpumask_t thread_sibling;
 	cpumask_t core_sibling;
+	cpumask_t cluster_sibling;
 	cpumask_t llc_sibling;
 };
 
@@ -73,13 +75,16 @@ struct cpu_topology {
 extern struct cpu_topology cpu_topology[NR_CPUS];
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
+#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
 #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
+#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
 #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
 void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
+const struct cpumask *cpu_clustergroup_mask(int cpu);
 void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7634cd737061..80d27d717631 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_die_id
 #define topology_die_id(cpu)			((void)(cpu), -1)
 #endif
+#ifndef topology_cluster_id
+#define topology_cluster_id(cpu)		((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)			((void)(cpu), 0)
 #endif
@@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_cluster_cpumask
+#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
+#endif
 #ifndef topology_die_cpumask
 #define topology_die_cpumask(cpu)		cpumask_of(cpu)
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 1/3] topology: Represent clusters of CPUs within a die
@ 2021-08-20  1:30   ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Jonathan Cameron, Tian Tao, Barry Song

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Both ACPI and DT provide the ability to describe additional layers of
topology between that of individual cores and higher level constructs
such as the level at which the last level cache is shared.
In ACPI this can be represented in PPTT as a Processor Hierarchy
Node Structure [1] that is the parent of the CPU cores and in turn
has a parent Processor Hierarchy Nodes Structure representing
a higher level of topology.

For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
cluster has 4 cpus. All clusters share L3 cache data, but each cluster
has local L3 tag. On the other hand, each clusters will share some
internal system bus.

+-----------------------------------+                          +---------+
|  +------+    +------+            +---------------------------+         |
|  | CPU0 |    | cpu1 |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+   cluster   |    |    tag    |         |         |
|  | CPU2 |    | CPU3 |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   |    |    L3     |         |         |
|  +------+    +------+             +----+    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |   L3    |
                                                               |   data  |
+-----------------------------------+                          |         |
|  +------+    +------+             |    +-----------+         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             +----+    L3     |         |         |
|                                   |    |    tag    |         |         |
|  +------+    +------+             |    |           |         |         |
|  |      |    |      |            ++    +-----------+         |         |
|  +------+    +------+            |---------------------------+         |
+-----------------------------------|                          |         |
+-----------------------------------|                          |         |
|  +------+    +------+            +---------------------------+         |
|  |      |    |      |             |    +-----------+         |         |
|  +------+    +------+             |    |           |         |         |
|                                   +----+    L3     |         |         |
|  +------+    +------+             |    |    tag    |         |         |
|  |      |    |      |             |    |           |         |         |
|  +------+    +------+             |    +-----------+         |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                          |         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |   +-----------+          |         |
|  +------+    +------+             |   |           |          |         |
|                                   |   |    L3     |          |         |
|  +------+    +------+             +---+    tag    |          |         |
|  |      |    |      |             |   |           |          |         |
|  +------+    +------+             |   +-----------+          |         |
|                                   |                          |         |
+-----------------------------------+                          |         |
+-----------------------------------+                         ++         |
|  +------+    +------+             +--------------------------+         |
|  |      |    |      |             |  +-----------+           |         |
|  +------+    +------+             |  |           |           |         |
|                                   |  |    L3     |           |         |
|  +------+    +------+             +--+    tag    |           |         |
|  |      |    |      |             |  |           |           |         |
|  +------+    +------+             |  +-----------+           |         |
|                                   |                          +---------+
+-----------------------------------+

That means spreading tasks among clusters will bring more bandwidth
while packing tasks within one cluster will lead to smaller cache
synchronization latency. So both kernel and userspace will have
a chance to leverage this topology to deploy tasks accordingly to
achieve either smaller cache latency within one cluster or an even
distribution of load among clusters for higher throughput.

This patch exposes cluster topology to both kernel and userspace.
Libraried like hwloc will know cluster by cluster_cpus and related
sysfs attributes. PoC of HWLOC support at [2].

Note this patch only handle the ACPI case.

Special consideration is needed for SMT processors, where it is
necessary to move 2 levels up the hierarchy from the leaf nodes
(thus skipping the processor core level).

Note that arm64 / ACPI does not provide any means of identifying
a die level in the topology but that may be unrelate to the cluster
level.

[1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
    structure (Type 0)
[2] https://github.com/hisilicon/hwloc/tree/linux-cluster

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
 Documentation/admin-guide/cputopology.rst     | 12 ++--
 arch/arm64/kernel/topology.c                  |  2 +
 drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
 drivers/base/arch_topology.c                  | 14 ++++
 drivers/base/topology.c                       | 10 +++
 include/linux/acpi.h                          |  5 ++
 include/linux/arch_topology.h                 |  5 ++
 include/linux/topology.h                      |  6 ++
 9 files changed, 132 insertions(+), 4 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
index 516dafea03eb..3965ce504484 100644
--- a/Documentation/ABI/stable/sysfs-devices-system-cpu
+++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
@@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
                 architecture and platform dependent.
 Values:         integer
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
+Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
+                identifier (rather than the kernel's). The actual value is
+                architecture and platform dependent.
+Values:         integer
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_id
 Description:    the book ID of cpuX. Typically it is the hardware platform's
                 identifier (rather than the kernel's). The actual value is
@@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
                 The format is like 0-3, 8-11, 14,17.
 Values:         decimal list.
 
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
+Description:    internal kernel map of CPUs within the same cluster.
+Values:         hexadecimal bitmask.
+
+What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
+Description:    human-readable list of CPUs within the same cluster.
+                The format is like 0-3, 8-11, 14,17.
+Values:         decimal list.
+
 What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
 Description:    internal kernel map of cpuX's hardware threads within the same
                 book_id. it's only used on s390.
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
index 8632a1db36e4..a5491949880d 100644
--- a/Documentation/admin-guide/cputopology.rst
+++ b/Documentation/admin-guide/cputopology.rst
@@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
 
 	#define topology_physical_package_id(cpu)
 	#define topology_die_id(cpu)
+	#define topology_cluster_id(cpu)
 	#define topology_core_id(cpu)
 	#define topology_book_id(cpu)
 	#define topology_drawer_id(cpu)
 	#define topology_sibling_cpumask(cpu)
 	#define topology_core_cpumask(cpu)
+	#define topology_cluster_cpumask(cpu)
 	#define topology_die_cpumask(cpu)
 	#define topology_book_cpumask(cpu)
 	#define topology_drawer_cpumask(cpu)
@@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
 
 1) topology_physical_package_id: -1
 2) topology_die_id: -1
-3) topology_core_id: 0
-4) topology_sibling_cpumask: just the given CPU
-5) topology_core_cpumask: just the given CPU
-6) topology_die_cpumask: just the given CPU
+3) topology_cluster_id: -1
+4) topology_core_id: 0
+5) topology_sibling_cpumask: just the given CPU
+6) topology_core_cpumask: just the given CPU
+7) topology_cluster_cpumask: just the given CPU
+8) topology_die_cpumask: just the given CPU
 
 For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
 default definitions for topology_book_id() and topology_book_cpumask().
diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
index 4dd14a6620c1..9ab78ad826e2 100644
--- a/arch/arm64/kernel/topology.c
+++ b/arch/arm64/kernel/topology.c
@@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
 			cpu_topology[cpu].thread_id  = -1;
 			cpu_topology[cpu].core_id    = topology_id;
 		}
+		topology_id = find_acpi_cpu_topology_cluster(cpu);
+		cpu_topology[cpu].cluster_id = topology_id;
 		topology_id = find_acpi_cpu_topology_package(cpu);
 		cpu_topology[cpu].package_id = topology_id;
 
diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
index fe69dc518f31..701f61c01359 100644
--- a/drivers/acpi/pptt.c
+++ b/drivers/acpi/pptt.c
@@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
 					  ACPI_PPTT_PHYSICAL_PACKAGE);
 }
 
+/**
+ * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
+ * @cpu: Kernel logical CPU number
+ *
+ * Determine a topology unique cluster ID for the given CPU/thread.
+ * This ID can then be used to group peers, which will have matching ids.
+ *
+ * The cluster, if present is the level of topology above CPUs. In a
+ * multi-thread CPU, it will be the level above the CPU, not the thread.
+ * It may not exist in single CPU systems. In simple multi-CPU systems,
+ * it may be equal to the package topology level.
+ *
+ * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
+ * or there is no toplogy level above the CPU..
+ * Otherwise returns a value which represents the package for this CPU.
+ */
+
+int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	struct acpi_table_header *table;
+	acpi_status status;
+	struct acpi_pptt_processor *cpu_node, *cluster_node;
+	u32 acpi_cpu_id;
+	int retval;
+	int is_thread;
+
+	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
+	if (ACPI_FAILURE(status)) {
+		acpi_pptt_warn_missing();
+		return -ENOENT;
+	}
+
+	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
+	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
+	if (cpu_node == NULL || !cpu_node->parent) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+
+	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
+	cluster_node = fetch_pptt_node(table, cpu_node->parent);
+	if (cluster_node == NULL) {
+		retval = -ENOENT;
+		goto put_table;
+	}
+	if (is_thread) {
+		if (!cluster_node->parent) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+		cluster_node = fetch_pptt_node(table, cluster_node->parent);
+		if (cluster_node == NULL) {
+			retval = -ENOENT;
+			goto put_table;
+		}
+	}
+	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
+		retval = cluster_node->acpi_processor_id;
+	else
+		retval = ACPI_PTR_DIFF(cluster_node, table);
+
+put_table:
+	acpi_put_table(table);
+
+	return retval;
+}
+
 /**
  * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
  * @cpu: Kernel logical CPU number
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 921312a8d957..5b1589adacaf 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -598,6 +598,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return core_mask;
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return &cpu_topology[cpu].cluster_sibling;
+}
+
 void update_siblings_masks(unsigned int cpuid)
 {
 	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
@@ -615,6 +620,11 @@ void update_siblings_masks(unsigned int cpuid)
 		if (cpuid_topo->package_id != cpu_topo->package_id)
 			continue;
 
+		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
+			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
+			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
+		}
+
 		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
 		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
 
@@ -633,6 +643,9 @@ static void clear_cpu_topology(int cpu)
 	cpumask_clear(&cpu_topo->llc_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
 
+	cpumask_clear(&cpu_topo->cluster_sibling);
+	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
+
 	cpumask_clear(&cpu_topo->core_sibling);
 	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
 	cpumask_clear(&cpu_topo->thread_sibling);
@@ -648,6 +661,7 @@ void __init reset_cpu_topology(void)
 
 		cpu_topo->thread_id = -1;
 		cpu_topo->core_id = -1;
+		cpu_topo->cluster_id = -1;
 		cpu_topo->package_id = -1;
 		cpu_topo->llc_id = -1;
 
diff --git a/drivers/base/topology.c b/drivers/base/topology.c
index 43c0940643f5..8f2b641d0b8c 100644
--- a/drivers/base/topology.c
+++ b/drivers/base/topology.c
@@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
 define_id_show_func(die_id);
 static DEVICE_ATTR_RO(die_id);
 
+define_id_show_func(cluster_id);
+static DEVICE_ATTR_RO(cluster_id);
+
 define_id_show_func(core_id);
 static DEVICE_ATTR_RO(core_id);
 
@@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
 static BIN_ATTR_RO(core_siblings, 0);
 static BIN_ATTR_RO(core_siblings_list, 0);
 
+define_siblings_read_func(cluster_cpus, cluster_cpumask);
+static BIN_ATTR_RO(cluster_cpus, 0);
+static BIN_ATTR_RO(cluster_cpus_list, 0);
+
 define_siblings_read_func(die_cpus, die_cpumask);
 static BIN_ATTR_RO(die_cpus, 0);
 static BIN_ATTR_RO(die_cpus_list, 0);
@@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
 	&bin_attr_thread_siblings_list,
 	&bin_attr_core_siblings,
 	&bin_attr_core_siblings_list,
+	&bin_attr_cluster_cpus,
+	&bin_attr_cluster_cpus_list,
 	&bin_attr_die_cpus,
 	&bin_attr_die_cpus_list,
 	&bin_attr_package_cpus,
@@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
 static struct attribute *default_attrs[] = {
 	&dev_attr_physical_package_id.attr,
 	&dev_attr_die_id.attr,
+	&dev_attr_cluster_id.attr,
 	&dev_attr_core_id.attr,
 #ifdef CONFIG_SCHED_BOOK
 	&dev_attr_book_id.attr,
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 72e4f7fd268c..6d65427e5f67 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
 #ifdef CONFIG_ACPI_PPTT
 int acpi_pptt_cpu_is_thread(unsigned int cpu);
 int find_acpi_cpu_topology(unsigned int cpu, int level);
+int find_acpi_cpu_topology_cluster(unsigned int cpu);
 int find_acpi_cpu_topology_package(unsigned int cpu);
 int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
 int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
@@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
 {
 	return -EINVAL;
 }
+static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
+{
+	return -EINVAL;
+}
 static inline int find_acpi_cpu_topology_package(unsigned int cpu)
 {
 	return -EINVAL;
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index f180240dc95f..b97cea83b25e 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
 struct cpu_topology {
 	int thread_id;
 	int core_id;
+	int cluster_id;
 	int package_id;
 	int llc_id;
 	cpumask_t thread_sibling;
 	cpumask_t core_sibling;
+	cpumask_t cluster_sibling;
 	cpumask_t llc_sibling;
 };
 
@@ -73,13 +75,16 @@ struct cpu_topology {
 extern struct cpu_topology cpu_topology[NR_CPUS];
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
+#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
 #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
 #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
+#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
 #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
 void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
+const struct cpumask *cpu_clustergroup_mask(int cpu);
 void update_siblings_masks(unsigned int cpu);
 void remove_cpu_topology(unsigned int cpuid);
 void reset_cpu_topology(void);
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 7634cd737061..80d27d717631 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_die_id
 #define topology_die_id(cpu)			((void)(cpu), -1)
 #endif
+#ifndef topology_cluster_id
+#define topology_cluster_id(cpu)		((void)(cpu), -1)
+#endif
 #ifndef topology_core_id
 #define topology_core_id(cpu)			((void)(cpu), 0)
 #endif
@@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
 #ifndef topology_core_cpumask
 #define topology_core_cpumask(cpu)		cpumask_of(cpu)
 #endif
+#ifndef topology_cluster_cpumask
+#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
+#endif
 #ifndef topology_die_cpumask
 #define topology_die_cpumask(cpu)		cpumask_of(cpu)
 #endif
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64
  2021-08-20  1:30 ` Barry Song
@ 2021-08-20  1:30   ` Barry Song
  -1 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song, Yicong Yang

From: Barry Song <song.bao.hua@hisilicon.com>

This patch adds scheduler level for clusters and automatically enables
the load balance among clusters. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.

Testing has widely been done in two different hardware configurations of
Kunpeng920:

 24 cores in one NUMA(6 clusters in each NUMA node);
 32 cores in one NUMA(8 clusters in each NUMA node)

Workload is running on either one NUMA node or four NUMA nodes, thus,
this can estimate the effect of cluster spreading w/ and w/o NUMA load
balance.

* Stream benchmark:

4threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)

6threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)

12threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)

Thus, it could help memory-bound workload especially under medium load.
Similar improvement is also seen in lkp-pbzip2:

* lkp-pbzip2 benchmark

2-96 threads (on 4NUMA * 24cores = 96cores)
                  lkp-pbzip2              lkp-pbzip2
                  w/o patch               w/ patch
Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*

2-24 threads (on 1NUMA * 24cores = 24cores)
                 lkp-pbzip2               lkp-pbzip2
                 w/o patch                w/ patch
Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*

In the case of 6 threads and 8 threads, we see the greatest performance
improvement.

Similar improvement can be seen on lkp-pixz though the improvement is
smaller:

* lkp-pixz benchmark

2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                  lkp-pixz               lkp-pixz
                  w/o patch              w/ patch
Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*

* SPECrate benchmark

4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
		Base     	 	Base
		Run Time   	 	Rate
		-------  	 	---------
4 Copies	w/o 580 (w/ 570)       	w/o 11.1 (w/ 11.3)
8 Copies	w/o 647 (w/ 605)       	w/o 20.0 (w/ 21.4, +7%)
16 Copies	w/o 844 (w/ 844)       	w/o 30.6 (w/ 30.6)

32 Copies(on 4NUMA * 32 cores = 128cores)
[w/o patch]
                 Base     Base        Base
Benchmarks       Copies  Run Time     Rate
--------------- -------  ---------  ---------
500.perlbench_r      32        584       87.2  *
502.gcc_r            32        503       90.2  *
505.mcf_r            32        745       69.4  *
520.omnetpp_r        32       1031       40.7  *
523.xalancbmk_r      32        597       56.6  *
525.x264_r            1         --            CE
531.deepsjeng_r      32        336      109    *
541.leela_r          32        556       95.4  *
548.exchange2_r      32        513      163    *
557.xz_r             32        530       65.2  *
 Est. SPECrate2017_int_base              80.3

[w/ patch]
                  Base     Base        Base
Benchmarks       Copies  Run Time     Rate
--------------- -------  ---------  ---------
500.perlbench_r      32        580      87.8 (+0.688%)  *
502.gcc_r            32        477      95.1 (+5.432%)  *
505.mcf_r            32        644      80.3 (+13.574%) *
520.omnetpp_r        32        942      44.6 (+9.58%)   *
523.xalancbmk_r      32        560      60.4 (+6.714%%) *
525.x264_r            1         --           CE
531.deepsjeng_r      32        337      109  (+0.000%) *
541.leela_r          32        554      95.6 (+0.210%) *
548.exchange2_r      32        515      163  (+0.000%) *
557.xz_r             32        524      66.0 (+1.227%) *
 Est. SPECrate2017_int_base              83.7 (+4.062%)

On the other hand, it is slightly helpful to CPU-bound tasks like
kernbench:

* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                     kernbench              kernbench
                     w/o cluster            w/ cluster
Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)

Note this patch isn't an universal win, it might hurt those workload
which can benefit from packing. Though tasks which want to take
advantages of lower communication latency of one cluster won't
necessarily been packed in one cluster while kernel is not aware of
clusters, they have some chance to been randomly packed. But this
patch will make them spread anyway.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/arm64/Kconfig             | 7 +++++++
 include/linux/sched/topology.h | 7 +++++++
 include/linux/topology.h       | 7 +++++++
 kernel/sched/topology.c        | 5 +++++
 4 files changed, 26 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fdcd54d39c1e..7a3cc2314a03 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -992,6 +992,13 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	help
+	  Cluster scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have clusters(sharing internal
+	  bus or sharing LLC cache tag). If unsure say N here.
+
 config SCHED_SMT
 	bool "SMT scheduler support"
 	help
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..2f9166f6dec8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
 }
 #endif
 
+#ifdef CONFIG_SCHED_CLUSTER
+static inline int cpu_cluster_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
 #ifdef CONFIG_SCHED_MC
 static inline int cpu_core_flags(void)
 {
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 80d27d717631..0b3704ad13c8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
 
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
+static inline const struct cpumask *cpu_cluster_mask(int cpu)
+{
+	return topology_cluster_cpumask(cpu);
+}
+#endif
+
 static inline const struct cpumask *cpu_cpu_mask(int cpu)
 {
 	return cpumask_of_node(cpu_to_node(cpu));
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..546cfb1c728e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1625,6 +1625,11 @@ static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64
@ 2021-08-20  1:30   ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song, Yicong Yang

From: Barry Song <song.bao.hua@hisilicon.com>

This patch adds scheduler level for clusters and automatically enables
the load balance among clusters. It will directly benefit a lot of
workload which loves more resources such as memory bandwidth, caches.

Testing has widely been done in two different hardware configurations of
Kunpeng920:

 24 cores in one NUMA(6 clusters in each NUMA node);
 32 cores in one NUMA(8 clusters in each NUMA node)

Workload is running on either one NUMA node or four NUMA nodes, thus,
this can estimate the effect of cluster spreading w/ and w/o NUMA load
balance.

* Stream benchmark:

4threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)

6threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)

12threads stream (on 1NUMA * 24cores = 24cores)
                stream                 stream
                w/o patch              w/ patch
MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)

Thus, it could help memory-bound workload especially under medium load.
Similar improvement is also seen in lkp-pbzip2:

* lkp-pbzip2 benchmark

2-96 threads (on 4NUMA * 24cores = 96cores)
                  lkp-pbzip2              lkp-pbzip2
                  w/o patch               w/ patch
Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*

2-24 threads (on 1NUMA * 24cores = 24cores)
                 lkp-pbzip2               lkp-pbzip2
                 w/o patch                w/ patch
Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*

In the case of 6 threads and 8 threads, we see the greatest performance
improvement.

Similar improvement can be seen on lkp-pixz though the improvement is
smaller:

* lkp-pixz benchmark

2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                  lkp-pixz               lkp-pixz
                  w/o patch              w/ patch
Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*

* SPECrate benchmark

4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
		Base     	 	Base
		Run Time   	 	Rate
		-------  	 	---------
4 Copies	w/o 580 (w/ 570)       	w/o 11.1 (w/ 11.3)
8 Copies	w/o 647 (w/ 605)       	w/o 20.0 (w/ 21.4, +7%)
16 Copies	w/o 844 (w/ 844)       	w/o 30.6 (w/ 30.6)

32 Copies(on 4NUMA * 32 cores = 128cores)
[w/o patch]
                 Base     Base        Base
Benchmarks       Copies  Run Time     Rate
--------------- -------  ---------  ---------
500.perlbench_r      32        584       87.2  *
502.gcc_r            32        503       90.2  *
505.mcf_r            32        745       69.4  *
520.omnetpp_r        32       1031       40.7  *
523.xalancbmk_r      32        597       56.6  *
525.x264_r            1         --            CE
531.deepsjeng_r      32        336      109    *
541.leela_r          32        556       95.4  *
548.exchange2_r      32        513      163    *
557.xz_r             32        530       65.2  *
 Est. SPECrate2017_int_base              80.3

[w/ patch]
                  Base     Base        Base
Benchmarks       Copies  Run Time     Rate
--------------- -------  ---------  ---------
500.perlbench_r      32        580      87.8 (+0.688%)  *
502.gcc_r            32        477      95.1 (+5.432%)  *
505.mcf_r            32        644      80.3 (+13.574%) *
520.omnetpp_r        32        942      44.6 (+9.58%)   *
523.xalancbmk_r      32        560      60.4 (+6.714%%) *
525.x264_r            1         --           CE
531.deepsjeng_r      32        337      109  (+0.000%) *
541.leela_r          32        554      95.6 (+0.210%) *
548.exchange2_r      32        515      163  (+0.000%) *
557.xz_r             32        524      66.0 (+1.227%) *
 Est. SPECrate2017_int_base              83.7 (+4.062%)

On the other hand, it is slightly helpful to CPU-bound tasks like
kernbench:

* 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                     kernbench              kernbench
                     w/o cluster            w/ cluster
Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)

Note this patch isn't an universal win, it might hurt those workload
which can benefit from packing. Though tasks which want to take
advantages of lower communication latency of one cluster won't
necessarily been packed in one cluster while kernel is not aware of
clusters, they have some chance to been randomly packed. But this
patch will make them spread anyway.

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/arm64/Kconfig             | 7 +++++++
 include/linux/sched/topology.h | 7 +++++++
 include/linux/topology.h       | 7 +++++++
 kernel/sched/topology.c        | 5 +++++
 4 files changed, 26 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index fdcd54d39c1e..7a3cc2314a03 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -992,6 +992,13 @@ config SCHED_MC
 	  making when dealing with multi-core CPU chips at a cost of slightly
 	  increased overhead in some places. If unsure say N here.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	help
+	  Cluster scheduler support improves the CPU scheduler's decision
+	  making when dealing with machines that have clusters(sharing internal
+	  bus or sharing LLC cache tag). If unsure say N here.
+
 config SCHED_SMT
 	bool "SMT scheduler support"
 	help
diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8f0f778b7c91..2f9166f6dec8 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -42,6 +42,13 @@ static inline int cpu_smt_flags(void)
 }
 #endif
 
+#ifdef CONFIG_SCHED_CLUSTER
+static inline int cpu_cluster_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+#endif
+
 #ifdef CONFIG_SCHED_MC
 static inline int cpu_core_flags(void)
 {
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 80d27d717631..0b3704ad13c8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -212,6 +212,13 @@ static inline const struct cpumask *cpu_smt_mask(int cpu)
 }
 #endif
 
+#if defined(CONFIG_SCHED_CLUSTER) && !defined(cpu_cluster_mask)
+static inline const struct cpumask *cpu_cluster_mask(int cpu)
+{
+	return topology_cluster_cpumask(cpu);
+}
+#endif
+
 static inline const struct cpumask *cpu_cpu_mask(int cpu)
 {
 	return cpumask_of_node(cpu_to_node(cpu));
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index b77ad49dc14f..546cfb1c728e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1625,6 +1625,11 @@ static struct sched_domain_topology_level default_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, cpu_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, cpu_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
+
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] scheduler: Add cluster scheduler level for x86
  2021-08-20  1:30 ` Barry Song
@ 2021-08-20  1:30   ` Barry Song
  -1 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song

From: Tim Chen <tim.c.chen@linux.intel.com>

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.
To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a
boost to performance especially on medium load system on Jacobsville.
on a Jacobsville that has 24 Atom cores, arranged into 6 clusters
of 4 cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/x86/Kconfig                |  8 ++++++
 arch/x86/include/asm/smp.h      |  7 ++++++
 arch/x86/include/asm/topology.h |  3 +++
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c    |  3 +++
 arch/x86/kernel/smpboot.c       | 44 ++++++++++++++++++++++++++++++++-
 6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 88fb922c23a0..e97356e99bbe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -996,6 +996,14 @@ config NR_CPUS
 	  This is purely to save memory: each supported CPU adds about 8KB
 	  to the kernel image.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	default n
+	help
+	 Cluster scheduler support improves the CPU scheduler's decision
+	 making when dealing with machines that have clusters of CPUs
+	 sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
 	def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 630ff08532be..08b0e90623ad 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 	return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+	return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399e5491..2548d824f103 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,17 +103,20 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)	(cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
 #define topology_logical_die_id(cpu)		(cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)			(cpu_data(cpu).cpu_die_id)
+#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
 
 extern unsigned int __max_die_per_package;
 
 #ifdef CONFIG_SMP
 #define topology_die_cpumask(cpu)		(per_cpu(cpu_die_map, cpu))
+#define topology_cluster_cpumask(cpu)		(cpu_clustergroup_mask(cpu))
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index d66af2950e06..3528987fef1d 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 		l2 = new_l2;
 #ifdef CONFIG_SMP
 		per_cpu(cpu_llc_id, cpu) = l2_id;
+		per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
 	}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 64b805bd6a54..38871f114af1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -79,6 +79,9 @@ EXPORT_SYMBOL(smp_num_siblings);
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 9320285a5e29..5832c6b6348f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -464,6 +466,21 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+		return false;
+
+	/* Do not match if L2 cache id does not match: */
+	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
+		return false;
+
+	return topology_sane(c, o, "l2c");
+}
+
 /*
  * Unlike the other levels, we do not enforce keeping a
  * multicore group inside a NUMA node.  If this happens, we will
@@ -523,7 +540,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 }
 
 
-#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
 	return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@@ -541,12 +558,21 @@ static int x86_smt_flags(void)
 	return cpu_smt_flags() | x86_sched_itmt_flags();
 }
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+static int x86_cluster_flags(void)
+{
+	return cpu_cluster_flags() | x86_sched_itmt_flags();
+}
+#endif
 #endif
 
 static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -557,6 +583,9 @@ static struct sched_domain_topology_level x86_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -584,6 +613,7 @@ void set_cpu_sibling_map(int cpu)
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+		cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
 		c->booted_cores = 1;
@@ -602,6 +632,9 @@ void set_cpu_sibling_map(int cpu)
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
 
+		if ((i == cpu) || (has_mp && match_l2c(c, o)))
+			link_mask(cpu_l2c_shared_mask, cpu, i);
+
 		if ((i == cpu) || (has_mp && match_die(c, o)))
 			link_mask(topology_die_cpumask, cpu, i);
 	}
@@ -649,6 +682,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return cpu_llc_shared_mask(cpu);
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return cpu_l2c_shared_mask(cpu);
+}
+
 static void impress_friends(void)
 {
 	int cpu;
@@ -1332,6 +1370,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+		zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
 	}
 
 	/*
@@ -1556,7 +1595,10 @@ static void remove_siblinginfo(int cpu)
 		cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
 	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+	for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
+		cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
+	cpumask_clear(cpu_l2c_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
 	cpumask_clear(topology_core_cpumask(cpu));
 	cpumask_clear(topology_die_cpumask(cpu));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/3] scheduler: Add cluster scheduler level for x86
@ 2021-08-20  1:30   ` Barry Song
  0 siblings, 0 replies; 16+ messages in thread
From: Barry Song @ 2021-08-20  1:30 UTC (permalink / raw)
  To: bp, catalin.marinas, dietmar.eggemann, gregkh, hpa, juri.lelli,
	bristot, lenb, mgorman, mingo, peterz, rjw, sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song

From: Tim Chen <tim.c.chen@linux.intel.com>

There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
shared among a cluster of cores instead of being exclusive to one
single core.
To prevent oversubscription of L2 cache, load should be balanced
between such L2 clusters, especially for tasks with no shared data.
On benchmark such as SPECrate mcf test, this change provides a
boost to performance especially on medium load system on Jacobsville.
on a Jacobsville that has 24 Atom cores, arranged into 6 clusters
of 4 cores each, the benchmark number is as follow:

 Improvement over baseline kernel for mcf_r
 copies		run time	base rate
 1		-0.1%		-0.2%
 6		25.1%		25.1%
 12		18.8%		19.0%
 24		0.3%		0.3%

So this looks pretty good. In terms of the system's task distribution,
some pretty bad clumping can be seen for the vanilla kernel without
the L2 cluster domain for the 6 and 12 copies case. With the extra
domain for cluster, the load does get evened out between the clusters.

Note this patch isn't an universal win as spreading isn't necessarily
a win, particually for those workload who can benefit from packing.

Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
---
 arch/x86/Kconfig                |  8 ++++++
 arch/x86/include/asm/smp.h      |  7 ++++++
 arch/x86/include/asm/topology.h |  3 +++
 arch/x86/kernel/cpu/cacheinfo.c |  1 +
 arch/x86/kernel/cpu/common.c    |  3 +++
 arch/x86/kernel/smpboot.c       | 44 ++++++++++++++++++++++++++++++++-
 6 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 88fb922c23a0..e97356e99bbe 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -996,6 +996,14 @@ config NR_CPUS
 	  This is purely to save memory: each supported CPU adds about 8KB
 	  to the kernel image.
 
+config SCHED_CLUSTER
+	bool "Cluster scheduler support"
+	default n
+	help
+	 Cluster scheduler support improves the CPU scheduler's decision
+	 making when dealing with machines that have clusters of CPUs
+	 sharing L2 cache. If unsure say N here.
+
 config SCHED_SMT
 	def_bool y if SMP
 
diff --git a/arch/x86/include/asm/smp.h b/arch/x86/include/asm/smp.h
index 630ff08532be..08b0e90623ad 100644
--- a/arch/x86/include/asm/smp.h
+++ b/arch/x86/include/asm/smp.h
@@ -16,7 +16,9 @@ DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_core_map);
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_die_map);
 /* cpus sharing the last level cache: */
 DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
+DECLARE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
 DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id);
+DECLARE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id);
 DECLARE_PER_CPU_READ_MOSTLY(int, cpu_number);
 
 static inline struct cpumask *cpu_llc_shared_mask(int cpu)
@@ -24,6 +26,11 @@ static inline struct cpumask *cpu_llc_shared_mask(int cpu)
 	return per_cpu(cpu_llc_shared_map, cpu);
 }
 
+static inline struct cpumask *cpu_l2c_shared_mask(int cpu)
+{
+	return per_cpu(cpu_l2c_shared_map, cpu);
+}
+
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_cpu_to_apicid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u32, x86_cpu_to_acpiid);
 DECLARE_EARLY_PER_CPU_READ_MOSTLY(u16, x86_bios_cpu_apicid);
diff --git a/arch/x86/include/asm/topology.h b/arch/x86/include/asm/topology.h
index 9239399e5491..2548d824f103 100644
--- a/arch/x86/include/asm/topology.h
+++ b/arch/x86/include/asm/topology.h
@@ -103,17 +103,20 @@ static inline void setup_node_to_cpumask_map(void) { }
 #include <asm-generic/topology.h>
 
 extern const struct cpumask *cpu_coregroup_mask(int cpu);
+extern const struct cpumask *cpu_clustergroup_mask(int cpu);
 
 #define topology_logical_package_id(cpu)	(cpu_data(cpu).logical_proc_id)
 #define topology_physical_package_id(cpu)	(cpu_data(cpu).phys_proc_id)
 #define topology_logical_die_id(cpu)		(cpu_data(cpu).logical_die_id)
 #define topology_die_id(cpu)			(cpu_data(cpu).cpu_die_id)
+#define topology_cluster_id(cpu)		(per_cpu(cpu_l2c_id, cpu))
 #define topology_core_id(cpu)			(cpu_data(cpu).cpu_core_id)
 
 extern unsigned int __max_die_per_package;
 
 #ifdef CONFIG_SMP
 #define topology_die_cpumask(cpu)		(per_cpu(cpu_die_map, cpu))
+#define topology_cluster_cpumask(cpu)		(cpu_clustergroup_mask(cpu))
 #define topology_core_cpumask(cpu)		(per_cpu(cpu_core_map, cpu))
 #define topology_sibling_cpumask(cpu)		(per_cpu(cpu_sibling_map, cpu))
 
diff --git a/arch/x86/kernel/cpu/cacheinfo.c b/arch/x86/kernel/cpu/cacheinfo.c
index d66af2950e06..3528987fef1d 100644
--- a/arch/x86/kernel/cpu/cacheinfo.c
+++ b/arch/x86/kernel/cpu/cacheinfo.c
@@ -846,6 +846,7 @@ void init_intel_cacheinfo(struct cpuinfo_x86 *c)
 		l2 = new_l2;
 #ifdef CONFIG_SMP
 		per_cpu(cpu_llc_id, cpu) = l2_id;
+		per_cpu(cpu_l2c_id, cpu) = l2_id;
 #endif
 	}
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 64b805bd6a54..38871f114af1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -79,6 +79,9 @@ EXPORT_SYMBOL(smp_num_siblings);
 /* Last level cache ID of each logical CPU */
 DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_llc_id) = BAD_APICID;
 
+/* L2 cache ID of each logical CPU */
+DEFINE_PER_CPU_READ_MOSTLY(u16, cpu_l2c_id) = BAD_APICID;
+
 /* correctly size the local cpu masks */
 void __init setup_cpu_local_masks(void)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 9320285a5e29..5832c6b6348f 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -101,6 +101,8 @@ EXPORT_PER_CPU_SYMBOL(cpu_die_map);
 
 DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_llc_shared_map);
 
+DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_l2c_shared_map);
+
 /* Per CPU bogomips and other parameters */
 DEFINE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
 EXPORT_PER_CPU_SYMBOL(cpu_info);
@@ -464,6 +466,21 @@ static bool match_die(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 	return false;
 }
 
+static bool match_l2c(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
+{
+	int cpu1 = c->cpu_index, cpu2 = o->cpu_index;
+
+	/* Do not match if we do not have a valid APICID for cpu: */
+	if (per_cpu(cpu_l2c_id, cpu1) == BAD_APICID)
+		return false;
+
+	/* Do not match if L2 cache id does not match: */
+	if (per_cpu(cpu_l2c_id, cpu1) != per_cpu(cpu_l2c_id, cpu2))
+		return false;
+
+	return topology_sane(c, o, "l2c");
+}
+
 /*
  * Unlike the other levels, we do not enforce keeping a
  * multicore group inside a NUMA node.  If this happens, we will
@@ -523,7 +540,7 @@ static bool match_llc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
 }
 
 
-#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_MC)
+#if defined(CONFIG_SCHED_SMT) || defined(CONFIG_SCHED_CLUSTER) || defined(CONFIG_SCHED_MC)
 static inline int x86_sched_itmt_flags(void)
 {
 	return sysctl_sched_itmt_enabled ? SD_ASYM_PACKING : 0;
@@ -541,12 +558,21 @@ static int x86_smt_flags(void)
 	return cpu_smt_flags() | x86_sched_itmt_flags();
 }
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+static int x86_cluster_flags(void)
+{
+	return cpu_cluster_flags() | x86_sched_itmt_flags();
+}
+#endif
 #endif
 
 static struct sched_domain_topology_level x86_numa_in_package_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -557,6 +583,9 @@ static struct sched_domain_topology_level x86_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, x86_smt_flags, SD_INIT_NAME(SMT) },
 #endif
+#ifdef CONFIG_SCHED_CLUSTER
+	{ cpu_clustergroup_mask, x86_cluster_flags, SD_INIT_NAME(CLS) },
+#endif
 #ifdef CONFIG_SCHED_MC
 	{ cpu_coregroup_mask, x86_core_flags, SD_INIT_NAME(MC) },
 #endif
@@ -584,6 +613,7 @@ void set_cpu_sibling_map(int cpu)
 	if (!has_mp) {
 		cpumask_set_cpu(cpu, topology_sibling_cpumask(cpu));
 		cpumask_set_cpu(cpu, cpu_llc_shared_mask(cpu));
+		cpumask_set_cpu(cpu, cpu_l2c_shared_mask(cpu));
 		cpumask_set_cpu(cpu, topology_core_cpumask(cpu));
 		cpumask_set_cpu(cpu, topology_die_cpumask(cpu));
 		c->booted_cores = 1;
@@ -602,6 +632,9 @@ void set_cpu_sibling_map(int cpu)
 		if ((i == cpu) || (has_mp && match_llc(c, o)))
 			link_mask(cpu_llc_shared_mask, cpu, i);
 
+		if ((i == cpu) || (has_mp && match_l2c(c, o)))
+			link_mask(cpu_l2c_shared_mask, cpu, i);
+
 		if ((i == cpu) || (has_mp && match_die(c, o)))
 			link_mask(topology_die_cpumask, cpu, i);
 	}
@@ -649,6 +682,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
 	return cpu_llc_shared_mask(cpu);
 }
 
+const struct cpumask *cpu_clustergroup_mask(int cpu)
+{
+	return cpu_l2c_shared_mask(cpu);
+}
+
 static void impress_friends(void)
 {
 	int cpu;
@@ -1332,6 +1370,7 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus)
 		zalloc_cpumask_var(&per_cpu(cpu_core_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_die_map, i), GFP_KERNEL);
 		zalloc_cpumask_var(&per_cpu(cpu_llc_shared_map, i), GFP_KERNEL);
+		zalloc_cpumask_var(&per_cpu(cpu_l2c_shared_map, i), GFP_KERNEL);
 	}
 
 	/*
@@ -1556,7 +1595,10 @@ static void remove_siblinginfo(int cpu)
 		cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling));
 	for_each_cpu(sibling, cpu_llc_shared_mask(cpu))
 		cpumask_clear_cpu(cpu, cpu_llc_shared_mask(sibling));
+	for_each_cpu(sibling, cpu_l2c_shared_mask(cpu))
+		cpumask_clear_cpu(cpu, cpu_l2c_shared_mask(sibling));
 	cpumask_clear(cpu_llc_shared_mask(cpu));
+	cpumask_clear(cpu_l2c_shared_mask(cpu));
 	cpumask_clear(topology_sibling_cpumask(cpu));
 	cpumask_clear(topology_core_cpumask(cpu));
 	cpumask_clear(topology_die_cpumask(cpu));
-- 
2.25.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] scheduler: Add cluster scheduler level for x86
  2021-08-20  1:30   ` Barry Song
@ 2021-08-23 17:49     ` Tim Chen
  -1 siblings, 0 replies; 16+ messages in thread
From: Tim Chen @ 2021-08-23 17:49 UTC (permalink / raw)
  To: Barry Song, bp, catalin.marinas, dietmar.eggemann, gregkh, hpa,
	juri.lelli, bristot, lenb, mgorman, mingo, peterz, rjw,
	sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song



On 8/19/21 6:30 PM, Barry Song wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
> shared among a cluster of cores instead of being exclusive to one
> single core.
> To prevent oversubscription of L2 cache, load should be balanced
> between such L2 clusters, especially for tasks with no shared data.
> On benchmark such as SPECrate mcf test, this change provides a
> boost to performance especially on medium load system on Jacobsville.
> on a Jacobsville that has 24 Atom cores, arranged into 6 clusters
> of 4 cores each, the benchmark number is as follow:
> 
>  Improvement over baseline kernel for mcf_r
>  copies		run time	base rate
>  1		-0.1%		-0.2%
>  6		25.1%		25.1%
>  12		18.8%		19.0%
>  24		0.3%		0.3%
> 
> So this looks pretty good. In terms of the system's task distribution,
> some pretty bad clumping can be seen for the vanilla kernel without
> the L2 cluster domain for the 6 and 12 copies case. With the extra
> domain for cluster, the load does get evened out between the clusters.
> 
> Note this patch isn't an universal win as spreading isn't necessarily
> a win, particually for those workload who can benefit from packing.

I have another patch set to make cluster scheduling selectable at run
time and boot time.  Will like to see people's feed back on this patch
set first before sending that out.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 3/3] scheduler: Add cluster scheduler level for x86
@ 2021-08-23 17:49     ` Tim Chen
  0 siblings, 0 replies; 16+ messages in thread
From: Tim Chen @ 2021-08-23 17:49 UTC (permalink / raw)
  To: Barry Song, bp, catalin.marinas, dietmar.eggemann, gregkh, hpa,
	juri.lelli, bristot, lenb, mgorman, mingo, peterz, rjw,
	sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Barry Song



On 8/19/21 6:30 PM, Barry Song wrote:
> From: Tim Chen <tim.c.chen@linux.intel.com>
> 
> There are x86 CPU architectures (e.g. Jacobsville) where L2 cahce is
> shared among a cluster of cores instead of being exclusive to one
> single core.
> To prevent oversubscription of L2 cache, load should be balanced
> between such L2 clusters, especially for tasks with no shared data.
> On benchmark such as SPECrate mcf test, this change provides a
> boost to performance especially on medium load system on Jacobsville.
> on a Jacobsville that has 24 Atom cores, arranged into 6 clusters
> of 4 cores each, the benchmark number is as follow:
> 
>  Improvement over baseline kernel for mcf_r
>  copies		run time	base rate
>  1		-0.1%		-0.2%
>  6		25.1%		25.1%
>  12		18.8%		19.0%
>  24		0.3%		0.3%
> 
> So this looks pretty good. In terms of the system's task distribution,
> some pretty bad clumping can be seen for the vanilla kernel without
> the L2 cluster domain for the 6 and 12 copies case. With the extra
> domain for cluster, the load does get evened out between the clusters.
> 
> Note this patch isn't an universal win as spreading isn't necessarily
> a win, particually for those workload who can benefit from packing.

I have another patch set to make cluster scheduling selectable at run
time and boot time.  Will like to see people's feed back on this patch
set first before sending that out.

Thanks.

Tim

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [BUG] Re: [PATCH 1/3] topology: Represent clusters of CPUs within a die
  2021-08-20  1:30   ` Barry Song
@ 2022-05-06 20:24     ` Jeremy Linton
  -1 siblings, 0 replies; 16+ messages in thread
From: Jeremy Linton @ 2022-05-06 20:24 UTC (permalink / raw)
  To: Barry Song, bp, catalin.marinas, dietmar.eggemann, gregkh, hpa,
	juri.lelli, bristot, lenb, mgorman, mingo, peterz, rjw,
	sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Tian Tao, Barry Song

Hi,

It seems this set is:

"BUG: arch topology borken"
                    ^code

on machines that don't actually have clusters, or provide a 
representation which might be taken for a cluster. The Ampere Altra for 
one. So, I guess its my job to relay what I was informed of when I 
intially proposed something similar a few years back.

Neither the ACPI/PPTT spec nor the Arm architectural spec mandate the 
concept of a "cluster" particularly in the form of a system with cores 
sharing the L2, which IIRC is the case for the Kunpeng. And it tends to 
be a shared L2 which gives the most bang for the buck (or was when I was 
testing/benchmarking all this, on aarch64) from scheduler changes which 
create cluster level scheduling domains. But OTOH, things like specJBB 
didn't really like those smaller MC levels (which I suspect is hurt by 
this change, without running a full benchmark suite, especially on 
something like the above ampere, given what is happening to its 
scheduling domains).

So, the one takeway I can give is this, the code below which is 
attempting to create a cluster level should be a bit more intellegent 
about whether there is an actual cluster. A first order approximation 
might be adding a check to see if the node immediatly above the CPU 
contains an L2 and that its shared. A better fix, of course is the 
reason this wasn't previously done, and that is to convince the ACPI 
commitee to standardize a CLUSTER level flag which could be utilized by 
a firmware/machine manufactuer to decide whether cluster level 
scheduling provides an advantage and simply not do it on machines which 
don't flag CLUSTER levels because its not avantagious.


Thanks,



On 8/19/21 20:30, Barry Song wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Both ACPI and DT provide the ability to describe additional layers of
> topology between that of individual cores and higher level constructs
> such as the level at which the last level cache is shared.
> In ACPI this can be represented in PPTT as a Processor Hierarchy
> Node Structure [1] that is the parent of the CPU cores and in turn
> has a parent Processor Hierarchy Nodes Structure representing
> a higher level of topology.
> 
> For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> has local L3 tag. On the other hand, each clusters will share some
> internal system bus.
> 
> +-----------------------------------+                          +---------+
> |  +------+    +------+            +---------------------------+         |
> |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
> |  +------+    +------+             |    |           |         |         |
> |                                   +----+    L3     |         |         |
> |  +------+    +------+   cluster   |    |    tag    |         |         |
> |  | CPU2 |    | CPU3 |             |    |           |         |         |
> |  +------+    +------+             |    +-----------+         |         |
> |                                   |                          |         |
> +-----------------------------------+                          |         |
> +-----------------------------------+                          |         |
> |  +------+    +------+             +--------------------------+         |
> |  |      |    |      |             |    +-----------+         |         |
> |  +------+    +------+             |    |           |         |         |
> |                                   |    |    L3     |         |         |
> |  +------+    +------+             +----+    tag    |         |         |
> |  |      |    |      |             |    |           |         |         |
> |  +------+    +------+             |    +-----------+         |         |
> |                                   |                          |         |
> +-----------------------------------+                          |   L3    |
>                                                                 |   data  |
> +-----------------------------------+                          |         |
> |  +------+    +------+             |    +-----------+         |         |
> |  |      |    |      |             |    |           |         |         |
> |  +------+    +------+             +----+    L3     |         |         |
> |                                   |    |    tag    |         |         |
> |  +------+    +------+             |    |           |         |         |
> |  |      |    |      |            ++    +-----------+         |         |
> |  +------+    +------+            |---------------------------+         |
> +-----------------------------------|                          |         |
> +-----------------------------------|                          |         |
> |  +------+    +------+            +---------------------------+         |
> |  |      |    |      |             |    +-----------+         |         |
> |  +------+    +------+             |    |           |         |         |
> |                                   +----+    L3     |         |         |
> |  +------+    +------+             |    |    tag    |         |         |
> |  |      |    |      |             |    |           |         |         |
> |  +------+    +------+             |    +-----------+         |         |
> |                                   |                          |         |
> +-----------------------------------+                          |         |
> +-----------------------------------+                          |         |
> |  +------+    +------+             +--------------------------+         |
> |  |      |    |      |             |   +-----------+          |         |
> |  +------+    +------+             |   |           |          |         |
> |                                   |   |    L3     |          |         |
> |  +------+    +------+             +---+    tag    |          |         |
> |  |      |    |      |             |   |           |          |         |
> |  +------+    +------+             |   +-----------+          |         |
> |                                   |                          |         |
> +-----------------------------------+                          |         |
> +-----------------------------------+                         ++         |
> |  +------+    +------+             +--------------------------+         |
> |  |      |    |      |             |  +-----------+           |         |
> |  +------+    +------+             |  |           |           |         |
> |                                   |  |    L3     |           |         |
> |  +------+    +------+             +--+    tag    |           |         |
> |  |      |    |      |             |  |           |           |         |
> |  +------+    +------+             |  +-----------+           |         |
> |                                   |                          +---------+
> +-----------------------------------+
> 
> That means spreading tasks among clusters will bring more bandwidth
> while packing tasks within one cluster will lead to smaller cache
> synchronization latency. So both kernel and userspace will have
> a chance to leverage this topology to deploy tasks accordingly to
> achieve either smaller cache latency within one cluster or an even
> distribution of load among clusters for higher throughput.
> 
> This patch exposes cluster topology to both kernel and userspace.
> Libraried like hwloc will know cluster by cluster_cpus and related
> sysfs attributes. PoC of HWLOC support at [2].
> 
> Note this patch only handle the ACPI case.
> 
> Special consideration is needed for SMT processors, where it is
> necessary to move 2 levels up the hierarchy from the leaf nodes
> (thus skipping the processor core level).
> 
> Note that arm64 / ACPI does not provide any means of identifying
> a die level in the topology but that may be unrelate to the cluster
> level.
> 
> [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
>      structure (Type 0)
> [2] https://github.com/hisilicon/hwloc/tree/linux-cluster
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>   .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
>   Documentation/admin-guide/cputopology.rst     | 12 ++--
>   arch/arm64/kernel/topology.c                  |  2 +
>   drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
>   drivers/base/arch_topology.c                  | 14 ++++
>   drivers/base/topology.c                       | 10 +++
>   include/linux/acpi.h                          |  5 ++
>   include/linux/arch_topology.h                 |  5 ++
>   include/linux/topology.h                      |  6 ++
>   9 files changed, 132 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
> index 516dafea03eb..3965ce504484 100644
> --- a/Documentation/ABI/stable/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
> @@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
>                   architecture and platform dependent.
>   Values:         integer
>   
> +What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
> +Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
> +                identifier (rather than the kernel's). The actual value is
> +                architecture and platform dependent.
> +Values:         integer
> +
>   What:           /sys/devices/system/cpu/cpuX/topology/book_id
>   Description:    the book ID of cpuX. Typically it is the hardware platform's
>                   identifier (rather than the kernel's). The actual value is
> @@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
>                   The format is like 0-3, 8-11, 14,17.
>   Values:         decimal list.
>   
> +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
> +Description:    internal kernel map of CPUs within the same cluster.
> +Values:         hexadecimal bitmask.
> +
> +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
> +Description:    human-readable list of CPUs within the same cluster.
> +                The format is like 0-3, 8-11, 14,17.
> +Values:         decimal list.
> +
>   What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
>   Description:    internal kernel map of cpuX's hardware threads within the same
>                   book_id. it's only used on s390.
> diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
> index 8632a1db36e4..a5491949880d 100644
> --- a/Documentation/admin-guide/cputopology.rst
> +++ b/Documentation/admin-guide/cputopology.rst
> @@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
>   
>   	#define topology_physical_package_id(cpu)
>   	#define topology_die_id(cpu)
> +	#define topology_cluster_id(cpu)
>   	#define topology_core_id(cpu)
>   	#define topology_book_id(cpu)
>   	#define topology_drawer_id(cpu)
>   	#define topology_sibling_cpumask(cpu)
>   	#define topology_core_cpumask(cpu)
> +	#define topology_cluster_cpumask(cpu)
>   	#define topology_die_cpumask(cpu)
>   	#define topology_book_cpumask(cpu)
>   	#define topology_drawer_cpumask(cpu)
> @@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
>   
>   1) topology_physical_package_id: -1
>   2) topology_die_id: -1
> -3) topology_core_id: 0
> -4) topology_sibling_cpumask: just the given CPU
> -5) topology_core_cpumask: just the given CPU
> -6) topology_die_cpumask: just the given CPU
> +3) topology_cluster_id: -1
> +4) topology_core_id: 0
> +5) topology_sibling_cpumask: just the given CPU
> +6) topology_core_cpumask: just the given CPU
> +7) topology_cluster_cpumask: just the given CPU
> +8) topology_die_cpumask: just the given CPU
>   
>   For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
>   default definitions for topology_book_id() and topology_book_cpumask().
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 4dd14a6620c1..9ab78ad826e2 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
>   			cpu_topology[cpu].thread_id  = -1;
>   			cpu_topology[cpu].core_id    = topology_id;
>   		}
> +		topology_id = find_acpi_cpu_topology_cluster(cpu);
> +		cpu_topology[cpu].cluster_id = topology_id;
>   		topology_id = find_acpi_cpu_topology_package(cpu);
>   		cpu_topology[cpu].package_id = topology_id;
>   
> diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
> index fe69dc518f31..701f61c01359 100644
> --- a/drivers/acpi/pptt.c
> +++ b/drivers/acpi/pptt.c
> @@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
>   					  ACPI_PPTT_PHYSICAL_PACKAGE);
>   }
>   
> +/**
> + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
> + * @cpu: Kernel logical CPU number
> + *
> + * Determine a topology unique cluster ID for the given CPU/thread.
> + * This ID can then be used to group peers, which will have matching ids.
> + *
> + * The cluster, if present is the level of topology above CPUs. In a
> + * multi-thread CPU, it will be the level above the CPU, not the thread.
> + * It may not exist in single CPU systems. In simple multi-CPU systems,
> + * it may be equal to the package topology level.
> + *
> + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
> + * or there is no toplogy level above the CPU..
> + * Otherwise returns a value which represents the package for this CPU.
> + */
> +
> +int find_acpi_cpu_topology_cluster(unsigned int cpu)
> +{
> +	struct acpi_table_header *table;
> +	acpi_status status;
> +	struct acpi_pptt_processor *cpu_node, *cluster_node;
> +	u32 acpi_cpu_id;
> +	int retval;
> +	int is_thread;
> +
> +	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
> +	if (ACPI_FAILURE(status)) {
> +		acpi_pptt_warn_missing();
> +		return -ENOENT;
> +	}
> +
> +	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
> +	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
> +	if (cpu_node == NULL || !cpu_node->parent) {
> +		retval = -ENOENT;
> +		goto put_table;
> +	}
> +
> +	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
> +	cluster_node = fetch_pptt_node(table, cpu_node->parent);
> +	if (cluster_node == NULL) {
> +		retval = -ENOENT;
> +		goto put_table;
> +	}
> +	if (is_thread) {
> +		if (!cluster_node->parent) {
> +			retval = -ENOENT;
> +			goto put_table;
> +		}
> +		cluster_node = fetch_pptt_node(table, cluster_node->parent);
> +		if (cluster_node == NULL) {
> +			retval = -ENOENT;
> +			goto put_table;
> +		}
> +	}
> +	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
> +		retval = cluster_node->acpi_processor_id;
> +	else
> +		retval = ACPI_PTR_DIFF(cluster_node, table);
> +
> +put_table:
> +	acpi_put_table(table);
> +
> +	return retval;
> +}
> +
>   /**
>    * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
>    * @cpu: Kernel logical CPU number
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 921312a8d957..5b1589adacaf 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -598,6 +598,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>   	return core_mask;
>   }
>   
> +const struct cpumask *cpu_clustergroup_mask(int cpu)
> +{
> +	return &cpu_topology[cpu].cluster_sibling;
> +}
> +
>   void update_siblings_masks(unsigned int cpuid)
>   {
>   	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> @@ -615,6 +620,11 @@ void update_siblings_masks(unsigned int cpuid)
>   		if (cpuid_topo->package_id != cpu_topo->package_id)
>   			continue;
>   
> +		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> +			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> +			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> +		}
> +
>   		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
>   		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
>   
> @@ -633,6 +643,9 @@ static void clear_cpu_topology(int cpu)
>   	cpumask_clear(&cpu_topo->llc_sibling);
>   	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
>   
> +	cpumask_clear(&cpu_topo->cluster_sibling);
> +	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
> +
>   	cpumask_clear(&cpu_topo->core_sibling);
>   	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
>   	cpumask_clear(&cpu_topo->thread_sibling);
> @@ -648,6 +661,7 @@ void __init reset_cpu_topology(void)
>   
>   		cpu_topo->thread_id = -1;
>   		cpu_topo->core_id = -1;
> +		cpu_topo->cluster_id = -1;
>   		cpu_topo->package_id = -1;
>   		cpu_topo->llc_id = -1;
>   
> diff --git a/drivers/base/topology.c b/drivers/base/topology.c
> index 43c0940643f5..8f2b641d0b8c 100644
> --- a/drivers/base/topology.c
> +++ b/drivers/base/topology.c
> @@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
>   define_id_show_func(die_id);
>   static DEVICE_ATTR_RO(die_id);
>   
> +define_id_show_func(cluster_id);
> +static DEVICE_ATTR_RO(cluster_id);
> +
>   define_id_show_func(core_id);
>   static DEVICE_ATTR_RO(core_id);
>   
> @@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
>   static BIN_ATTR_RO(core_siblings, 0);
>   static BIN_ATTR_RO(core_siblings_list, 0);
>   
> +define_siblings_read_func(cluster_cpus, cluster_cpumask);
> +static BIN_ATTR_RO(cluster_cpus, 0);
> +static BIN_ATTR_RO(cluster_cpus_list, 0);
> +
>   define_siblings_read_func(die_cpus, die_cpumask);
>   static BIN_ATTR_RO(die_cpus, 0);
>   static BIN_ATTR_RO(die_cpus_list, 0);
> @@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
>   	&bin_attr_thread_siblings_list,
>   	&bin_attr_core_siblings,
>   	&bin_attr_core_siblings_list,
> +	&bin_attr_cluster_cpus,
> +	&bin_attr_cluster_cpus_list,
>   	&bin_attr_die_cpus,
>   	&bin_attr_die_cpus_list,
>   	&bin_attr_package_cpus,
> @@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
>   static struct attribute *default_attrs[] = {
>   	&dev_attr_physical_package_id.attr,
>   	&dev_attr_die_id.attr,
> +	&dev_attr_cluster_id.attr,
>   	&dev_attr_core_id.attr,
>   #ifdef CONFIG_SCHED_BOOK
>   	&dev_attr_book_id.attr,
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 72e4f7fd268c..6d65427e5f67 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
>   #ifdef CONFIG_ACPI_PPTT
>   int acpi_pptt_cpu_is_thread(unsigned int cpu);
>   int find_acpi_cpu_topology(unsigned int cpu, int level);
> +int find_acpi_cpu_topology_cluster(unsigned int cpu);
>   int find_acpi_cpu_topology_package(unsigned int cpu);
>   int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
>   int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
> @@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
>   {
>   	return -EINVAL;
>   }
> +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
> +{
> +	return -EINVAL;
> +}
>   static inline int find_acpi_cpu_topology_package(unsigned int cpu)
>   {
>   	return -EINVAL;
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index f180240dc95f..b97cea83b25e 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
>   struct cpu_topology {
>   	int thread_id;
>   	int core_id;
> +	int cluster_id;
>   	int package_id;
>   	int llc_id;
>   	cpumask_t thread_sibling;
>   	cpumask_t core_sibling;
> +	cpumask_t cluster_sibling;
>   	cpumask_t llc_sibling;
>   };
>   
> @@ -73,13 +75,16 @@ struct cpu_topology {
>   extern struct cpu_topology cpu_topology[NR_CPUS];
>   
>   #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
> +#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
>   #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
>   #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
>   #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
> +#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
>   #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
>   void init_cpu_topology(void);
>   void store_cpu_topology(unsigned int cpuid);
>   const struct cpumask *cpu_coregroup_mask(int cpu);
> +const struct cpumask *cpu_clustergroup_mask(int cpu);
>   void update_siblings_masks(unsigned int cpu);
>   void remove_cpu_topology(unsigned int cpuid);
>   void reset_cpu_topology(void);
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 7634cd737061..80d27d717631 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
>   #ifndef topology_die_id
>   #define topology_die_id(cpu)			((void)(cpu), -1)
>   #endif
> +#ifndef topology_cluster_id
> +#define topology_cluster_id(cpu)		((void)(cpu), -1)
> +#endif
>   #ifndef topology_core_id
>   #define topology_core_id(cpu)			((void)(cpu), 0)
>   #endif
> @@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
>   #ifndef topology_core_cpumask
>   #define topology_core_cpumask(cpu)		cpumask_of(cpu)
>   #endif
> +#ifndef topology_cluster_cpumask
> +#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
> +#endif
>   #ifndef topology_die_cpumask
>   #define topology_die_cpumask(cpu)		cpumask_of(cpu)
>   #endif


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [BUG] Re: [PATCH 1/3] topology: Represent clusters of CPUs within a die
@ 2022-05-06 20:24     ` Jeremy Linton
  0 siblings, 0 replies; 16+ messages in thread
From: Jeremy Linton @ 2022-05-06 20:24 UTC (permalink / raw)
  To: Barry Song, bp, catalin.marinas, dietmar.eggemann, gregkh, hpa,
	juri.lelli, bristot, lenb, mgorman, mingo, peterz, rjw,
	sudeep.holla, tglx
  Cc: aubrey.li, bsegall, guodong.xu, jonathan.cameron, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Tian Tao, Barry Song

Hi,

It seems this set is:

"BUG: arch topology borken"
                    ^code

on machines that don't actually have clusters, or provide a 
representation which might be taken for a cluster. The Ampere Altra for 
one. So, I guess its my job to relay what I was informed of when I 
intially proposed something similar a few years back.

Neither the ACPI/PPTT spec nor the Arm architectural spec mandate the 
concept of a "cluster" particularly in the form of a system with cores 
sharing the L2, which IIRC is the case for the Kunpeng. And it tends to 
be a shared L2 which gives the most bang for the buck (or was when I was 
testing/benchmarking all this, on aarch64) from scheduler changes which 
create cluster level scheduling domains. But OTOH, things like specJBB 
didn't really like those smaller MC levels (which I suspect is hurt by 
this change, without running a full benchmark suite, especially on 
something like the above ampere, given what is happening to its 
scheduling domains).

So, the one takeway I can give is this, the code below which is 
attempting to create a cluster level should be a bit more intellegent 
about whether there is an actual cluster. A first order approximation 
might be adding a check to see if the node immediatly above the CPU 
contains an L2 and that its shared. A better fix, of course is the 
reason this wasn't previously done, and that is to convince the ACPI 
commitee to standardize a CLUSTER level flag which could be utilized by 
a firmware/machine manufactuer to decide whether cluster level 
scheduling provides an advantage and simply not do it on machines which 
don't flag CLUSTER levels because its not avantagious.


Thanks,



On 8/19/21 20:30, Barry Song wrote:
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Both ACPI and DT provide the ability to describe additional layers of
> topology between that of individual cores and higher level constructs
> such as the level at which the last level cache is shared.
> In ACPI this can be represented in PPTT as a Processor Hierarchy
> Node Structure [1] that is the parent of the CPU cores and in turn
> has a parent Processor Hierarchy Nodes Structure representing
> a higher level of topology.
> 
> For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> has local L3 tag. On the other hand, each clusters will share some
> internal system bus.
> 
> +-----------------------------------+                          +---------+
> |  +------+    +------+            +---------------------------+         |
> |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
> |  +------+    +------+             |    |           |         |         |
> |                                   +----+    L3     |         |         |
> |  +------+    +------+   cluster   |    |    tag    |         |         |
> |  | CPU2 |    | CPU3 |             |    |           |         |         |
> |  +------+    +------+             |    +-----------+         |         |
> |                                   |                          |         |
> +-----------------------------------+                          |         |
> +-----------------------------------+                          |         |
> |  +------+    +------+             +--------------------------+         |
> |  |      |    |      |             |    +-----------+         |         |
> |  +------+    +------+             |    |           |         |         |
> |                                   |    |    L3     |         |         |
> |  +------+    +------+             +----+    tag    |         |         |
> |  |      |    |      |             |    |           |         |         |
> |  +------+    +------+             |    +-----------+         |         |
> |                                   |                          |         |
> +-----------------------------------+                          |   L3    |
>                                                                 |   data  |
> +-----------------------------------+                          |         |
> |  +------+    +------+             |    +-----------+         |         |
> |  |      |    |      |             |    |           |         |         |
> |  +------+    +------+             +----+    L3     |         |         |
> |                                   |    |    tag    |         |         |
> |  +------+    +------+             |    |           |         |         |
> |  |      |    |      |            ++    +-----------+         |         |
> |  +------+    +------+            |---------------------------+         |
> +-----------------------------------|                          |         |
> +-----------------------------------|                          |         |
> |  +------+    +------+            +---------------------------+         |
> |  |      |    |      |             |    +-----------+         |         |
> |  +------+    +------+             |    |           |         |         |
> |                                   +----+    L3     |         |         |
> |  +------+    +------+             |    |    tag    |         |         |
> |  |      |    |      |             |    |           |         |         |
> |  +------+    +------+             |    +-----------+         |         |
> |                                   |                          |         |
> +-----------------------------------+                          |         |
> +-----------------------------------+                          |         |
> |  +------+    +------+             +--------------------------+         |
> |  |      |    |      |             |   +-----------+          |         |
> |  +------+    +------+             |   |           |          |         |
> |                                   |   |    L3     |          |         |
> |  +------+    +------+             +---+    tag    |          |         |
> |  |      |    |      |             |   |           |          |         |
> |  +------+    +------+             |   +-----------+          |         |
> |                                   |                          |         |
> +-----------------------------------+                          |         |
> +-----------------------------------+                         ++         |
> |  +------+    +------+             +--------------------------+         |
> |  |      |    |      |             |  +-----------+           |         |
> |  +------+    +------+             |  |           |           |         |
> |                                   |  |    L3     |           |         |
> |  +------+    +------+             +--+    tag    |           |         |
> |  |      |    |      |             |  |           |           |         |
> |  +------+    +------+             |  +-----------+           |         |
> |                                   |                          +---------+
> +-----------------------------------+
> 
> That means spreading tasks among clusters will bring more bandwidth
> while packing tasks within one cluster will lead to smaller cache
> synchronization latency. So both kernel and userspace will have
> a chance to leverage this topology to deploy tasks accordingly to
> achieve either smaller cache latency within one cluster or an even
> distribution of load among clusters for higher throughput.
> 
> This patch exposes cluster topology to both kernel and userspace.
> Libraried like hwloc will know cluster by cluster_cpus and related
> sysfs attributes. PoC of HWLOC support at [2].
> 
> Note this patch only handle the ACPI case.
> 
> Special consideration is needed for SMT processors, where it is
> necessary to move 2 levels up the hierarchy from the leaf nodes
> (thus skipping the processor core level).
> 
> Note that arm64 / ACPI does not provide any means of identifying
> a die level in the topology but that may be unrelate to the cluster
> level.
> 
> [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
>      structure (Type 0)
> [2] https://github.com/hisilicon/hwloc/tree/linux-cluster
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> ---
>   .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
>   Documentation/admin-guide/cputopology.rst     | 12 ++--
>   arch/arm64/kernel/topology.c                  |  2 +
>   drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
>   drivers/base/arch_topology.c                  | 14 ++++
>   drivers/base/topology.c                       | 10 +++
>   include/linux/acpi.h                          |  5 ++
>   include/linux/arch_topology.h                 |  5 ++
>   include/linux/topology.h                      |  6 ++
>   9 files changed, 132 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
> index 516dafea03eb..3965ce504484 100644
> --- a/Documentation/ABI/stable/sysfs-devices-system-cpu
> +++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
> @@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
>                   architecture and platform dependent.
>   Values:         integer
>   
> +What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
> +Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
> +                identifier (rather than the kernel's). The actual value is
> +                architecture and platform dependent.
> +Values:         integer
> +
>   What:           /sys/devices/system/cpu/cpuX/topology/book_id
>   Description:    the book ID of cpuX. Typically it is the hardware platform's
>                   identifier (rather than the kernel's). The actual value is
> @@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
>                   The format is like 0-3, 8-11, 14,17.
>   Values:         decimal list.
>   
> +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
> +Description:    internal kernel map of CPUs within the same cluster.
> +Values:         hexadecimal bitmask.
> +
> +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
> +Description:    human-readable list of CPUs within the same cluster.
> +                The format is like 0-3, 8-11, 14,17.
> +Values:         decimal list.
> +
>   What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
>   Description:    internal kernel map of cpuX's hardware threads within the same
>                   book_id. it's only used on s390.
> diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
> index 8632a1db36e4..a5491949880d 100644
> --- a/Documentation/admin-guide/cputopology.rst
> +++ b/Documentation/admin-guide/cputopology.rst
> @@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
>   
>   	#define topology_physical_package_id(cpu)
>   	#define topology_die_id(cpu)
> +	#define topology_cluster_id(cpu)
>   	#define topology_core_id(cpu)
>   	#define topology_book_id(cpu)
>   	#define topology_drawer_id(cpu)
>   	#define topology_sibling_cpumask(cpu)
>   	#define topology_core_cpumask(cpu)
> +	#define topology_cluster_cpumask(cpu)
>   	#define topology_die_cpumask(cpu)
>   	#define topology_book_cpumask(cpu)
>   	#define topology_drawer_cpumask(cpu)
> @@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
>   
>   1) topology_physical_package_id: -1
>   2) topology_die_id: -1
> -3) topology_core_id: 0
> -4) topology_sibling_cpumask: just the given CPU
> -5) topology_core_cpumask: just the given CPU
> -6) topology_die_cpumask: just the given CPU
> +3) topology_cluster_id: -1
> +4) topology_core_id: 0
> +5) topology_sibling_cpumask: just the given CPU
> +6) topology_core_cpumask: just the given CPU
> +7) topology_cluster_cpumask: just the given CPU
> +8) topology_die_cpumask: just the given CPU
>   
>   For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
>   default definitions for topology_book_id() and topology_book_cpumask().
> diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> index 4dd14a6620c1..9ab78ad826e2 100644
> --- a/arch/arm64/kernel/topology.c
> +++ b/arch/arm64/kernel/topology.c
> @@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
>   			cpu_topology[cpu].thread_id  = -1;
>   			cpu_topology[cpu].core_id    = topology_id;
>   		}
> +		topology_id = find_acpi_cpu_topology_cluster(cpu);
> +		cpu_topology[cpu].cluster_id = topology_id;
>   		topology_id = find_acpi_cpu_topology_package(cpu);
>   		cpu_topology[cpu].package_id = topology_id;
>   
> diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
> index fe69dc518f31..701f61c01359 100644
> --- a/drivers/acpi/pptt.c
> +++ b/drivers/acpi/pptt.c
> @@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
>   					  ACPI_PPTT_PHYSICAL_PACKAGE);
>   }
>   
> +/**
> + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
> + * @cpu: Kernel logical CPU number
> + *
> + * Determine a topology unique cluster ID for the given CPU/thread.
> + * This ID can then be used to group peers, which will have matching ids.
> + *
> + * The cluster, if present is the level of topology above CPUs. In a
> + * multi-thread CPU, it will be the level above the CPU, not the thread.
> + * It may not exist in single CPU systems. In simple multi-CPU systems,
> + * it may be equal to the package topology level.
> + *
> + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
> + * or there is no toplogy level above the CPU..
> + * Otherwise returns a value which represents the package for this CPU.
> + */
> +
> +int find_acpi_cpu_topology_cluster(unsigned int cpu)
> +{
> +	struct acpi_table_header *table;
> +	acpi_status status;
> +	struct acpi_pptt_processor *cpu_node, *cluster_node;
> +	u32 acpi_cpu_id;
> +	int retval;
> +	int is_thread;
> +
> +	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
> +	if (ACPI_FAILURE(status)) {
> +		acpi_pptt_warn_missing();
> +		return -ENOENT;
> +	}
> +
> +	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
> +	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
> +	if (cpu_node == NULL || !cpu_node->parent) {
> +		retval = -ENOENT;
> +		goto put_table;
> +	}
> +
> +	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
> +	cluster_node = fetch_pptt_node(table, cpu_node->parent);
> +	if (cluster_node == NULL) {
> +		retval = -ENOENT;
> +		goto put_table;
> +	}
> +	if (is_thread) {
> +		if (!cluster_node->parent) {
> +			retval = -ENOENT;
> +			goto put_table;
> +		}
> +		cluster_node = fetch_pptt_node(table, cluster_node->parent);
> +		if (cluster_node == NULL) {
> +			retval = -ENOENT;
> +			goto put_table;
> +		}
> +	}
> +	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
> +		retval = cluster_node->acpi_processor_id;
> +	else
> +		retval = ACPI_PTR_DIFF(cluster_node, table);
> +
> +put_table:
> +	acpi_put_table(table);
> +
> +	return retval;
> +}
> +
>   /**
>    * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
>    * @cpu: Kernel logical CPU number
> diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> index 921312a8d957..5b1589adacaf 100644
> --- a/drivers/base/arch_topology.c
> +++ b/drivers/base/arch_topology.c
> @@ -598,6 +598,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
>   	return core_mask;
>   }
>   
> +const struct cpumask *cpu_clustergroup_mask(int cpu)
> +{
> +	return &cpu_topology[cpu].cluster_sibling;
> +}
> +
>   void update_siblings_masks(unsigned int cpuid)
>   {
>   	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> @@ -615,6 +620,11 @@ void update_siblings_masks(unsigned int cpuid)
>   		if (cpuid_topo->package_id != cpu_topo->package_id)
>   			continue;
>   
> +		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> +			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> +			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> +		}
> +
>   		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
>   		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
>   
> @@ -633,6 +643,9 @@ static void clear_cpu_topology(int cpu)
>   	cpumask_clear(&cpu_topo->llc_sibling);
>   	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
>   
> +	cpumask_clear(&cpu_topo->cluster_sibling);
> +	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
> +
>   	cpumask_clear(&cpu_topo->core_sibling);
>   	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
>   	cpumask_clear(&cpu_topo->thread_sibling);
> @@ -648,6 +661,7 @@ void __init reset_cpu_topology(void)
>   
>   		cpu_topo->thread_id = -1;
>   		cpu_topo->core_id = -1;
> +		cpu_topo->cluster_id = -1;
>   		cpu_topo->package_id = -1;
>   		cpu_topo->llc_id = -1;
>   
> diff --git a/drivers/base/topology.c b/drivers/base/topology.c
> index 43c0940643f5..8f2b641d0b8c 100644
> --- a/drivers/base/topology.c
> +++ b/drivers/base/topology.c
> @@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
>   define_id_show_func(die_id);
>   static DEVICE_ATTR_RO(die_id);
>   
> +define_id_show_func(cluster_id);
> +static DEVICE_ATTR_RO(cluster_id);
> +
>   define_id_show_func(core_id);
>   static DEVICE_ATTR_RO(core_id);
>   
> @@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
>   static BIN_ATTR_RO(core_siblings, 0);
>   static BIN_ATTR_RO(core_siblings_list, 0);
>   
> +define_siblings_read_func(cluster_cpus, cluster_cpumask);
> +static BIN_ATTR_RO(cluster_cpus, 0);
> +static BIN_ATTR_RO(cluster_cpus_list, 0);
> +
>   define_siblings_read_func(die_cpus, die_cpumask);
>   static BIN_ATTR_RO(die_cpus, 0);
>   static BIN_ATTR_RO(die_cpus_list, 0);
> @@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
>   	&bin_attr_thread_siblings_list,
>   	&bin_attr_core_siblings,
>   	&bin_attr_core_siblings_list,
> +	&bin_attr_cluster_cpus,
> +	&bin_attr_cluster_cpus_list,
>   	&bin_attr_die_cpus,
>   	&bin_attr_die_cpus_list,
>   	&bin_attr_package_cpus,
> @@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
>   static struct attribute *default_attrs[] = {
>   	&dev_attr_physical_package_id.attr,
>   	&dev_attr_die_id.attr,
> +	&dev_attr_cluster_id.attr,
>   	&dev_attr_core_id.attr,
>   #ifdef CONFIG_SCHED_BOOK
>   	&dev_attr_book_id.attr,
> diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> index 72e4f7fd268c..6d65427e5f67 100644
> --- a/include/linux/acpi.h
> +++ b/include/linux/acpi.h
> @@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
>   #ifdef CONFIG_ACPI_PPTT
>   int acpi_pptt_cpu_is_thread(unsigned int cpu);
>   int find_acpi_cpu_topology(unsigned int cpu, int level);
> +int find_acpi_cpu_topology_cluster(unsigned int cpu);
>   int find_acpi_cpu_topology_package(unsigned int cpu);
>   int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
>   int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
> @@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
>   {
>   	return -EINVAL;
>   }
> +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
> +{
> +	return -EINVAL;
> +}
>   static inline int find_acpi_cpu_topology_package(unsigned int cpu)
>   {
>   	return -EINVAL;
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index f180240dc95f..b97cea83b25e 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
>   struct cpu_topology {
>   	int thread_id;
>   	int core_id;
> +	int cluster_id;
>   	int package_id;
>   	int llc_id;
>   	cpumask_t thread_sibling;
>   	cpumask_t core_sibling;
> +	cpumask_t cluster_sibling;
>   	cpumask_t llc_sibling;
>   };
>   
> @@ -73,13 +75,16 @@ struct cpu_topology {
>   extern struct cpu_topology cpu_topology[NR_CPUS];
>   
>   #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
> +#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
>   #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
>   #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
>   #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
> +#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
>   #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
>   void init_cpu_topology(void);
>   void store_cpu_topology(unsigned int cpuid);
>   const struct cpumask *cpu_coregroup_mask(int cpu);
> +const struct cpumask *cpu_clustergroup_mask(int cpu);
>   void update_siblings_masks(unsigned int cpu);
>   void remove_cpu_topology(unsigned int cpuid);
>   void reset_cpu_topology(void);
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 7634cd737061..80d27d717631 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
>   #ifndef topology_die_id
>   #define topology_die_id(cpu)			((void)(cpu), -1)
>   #endif
> +#ifndef topology_cluster_id
> +#define topology_cluster_id(cpu)		((void)(cpu), -1)
> +#endif
>   #ifndef topology_core_id
>   #define topology_core_id(cpu)			((void)(cpu), 0)
>   #endif
> @@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
>   #ifndef topology_core_cpumask
>   #define topology_core_cpumask(cpu)		cpumask_of(cpu)
>   #endif
> +#ifndef topology_cluster_cpumask
> +#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
> +#endif
>   #ifndef topology_die_cpumask
>   #define topology_die_cpumask(cpu)		cpumask_of(cpu)
>   #endif


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] Re: [PATCH 1/3] topology: Represent clusters of CPUs within a die
  2022-05-06 20:24     ` Jeremy Linton
@ 2022-05-09 10:15       ` Jonathan Cameron
  -1 siblings, 0 replies; 16+ messages in thread
From: Jonathan Cameron @ 2022-05-09 10:15 UTC (permalink / raw)
  To: Jeremy Linton
  Cc: Barry Song, bp, catalin.marinas, dietmar.eggemann, gregkh, hpa,
	juri.lelli, bristot, lenb, mgorman, mingo, peterz, rjw,
	sudeep.holla, tglx, aubrey.li, bsegall, guodong.xu, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Tian Tao, Barry Song, Darren Hart

On Fri, 6 May 2022 15:24:27 -0500
Jeremy Linton <jeremy.linton@arm.com> wrote:

> Hi,
> 
> It seems this set is:
> 
> "BUG: arch topology borken"
>                     ^code
> 
> on machines that don't actually have clusters, or provide a 
> representation which might be taken for a cluster. The Ampere Altra for 
> one. So, I guess its my job to relay what I was informed of when I 
> intially proposed something similar a few years back.
> 
> Neither the ACPI/PPTT spec nor the Arm architectural spec mandate the 
> concept of a "cluster" particularly in the form of a system with cores 
> sharing the L2, which IIRC is the case for the Kunpeng.

It is not. Kunpeng 920 shares l3 tag cache, but not l2 cache (which is
private for each core).
As such the existence of a cluster is not distinguished by sharing
of any cache resources that are in PPTT.  There is an argument for potentially
adding more types of resource to PPTT to give a richer description.

Whilst ACPI doesn't mandate a cluster (there is an example, though that happens
to have L3 shared across the cluster), it does allow for addition
hierarchy description. Cluster is just a name for such an extra level.

> And it tends to 
> be a shared L2 which gives the most bang for the buck (or was when I was 
> testing/benchmarking all this, on aarch64) from scheduler changes which 
> create cluster level scheduling domains.
> But OTOH, things like specJBB 
> didn't really like those smaller MC levels (which I suspect is hurt by 
> this change, without running a full benchmark suite, especially on 
> something like the above ampere, given what is happening to its 
> scheduling domains).
> 
> So, the one takeway I can give is this, the code below which is 
> attempting to create a cluster level should be a bit more intellegent 
> about whether there is an actual cluster.

I agree that more intelligence is needed, though I think that belongs
in the interpretation of the cluster level.  This particular patch
should present that information in a consistent fashion.  My understanding
is it is consistent with how other levels have been presented in that
it's perfectly acceptable to have multiple levels that can be collapsed
by the users of the description. (perhaps I'm wrong on that?)

> A first order approximation 
> might be adding a check to see if the node immediatly above the CPU 
> contains an L2 and that its shared. 

That rules out our clusters, so not a great starting point :)

Darren Hart's recent set for Ampere Altra is fixing a different combination
but is in some sense similar in that it corrects an assumption that turned
out to be false in the user of the topology description whilst leaving the
description alone.

> A better fix, of course is the 
> reason this wasn't previously done, and that is to convince the ACPI 
> commitee to standardize a CLUSTER level flag which could be utilized by 
> a firmware/machine manufactuer to decide whether cluster level 
> scheduling provides an advantage and simply not do it on machines which 
> don't flag CLUSTER levels because its not avantagious.

While I obviously can't predict discussions in ASWG, my gut feeling
is that would be a non starter with questions along the lines of:

1) Why is this level special? The spec already defines a hierarchical
   description with caches described at each level, so you can infer
   what is intended.  If we define cluster, we'll also need to define
   super cluster (we have designs with super clusters and it's only going
   to get worse as systems continue to get bigger.)
2) If an architecture does not share resources at a given level that will
   have significant impact on scheduling decisions, don't present the
   level.  So if no advantage has seen, what is it doing there?

Thanks

Jonathan

> 
> 
> Thanks,
> 
> 
> 
> On 8/19/21 20:30, Barry Song wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Both ACPI and DT provide the ability to describe additional layers of
> > topology between that of individual cores and higher level constructs
> > such as the level at which the last level cache is shared.
> > In ACPI this can be represented in PPTT as a Processor Hierarchy
> > Node Structure [1] that is the parent of the CPU cores and in turn
> > has a parent Processor Hierarchy Nodes Structure representing
> > a higher level of topology.
> > 
> > For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> > has local L3 tag. On the other hand, each clusters will share some
> > internal system bus.
> > 
> > +-----------------------------------+                          +---------+
> > |  +------+    +------+            +---------------------------+         |
> > |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |                                   +----+    L3     |         |         |
> > |  +------+    +------+   cluster   |    |    tag    |         |         |
> > |  | CPU2 |    | CPU3 |             |    |           |         |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |         |
> > +-----------------------------------+                          |         |
> > |  +------+    +------+             +--------------------------+         |
> > |  |      |    |      |             |    +-----------+         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |                                   |    |    L3     |         |         |
> > |  +------+    +------+             +----+    tag    |         |         |
> > |  |      |    |      |             |    |           |         |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |   L3    |
> >                                                                 |   data  |
> > +-----------------------------------+                          |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |  |      |    |      |             |    |           |         |         |
> > |  +------+    +------+             +----+    L3     |         |         |
> > |                                   |    |    tag    |         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |  |      |    |      |            ++    +-----------+         |         |
> > |  +------+    +------+            |---------------------------+         |
> > +-----------------------------------|                          |         |
> > +-----------------------------------|                          |         |
> > |  +------+    +------+            +---------------------------+         |
> > |  |      |    |      |             |    +-----------+         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |                                   +----+    L3     |         |         |
> > |  +------+    +------+             |    |    tag    |         |         |
> > |  |      |    |      |             |    |           |         |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |         |
> > +-----------------------------------+                          |         |
> > |  +------+    +------+             +--------------------------+         |
> > |  |      |    |      |             |   +-----------+          |         |
> > |  +------+    +------+             |   |           |          |         |
> > |                                   |   |    L3     |          |         |
> > |  +------+    +------+             +---+    tag    |          |         |
> > |  |      |    |      |             |   |           |          |         |
> > |  +------+    +------+             |   +-----------+          |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |         |
> > +-----------------------------------+                         ++         |
> > |  +------+    +------+             +--------------------------+         |
> > |  |      |    |      |             |  +-----------+           |         |
> > |  +------+    +------+             |  |           |           |         |
> > |                                   |  |    L3     |           |         |
> > |  +------+    +------+             +--+    tag    |           |         |
> > |  |      |    |      |             |  |           |           |         |
> > |  +------+    +------+             |  +-----------+           |         |
> > |                                   |                          +---------+
> > +-----------------------------------+
> > 
> > That means spreading tasks among clusters will bring more bandwidth
> > while packing tasks within one cluster will lead to smaller cache
> > synchronization latency. So both kernel and userspace will have
> > a chance to leverage this topology to deploy tasks accordingly to
> > achieve either smaller cache latency within one cluster or an even
> > distribution of load among clusters for higher throughput.
> > 
> > This patch exposes cluster topology to both kernel and userspace.
> > Libraried like hwloc will know cluster by cluster_cpus and related
> > sysfs attributes. PoC of HWLOC support at [2].
> > 
> > Note this patch only handle the ACPI case.
> > 
> > Special consideration is needed for SMT processors, where it is
> > necessary to move 2 levels up the hierarchy from the leaf nodes
> > (thus skipping the processor core level).
> > 
> > Note that arm64 / ACPI does not provide any means of identifying
> > a die level in the topology but that may be unrelate to the cluster
> > level.
> > 
> > [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
> >      structure (Type 0)
> > [2] https://github.com/hisilicon/hwloc/tree/linux-cluster
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >   .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
> >   Documentation/admin-guide/cputopology.rst     | 12 ++--
> >   arch/arm64/kernel/topology.c                  |  2 +
> >   drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
> >   drivers/base/arch_topology.c                  | 14 ++++
> >   drivers/base/topology.c                       | 10 +++
> >   include/linux/acpi.h                          |  5 ++
> >   include/linux/arch_topology.h                 |  5 ++
> >   include/linux/topology.h                      |  6 ++
> >   9 files changed, 132 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
> > index 516dafea03eb..3965ce504484 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
> > @@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
> >                   architecture and platform dependent.
> >   Values:         integer
> >   
> > +What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
> > +Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
> > +                identifier (rather than the kernel's). The actual value is
> > +                architecture and platform dependent.
> > +Values:         integer
> > +
> >   What:           /sys/devices/system/cpu/cpuX/topology/book_id
> >   Description:    the book ID of cpuX. Typically it is the hardware platform's
> >                   identifier (rather than the kernel's). The actual value is
> > @@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
> >                   The format is like 0-3, 8-11, 14,17.
> >   Values:         decimal list.
> >   
> > +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
> > +Description:    internal kernel map of CPUs within the same cluster.
> > +Values:         hexadecimal bitmask.
> > +
> > +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
> > +Description:    human-readable list of CPUs within the same cluster.
> > +                The format is like 0-3, 8-11, 14,17.
> > +Values:         decimal list.
> > +
> >   What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
> >   Description:    internal kernel map of cpuX's hardware threads within the same
> >                   book_id. it's only used on s390.
> > diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
> > index 8632a1db36e4..a5491949880d 100644
> > --- a/Documentation/admin-guide/cputopology.rst
> > +++ b/Documentation/admin-guide/cputopology.rst
> > @@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
> >   
> >   	#define topology_physical_package_id(cpu)
> >   	#define topology_die_id(cpu)
> > +	#define topology_cluster_id(cpu)
> >   	#define topology_core_id(cpu)
> >   	#define topology_book_id(cpu)
> >   	#define topology_drawer_id(cpu)
> >   	#define topology_sibling_cpumask(cpu)
> >   	#define topology_core_cpumask(cpu)
> > +	#define topology_cluster_cpumask(cpu)
> >   	#define topology_die_cpumask(cpu)
> >   	#define topology_book_cpumask(cpu)
> >   	#define topology_drawer_cpumask(cpu)
> > @@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
> >   
> >   1) topology_physical_package_id: -1
> >   2) topology_die_id: -1
> > -3) topology_core_id: 0
> > -4) topology_sibling_cpumask: just the given CPU
> > -5) topology_core_cpumask: just the given CPU
> > -6) topology_die_cpumask: just the given CPU
> > +3) topology_cluster_id: -1
> > +4) topology_core_id: 0
> > +5) topology_sibling_cpumask: just the given CPU
> > +6) topology_core_cpumask: just the given CPU
> > +7) topology_cluster_cpumask: just the given CPU
> > +8) topology_die_cpumask: just the given CPU
> >   
> >   For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
> >   default definitions for topology_book_id() and topology_book_cpumask().
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 4dd14a6620c1..9ab78ad826e2 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
> >   			cpu_topology[cpu].thread_id  = -1;
> >   			cpu_topology[cpu].core_id    = topology_id;
> >   		}
> > +		topology_id = find_acpi_cpu_topology_cluster(cpu);
> > +		cpu_topology[cpu].cluster_id = topology_id;
> >   		topology_id = find_acpi_cpu_topology_package(cpu);
> >   		cpu_topology[cpu].package_id = topology_id;
> >   
> > diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
> > index fe69dc518f31..701f61c01359 100644
> > --- a/drivers/acpi/pptt.c
> > +++ b/drivers/acpi/pptt.c
> > @@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
> >   					  ACPI_PPTT_PHYSICAL_PACKAGE);
> >   }
> >   
> > +/**
> > + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
> > + * @cpu: Kernel logical CPU number
> > + *
> > + * Determine a topology unique cluster ID for the given CPU/thread.
> > + * This ID can then be used to group peers, which will have matching ids.
> > + *
> > + * The cluster, if present is the level of topology above CPUs. In a
> > + * multi-thread CPU, it will be the level above the CPU, not the thread.
> > + * It may not exist in single CPU systems. In simple multi-CPU systems,
> > + * it may be equal to the package topology level.
> > + *
> > + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
> > + * or there is no toplogy level above the CPU..
> > + * Otherwise returns a value which represents the package for this CPU.
> > + */
> > +
> > +int find_acpi_cpu_topology_cluster(unsigned int cpu)
> > +{
> > +	struct acpi_table_header *table;
> > +	acpi_status status;
> > +	struct acpi_pptt_processor *cpu_node, *cluster_node;
> > +	u32 acpi_cpu_id;
> > +	int retval;
> > +	int is_thread;
> > +
> > +	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
> > +	if (ACPI_FAILURE(status)) {
> > +		acpi_pptt_warn_missing();
> > +		return -ENOENT;
> > +	}
> > +
> > +	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
> > +	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
> > +	if (cpu_node == NULL || !cpu_node->parent) {
> > +		retval = -ENOENT;
> > +		goto put_table;
> > +	}
> > +
> > +	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
> > +	cluster_node = fetch_pptt_node(table, cpu_node->parent);
> > +	if (cluster_node == NULL) {
> > +		retval = -ENOENT;
> > +		goto put_table;
> > +	}
> > +	if (is_thread) {
> > +		if (!cluster_node->parent) {
> > +			retval = -ENOENT;
> > +			goto put_table;
> > +		}
> > +		cluster_node = fetch_pptt_node(table, cluster_node->parent);
> > +		if (cluster_node == NULL) {
> > +			retval = -ENOENT;
> > +			goto put_table;
> > +		}
> > +	}
> > +	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
> > +		retval = cluster_node->acpi_processor_id;
> > +	else
> > +		retval = ACPI_PTR_DIFF(cluster_node, table);
> > +
> > +put_table:
> > +	acpi_put_table(table);
> > +
> > +	return retval;
> > +}
> > +
> >   /**
> >    * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
> >    * @cpu: Kernel logical CPU number
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 921312a8d957..5b1589adacaf 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -598,6 +598,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >   	return core_mask;
> >   }
> >   
> > +const struct cpumask *cpu_clustergroup_mask(int cpu)
> > +{
> > +	return &cpu_topology[cpu].cluster_sibling;
> > +}
> > +
> >   void update_siblings_masks(unsigned int cpuid)
> >   {
> >   	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> > @@ -615,6 +620,11 @@ void update_siblings_masks(unsigned int cpuid)
> >   		if (cpuid_topo->package_id != cpu_topo->package_id)
> >   			continue;
> >   
> > +		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > +			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > +			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > +		}
> > +
> >   		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
> >   		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
> >   
> > @@ -633,6 +643,9 @@ static void clear_cpu_topology(int cpu)
> >   	cpumask_clear(&cpu_topo->llc_sibling);
> >   	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
> >   
> > +	cpumask_clear(&cpu_topo->cluster_sibling);
> > +	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
> > +
> >   	cpumask_clear(&cpu_topo->core_sibling);
> >   	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
> >   	cpumask_clear(&cpu_topo->thread_sibling);
> > @@ -648,6 +661,7 @@ void __init reset_cpu_topology(void)
> >   
> >   		cpu_topo->thread_id = -1;
> >   		cpu_topo->core_id = -1;
> > +		cpu_topo->cluster_id = -1;
> >   		cpu_topo->package_id = -1;
> >   		cpu_topo->llc_id = -1;
> >   
> > diff --git a/drivers/base/topology.c b/drivers/base/topology.c
> > index 43c0940643f5..8f2b641d0b8c 100644
> > --- a/drivers/base/topology.c
> > +++ b/drivers/base/topology.c
> > @@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
> >   define_id_show_func(die_id);
> >   static DEVICE_ATTR_RO(die_id);
> >   
> > +define_id_show_func(cluster_id);
> > +static DEVICE_ATTR_RO(cluster_id);
> > +
> >   define_id_show_func(core_id);
> >   static DEVICE_ATTR_RO(core_id);
> >   
> > @@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
> >   static BIN_ATTR_RO(core_siblings, 0);
> >   static BIN_ATTR_RO(core_siblings_list, 0);
> >   
> > +define_siblings_read_func(cluster_cpus, cluster_cpumask);
> > +static BIN_ATTR_RO(cluster_cpus, 0);
> > +static BIN_ATTR_RO(cluster_cpus_list, 0);
> > +
> >   define_siblings_read_func(die_cpus, die_cpumask);
> >   static BIN_ATTR_RO(die_cpus, 0);
> >   static BIN_ATTR_RO(die_cpus_list, 0);
> > @@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
> >   	&bin_attr_thread_siblings_list,
> >   	&bin_attr_core_siblings,
> >   	&bin_attr_core_siblings_list,
> > +	&bin_attr_cluster_cpus,
> > +	&bin_attr_cluster_cpus_list,
> >   	&bin_attr_die_cpus,
> >   	&bin_attr_die_cpus_list,
> >   	&bin_attr_package_cpus,
> > @@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
> >   static struct attribute *default_attrs[] = {
> >   	&dev_attr_physical_package_id.attr,
> >   	&dev_attr_die_id.attr,
> > +	&dev_attr_cluster_id.attr,
> >   	&dev_attr_core_id.attr,
> >   #ifdef CONFIG_SCHED_BOOK
> >   	&dev_attr_book_id.attr,
> > diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> > index 72e4f7fd268c..6d65427e5f67 100644
> > --- a/include/linux/acpi.h
> > +++ b/include/linux/acpi.h
> > @@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
> >   #ifdef CONFIG_ACPI_PPTT
> >   int acpi_pptt_cpu_is_thread(unsigned int cpu);
> >   int find_acpi_cpu_topology(unsigned int cpu, int level);
> > +int find_acpi_cpu_topology_cluster(unsigned int cpu);
> >   int find_acpi_cpu_topology_package(unsigned int cpu);
> >   int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
> >   int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
> > @@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
> >   {
> >   	return -EINVAL;
> >   }
> > +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
> > +{
> > +	return -EINVAL;
> > +}
> >   static inline int find_acpi_cpu_topology_package(unsigned int cpu)
> >   {
> >   	return -EINVAL;
> > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> > index f180240dc95f..b97cea83b25e 100644
> > --- a/include/linux/arch_topology.h
> > +++ b/include/linux/arch_topology.h
> > @@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
> >   struct cpu_topology {
> >   	int thread_id;
> >   	int core_id;
> > +	int cluster_id;
> >   	int package_id;
> >   	int llc_id;
> >   	cpumask_t thread_sibling;
> >   	cpumask_t core_sibling;
> > +	cpumask_t cluster_sibling;
> >   	cpumask_t llc_sibling;
> >   };
> >   
> > @@ -73,13 +75,16 @@ struct cpu_topology {
> >   extern struct cpu_topology cpu_topology[NR_CPUS];
> >   
> >   #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
> > +#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
> >   #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
> >   #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
> >   #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
> > +#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
> >   #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
> >   void init_cpu_topology(void);
> >   void store_cpu_topology(unsigned int cpuid);
> >   const struct cpumask *cpu_coregroup_mask(int cpu);
> > +const struct cpumask *cpu_clustergroup_mask(int cpu);
> >   void update_siblings_masks(unsigned int cpu);
> >   void remove_cpu_topology(unsigned int cpuid);
> >   void reset_cpu_topology(void);
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 7634cd737061..80d27d717631 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
> >   #ifndef topology_die_id
> >   #define topology_die_id(cpu)			((void)(cpu), -1)
> >   #endif
> > +#ifndef topology_cluster_id
> > +#define topology_cluster_id(cpu)		((void)(cpu), -1)
> > +#endif
> >   #ifndef topology_core_id
> >   #define topology_core_id(cpu)			((void)(cpu), 0)
> >   #endif
> > @@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
> >   #ifndef topology_core_cpumask
> >   #define topology_core_cpumask(cpu)		cpumask_of(cpu)
> >   #endif
> > +#ifndef topology_cluster_cpumask
> > +#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
> > +#endif
> >   #ifndef topology_die_cpumask
> >   #define topology_die_cpumask(cpu)		cpumask_of(cpu)
> >   #endif  
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] Re: [PATCH 1/3] topology: Represent clusters of CPUs within a die
@ 2022-05-09 10:15       ` Jonathan Cameron
  0 siblings, 0 replies; 16+ messages in thread
From: Jonathan Cameron @ 2022-05-09 10:15 UTC (permalink / raw)
  To: Jeremy Linton
  Cc: Barry Song, bp, catalin.marinas, dietmar.eggemann, gregkh, hpa,
	juri.lelli, bristot, lenb, mgorman, mingo, peterz, rjw,
	sudeep.holla, tglx, aubrey.li, bsegall, guodong.xu, liguozhu,
	linux-acpi, linux-arm-kernel, linux-kernel, mark.rutland,
	msys.mizuma, prime.zeng, rostedt, tim.c.chen, valentin.schneider,
	vincent.guittot, will, x86, xuwei5, yangyicong, linuxarm,
	Tian Tao, Barry Song, Darren Hart

On Fri, 6 May 2022 15:24:27 -0500
Jeremy Linton <jeremy.linton@arm.com> wrote:

> Hi,
> 
> It seems this set is:
> 
> "BUG: arch topology borken"
>                     ^code
> 
> on machines that don't actually have clusters, or provide a 
> representation which might be taken for a cluster. The Ampere Altra for 
> one. So, I guess its my job to relay what I was informed of when I 
> intially proposed something similar a few years back.
> 
> Neither the ACPI/PPTT spec nor the Arm architectural spec mandate the 
> concept of a "cluster" particularly in the form of a system with cores 
> sharing the L2, which IIRC is the case for the Kunpeng.

It is not. Kunpeng 920 shares l3 tag cache, but not l2 cache (which is
private for each core).
As such the existence of a cluster is not distinguished by sharing
of any cache resources that are in PPTT.  There is an argument for potentially
adding more types of resource to PPTT to give a richer description.

Whilst ACPI doesn't mandate a cluster (there is an example, though that happens
to have L3 shared across the cluster), it does allow for addition
hierarchy description. Cluster is just a name for such an extra level.

> And it tends to 
> be a shared L2 which gives the most bang for the buck (or was when I was 
> testing/benchmarking all this, on aarch64) from scheduler changes which 
> create cluster level scheduling domains.
> But OTOH, things like specJBB 
> didn't really like those smaller MC levels (which I suspect is hurt by 
> this change, without running a full benchmark suite, especially on 
> something like the above ampere, given what is happening to its 
> scheduling domains).
> 
> So, the one takeway I can give is this, the code below which is 
> attempting to create a cluster level should be a bit more intellegent 
> about whether there is an actual cluster.

I agree that more intelligence is needed, though I think that belongs
in the interpretation of the cluster level.  This particular patch
should present that information in a consistent fashion.  My understanding
is it is consistent with how other levels have been presented in that
it's perfectly acceptable to have multiple levels that can be collapsed
by the users of the description. (perhaps I'm wrong on that?)

> A first order approximation 
> might be adding a check to see if the node immediatly above the CPU 
> contains an L2 and that its shared. 

That rules out our clusters, so not a great starting point :)

Darren Hart's recent set for Ampere Altra is fixing a different combination
but is in some sense similar in that it corrects an assumption that turned
out to be false in the user of the topology description whilst leaving the
description alone.

> A better fix, of course is the 
> reason this wasn't previously done, and that is to convince the ACPI 
> commitee to standardize a CLUSTER level flag which could be utilized by 
> a firmware/machine manufactuer to decide whether cluster level 
> scheduling provides an advantage and simply not do it on machines which 
> don't flag CLUSTER levels because its not avantagious.

While I obviously can't predict discussions in ASWG, my gut feeling
is that would be a non starter with questions along the lines of:

1) Why is this level special? The spec already defines a hierarchical
   description with caches described at each level, so you can infer
   what is intended.  If we define cluster, we'll also need to define
   super cluster (we have designs with super clusters and it's only going
   to get worse as systems continue to get bigger.)
2) If an architecture does not share resources at a given level that will
   have significant impact on scheduling decisions, don't present the
   level.  So if no advantage has seen, what is it doing there?

Thanks

Jonathan

> 
> 
> Thanks,
> 
> 
> 
> On 8/19/21 20:30, Barry Song wrote:
> > From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > 
> > Both ACPI and DT provide the ability to describe additional layers of
> > topology between that of individual cores and higher level constructs
> > such as the level at which the last level cache is shared.
> > In ACPI this can be represented in PPTT as a Processor Hierarchy
> > Node Structure [1] that is the parent of the CPU cores and in turn
> > has a parent Processor Hierarchy Nodes Structure representing
> > a higher level of topology.
> > 
> > For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
> > cluster has 4 cpus. All clusters share L3 cache data, but each cluster
> > has local L3 tag. On the other hand, each clusters will share some
> > internal system bus.
> > 
> > +-----------------------------------+                          +---------+
> > |  +------+    +------+            +---------------------------+         |
> > |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |                                   +----+    L3     |         |         |
> > |  +------+    +------+   cluster   |    |    tag    |         |         |
> > |  | CPU2 |    | CPU3 |             |    |           |         |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |         |
> > +-----------------------------------+                          |         |
> > |  +------+    +------+             +--------------------------+         |
> > |  |      |    |      |             |    +-----------+         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |                                   |    |    L3     |         |         |
> > |  +------+    +------+             +----+    tag    |         |         |
> > |  |      |    |      |             |    |           |         |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |   L3    |
> >                                                                 |   data  |
> > +-----------------------------------+                          |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |  |      |    |      |             |    |           |         |         |
> > |  +------+    +------+             +----+    L3     |         |         |
> > |                                   |    |    tag    |         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |  |      |    |      |            ++    +-----------+         |         |
> > |  +------+    +------+            |---------------------------+         |
> > +-----------------------------------|                          |         |
> > +-----------------------------------|                          |         |
> > |  +------+    +------+            +---------------------------+         |
> > |  |      |    |      |             |    +-----------+         |         |
> > |  +------+    +------+             |    |           |         |         |
> > |                                   +----+    L3     |         |         |
> > |  +------+    +------+             |    |    tag    |         |         |
> > |  |      |    |      |             |    |           |         |         |
> > |  +------+    +------+             |    +-----------+         |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |         |
> > +-----------------------------------+                          |         |
> > |  +------+    +------+             +--------------------------+         |
> > |  |      |    |      |             |   +-----------+          |         |
> > |  +------+    +------+             |   |           |          |         |
> > |                                   |   |    L3     |          |         |
> > |  +------+    +------+             +---+    tag    |          |         |
> > |  |      |    |      |             |   |           |          |         |
> > |  +------+    +------+             |   +-----------+          |         |
> > |                                   |                          |         |
> > +-----------------------------------+                          |         |
> > +-----------------------------------+                         ++         |
> > |  +------+    +------+             +--------------------------+         |
> > |  |      |    |      |             |  +-----------+           |         |
> > |  +------+    +------+             |  |           |           |         |
> > |                                   |  |    L3     |           |         |
> > |  +------+    +------+             +--+    tag    |           |         |
> > |  |      |    |      |             |  |           |           |         |
> > |  +------+    +------+             |  +-----------+           |         |
> > |                                   |                          +---------+
> > +-----------------------------------+
> > 
> > That means spreading tasks among clusters will bring more bandwidth
> > while packing tasks within one cluster will lead to smaller cache
> > synchronization latency. So both kernel and userspace will have
> > a chance to leverage this topology to deploy tasks accordingly to
> > achieve either smaller cache latency within one cluster or an even
> > distribution of load among clusters for higher throughput.
> > 
> > This patch exposes cluster topology to both kernel and userspace.
> > Libraried like hwloc will know cluster by cluster_cpus and related
> > sysfs attributes. PoC of HWLOC support at [2].
> > 
> > Note this patch only handle the ACPI case.
> > 
> > Special consideration is needed for SMT processors, where it is
> > necessary to move 2 levels up the hierarchy from the leaf nodes
> > (thus skipping the processor core level).
> > 
> > Note that arm64 / ACPI does not provide any means of identifying
> > a die level in the topology but that may be unrelate to the cluster
> > level.
> > 
> > [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
> >      structure (Type 0)
> > [2] https://github.com/hisilicon/hwloc/tree/linux-cluster
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
> > ---
> >   .../ABI/stable/sysfs-devices-system-cpu       | 15 +++++
> >   Documentation/admin-guide/cputopology.rst     | 12 ++--
> >   arch/arm64/kernel/topology.c                  |  2 +
> >   drivers/acpi/pptt.c                           | 67 +++++++++++++++++++
> >   drivers/base/arch_topology.c                  | 14 ++++
> >   drivers/base/topology.c                       | 10 +++
> >   include/linux/acpi.h                          |  5 ++
> >   include/linux/arch_topology.h                 |  5 ++
> >   include/linux/topology.h                      |  6 ++
> >   9 files changed, 132 insertions(+), 4 deletions(-)
> > 
> > diff --git a/Documentation/ABI/stable/sysfs-devices-system-cpu b/Documentation/ABI/stable/sysfs-devices-system-cpu
> > index 516dafea03eb..3965ce504484 100644
> > --- a/Documentation/ABI/stable/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/stable/sysfs-devices-system-cpu
> > @@ -42,6 +42,12 @@ Description:    the CPU core ID of cpuX. Typically it is the hardware platform's
> >                   architecture and platform dependent.
> >   Values:         integer
> >   
> > +What:           /sys/devices/system/cpu/cpuX/topology/cluster_id
> > +Description:    the cluster ID of cpuX.  Typically it is the hardware platform's
> > +                identifier (rather than the kernel's). The actual value is
> > +                architecture and platform dependent.
> > +Values:         integer
> > +
> >   What:           /sys/devices/system/cpu/cpuX/topology/book_id
> >   Description:    the book ID of cpuX. Typically it is the hardware platform's
> >                   identifier (rather than the kernel's). The actual value is
> > @@ -85,6 +91,15 @@ Description:    human-readable list of CPUs within the same die.
> >                   The format is like 0-3, 8-11, 14,17.
> >   Values:         decimal list.
> >   
> > +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus
> > +Description:    internal kernel map of CPUs within the same cluster.
> > +Values:         hexadecimal bitmask.
> > +
> > +What:           /sys/devices/system/cpu/cpuX/topology/cluster_cpus_list
> > +Description:    human-readable list of CPUs within the same cluster.
> > +                The format is like 0-3, 8-11, 14,17.
> > +Values:         decimal list.
> > +
> >   What:           /sys/devices/system/cpu/cpuX/topology/book_siblings
> >   Description:    internal kernel map of cpuX's hardware threads within the same
> >                   book_id. it's only used on s390.
> > diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst
> > index 8632a1db36e4..a5491949880d 100644
> > --- a/Documentation/admin-guide/cputopology.rst
> > +++ b/Documentation/admin-guide/cputopology.rst
> > @@ -19,11 +19,13 @@ these macros in include/asm-XXX/topology.h::
> >   
> >   	#define topology_physical_package_id(cpu)
> >   	#define topology_die_id(cpu)
> > +	#define topology_cluster_id(cpu)
> >   	#define topology_core_id(cpu)
> >   	#define topology_book_id(cpu)
> >   	#define topology_drawer_id(cpu)
> >   	#define topology_sibling_cpumask(cpu)
> >   	#define topology_core_cpumask(cpu)
> > +	#define topology_cluster_cpumask(cpu)
> >   	#define topology_die_cpumask(cpu)
> >   	#define topology_book_cpumask(cpu)
> >   	#define topology_drawer_cpumask(cpu)
> > @@ -39,10 +41,12 @@ not defined by include/asm-XXX/topology.h:
> >   
> >   1) topology_physical_package_id: -1
> >   2) topology_die_id: -1
> > -3) topology_core_id: 0
> > -4) topology_sibling_cpumask: just the given CPU
> > -5) topology_core_cpumask: just the given CPU
> > -6) topology_die_cpumask: just the given CPU
> > +3) topology_cluster_id: -1
> > +4) topology_core_id: 0
> > +5) topology_sibling_cpumask: just the given CPU
> > +6) topology_core_cpumask: just the given CPU
> > +7) topology_cluster_cpumask: just the given CPU
> > +8) topology_die_cpumask: just the given CPU
> >   
> >   For architectures that don't support books (CONFIG_SCHED_BOOK) there are no
> >   default definitions for topology_book_id() and topology_book_cpumask().
> > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c
> > index 4dd14a6620c1..9ab78ad826e2 100644
> > --- a/arch/arm64/kernel/topology.c
> > +++ b/arch/arm64/kernel/topology.c
> > @@ -103,6 +103,8 @@ int __init parse_acpi_topology(void)
> >   			cpu_topology[cpu].thread_id  = -1;
> >   			cpu_topology[cpu].core_id    = topology_id;
> >   		}
> > +		topology_id = find_acpi_cpu_topology_cluster(cpu);
> > +		cpu_topology[cpu].cluster_id = topology_id;
> >   		topology_id = find_acpi_cpu_topology_package(cpu);
> >   		cpu_topology[cpu].package_id = topology_id;
> >   
> > diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c
> > index fe69dc518f31..701f61c01359 100644
> > --- a/drivers/acpi/pptt.c
> > +++ b/drivers/acpi/pptt.c
> > @@ -746,6 +746,73 @@ int find_acpi_cpu_topology_package(unsigned int cpu)
> >   					  ACPI_PPTT_PHYSICAL_PACKAGE);
> >   }
> >   
> > +/**
> > + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value
> > + * @cpu: Kernel logical CPU number
> > + *
> > + * Determine a topology unique cluster ID for the given CPU/thread.
> > + * This ID can then be used to group peers, which will have matching ids.
> > + *
> > + * The cluster, if present is the level of topology above CPUs. In a
> > + * multi-thread CPU, it will be the level above the CPU, not the thread.
> > + * It may not exist in single CPU systems. In simple multi-CPU systems,
> > + * it may be equal to the package topology level.
> > + *
> > + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found
> > + * or there is no toplogy level above the CPU..
> > + * Otherwise returns a value which represents the package for this CPU.
> > + */
> > +
> > +int find_acpi_cpu_topology_cluster(unsigned int cpu)
> > +{
> > +	struct acpi_table_header *table;
> > +	acpi_status status;
> > +	struct acpi_pptt_processor *cpu_node, *cluster_node;
> > +	u32 acpi_cpu_id;
> > +	int retval;
> > +	int is_thread;
> > +
> > +	status = acpi_get_table(ACPI_SIG_PPTT, 0, &table);
> > +	if (ACPI_FAILURE(status)) {
> > +		acpi_pptt_warn_missing();
> > +		return -ENOENT;
> > +	}
> > +
> > +	acpi_cpu_id = get_acpi_id_for_cpu(cpu);
> > +	cpu_node = acpi_find_processor_node(table, acpi_cpu_id);
> > +	if (cpu_node == NULL || !cpu_node->parent) {
> > +		retval = -ENOENT;
> > +		goto put_table;
> > +	}
> > +
> > +	is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD;
> > +	cluster_node = fetch_pptt_node(table, cpu_node->parent);
> > +	if (cluster_node == NULL) {
> > +		retval = -ENOENT;
> > +		goto put_table;
> > +	}
> > +	if (is_thread) {
> > +		if (!cluster_node->parent) {
> > +			retval = -ENOENT;
> > +			goto put_table;
> > +		}
> > +		cluster_node = fetch_pptt_node(table, cluster_node->parent);
> > +		if (cluster_node == NULL) {
> > +			retval = -ENOENT;
> > +			goto put_table;
> > +		}
> > +	}
> > +	if (cluster_node->flags & ACPI_PPTT_ACPI_PROCESSOR_ID_VALID)
> > +		retval = cluster_node->acpi_processor_id;
> > +	else
> > +		retval = ACPI_PTR_DIFF(cluster_node, table);
> > +
> > +put_table:
> > +	acpi_put_table(table);
> > +
> > +	return retval;
> > +}
> > +
> >   /**
> >    * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag
> >    * @cpu: Kernel logical CPU number
> > diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> > index 921312a8d957..5b1589adacaf 100644
> > --- a/drivers/base/arch_topology.c
> > +++ b/drivers/base/arch_topology.c
> > @@ -598,6 +598,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu)
> >   	return core_mask;
> >   }
> >   
> > +const struct cpumask *cpu_clustergroup_mask(int cpu)
> > +{
> > +	return &cpu_topology[cpu].cluster_sibling;
> > +}
> > +
> >   void update_siblings_masks(unsigned int cpuid)
> >   {
> >   	struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid];
> > @@ -615,6 +620,11 @@ void update_siblings_masks(unsigned int cpuid)
> >   		if (cpuid_topo->package_id != cpu_topo->package_id)
> >   			continue;
> >   
> > +		if (cpuid_topo->cluster_id == cpu_topo->cluster_id) {
> > +			cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling);
> > +			cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling);
> > +		}
> > +
> >   		cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
> >   		cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);
> >   
> > @@ -633,6 +643,9 @@ static void clear_cpu_topology(int cpu)
> >   	cpumask_clear(&cpu_topo->llc_sibling);
> >   	cpumask_set_cpu(cpu, &cpu_topo->llc_sibling);
> >   
> > +	cpumask_clear(&cpu_topo->cluster_sibling);
> > +	cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling);
> > +
> >   	cpumask_clear(&cpu_topo->core_sibling);
> >   	cpumask_set_cpu(cpu, &cpu_topo->core_sibling);
> >   	cpumask_clear(&cpu_topo->thread_sibling);
> > @@ -648,6 +661,7 @@ void __init reset_cpu_topology(void)
> >   
> >   		cpu_topo->thread_id = -1;
> >   		cpu_topo->core_id = -1;
> > +		cpu_topo->cluster_id = -1;
> >   		cpu_topo->package_id = -1;
> >   		cpu_topo->llc_id = -1;
> >   
> > diff --git a/drivers/base/topology.c b/drivers/base/topology.c
> > index 43c0940643f5..8f2b641d0b8c 100644
> > --- a/drivers/base/topology.c
> > +++ b/drivers/base/topology.c
> > @@ -48,6 +48,9 @@ static DEVICE_ATTR_RO(physical_package_id);
> >   define_id_show_func(die_id);
> >   static DEVICE_ATTR_RO(die_id);
> >   
> > +define_id_show_func(cluster_id);
> > +static DEVICE_ATTR_RO(cluster_id);
> > +
> >   define_id_show_func(core_id);
> >   static DEVICE_ATTR_RO(core_id);
> >   
> > @@ -63,6 +66,10 @@ define_siblings_read_func(core_siblings, core_cpumask);
> >   static BIN_ATTR_RO(core_siblings, 0);
> >   static BIN_ATTR_RO(core_siblings_list, 0);
> >   
> > +define_siblings_read_func(cluster_cpus, cluster_cpumask);
> > +static BIN_ATTR_RO(cluster_cpus, 0);
> > +static BIN_ATTR_RO(cluster_cpus_list, 0);
> > +
> >   define_siblings_read_func(die_cpus, die_cpumask);
> >   static BIN_ATTR_RO(die_cpus, 0);
> >   static BIN_ATTR_RO(die_cpus_list, 0);
> > @@ -94,6 +101,8 @@ static struct bin_attribute *bin_attrs[] = {
> >   	&bin_attr_thread_siblings_list,
> >   	&bin_attr_core_siblings,
> >   	&bin_attr_core_siblings_list,
> > +	&bin_attr_cluster_cpus,
> > +	&bin_attr_cluster_cpus_list,
> >   	&bin_attr_die_cpus,
> >   	&bin_attr_die_cpus_list,
> >   	&bin_attr_package_cpus,
> > @@ -112,6 +121,7 @@ static struct bin_attribute *bin_attrs[] = {
> >   static struct attribute *default_attrs[] = {
> >   	&dev_attr_physical_package_id.attr,
> >   	&dev_attr_die_id.attr,
> > +	&dev_attr_cluster_id.attr,
> >   	&dev_attr_core_id.attr,
> >   #ifdef CONFIG_SCHED_BOOK
> >   	&dev_attr_book_id.attr,
> > diff --git a/include/linux/acpi.h b/include/linux/acpi.h
> > index 72e4f7fd268c..6d65427e5f67 100644
> > --- a/include/linux/acpi.h
> > +++ b/include/linux/acpi.h
> > @@ -1353,6 +1353,7 @@ static inline int lpit_read_residency_count_address(u64 *address)
> >   #ifdef CONFIG_ACPI_PPTT
> >   int acpi_pptt_cpu_is_thread(unsigned int cpu);
> >   int find_acpi_cpu_topology(unsigned int cpu, int level);
> > +int find_acpi_cpu_topology_cluster(unsigned int cpu);
> >   int find_acpi_cpu_topology_package(unsigned int cpu);
> >   int find_acpi_cpu_topology_hetero_id(unsigned int cpu);
> >   int find_acpi_cpu_cache_topology(unsigned int cpu, int level);
> > @@ -1365,6 +1366,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level)
> >   {
> >   	return -EINVAL;
> >   }
> > +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu)
> > +{
> > +	return -EINVAL;
> > +}
> >   static inline int find_acpi_cpu_topology_package(unsigned int cpu)
> >   {
> >   	return -EINVAL;
> > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> > index f180240dc95f..b97cea83b25e 100644
> > --- a/include/linux/arch_topology.h
> > +++ b/include/linux/arch_topology.h
> > @@ -62,10 +62,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus,
> >   struct cpu_topology {
> >   	int thread_id;
> >   	int core_id;
> > +	int cluster_id;
> >   	int package_id;
> >   	int llc_id;
> >   	cpumask_t thread_sibling;
> >   	cpumask_t core_sibling;
> > +	cpumask_t cluster_sibling;
> >   	cpumask_t llc_sibling;
> >   };
> >   
> > @@ -73,13 +75,16 @@ struct cpu_topology {
> >   extern struct cpu_topology cpu_topology[NR_CPUS];
> >   
> >   #define topology_physical_package_id(cpu)	(cpu_topology[cpu].package_id)
> > +#define topology_cluster_id(cpu)	(cpu_topology[cpu].cluster_id)
> >   #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
> >   #define topology_core_cpumask(cpu)	(&cpu_topology[cpu].core_sibling)
> >   #define topology_sibling_cpumask(cpu)	(&cpu_topology[cpu].thread_sibling)
> > +#define topology_cluster_cpumask(cpu)	(&cpu_topology[cpu].cluster_sibling)
> >   #define topology_llc_cpumask(cpu)	(&cpu_topology[cpu].llc_sibling)
> >   void init_cpu_topology(void);
> >   void store_cpu_topology(unsigned int cpuid);
> >   const struct cpumask *cpu_coregroup_mask(int cpu);
> > +const struct cpumask *cpu_clustergroup_mask(int cpu);
> >   void update_siblings_masks(unsigned int cpu);
> >   void remove_cpu_topology(unsigned int cpuid);
> >   void reset_cpu_topology(void);
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 7634cd737061..80d27d717631 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -186,6 +186,9 @@ static inline int cpu_to_mem(int cpu)
> >   #ifndef topology_die_id
> >   #define topology_die_id(cpu)			((void)(cpu), -1)
> >   #endif
> > +#ifndef topology_cluster_id
> > +#define topology_cluster_id(cpu)		((void)(cpu), -1)
> > +#endif
> >   #ifndef topology_core_id
> >   #define topology_core_id(cpu)			((void)(cpu), 0)
> >   #endif
> > @@ -195,6 +198,9 @@ static inline int cpu_to_mem(int cpu)
> >   #ifndef topology_core_cpumask
> >   #define topology_core_cpumask(cpu)		cpumask_of(cpu)
> >   #endif
> > +#ifndef topology_cluster_cpumask
> > +#define topology_cluster_cpumask(cpu)		cpumask_of(cpu)
> > +#endif
> >   #ifndef topology_die_cpumask
> >   #define topology_die_cpumask(cpu)		cpumask_of(cpu)
> >   #endif  
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] Re: [PATCH 1/3] topology: Represent clusters of CPUs within a die
  2022-05-09 10:15       ` Jonathan Cameron
@ 2022-05-10 19:17         ` Darren Hart
  -1 siblings, 0 replies; 16+ messages in thread
From: Darren Hart @ 2022-05-10 19:17 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jeremy Linton, Barry Song, bp, catalin.marinas, dietmar.eggemann,
	gregkh, hpa, juri.lelli, bristot, lenb, mgorman, mingo, peterz,
	rjw, sudeep.holla, tglx, aubrey.li, bsegall, guodong.xu,
	liguozhu, linux-acpi, linux-arm-kernel, linux-kernel,
	mark.rutland, msys.mizuma, prime.zeng, rostedt, tim.c.chen,
	valentin.schneider, vincent.guittot, will, x86, xuwei5,
	yangyicong, linuxarm, Tian Tao, Barry Song

On Mon, May 09, 2022 at 11:15:53AM +0100, Jonathan Cameron wrote:
> On Fri, 6 May 2022 15:24:27 -0500
> Jeremy Linton <jeremy.linton@arm.com> wrote:
> 
> > Hi,
> > 
> > It seems this set is:
> > 
> > "BUG: arch topology borken"
> >                     ^code
> > 
> > on machines that don't actually have clusters, or provide a 
> > representation which might be taken for a cluster. The Ampere Altra for 

Hi All,

The fix for this particular issue is upstream:
db1e59483dfd topology: make core_mask include at least cluster_siblings


> > one. So, I guess its my job to relay what I was informed of when I 
> > intially proposed something similar a few years back.
> > 
> > Neither the ACPI/PPTT spec nor the Arm architectural spec mandate the 
> > concept of a "cluster" particularly in the form of a system with cores 
> > sharing the L2, which IIRC is the case for the Kunpeng.
> 
> It is not. Kunpeng 920 shares l3 tag cache, but not l2 cache (which is
> private for each core).
> As such the existence of a cluster is not distinguished by sharing
> of any cache resources that are in PPTT.  There is an argument for potentially
> adding more types of resource to PPTT to give a richer description.
> 
> Whilst ACPI doesn't mandate a cluster (there is an example, though that happens
> to have L3 shared across the cluster), it does allow for addition
> hierarchy description. Cluster is just a name for such an extra level.
> 
> > And it tends to 
> > be a shared L2 which gives the most bang for the buck (or was when I was 
> > testing/benchmarking all this, on aarch64) from scheduler changes which 
> > create cluster level scheduling domains.
> > But OTOH, things like specJBB 
> > didn't really like those smaller MC levels (which I suspect is hurt by 
> > this change, without running a full benchmark suite, especially on 
> > something like the above ampere, given what is happening to its 
> > scheduling domains).
> > 
> > So, the one takeway I can give is this, the code below which is 
> > attempting to create a cluster level should be a bit more intellegent 
> > about whether there is an actual cluster.
> 
> I agree that more intelligence is needed, though I think that belongs
> in the interpretation of the cluster level.  This particular patch
> should present that information in a consistent fashion.  My understanding
> is it is consistent with how other levels have been presented in that
> it's perfectly acceptable to have multiple levels that can be collapsed
> by the users of the description. (perhaps I'm wrong on that?)
> 

Collapsing redundant levels is indeed an intentional part of the design as I
understand it.

> > A first order approximation 
> > might be adding a check to see if the node immediatly above the CPU 
> > contains an L2 and that its shared. 
> 
> That rules out our clusters, so not a great starting point :)
> 
> Darren Hart's recent set for Ampere Altra is fixing a different combination
> but is in some sense similar in that it corrects an assumption that turned
> out to be false in the user of the topology description whilst leaving the
> description alone.

I think that concept is important: "correct assumptions in the abstraction while
leaving the description alone" (provided the description follows the relevant
standards and specifications of course).

> 
> > A better fix, of course is the 
> > reason this wasn't previously done, and that is to convince the ACPI 
> > commitee to standardize a CLUSTER level flag which could be utilized by 
> > a firmware/machine manufactuer to decide whether cluster level 
> > scheduling provides an advantage and simply not do it on machines which 
> > don't flag CLUSTER levels because its not avantagious.
> 
> While I obviously can't predict discussions in ASWG, my gut feeling
> is that would be a non starter with questions along the lines of:
> 
> 1) Why is this level special? The spec already defines a hierarchical
>    description with caches described at each level, so you can infer
>    what is intended.  If we define cluster, we'll also need to define
>    super cluster (we have designs with super clusters and it's only going
>    to get worse as systems continue to get bigger.)

While I share Jeremy's concern about the lack of specificity of the term
Cluster, I suspect Jonathan's point about that path leading to more and more
categorization (e.g. super cluster) in a space that is rapidly evolving is
accurate.

Beyond the topology of cores and cpu-side caches, we have other properties to
consider which affect scheduling performance (like the shared snoop filter,
memory-side caches, etc.) and could/should be considered in the heuristics.
Comprehending all of these into a fixed set of defined topology constructs seems
unlikely.

It seems to me we are going to need to respond to a set of properties
rather than attempting to rigidly define what future topologies will be. Even
terms like die and package start to get fuzzy as more and more complex
architectures get pushed down into a single socket.

-- 
Darren Hart
Ampere Computing / OS and Kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [BUG] Re: [PATCH 1/3] topology: Represent clusters of CPUs within a die
@ 2022-05-10 19:17         ` Darren Hart
  0 siblings, 0 replies; 16+ messages in thread
From: Darren Hart @ 2022-05-10 19:17 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Jeremy Linton, Barry Song, bp, catalin.marinas, dietmar.eggemann,
	gregkh, hpa, juri.lelli, bristot, lenb, mgorman, mingo, peterz,
	rjw, sudeep.holla, tglx, aubrey.li, bsegall, guodong.xu,
	liguozhu, linux-acpi, linux-arm-kernel, linux-kernel,
	mark.rutland, msys.mizuma, prime.zeng, rostedt, tim.c.chen,
	valentin.schneider, vincent.guittot, will, x86, xuwei5,
	yangyicong, linuxarm, Tian Tao, Barry Song

On Mon, May 09, 2022 at 11:15:53AM +0100, Jonathan Cameron wrote:
> On Fri, 6 May 2022 15:24:27 -0500
> Jeremy Linton <jeremy.linton@arm.com> wrote:
> 
> > Hi,
> > 
> > It seems this set is:
> > 
> > "BUG: arch topology borken"
> >                     ^code
> > 
> > on machines that don't actually have clusters, or provide a 
> > representation which might be taken for a cluster. The Ampere Altra for 

Hi All,

The fix for this particular issue is upstream:
db1e59483dfd topology: make core_mask include at least cluster_siblings


> > one. So, I guess its my job to relay what I was informed of when I 
> > intially proposed something similar a few years back.
> > 
> > Neither the ACPI/PPTT spec nor the Arm architectural spec mandate the 
> > concept of a "cluster" particularly in the form of a system with cores 
> > sharing the L2, which IIRC is the case for the Kunpeng.
> 
> It is not. Kunpeng 920 shares l3 tag cache, but not l2 cache (which is
> private for each core).
> As such the existence of a cluster is not distinguished by sharing
> of any cache resources that are in PPTT.  There is an argument for potentially
> adding more types of resource to PPTT to give a richer description.
> 
> Whilst ACPI doesn't mandate a cluster (there is an example, though that happens
> to have L3 shared across the cluster), it does allow for addition
> hierarchy description. Cluster is just a name for such an extra level.
> 
> > And it tends to 
> > be a shared L2 which gives the most bang for the buck (or was when I was 
> > testing/benchmarking all this, on aarch64) from scheduler changes which 
> > create cluster level scheduling domains.
> > But OTOH, things like specJBB 
> > didn't really like those smaller MC levels (which I suspect is hurt by 
> > this change, without running a full benchmark suite, especially on 
> > something like the above ampere, given what is happening to its 
> > scheduling domains).
> > 
> > So, the one takeway I can give is this, the code below which is 
> > attempting to create a cluster level should be a bit more intellegent 
> > about whether there is an actual cluster.
> 
> I agree that more intelligence is needed, though I think that belongs
> in the interpretation of the cluster level.  This particular patch
> should present that information in a consistent fashion.  My understanding
> is it is consistent with how other levels have been presented in that
> it's perfectly acceptable to have multiple levels that can be collapsed
> by the users of the description. (perhaps I'm wrong on that?)
> 

Collapsing redundant levels is indeed an intentional part of the design as I
understand it.

> > A first order approximation 
> > might be adding a check to see if the node immediatly above the CPU 
> > contains an L2 and that its shared. 
> 
> That rules out our clusters, so not a great starting point :)
> 
> Darren Hart's recent set for Ampere Altra is fixing a different combination
> but is in some sense similar in that it corrects an assumption that turned
> out to be false in the user of the topology description whilst leaving the
> description alone.

I think that concept is important: "correct assumptions in the abstraction while
leaving the description alone" (provided the description follows the relevant
standards and specifications of course).

> 
> > A better fix, of course is the 
> > reason this wasn't previously done, and that is to convince the ACPI 
> > commitee to standardize a CLUSTER level flag which could be utilized by 
> > a firmware/machine manufactuer to decide whether cluster level 
> > scheduling provides an advantage and simply not do it on machines which 
> > don't flag CLUSTER levels because its not avantagious.
> 
> While I obviously can't predict discussions in ASWG, my gut feeling
> is that would be a non starter with questions along the lines of:
> 
> 1) Why is this level special? The spec already defines a hierarchical
>    description with caches described at each level, so you can infer
>    what is intended.  If we define cluster, we'll also need to define
>    super cluster (we have designs with super clusters and it's only going
>    to get worse as systems continue to get bigger.)

While I share Jeremy's concern about the lack of specificity of the term
Cluster, I suspect Jonathan's point about that path leading to more and more
categorization (e.g. super cluster) in a space that is rapidly evolving is
accurate.

Beyond the topology of cores and cpu-side caches, we have other properties to
consider which affect scheduling performance (like the shared snoop filter,
memory-side caches, etc.) and could/should be considered in the heuristics.
Comprehending all of these into a fixed set of defined topology constructs seems
unlikely.

It seems to me we are going to need to respond to a set of properties
rather than attempting to rigidly define what future topologies will be. Even
terms like die and package start to get fuzzy as more and more complex
architectures get pushed down into a single socket.

-- 
Darren Hart
Ampere Computing / OS and Kernel

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-05-10 19:19 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-20  1:30 [PATCH 0/3] Represent cluster topology and enable load balance between clusters Barry Song
2021-08-20  1:30 ` Barry Song
2021-08-20  1:30 ` [PATCH 1/3] topology: Represent clusters of CPUs within a die Barry Song
2021-08-20  1:30   ` Barry Song
2022-05-06 20:24   ` [BUG] " Jeremy Linton
2022-05-06 20:24     ` Jeremy Linton
2022-05-09 10:15     ` Jonathan Cameron
2022-05-09 10:15       ` Jonathan Cameron
2022-05-10 19:17       ` Darren Hart
2022-05-10 19:17         ` Darren Hart
2021-08-20  1:30 ` [PATCH 2/3] scheduler: Add cluster scheduler level in core and related Kconfig for ARM64 Barry Song
2021-08-20  1:30   ` Barry Song
2021-08-20  1:30 ` [PATCH 3/3] scheduler: Add cluster scheduler level for x86 Barry Song
2021-08-20  1:30   ` Barry Song
2021-08-23 17:49   ` Tim Chen
2021-08-23 17:49     ` Tim Chen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.