LinuxPPC-Dev Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v4 00/10] Coregroup support on Powerpc
@ 2020-07-27  5:32 Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES Srikar Dronamraju
                   ` (10 more replies)
  0 siblings, 11 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran,
	Michael Neuling, Srikar Dronamraju, Michael Ellerman,
	Peter Zijlstra, Jordan Niethe, Anton Blanchard, LKML,
	Valentin Schneider, Nick Piggin, linuxppc-dev, Ingo Molnar

Changelog v3 ->v4:
v3: https://lore.kernel.org/lkml/20200723085116.4731-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Create coregroup domain
	if coregroup_support doesn't exist, update MC mask to the next
	smaller domain mask.

Changelog v2 -> v3:
v2: https://lore.kernel.org/linuxppc-dev/20200721113814.32284-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Cache node for reuse
	Removed node caching part. Rewrote the Commit msg (Michael Ellerman)
	Renamed to powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES

powerpc/smp: Enable small core scheduling sooner
	Rewrote changelog (Gautham)
	Renamed to powerpc/smp: Move topology fixups into  a new function

powerpc/smp: Create coregroup domain
	Add optimization for mask updation under coregroup_support

Changelog v1 -> v2:
v1: https://lore.kernel.org/linuxppc-dev/20200714043624.5648-1-srikar@linux.vnet.ibm.com/t/#u

powerpc/smp: Merge Power9 topology with Power topology
	Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
	since cpu_smt_mask is only defined under CONFIG_SCHED_SMT

powerpc/smp: Enable small core scheduling sooner
	Restored the previous info msg (Jordan)
	Moved big core topology fixup to fixup_topology (Gautham)

powerpc/smp: Dont assume l2-cache to be superset of sibling
	Set cpumask after verifying l2-cache. (Gautham)

powerpc/smp: Generalize 2nd sched domain
	Moved shared_cache topology fixup to fixup_topology (Gautham)

Powerpc/numa: Detect support for coregroup
	Explained Coregroup in commit msg (Michael Ellerman)

Powerpc/smp: Create coregroup domain
	Moved coregroup topology fixup to fixup_topology (Gautham)

powerpc/smp: Implement cpu_to_coregroup_id
	Move coregroup_enabled before getting associativity (Gautham)

powerpc/smp: Provide an ability to disable coregroup
	Patch dropped (Michael Ellerman)

Cleanup of existing powerpc topologies and add coregroup support on
Powerpc. Coregroup is a group of (subset of) cores of a DIE that share
a resource.

Patch 7 of this patch series: "Powerpc/numa: Detect support for coregroup"
depends on
https://lore.kernel.org/linuxppc-dev/20200707140644.7241-1-srikar@linux.vnet.ibm.com/t/#u
However it should be easy to rebase the patch without the above patch.

This patch series is based on top of current powerpc/next tree + the
above patch.

On Power 8 Systems
------------------
$ tail /proc/cpuinfo
processor	: 255
cpu		: POWER8 (architected), altivec supported
clock		: 3724.000000MHz
revision	: 2.1 (pvr 004b 0201)

timebase	: 512000000
platform	: pSeries
model		: IBM,8408-E8E
machine		: CHRP IBM,8408-E8E
MMU		: Hash

Before the patchset
-------------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
DIE
NUMA
NUMA
$ head /proc/schedstat
version 15
timestamp 4295534931
cpu0 0 0 0 0 0 0 41389823338 17682779896 14117
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 27087859050 152273672 10396
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After the patchset
------------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
DIE
NUMA
NUMA
$ head /proc/schedstat
version 15
timestamp 4295534931
cpu0 0 0 0 0 0 0 41389823338 17682779896 14117
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,00000000,00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 27087859050 152273672 10396
domain0 00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Power 9 (with device-tree enablement to show coregroups).
(hunks for mimicing a coregroup was posted at
https://lore.kernel.org/linuxppc-dev/20200714043624.5648-1-srikar@linux.vnet.ibm.com/t/#m2cb09bb11c7a93257d6123d1d27edb8212f8af21)
-----------------------------------------------------------
$ tail /proc/cpuinfo
processor	: 127
cpu		: POWER9 (architected), altivec supported
clock		: 3000.000000MHz
revision	: 2.2 (pvr 004e 0202)

timebase	: 512000000
platform	: pSeries
model		: IBM,9008-22L
machine		: CHRP IBM,9008-22L
MMU		: Hash

Before patchset
--------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

After patchset
--------------
$ cat /proc/sys/kernel/sched_domain/cpu0/domain*/name
SMT
CACHE
MC
DIE
NUMA

$ head /proc/schedstat
version 15
timestamp 4318242208
cpu0 0 0 0 0 0 0 28077107004 4773387362 78205
domain0 00000000,00000000,00000000,00000055 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,00000000,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 00000000,00000000,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain4 ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 24177439200 413887604 75393
domain0 00000000,00000000,00000000,000000aa 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>

Srikar Dronamraju (10):
  powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES
  powerpc/smp: Merge Power9 topology with Power topology
  powerpc/smp: Move powerpc_topology above
  powerpc/smp: Move topology fixups into  a new function
  powerpc/smp: Dont assume l2-cache to be superset of sibling
  powerpc/smp: Generalize 2nd sched domain
  powerpc/numa: Detect support for coregroup
  powerpc/smp: Allocate cpumask only after searching thread group
  powerpc/smp: Create coregroup domain
  powerpc/smp: Implement cpu_to_coregroup_id

 arch/powerpc/include/asm/smp.h      |   1 +
 arch/powerpc/include/asm/topology.h |  10 ++
 arch/powerpc/kernel/smp.c           | 246 +++++++++++++++++-----------
 arch/powerpc/mm/numa.c              |  59 +++++--
 4 files changed, 210 insertions(+), 106 deletions(-)

-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 02/10] powerpc/smp: Merge Power9 topology with Power topology Srikar Dronamraju
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
"error: _numa_cpu_lookup_table_ undeclared"

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2 -> v3:
	Removed node caching part. Rewrote the Commit msg (Michael Ellerman)
	Renamed to powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES

 arch/powerpc/kernel/smp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 73199470c265..edf94ca64eea 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -860,6 +860,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, cpu_to_node(cpu));
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, cpu_to_node(cpu));
+#ifdef CONFIG_NEED_MULTIPLE_NODES
 		/*
 		 * numa_node_id() works after this.
 		 */
@@ -868,6 +869,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 			set_cpu_numa_mem(cpu,
 				local_memory_node(numa_cpu_lookup_table[cpu]));
 		}
+#endif
 	}
 
 	/* Init the cpumasks so the boot CPU is related to itself */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 02/10] powerpc/smp: Merge Power9 topology with Power topology
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 03/10] powerpc/smp: Move powerpc_topology above Srikar Dronamraju
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Replaced a reference to cpu_smt_mask with per_cpu(cpu_sibling_map, cpu)
	since cpu_smt_mask is only defined under CONFIG_SCHED_SMT

 arch/powerpc/kernel/smp.c | 33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index edf94ca64eea..283a04e54f52 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1313,7 +1313,7 @@ int setup_profiling_timer(unsigned int multiplier)
 }
 
 #ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymetric SMT dependancy */
+/* cpumask of CPUs with asymmetric SMT dependency */
 static int powerpc_smt_flags(void)
 {
 	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
@@ -1326,14 +1326,6 @@ static int powerpc_smt_flags(void)
 }
 #endif
 
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
-	{ NULL, },
-};
-
 /*
  * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
  * This topology makes it *much* cheaper to migrate tasks between adjacent cores
@@ -1351,7 +1343,13 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	return cpu_l2_cache_mask(cpu);
+	if (shared_caches)
+		return cpu_l2_cache_mask(cpu);
+
+	if (has_big_cores)
+		return cpu_smallcore_mask(cpu);
+
+	return per_cpu(cpu_sibling_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -1361,7 +1359,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
-static struct sched_domain_topology_level power9_topology[] = {
+static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
@@ -1386,21 +1384,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
 #ifdef CONFIG_SCHED_SMT
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
-		power9_topology[0].mask = smallcore_smt_mask;
 		powerpc_topology[0].mask = smallcore_smt_mask;
 	}
 #endif
-	/*
-	 * If any CPU detects that it's sharing a cache with another CPU then
-	 * use the deeper topology that is aware of this sharing.
-	 */
-	if (shared_caches) {
-		pr_info("Using shared cache scheduler topology\n");
-		set_sched_topology(power9_topology);
-	} else {
-		pr_info("Using standard scheduler topology\n");
-		set_sched_topology(powerpc_topology);
-	}
+	set_sched_topology(powerpc_topology);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 03/10] powerpc/smp: Move powerpc_topology above
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 02/10] powerpc/smp: Merge Power9 topology with Power topology Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 04/10] powerpc/smp: Move topology fixups into a new function Srikar Dronamraju
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Just moving the powerpc_topology description above.
This will help in using functions in this file and avoid declarations.

No other functional changes

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 116 +++++++++++++++++++-------------------
 1 file changed, 58 insertions(+), 58 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 283a04e54f52..a685915e5941 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -818,6 +818,64 @@ static int init_cpu_l1_cache_map(int cpu)
 	return err;
 }
 
+static bool shared_caches;
+
+#ifdef CONFIG_SCHED_SMT
+/* cpumask of CPUs with asymmetric SMT dependency */
+static int powerpc_smt_flags(void)
+{
+	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
+
+	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
+		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
+		flags |= SD_ASYM_PACKING;
+	}
+	return flags;
+}
+#endif
+
+/*
+ * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
+ * This topology makes it *much* cheaper to migrate tasks between adjacent cores
+ * since the migrated task remains cache hot. We want to take advantage of this
+ * at the scheduler level so an extra topology level is required.
+ */
+static int powerpc_shared_cache_flags(void)
+{
+	return SD_SHARE_PKG_RESOURCES;
+}
+
+/*
+ * We can't just pass cpu_l2_cache_mask() directly because
+ * returns a non-const pointer and the compiler barfs on that.
+ */
+static const struct cpumask *shared_cache_mask(int cpu)
+{
+	if (shared_caches)
+		return cpu_l2_cache_mask(cpu);
+
+	if (has_big_cores)
+		return cpu_smallcore_mask(cpu);
+
+	return per_cpu(cpu_sibling_map, cpu);
+}
+
+#ifdef CONFIG_SCHED_SMT
+static const struct cpumask *smallcore_smt_mask(int cpu)
+{
+	return cpu_smallcore_mask(cpu);
+}
+#endif
+
+static struct sched_domain_topology_level powerpc_topology[] = {
+#ifdef CONFIG_SCHED_SMT
+	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
+#endif
+	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL, },
+};
+
 static int init_big_cores(void)
 {
 	int cpu;
@@ -1247,8 +1305,6 @@ static void add_cpu_to_masks(int cpu)
 			set_cpus_related(cpu, i, cpu_core_mask);
 }
 
-static bool shared_caches;
-
 /* Activate a secondary processor. */
 void start_secondary(void *unused)
 {
@@ -1312,62 +1368,6 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
-#ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymmetric SMT dependency */
-static int powerpc_smt_flags(void)
-{
-	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
-
-	if (cpu_has_feature(CPU_FTR_ASYM_SMT)) {
-		printk_once(KERN_INFO "Enabling Asymmetric SMT scheduling\n");
-		flags |= SD_ASYM_PACKING;
-	}
-	return flags;
-}
-#endif
-
-/*
- * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
- * This topology makes it *much* cheaper to migrate tasks between adjacent cores
- * since the migrated task remains cache hot. We want to take advantage of this
- * at the scheduler level so an extra topology level is required.
- */
-static int powerpc_shared_cache_flags(void)
-{
-	return SD_SHARE_PKG_RESOURCES;
-}
-
-/*
- * We can't just pass cpu_l2_cache_mask() directly because
- * returns a non-const pointer and the compiler barfs on that.
- */
-static const struct cpumask *shared_cache_mask(int cpu)
-{
-	if (shared_caches)
-		return cpu_l2_cache_mask(cpu);
-
-	if (has_big_cores)
-		return cpu_smallcore_mask(cpu);
-
-	return per_cpu(cpu_sibling_map, cpu);
-}
-
-#ifdef CONFIG_SCHED_SMT
-static const struct cpumask *smallcore_smt_mask(int cpu)
-{
-	return cpu_smallcore_mask(cpu);
-}
-#endif
-
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
-	{ NULL, },
-};
-
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	/*
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 04/10] powerpc/smp: Move topology fixups into a new function
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 03/10] powerpc/smp: Move powerpc_topology above Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling Srikar Dronamraju
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Move topology fixup based on the platform attributes into its own
function which is called just before set_sched_topology.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2 -> v3:
	Rewrote changelog (Gautham)
	Renamed to powerpc/smp: Move topology fixups into  a new function

 arch/powerpc/kernel/smp.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a685915e5941..da27f6909be1 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1368,6 +1368,16 @@ int setup_profiling_timer(unsigned int multiplier)
 	return 0;
 }
 
+static void fixup_topology(void)
+{
+#ifdef CONFIG_SCHED_SMT
+	if (has_big_cores) {
+		pr_info("Big cores detected but using small core scheduling\n");
+		powerpc_topology[0].mask = smallcore_smt_mask;
+	}
+#endif
+}
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
 	/*
@@ -1381,12 +1391,7 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-#ifdef CONFIG_SCHED_SMT
-	if (has_big_cores) {
-		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
-	}
-#endif
+	fixup_topology();
 	set_sched_topology(powerpc_topology);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 04/10] powerpc/smp: Move topology fixups into a new function Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-27  5:32 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption. cpu_l2_cache_mask is a superset of
cpu_sibling_mask if and only if shared_caches is set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Set cpumask after verifying l2-cache. (Gautham)

 arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index da27f6909be1..d997c7411664 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1194,6 +1194,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
 	if (!l2_cache)
 		return false;
 
+	cpumask_set_cpu(cpu, mask_fn(cpu));
 	for_each_cpu(i, cpu_online_mask) {
 		/*
 		 * when updating the marks the current CPU has not been marked
@@ -1276,29 +1277,30 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
 			set_cpus_related(i, cpu, cpu_sibling_mask);
 
 	add_cpu_to_smallcore_masks(cpu);
-	/*
-	 * Copy the thread sibling mask into the cache sibling mask
-	 * and mark any CPUs that share an L2 with this CPU.
-	 */
-	for_each_cpu(i, cpu_sibling_mask(cpu))
-		set_cpus_related(cpu, i, cpu_l2_cache_mask);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-	/*
-	 * Copy the cache sibling mask into core sibling mask and mark
-	 * any CPUs on the same chip as this CPU.
-	 */
-	for_each_cpu(i, cpu_l2_cache_mask(cpu))
-		set_cpus_related(cpu, i, cpu_core_mask);
+	if (pkg_id == -1) {
+		struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * Copy the sibling mask into core sibling mask and
+		 * mark any CPUs on the same chip as this CPU.
+		 */
+		if (shared_caches)
+			mask = cpu_l2_cache_mask;
+
+		for_each_cpu(i, mask(cpu))
+			set_cpus_related(cpu, i, cpu_core_mask);
 
-	if (pkg_id == -1)
 		return;
+	}
 
 	for_each_cpu(i, cpu_online_mask)
 		if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-30  5:55   ` Gautham R Shenoy
  2020-07-31  7:45   ` Michael Ellerman
  2020-07-27  5:32 ` [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup Srikar Dronamraju
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Currently "CACHE" domain happens to be the 2nd sched domain as per
powerpc_topology. This domain will collapse if cpumask of l2-cache is
same as SMT domain. However we could generalize this domain such that it
could mean either be a "CACHE" domain or a "BIGCORE" domain.

While setting up the "CACHE" domain, check if shared_cache is already
set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Moved shared_cache topology fixup to fixup_topology (Gautham)

 arch/powerpc/kernel/smp.c | 48 +++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index d997c7411664..3c5ccf6d2b1c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 EXPORT_SYMBOL_GPL(has_big_cores);
 
+enum {
+#ifdef CONFIG_SCHED_SMT
+	smt_idx,
+#endif
+	bigcore_idx,
+	die_idx,
+};
+
 #define MAX_THREAD_LIST_SIZE	8
 #define THREAD_GROUP_SHARE_L1   1
 struct thread_groups {
@@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	if (shared_caches)
-		return cpu_l2_cache_mask(cpu);
-
-	if (has_big_cores)
-		return cpu_smallcore_mask(cpu);
-
-	return per_cpu(cpu_sibling_map, cpu);
+	return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static const struct cpumask *cpu_bigcore_mask(int cpu)
+{
+	return per_cpu(cpu_sibling_map, cpu);
+}
+
 static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
 void start_secondary(void *unused)
 {
 	unsigned int cpu = smp_processor_id();
-	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
 	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
@@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
 	/* Update topology CPU masks */
 	add_cpu_to_masks(cpu);
 
-	if (has_big_cores)
-		sibling_mask = cpu_smallcore_mask;
 	/*
 	 * Check for any shared caches. Note that this must be done on a
 	 * per-core basis because one core in the pair might be disabled.
 	 */
-	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
-		shared_caches = true;
+	if (!shared_caches) {
+		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+		struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+		if (has_big_cores)
+			sibling_mask = cpu_smallcore_mask;
+
+		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+			shared_caches = true;
+	}
 
 	set_numa_node(numa_cpu_lookup_table[cpu]);
 	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
@@ -1375,9 +1387,17 @@ static void fixup_topology(void)
 #ifdef CONFIG_SCHED_SMT
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
+		powerpc_topology[smt_idx].mask = smallcore_smt_mask;
 	}
 #endif
+	if (shared_caches) {
+		pr_info("Using shared cache scheduler topology\n");
+		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
+		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
+#ifdef CONFIG_SCHED_DEBUG
+		powerpc_topology[bigcore_idx].name = "CACHE";
+#endif
+	}
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-31  7:49   ` Michael Ellerman
  2020-07-27  5:32 ` [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group Srikar Dronamraju
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Explained Coregroup in commit msg (Michael Ellerman)

 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/smp.c      |  1 +
 arch/powerpc/mm/numa.c         | 34 +++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 49a25e2400f2..5bdc17a7049f 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -28,6 +28,7 @@
 extern int boot_cpuid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
+extern bool coregroup_enabled;
 
 extern void cpu_die(void);
 extern int cpu_to_chip_id(int cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 3c5ccf6d2b1c..698000c7f76f 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 
 struct task_struct *secondary_current;
 bool has_big_cores;
+bool coregroup_enabled;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 2298899a0f0a..51cb672f113b 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 static void __init find_possible_nodes(void)
 {
 	struct device_node *rtas;
-	u32 numnodes, i;
+	const __be32 *domains;
+	int prop_length, max_nodes;
+	u32 i;
 
 	if (!numa_enabled)
 		return;
@@ -895,25 +897,31 @@ static void __init find_possible_nodes(void)
 	if (!rtas)
 		return;
 
-	if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains",
-				min_common_depth, &numnodes)) {
-		/*
-		 * ibm,current-associativity-domains is a fairly recent
-		 * property. If it doesn't exist, then fallback on
-		 * ibm,max-associativity-domains. Current denotes what the
-		 * platform can support compared to max which denotes what the
-		 * Hypervisor can support.
-		 */
-		if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains",
-				min_common_depth, &numnodes))
+	/*
+	 * ibm,current-associativity-domains is a fairly recent property. If
+	 * it doesn't exist, then fallback on ibm,max-associativity-domains.
+	 * Current denotes what the platform can support compared to max
+	 * which denotes what the Hypervisor can support.
+	 */
+	domains = of_get_property(rtas, "ibm,current-associativity-domains",
+					&prop_length);
+	if (!domains) {
+		domains = of_get_property(rtas, "ibm,max-associativity-domains",
+					&prop_length);
+		if (!domains)
 			goto out;
 	}
 
-	for (i = 0; i < numnodes; i++) {
+	max_nodes = of_read_number(&domains[min_common_depth], 1);
+	for (i = 0; i < max_nodes; i++) {
 		if (!node_possible(i))
 			node_set(i, node_possible_map);
 	}
 
+	prop_length /= sizeof(int);
+	if (prop_length > min_common_depth + 2)
+		coregroup_enabled = 1;
+
 out:
 	of_node_put(rtas);
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-31  7:52   ` Michael Ellerman
  2020-07-27  5:32 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
                   ` (2 subsequent siblings)
  10 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

If allocated earlier and the search fails, then cpumask need to be
freed. However cpu_l1_cache_map can be allocated after we search thread
group.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 698000c7f76f..dab96a1203ec 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
 	if (err)
 		goto out;
 
-	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
-				GFP_KERNEL,
-				cpu_to_node(cpu));
-
 	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
 
 	if (unlikely(cpu_group_start == -1)) {
@@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
 		goto out;
 	}
 
+	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
+				GFP_KERNEL, cpu_to_node(cpu));
+
 	for (i = first_thread; i < first_thread + threads_per_core; i++) {
 		int i_group_start = get_cpu_thread_group_start(i, &tg);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-27 18:52   ` Gautham R Shenoy
  2020-07-28 15:03   ` Valentin Schneider
  2020-07-27  5:32 ` [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id Srikar Dronamraju
  2020-07-30 17:22 ` [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
  10 siblings, 2 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v3 ->v4:
	if coregroup_support doesn't exist, update MC mask to the next
	smaller domain mask.

Changelog v2 -> v3:
	Add optimization for mask updation under coregroup_support

Changelog v1 -> v2:
	Moved coregroup topology fixup to fixup_topology (Gautham)

 arch/powerpc/include/asm/topology.h | 10 +++++++
 arch/powerpc/kernel/smp.c           | 44 +++++++++++++++++++++++++++++
 arch/powerpc/mm/numa.c              |  5 ++++
 3 files changed, 59 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f0b6300e7dd3..6609174918ab 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
 
 #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
 extern int find_and_online_cpu_nid(int cpu);
+extern int cpu_to_coregroup_id(int cpu);
 #else
 static inline int find_and_online_cpu_nid(int cpu)
 {
 	return 0;
 }
 
+static inline int cpu_to_coregroup_id(int cpu)
+{
+#ifdef CONFIG_SMP
+	return cpu_to_core_id(cpu);
+#else
+	return 0;
+#endif
+}
+
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include <asm-generic/topology.h>
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index dab96a1203ec..95f0bf72e283 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
+DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
 
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
@@ -91,6 +92,7 @@ enum {
 	smt_idx,
 #endif
 	bigcore_idx,
+	mc_idx,
 	die_idx,
 };
 
@@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static struct cpumask *cpu_coregroup_mask(int cpu)
+{
+	return per_cpu(cpu_coregroup_map, cpu);
+}
+
+static bool has_coregroup_support(void)
+{
+	return coregroup_enabled;
+}
+
+static const struct cpumask *cpu_mc_mask(int cpu)
+{
+	return cpu_coregroup_mask(cpu);
+}
+
 static const struct cpumask *cpu_bigcore_mask(int cpu)
 {
 	return per_cpu(cpu_sibling_map, cpu);
@@ -879,6 +896,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
 	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
+	{ cpu_mc_mask, SD_INIT_NAME(MC) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, cpu_to_node(cpu));
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, cpu_to_node(cpu));
+		if (has_coregroup_support())
+			zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu),
+						GFP_KERNEL, cpu_to_node(cpu));
+
 #ifdef CONFIG_NEED_MULTIPLE_NODES
 		/*
 		 * numa_node_id() works after this.
@@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
 	cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
+	if (has_coregroup_support())
+		cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+
 	init_big_cores();
 	if (has_big_cores) {
 		cpumask_set_cpu(boot_cpuid,
@@ -1233,6 +1258,8 @@ static void remove_cpu_from_masks(int cpu)
 		set_cpus_unrelated(cpu, i, cpu_sibling_mask);
 		if (has_big_cores)
 			set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
+		if (has_coregroup_support())
+			set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
 	}
 }
 #endif
@@ -1293,6 +1320,20 @@ static void add_cpu_to_masks(int cpu)
 	add_cpu_to_smallcore_masks(cpu);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
+	if (has_coregroup_support()) {
+		int coregroup_id = cpu_to_coregroup_id(cpu);
+
+		cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
+		for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
+			int fcpu = cpu_first_thread_sibling(i);
+
+			if (fcpu == first_thread)
+				set_cpus_related(cpu, i, cpu_coregroup_mask);
+			else if (coregroup_id == cpu_to_coregroup_id(i))
+				set_cpus_related(cpu, i, cpu_coregroup_mask);
+		}
+	}
+
 	if (pkg_id == -1) {
 		struct cpumask *(*mask)(int) = cpu_sibling_mask;
 
@@ -1398,6 +1439,9 @@ static void fixup_topology(void)
 		powerpc_topology[bigcore_idx].name = "CACHE";
 #endif
 	}
+
+	if (!has_coregroup_support())
+		powerpc_topology[mc_idx].mask = powerpc_topology[bigcore_idx].mask;
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 51cb672f113b..0d57779e7942 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1216,6 +1216,11 @@ int find_and_online_cpu_nid(int cpu)
 	return new_nid;
 }
 
+int cpu_to_coregroup_id(int cpu)
+{
+	return cpu_to_core_id(cpu);
+}
+
 static int topology_update_init(void)
 {
 	topology_inited = 1;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
@ 2020-07-27  5:32 ` Srikar Dronamraju
  2020-07-31  8:02   ` Michael Ellerman
  2020-07-30 17:22 ` [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
  10 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:32 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Reviewed-by : Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Move coregroup_enabled before getting associativity (Gautham)

 arch/powerpc/mm/numa.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 0d57779e7942..8b3b3ec7fcc4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
 
 int cpu_to_coregroup_id(int cpu)
 {
+	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+	int index;
+
+	if (cpu < 0 || cpu > nr_cpu_ids)
+		return -1;
+
+	if (!coregroup_enabled)
+		goto out;
+
+	if (!firmware_has_feature(FW_FEATURE_VPHN))
+		goto out;
+
+	if (vphn_get_associativity(cpu, associativity))
+		goto out;
+
+	index = of_read_number(associativity, 1);
+	if (index > min_common_depth + 1)
+		return of_read_number(&associativity[index - 1], 1);
+
+out:
 	return cpu_to_core_id(cpu);
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-27  5:32 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
@ 2020-07-27 18:52   ` Gautham R Shenoy
  2020-07-28 15:03   ` Valentin Schneider
  1 sibling, 0 replies; 32+ messages in thread
From: Gautham R Shenoy @ 2020-07-27 18:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar

Hi Srikar,

On Mon, Jul 27, 2020 at 11:02:29AM +0530, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
> 
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

This version looks good to me.

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>


> ---
> Changelog v3 ->v4:
> 	if coregroup_support doesn't exist, update MC mask to the next
> 	smaller domain mask.
> 
> Changelog v2 -> v3:
> 	Add optimization for mask updation under coregroup_support
> 
> Changelog v1 -> v2:
> 	Moved coregroup topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/include/asm/topology.h | 10 +++++++
>  arch/powerpc/kernel/smp.c           | 44 +++++++++++++++++++++++++++++
>  arch/powerpc/mm/numa.c              |  5 ++++
>  3 files changed, 59 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..6609174918ab 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -88,12 +88,22 @@ static inline int cpu_distance(__be32 *cpu1_assoc, __be32 *cpu2_assoc)
> 
>  #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR)
>  extern int find_and_online_cpu_nid(int cpu);
> +extern int cpu_to_coregroup_id(int cpu);
>  #else
>  static inline int find_and_online_cpu_nid(int cpu)
>  {
>  	return 0;
>  }
> 
> +static inline int cpu_to_coregroup_id(int cpu)
> +{
> +#ifdef CONFIG_SMP
> +	return cpu_to_core_id(cpu);
> +#else
> +	return 0;
> +#endif
> +}
> +
>  #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
> 
>  #include <asm-generic/topology.h>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index dab96a1203ec..95f0bf72e283 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
> +DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
> 
>  EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
> @@ -91,6 +92,7 @@ enum {
>  	smt_idx,
>  #endif
>  	bigcore_idx,
> +	mc_idx,
>  	die_idx,
>  };
> 
> @@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static struct cpumask *cpu_coregroup_mask(int cpu)
> +{
> +	return per_cpu(cpu_coregroup_map, cpu);
> +}
> +
> +static bool has_coregroup_support(void)
> +{
> +	return coregroup_enabled;
> +}
> +
> +static const struct cpumask *cpu_mc_mask(int cpu)
> +{
> +	return cpu_coregroup_mask(cpu);
> +}
> +
>  static const struct cpumask *cpu_bigcore_mask(int cpu)
>  {
>  	return per_cpu(cpu_sibling_map, cpu);
> @@ -879,6 +896,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
>  	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
>  	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
> +	{ cpu_mc_mask, SD_INIT_NAME(MC) },
>  	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
>  	{ NULL, },
>  };
> @@ -925,6 +943,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  					GFP_KERNEL, cpu_to_node(cpu));
>  		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
>  					GFP_KERNEL, cpu_to_node(cpu));
> +		if (has_coregroup_support())
> +			zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu),
> +						GFP_KERNEL, cpu_to_node(cpu));
> +
>  #ifdef CONFIG_NEED_MULTIPLE_NODES
>  		/*
>  		 * numa_node_id() works after this.
> @@ -942,6 +964,9 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  	cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
>  	cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
> 
> +	if (has_coregroup_support())
> +		cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
> +
>  	init_big_cores();
>  	if (has_big_cores) {
>  		cpumask_set_cpu(boot_cpuid,
> @@ -1233,6 +1258,8 @@ static void remove_cpu_from_masks(int cpu)
>  		set_cpus_unrelated(cpu, i, cpu_sibling_mask);
>  		if (has_big_cores)
>  			set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
> +		if (has_coregroup_support())
> +			set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
>  	}
>  }
>  #endif
> @@ -1293,6 +1320,20 @@ static void add_cpu_to_masks(int cpu)
>  	add_cpu_to_smallcore_masks(cpu);
>  	update_mask_by_l2(cpu, cpu_l2_cache_mask);
> 
> +	if (has_coregroup_support()) {
> +		int coregroup_id = cpu_to_coregroup_id(cpu);
> +
> +		cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
> +		for_each_cpu_and(i, cpu_online_mask, cpu_cpu_mask(cpu)) {
> +			int fcpu = cpu_first_thread_sibling(i);
> +
> +			if (fcpu == first_thread)
> +				set_cpus_related(cpu, i, cpu_coregroup_mask);
> +			else if (coregroup_id == cpu_to_coregroup_id(i))
> +				set_cpus_related(cpu, i, cpu_coregroup_mask);
> +		}
> +	}
> +
>  	if (pkg_id == -1) {
>  		struct cpumask *(*mask)(int) = cpu_sibling_mask;
> 
> @@ -1398,6 +1439,9 @@ static void fixup_topology(void)
>  		powerpc_topology[bigcore_idx].name = "CACHE";
>  #endif
>  	}
> +
> +	if (!has_coregroup_support())
> +		powerpc_topology[mc_idx].mask = powerpc_topology[bigcore_idx].mask;
>  }
> 
>  void __init smp_cpus_done(unsigned int max_cpus)
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 51cb672f113b..0d57779e7942 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1216,6 +1216,11 @@ int find_and_online_cpu_nid(int cpu)
>  	return new_nid;
>  }
> 
> +int cpu_to_coregroup_id(int cpu)
> +{
> +	return cpu_to_core_id(cpu);
> +}
> +
>  static int topology_update_init(void)
>  {
>  	topology_inited = 1;
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-27  5:32 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
  2020-07-27 18:52   ` Gautham R Shenoy
@ 2020-07-28 15:03   ` Valentin Schneider
  2020-07-29  6:13     ` Srikar Dronamraju
  1 sibling, 1 reply; 32+ messages in thread
From: Valentin Schneider @ 2020-07-28 15:03 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Oliver O'Halloran, Jordan Niethe,
	linuxppc-dev, Ingo Molnar


Hi,

On 27/07/20 06:32, Srikar Dronamraju wrote:
> Add percpu coregroup maps and masks to create coregroup domain.
> If a coregroup doesn't exist, the coregroup domain will be degenerated
> in favour of SMT/CACHE domain.
>

So there's at least one arm64 platform out there with the same "pairs of
cores share L2" thing (Ampere eMAG), and that lives quite happily with the
default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
domain, and the whole system is covered by DIE.

Now arguably it's not a perfect representation; DIE doesn't have
SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
will impact all callsites using cpus_share_cache(): in the eMAG case, only
pairs of cores will be seen as sharing cache, even though *all* cores share
the same L3.

I'm trying to paint a picture of what the P9 topology looks like (the one
you showcase in your cover letter) to see if there are any similarities;
from what I gather in [1], wikichips and your cover letter, with P9 you can
have something like this in a single DIE (somewhat unsure about L3 setup;
it looks to be distributed?)

     +---------------------------------------------------------------------+
     |                                  L3                                 |
     +---------------+-+---------------+-+---------------+-+---------------+
     |       L2      | |       L2      | |       L2      | |       L2      |
     +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
     |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
     +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
     |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
     +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+

Which would lead to (ignoring the whole SMT CPU numbering shenanigans)

NUMA     [                                                   ...
DIE      [                                             ]
MC       [         ] [         ] [         ] [         ]
BIGCORE  [         ] [         ] [         ] [         ]
SMT      [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
         00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  <other node here>

This however has MC == BIGCORE; what makes it you can have different spans
for these two domains? If it's not too much to ask, I'd love to have a P9
topology diagram.

[1]: 20200722081822.GG9290@linux.vnet.ibm.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-28 15:03   ` Valentin Schneider
@ 2020-07-29  6:13     ` Srikar Dronamraju
  2020-07-31  1:05       ` Valentin Schneider
  2020-07-31  7:36       ` Gautham R Shenoy
  0 siblings, 2 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-29  6:13 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Oliver O'Halloran, Jordan Niethe,
	linuxppc-dev, Ingo Molnar

* Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:

Hi Valentin,

Thanks for looking into the patches.

> On 27/07/20 06:32, Srikar Dronamraju wrote:
> > Add percpu coregroup maps and masks to create coregroup domain.
> > If a coregroup doesn't exist, the coregroup domain will be degenerated
> > in favour of SMT/CACHE domain.
> >
> 
> So there's at least one arm64 platform out there with the same "pairs of
> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
> domain, and the whole system is covered by DIE.
> 
> Now arguably it's not a perfect representation; DIE doesn't have
> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
> will impact all callsites using cpus_share_cache(): in the eMAG case, only
> pairs of cores will be seen as sharing cache, even though *all* cores share
> the same L3.
> 

Okay, Its good to know that we have a chip which is similar to P9 in
topology.

> I'm trying to paint a picture of what the P9 topology looks like (the one
> you showcase in your cover letter) to see if there are any similarities;
> from what I gather in [1], wikichips and your cover letter, with P9 you can
> have something like this in a single DIE (somewhat unsure about L3 setup;
> it looks to be distributed?)
> 
>      +---------------------------------------------------------------------+
>      |                                  L3                                 |
>      +---------------+-+---------------+-+---------------+-+---------------+
>      |       L2      | |       L2      | |       L2      | |       L2      |
>      +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
>      |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
>      +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>      |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
>      +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
> 
> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
> 
> NUMA     [                                                   ...
> DIE      [                                             ]
> MC       [         ] [         ] [         ] [         ]
> BIGCORE  [         ] [         ] [         ] [         ]
> SMT      [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
>          00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  <other node here>
> 

What you have summed up is perfectly what a P9 topology looks like. I dont
think I could have explained it better than this.

> This however has MC == BIGCORE; what makes it you can have different spans
> for these two domains? If it's not too much to ask, I'd love to have a P9
> topology diagram.
> 
> [1]: 20200722081822.GG9290@linux.vnet.ibm.com

At this time the current topology would be good enough i.e BIGCORE would
always be equal to a MC. However in future we could have chips that can have
lesser/larger number of CPUs in llc than in a BIGCORE or we could have
granular or split L3 caches within a DIE. In such a case BIGCORE != MC.

Also in the current P9 itself, two neighbouring core-pairs form a quad.
Cache latency within a quad is better than a latency to a distant core-pair.
Cache latency within a core pair is way better than latency within a quad.
So if we have only 4 threads running on a DIE all of them accessing the same
cache-lines, then we could probably benefit if all the tasks were to run
within the quad aka MC/Coregroup.

I have found some benchmarks which are latency sensitive to benefit by
having a grouping a quad level (using kernel hacks and not backed by
firmware changes). Gautham also found similar results in his experiments
but he only used binding within the stock kernel.

I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
domain need not be LLC domain for Power.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
  2020-07-27  5:32 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju
@ 2020-07-30  5:55   ` Gautham R Shenoy
  2020-07-31  7:45   ` Michael Ellerman
  1 sibling, 0 replies; 32+ messages in thread
From: Gautham R Shenoy @ 2020-07-30  5:55 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar

On Mon, Jul 27, 2020 at 11:02:26AM +0530, Srikar Dronamraju wrote:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
> 
> While setting up the "CACHE" domain, check if shared_cache is already
> set.
> 
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>

> ---
> Changelog v1 -> v2:
> 	Moved shared_cache topology fixup to fixup_topology (Gautham)
> 
>  arch/powerpc/kernel/smp.c | 48 +++++++++++++++++++++++++++------------
>  1 file changed, 34 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index d997c7411664..3c5ccf6d2b1c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
> 
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> +	smt_idx,
> +#endif
> +	bigcore_idx,
> +	die_idx,
> +};
> +
>  #define MAX_THREAD_LIST_SIZE	8
>  #define THREAD_GROUP_SHARE_L1   1
>  struct thread_groups {
> @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> -	if (shared_caches)
> -		return cpu_l2_cache_mask(cpu);
> -
> -	if (has_big_cores)
> -		return cpu_smallcore_mask(cpu);
> -
> -	return per_cpu(cpu_sibling_map, cpu);
> +	return per_cpu(cpu_l2_cache_map, cpu);
>  }
> 
>  #ifdef CONFIG_SCHED_SMT
> @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
> 
> +static const struct cpumask *cpu_bigcore_mask(int cpu)
> +{
> +	return per_cpu(cpu_sibling_map, cpu);
> +}
> +
>  static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>  	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> -	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> +	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
>  	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
>  	{ NULL, },
>  };
> @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>  	unsigned int cpu = smp_processor_id();
> -	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> 
>  	mmgrab(&init_mm);
>  	current->active_mm = &init_mm;
> @@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
>  	/* Update topology CPU masks */
>  	add_cpu_to_masks(cpu);
> 
> -	if (has_big_cores)
> -		sibling_mask = cpu_smallcore_mask;
>  	/*
>  	 * Check for any shared caches. Note that this must be done on a
>  	 * per-core basis because one core in the pair might be disabled.
>  	 */
> -	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> -		shared_caches = true;
> +	if (!shared_caches) {
> +		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> +		struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> +		if (has_big_cores)
> +			sibling_mask = cpu_smallcore_mask;
> +
> +		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> +			shared_caches = true;
> +	}
> 
>  	set_numa_node(numa_cpu_lookup_table[cpu]);
>  	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1375,9 +1387,17 @@ static void fixup_topology(void)
>  #ifdef CONFIG_SCHED_SMT
>  	if (has_big_cores) {
>  		pr_info("Big cores detected but using small core scheduling\n");
> -		powerpc_topology[0].mask = smallcore_smt_mask;
> +		powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>  	}
>  #endif
> +	if (shared_caches) {
> +		pr_info("Using shared cache scheduler topology\n");
> +		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> +		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
> +#ifdef CONFIG_SCHED_DEBUG
> +		powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> +	}
>  }
> 
>  void __init smp_cpus_done(unsigned int max_cpus)
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 00/10] Coregroup support on Powerpc
  2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2020-07-27  5:32 ` [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id Srikar Dronamraju
@ 2020-07-30 17:22 ` Srikar Dronamraju
  10 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-30 17:22 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran,
	Michael Neuling, Michael Ellerman, Peter Zijlstra, Jordan Niethe,
	Anton Blanchard, LKML, Ingo Molnar, Nick Piggin, linuxppc-dev,
	Valentin Schneider

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2020-07-27 11:02:20]:

> Changelog v3 ->v4:
> v3: https://lore.kernel.org/lkml/20200723085116.4731-1-srikar@linux.vnet.ibm.com/t/#u
>

Here is a summary of some of the testing done with coregroup v4 patchsets.
It includes ebizzy, schbench, perf bench sched pipe and topology verification.
One the left side are results from powerpc/next tree and on the right are the
results with the patchset applied.  Topological verification clearly shows that
there is no change in topology with and without the patches on all the 3 class
of systems that were tested.

On PowerPc/Next                                                            On Powerpc/next + Coregroup Support v4 patchset

Power 9 PowerNV (2 Node/ 160 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N      Min       Max    Median       Avg        Stddev                  N      Min       Max    Median       Avg      Stddev
100   993884   1276090   1173476   1165914     54867.201                100   910470   1279820   1171095   1162091    67363.28

schbench (latency hence lower is better)
Latency percentiles (usec)                                              Latency percentiles (usec)
        50.0th: 455                                                             50.0th: 454
        75.0th: 533                                                             75.0th: 543
        90.0th: 683                                                             90.0th: 701
        95.0th: 743                                                             95.0th: 737
        *99.0th: 815                                                            *99.0th: 805
        99.5th: 839                                                             99.5th: 835
        99.9th: 913                                                             99.9th: 893
        min=0, max=1011                                                         min=0, max=2833

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:                                       # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes                # Executed 1000000 pipe operations between two processes

     Total time: 6.083 [sec]                                                 Total time: 6.303 [sec]

       6.083576 usecs/op                                                       6.303318 usecs/op
         164377 ops/sec                                                          158646 ops/sec


Power 9 LPAR (2 Node/ 128 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N       Min       Max    Median         Avg      Stddev                 N       Min       Max    Median         Avg      Stddev
100   1058029   1295393   1200414   1188306.7   56786.538               100    943264   1287619   1180522   1168473.2   64469.955

schbench (latency hence lower is better)
Latency percentiles (usec)                                                Latency percentiles (usec)
        50.0000th: 34                                                             50.0000th: 39
        75.0000th: 46                                                             75.0000th: 52
        90.0000th: 53                                                             90.0000th: 68
        95.0000th: 56                                                             95.0000th: 77
        *99.0000th: 61                                                            *99.0000th: 89
        99.5000th: 63                                                             99.5000th: 94
        99.9000th: 81                                                             99.9000th: 169
        min=0, max=8405                                                           min=0, max=23674

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:                                        # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes                 # Executed 1000000 pipe operations between two processes

     Total time: 8.768 [sec]                                                      Total time: 5.217 [sec]

       8.768400 usecs/op                                                            5.217625 usecs/op
         114045 ops/sec                                                               191658 ops/sec

Power 8 LPAR (8 Node/ 256 Cpu System)
---------------------------------
ebizzy (Throughput of 100 iterations of 30 seconds higher throughput is better)
  N       Min       Max    Median         Avg      Stddev               N      Min      Max   Median        Avg     Stddev
100   1267615   1965234   1707423   1689137.6   144363.29             100  1175357  1924262  1691104  1664792.1   145876.4

schbench (latency hence lower is better)
Latency percentiles (usec)                                             Latency percentiles (usec)
        50.0th: 37                                                             50.0th: 36
        75.0th: 51                                                             75.0th: 48
        90.0th: 59                                                             90.0th: 55
        95.0th: 63                                                             95.0th: 59
        *99.0th: 71                                                            *99.0th: 67
        99.5th: 75                                                             99.5th: 72
        99.9th: 105                                                            99.9th: 170
        min=0, max=18560                                                       min=0, max=27031

perf bench sched pipe (lesser time and higher ops/sec is better)
# Running 'sched/pipe' benchmark:                                       # Running 'sched/pipe' benchmark:
# Executed 1000000 pipe operations between two processes               # Executed 1000000 pipe operations between two processes

     Total time: 6.013 [sec]                                                 Total time: 5.930 [sec]

       6.013963 usecs/op                                                        5.930724 usecs/op
         166279 ops/sec                                                           168613 ops/sec

Topology verification on Power9
Power9/ PowerNV / SMT4

tail -f /proc/cpuinfo
---------------------
cpu                : POWER9, altivec supported
clock                : 3600.000000MHz
revision        : 2.2 (pvr 004e 1202)

timebase        : 512000000
platform        : PowerNV
model                : 9006-22P
machine                : PowerNV 9006-22P
firmware        : OPAL
MMU                : Radix

On PowerPc/Next                                                            On Powerpc/next + Coregroup Support v4 patchset
lscpu                                                                      lscpu
------                                                                     ------
Architecture:        ppc64le                                               Architecture:        ppc64le
Byte Order:          Little Endian                                         Byte Order:          Little Endian
CPU(s):              160                                                   CPU(s):              160
On-line CPU(s) list: 0-159                                                 On-line CPU(s) list: 0-159
Thread(s) per core:  4                                                     Thread(s) per core:  4
Core(s) per socket:  20                                                    Core(s) per socket:  20
Socket(s):           2                                                     Socket(s):           2
NUMA node(s):        2                                                     NUMA node(s):        2
Model:               2.2 (pvr 004e 1202)                                   Model:               2.2 (pvr 004e 1202)
Model name:          POWER9, altivec supported                             Model name:          POWER9, altivec supported
CPU max MHz:         3800.0000                                             CPU max MHz:         3800.0000
CPU min MHz:         2166.0000                                             CPU min MHz:         2166.0000
L1d cache:           32K                                                   L1d cache:           32K
L1i cache:           32K                                                   L1i cache:           32K
L2 cache:            512K                                                  L2 cache:            512K
L3 cache:            10240K                                                L3 cache:            10240K
NUMA node0 CPU(s):   0-79                                                  NUMA node0 CPU(s):   0-79
NUMA node8 CPU(s):   80-159                                                NUMA node8 CPU(s):   80-159

grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                     grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
-----------------------------------------------------                      -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT                        /proc/sys/kernel/sched_domain/cpu0/domain0/name:SMT
/proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE                      /proc/sys/kernel/sched_domain/cpu0/domain1/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE                        /proc/sys/kernel/sched_domain/cpu0/domain2/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA                       /proc/sys/kernel/sched_domain/cpu0/domain3/name:NUMA

grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags                    grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------                     ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391                      /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2391
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327                      /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071                      /proc/sys/kernel/sched_domain/cpu0/domain2/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801                     /proc/sys/kernel/sched_domain/cpu0/domain3/flags:12801


On PowerPc/Next
head /proc/schedstat
--------------------
version 15
timestamp 4295043536
cpu0 0 0 0 0 0 0 9597119314 2408913694 11897
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 4941435230 11106132 1583
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Powerpc/next + Coregroup Support v4 patchset
head /proc/schedstat
--------------------
version 15
timestamp 4296311826
cpu0 0 0 0 0 0 0 3353674045024 3781680865826 297483
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 00000000,00000000,0000ffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain3 ffffffff,ffffffff,ffffffff,ffffffff,ffffffff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu1 0 0 0 0 0 0 3337873293332 4231590033856 229090
domain0 00000000,00000000,00000000,00000000,0000000f 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00000000,00000000,000000ff 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0


Post sudo ppc64_cpu --smt=1                                                     Post sudo ppc64_cpu --smt=1
---------------------                                                           ---------------------
grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name                          grep . /proc/sys/kernel/sched_domain/cpu0/domain*/name
-----------------------------------------------------                           -----------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE                           /proc/sys/kernel/sched_domain/cpu0/domain0/name:CACHE
/proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE                             /proc/sys/kernel/sched_domain/cpu0/domain1/name:DIE
/proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA                            /proc/sys/kernel/sched_domain/cpu0/domain2/name:NUMA

grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags                         grep . /proc/sys/kernel/sched_domain/cpu0/domain*/flags
------------------------------------------------------                          ------------------------------------------------------
/proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327                           /proc/sys/kernel/sched_domain/cpu0/domain0/flags:2327
/proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071                           /proc/sys/kernel/sched_domain/cpu0/domain1/flags:2071
/proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801                          /proc/sys/kernel/sched_domain/cpu0/domain2/flags:12801


On Powerpc/next
head /proc/schedstat
--------------------
version 15
timestamp 4295046242
cpu0 0 0 0 0 0 0 10978610020 2658997390 13068
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 5408663896 95701034 7697
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

On Powerpc/next + Coregroup Support v4 patchset
head /proc/schedstat
--------------------
version 15
timestamp 4296314905
cpu0 0 0 0 0 0 0 3355392013536 3781975150576 298723
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
cpu4 0 0 0 0 0 0 3351637920996 4427329763050 256776
domain0 00000000,00000000,00000000,00000000,00000011 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain1 00000000,00000000,00001111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
domain2 91111111,11111111,11111111,11111111,11111111 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Similar verification was done on Power 8 (8 Node 256 CPU LPAR) and Power 9 (2
node 128 Cpu LPAR) and they showed the topology before and after the patch to be
identical. If Interested, I could provide the same.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-29  6:13     ` Srikar Dronamraju
@ 2020-07-31  1:05       ` Valentin Schneider
  2020-08-03  6:01         ` Srikar Dronamraju
  2020-07-31  7:36       ` Gautham R Shenoy
  1 sibling, 1 reply; 32+ messages in thread
From: Valentin Schneider @ 2020-07-31  1:05 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Morten Rasmussen, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar


(+Cc Morten)

On 29/07/20 07:13, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>
> Hi Valentin,
>
> Thanks for looking into the patches.
>
>> On 27/07/20 06:32, Srikar Dronamraju wrote:
>> > Add percpu coregroup maps and masks to create coregroup domain.
>> > If a coregroup doesn't exist, the coregroup domain will be degenerated
>> > in favour of SMT/CACHE domain.
>> >
>>
>> So there's at least one arm64 platform out there with the same "pairs of
>> cores share L2" thing (Ampere eMAG), and that lives quite happily with the
>> default scheduler topology (SMT/MC/DIE). Each pair of core gets its MC
>> domain, and the whole system is covered by DIE.
>>
>> Now arguably it's not a perfect representation; DIE doesn't have
>> SD_SHARE_PKG_RESOURCES so the highest level sd_llc can point to is MC. That
>> will impact all callsites using cpus_share_cache(): in the eMAG case, only
>> pairs of cores will be seen as sharing cache, even though *all* cores share
>> the same L3.
>>
>
> Okay, Its good to know that we have a chip which is similar to P9 in
> topology.
>
>> I'm trying to paint a picture of what the P9 topology looks like (the one
>> you showcase in your cover letter) to see if there are any similarities;
>> from what I gather in [1], wikichips and your cover letter, with P9 you can
>> have something like this in a single DIE (somewhat unsure about L3 setup;
>> it looks to be distributed?)
>>
>>      +---------------------------------------------------------------------+
>>      |                                  L3                                 |
>>      +---------------+-+---------------+-+---------------+-+---------------+
>>      |       L2      | |       L2      | |       L2      | |       L2      |
>>      +------+-+------+ +------+-+------+ +------+-+------+ +------+-+------+
>>      |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  | |  L1  |
>>      +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>>      |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs| |4 CPUs|
>>      +------+ +------+ +------+ +------+ +------+ +------+ +------+ +------+
>>
>> Which would lead to (ignoring the whole SMT CPU numbering shenanigans)
>>
>> NUMA     [                                                   ...
>> DIE      [                                             ]
>> MC       [         ] [         ] [         ] [         ]
>> BIGCORE  [         ] [         ] [         ] [         ]
>> SMT      [   ] [   ] [   ] [   ] [   ] [   ] [   ] [   ]
>>          00-03 04-07 08-11 12-15 16-19 20-23 24-27 28-31  <other node here>
>>
>
> What you have summed up is perfectly what a P9 topology looks like. I dont
> think I could have explained it better than this.
>

Yay!

>> This however has MC == BIGCORE; what makes it you can have different spans
>> for these two domains? If it's not too much to ask, I'd love to have a P9
>> topology diagram.
>>
>> [1]: 20200722081822.GG9290@linux.vnet.ibm.com
>
> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
>

Right, that one's fair enough.

> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
>

Did you test this? WRT load balance we do try to balance "load" over the
different domain spans, so if you represent quads as their own MC domain,
you would AFAICT end up spreading tasks over the quads (rather than packing
them) when balancing at e.g. DIE level. The desired behaviour might be
hackable with some more ASYM_PACKING, but I'm not sure I should be
suggesting that :-)

> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
>

IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
binding thing.

That's also where things get interesting (for me) because I experienced
something similar on another arm64 platform (ThunderX1). This was more
about cache bandwidth than cache latency, but IMO it's in the same bag of
fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
gave up on it given the TX1 was the only (arm64) platform where I could get
both significant and reproducible results.

Now, if you folks are seeing this on completely different hardware and have
"real" workloads that truly benefit from this kind of domain partitioning,
this might be another incentive to try and sort of generalize this. That's
outside the scope of your series, but your findings give me some hope!

I think what I had in mind back then was that if enough folks cared about
it, we might get some bits added to the ACPI spec; something along the
lines of proximity domains for the caches described in PPTT, IOW a cache
distance matrix. I don't really know what it'll take to get there, but I
figured I'd dump this in case someone's listening :-)

> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

From what I understood your MC domain does seem to map to LLC; but in any
case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
your changes your sd_llc is gonna be SMT, and that's not going to be a very
big mask. IMO you do want to correctly reflect your LLC situation via this
flag to make cpus_share_cache() work properly.

[1]: https://linuxplumbersconf.org/event/4/contributions/484/

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-29  6:13     ` Srikar Dronamraju
  2020-07-31  1:05       ` Valentin Schneider
@ 2020-07-31  7:36       ` Gautham R Shenoy
  1 sibling, 0 replies; 32+ messages in thread
From: Gautham R Shenoy @ 2020-07-31  7:36 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Ingo Molnar, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Valentin Schneider

Hi Srikar, Valentin,

On Wed, Jul 29, 2020 at 11:43:55AM +0530, Srikar Dronamraju wrote:
> * Valentin Schneider <valentin.schneider@arm.com> [2020-07-28 16:03:11]:
>

[..snip..]

> At this time the current topology would be good enough i.e BIGCORE would
> always be equal to a MC. However in future we could have chips that can have
> lesser/larger number of CPUs in llc than in a BIGCORE or we could have
> granular or split L3 caches within a DIE. In such a case BIGCORE != MC.
> 
> Also in the current P9 itself, two neighbouring core-pairs form a quad.
> Cache latency within a quad is better than a latency to a distant core-pair.
> Cache latency within a core pair is way better than latency within a quad.
> So if we have only 4 threads running on a DIE all of them accessing the same
> cache-lines, then we could probably benefit if all the tasks were to run
> within the quad aka MC/Coregroup.
> 
> I have found some benchmarks which are latency sensitive to benefit by
> having a grouping a quad level (using kernel hacks and not backed by
> firmware changes). Gautham also found similar results in his experiments
> but he only used binding within the stock kernel.
> 
> I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> domain need not be LLC domain for Power.

I am observing that SD_SHARE_PKG_RESOURCES at L2 provides the best
results for POWER9 in terms of cache-benefits during wakeup.  On a
POWER9 Boston machine, running a producer-consumer test case
(https://github.com/gautshen/misc/blob/master/producer_consumer/producer_consumer.c)

The test case creates two threads, one Producer and another
Consumer. Both work on a fairly large shared array of size 64M.  In an
interation the Producer performs stores to 1024 random locations and
wakes up the Consumer. In the Consumer's iteration, loads from those
exact 1024 locations.

We measure the number of Consumer iterations per second and the
average time for each Consumer iteration. The smaller the time, the
better it is.

The following results are when I pinned the Producer and Consumer to
different combinations of CPUs to cover Small core , Big-core,
Neighbouring Big-core, Far off core within the same chip, and across
chips. There is a also a case where they are not affined anywhere, and
we let the scheduler wake them up correctly.

We find the best results when the Producer and Consumer are within the
same L2 domain. These numbers are also close to the numbers that we
get when we let the Scheduler wake them up (where LLC is L2).

## Same Small core (4 threads: Shares L1, L2, L3, Frequency Domain)

Consumer affined to  CPU 3
Producer affined to  CPU 1
    4698 iterations, avg time: 20034 ns
    4951 iterations, avg time: 20012 ns
    4957 iterations, avg time: 19971 ns
    4968 iterations, avg time: 19985 ns
    4970 iterations, avg time: 19977 ns


## Same Big Core (8 threads: Shares L2, L3, Frequency Domain)

Consumer affined to  CPU 7
Producer affined to  CPU 1
    4580 iterations, avg time: 19403 ns
    4851 iterations, avg time: 19373 ns
    4849 iterations, avg time: 19394 ns
    4856 iterations, avg time: 19394 ns
    4867 iterations, avg time: 19353 ns

## Neighbouring Big-core (Faster data-snooping from L2. Shares L3, Frequency Domain)

Producer affined to  CPU 1
Consumer affined to  CPU 11
    4270 iterations, avg time: 24158 ns
    4491 iterations, avg time: 24157 ns
    4500 iterations, avg time: 24148 ns
    4516 iterations, avg time: 24164 ns
    4518 iterations, avg time: 24165 ns


## Any other Big-core from Same Chip (Shares L3)

Producer affined to  CPU 1
Consumer affined to  CPU 87
    4176 iterations, avg time: 27953 ns
    4417 iterations, avg time: 27925 ns
    4415 iterations, avg time: 27934 ns
    4417 iterations, avg time: 27983 ns
    4430 iterations, avg time: 27958 ns


## Different Chips (No cache-sharing)

Consumer affined to  CPU 175
Producer affined to  CPU 1
    3277 iterations, avg time: 50786 ns
    3063 iterations, avg time: 50732 ns
    2831 iterations, avg time: 50737 ns
    2859 iterations, avg time: 50688 ns
    2849 iterations, avg time: 50722 ns

## Without affining them (Let Scheduler wake-them up appropriately)
Consumer affined to  CPU 0-175
Producer affined to  CPU 0-175
    4821 iterations, avg time: 19412 ns
    4863 iterations, avg time: 19435 ns
    4855 iterations, avg time: 19381 ns
    4811 iterations, avg time: 19458 ns
    4892 iterations, avg time: 19429 ns


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
  2020-07-27  5:32 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju
  2020-07-30  5:55   ` Gautham R Shenoy
@ 2020-07-31  7:45   ` Michael Ellerman
  2020-07-31  9:29     ` Srikar Dronamraju
  1 sibling, 1 reply; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31  7:45 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Currently "CACHE" domain happens to be the 2nd sched domain as per
> powerpc_topology. This domain will collapse if cpumask of l2-cache is
> same as SMT domain. However we could generalize this domain such that it
> could mean either be a "CACHE" domain or a "BIGCORE" domain.
>
> While setting up the "CACHE" domain, check if shared_cache is already
> set.

PeterZ asked for some overview of what you're doing and why, you
responded to his mail, but I was expecting to see that text incorporated
here somewhere.

He also asked for some comments, which I would also like to see.


I'm also not clear why we want to rename it to "bigcore", that's not a
commonly understood term, I don't think it's clear to new readers what
it means.

Leaving it as the shared cache domain, and having a comment mentioning
that "bigcores" share a cache, would be clearer I think.

cheers


> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1 -> v2:
> 	Moved shared_cache topology fixup to fixup_topology (Gautham)
>
>  arch/powerpc/kernel/smp.c | 48 +++++++++++++++++++++++++++------------
>  1 file changed, 34 insertions(+), 14 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index d997c7411664..3c5ccf6d2b1c 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
>  EXPORT_PER_CPU_SYMBOL(cpu_core_map);
>  EXPORT_SYMBOL_GPL(has_big_cores);
>  
> +enum {
> +#ifdef CONFIG_SCHED_SMT
> +	smt_idx,
> +#endif
> +	bigcore_idx,
> +	die_idx,
> +};
> +
>  #define MAX_THREAD_LIST_SIZE	8
>  #define THREAD_GROUP_SHARE_L1   1
>  struct thread_groups {
> @@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
>   */
>  static const struct cpumask *shared_cache_mask(int cpu)
>  {
> -	if (shared_caches)
> -		return cpu_l2_cache_mask(cpu);
> -
> -	if (has_big_cores)
> -		return cpu_smallcore_mask(cpu);
> -
> -	return per_cpu(cpu_sibling_map, cpu);
> +	return per_cpu(cpu_l2_cache_map, cpu);
>  }
>  
>  #ifdef CONFIG_SCHED_SMT
> @@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
>  }
>  #endif
>  
> +static const struct cpumask *cpu_bigcore_mask(int cpu)
> +{
> +	return per_cpu(cpu_sibling_map, cpu);
> +}
> +
>  static struct sched_domain_topology_level powerpc_topology[] = {
>  #ifdef CONFIG_SCHED_SMT
>  	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
>  #endif
> -	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
> +	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
>  	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
>  	{ NULL, },
>  };
> @@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
>  void start_secondary(void *unused)
>  {
>  	unsigned int cpu = smp_processor_id();
> -	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
>  
>  	mmgrab(&init_mm);
>  	current->active_mm = &init_mm;
> @@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
>  	/* Update topology CPU masks */
>  	add_cpu_to_masks(cpu);
>  
> -	if (has_big_cores)
> -		sibling_mask = cpu_smallcore_mask;
>  	/*
>  	 * Check for any shared caches. Note that this must be done on a
>  	 * per-core basis because one core in the pair might be disabled.
>  	 */
> -	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
> -		shared_caches = true;
> +	if (!shared_caches) {
> +		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
> +		struct cpumask *mask = cpu_l2_cache_mask(cpu);
> +
> +		if (has_big_cores)
> +			sibling_mask = cpu_smallcore_mask;
> +
> +		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
> +			shared_caches = true;
> +	}
>  
>  	set_numa_node(numa_cpu_lookup_table[cpu]);
>  	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
> @@ -1375,9 +1387,17 @@ static void fixup_topology(void)
>  #ifdef CONFIG_SCHED_SMT
>  	if (has_big_cores) {
>  		pr_info("Big cores detected but using small core scheduling\n");
> -		powerpc_topology[0].mask = smallcore_smt_mask;
> +		powerpc_topology[smt_idx].mask = smallcore_smt_mask;
>  	}
>  #endif
> +	if (shared_caches) {
> +		pr_info("Using shared cache scheduler topology\n");
> +		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
> +		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
> +#ifdef CONFIG_SCHED_DEBUG
> +		powerpc_topology[bigcore_idx].name = "CACHE";
> +#endif
> +	}
>  }
>  
>  void __init smp_cpus_done(unsigned int max_cpus)
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
  2020-07-27  5:32 ` [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup Srikar Dronamraju
@ 2020-07-31  7:49   ` Michael Ellerman
  2020-07-31  9:18     ` Srikar Dronamraju
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31  7:49 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Add support for grouping cores based on the device-tree classification.
> - The last domain in the associativity domains always refers to the
> core.
> - If primary reference domain happens to be the penultimate domain in
> the associativity domains device-tree property, then there are no
> coregroups. However if its not a penultimate domain, then there are
> coregroups. There can be more than one coregroup. For now we would be
> interested in the last or the smallest coregroups.

This still doesn't tell me what a coregroup actually represents.

I get that it's a grouping of cores, and that the device tree specifies
it for us, but grouping based on what?

I think the answer is we aren't being told by firmware, it's just a
grouping based on some opaque performance characteristic and we just
have to take that as given.

But please explain that clearly in the change log and the code comments.

cheers


> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1 -> v2:
> 	Explained Coregroup in commit msg (Michael Ellerman)
>
>  arch/powerpc/include/asm/smp.h |  1 +
>  arch/powerpc/kernel/smp.c      |  1 +
>  arch/powerpc/mm/numa.c         | 34 +++++++++++++++++++++-------------
>  3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..5bdc17a7049f 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -28,6 +28,7 @@
>  extern int boot_cpuid;
>  extern int spinning_secondaries;
>  extern u32 *cpu_to_phys_id;
> +extern bool coregroup_enabled;
>  
>  extern void cpu_die(void);
>  extern int cpu_to_chip_id(int cpu);
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 3c5ccf6d2b1c..698000c7f76f 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
>  
>  struct task_struct *secondary_current;
>  bool has_big_cores;
> +bool coregroup_enabled;
>  
>  DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
>  DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 2298899a0f0a..51cb672f113b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -886,7 +886,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
>  static void __init find_possible_nodes(void)
>  {
>  	struct device_node *rtas;
> -	u32 numnodes, i;
> +	const __be32 *domains;
> +	int prop_length, max_nodes;
> +	u32 i;
>  
>  	if (!numa_enabled)
>  		return;
> @@ -895,25 +897,31 @@ static void __init find_possible_nodes(void)
>  	if (!rtas)
>  		return;
>  
> -	if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains",
> -				min_common_depth, &numnodes)) {
> -		/*
> -		 * ibm,current-associativity-domains is a fairly recent
> -		 * property. If it doesn't exist, then fallback on
> -		 * ibm,max-associativity-domains. Current denotes what the
> -		 * platform can support compared to max which denotes what the
> -		 * Hypervisor can support.
> -		 */
> -		if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains",
> -				min_common_depth, &numnodes))
> +	/*
> +	 * ibm,current-associativity-domains is a fairly recent property. If
> +	 * it doesn't exist, then fallback on ibm,max-associativity-domains.
> +	 * Current denotes what the platform can support compared to max
> +	 * which denotes what the Hypervisor can support.
> +	 */
> +	domains = of_get_property(rtas, "ibm,current-associativity-domains",
> +					&prop_length);
> +	if (!domains) {
> +		domains = of_get_property(rtas, "ibm,max-associativity-domains",
> +					&prop_length);
> +		if (!domains)
>  			goto out;
>  	}
>  
> -	for (i = 0; i < numnodes; i++) {
> +	max_nodes = of_read_number(&domains[min_common_depth], 1);
> +	for (i = 0; i < max_nodes; i++) {
>  		if (!node_possible(i))
>  			node_set(i, node_possible_map);
>  	}
>  
> +	prop_length /= sizeof(int);
> +	if (prop_length > min_common_depth + 2)
> +		coregroup_enabled = 1;
> +
>  out:
>  	of_node_put(rtas);
>  }
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
  2020-07-27  5:32 ` [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group Srikar Dronamraju
@ 2020-07-31  7:52   ` Michael Ellerman
  2020-07-31  9:49     ` Srikar Dronamraju
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31  7:52 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> If allocated earlier and the search fails, then cpumask need to be
> freed. However cpu_l1_cache_map can be allocated after we search thread
> group.

It's not freed anywhere AFAICS?

And even after this change there's still an error path that doesn't free
it, isn't there?

cheers

> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/smp.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 698000c7f76f..dab96a1203ec 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
>  	if (err)
>  		goto out;
>  
> -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> -				GFP_KERNEL,
> -				cpu_to_node(cpu));
> -
>  	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
>  
>  	if (unlikely(cpu_group_start == -1)) {
> @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
>  		goto out;
>  	}
>  
> +	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> +				GFP_KERNEL, cpu_to_node(cpu));
> +
>  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
>  		int i_group_start = get_cpu_thread_group_start(i, &tg);
>  
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
  2020-07-27  5:32 ` [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id Srikar Dronamraju
@ 2020-07-31  8:02   ` Michael Ellerman
  2020-07-31  9:58     ` Srikar Dronamraju
  0 siblings, 1 reply; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31  8:02 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> Lookup the coregroup id from the associativity array.

It's slightly strange that this is called in patch 9, but only properly
implemented here in patch 10.

I'm not saying you have to squash them together, but it would be good if
the change log for patch 9 mentioned that a subsequent commit will
complete the implementation and how that affects the behaviour.

cheers

> If unable to detect the coregroup id, fallback on the core id.
> This way, ensure sched_domain degenerates and an extra sched domain is
> not created.
>
> Ideally this function should have been implemented in
> arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
> don't need to find the primary domain again.
>
> If the device-tree mentions more than one coregroup, then kernel
> implements only the last or the smallest coregroup, which currently
> corresponds to the penultimate domain in the device-tree.
>
> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Anton Blanchard <anton@ozlabs.org>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Jordan Niethe <jniethe5@gmail.com>
> Reviewed-by : Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1 -> v2:
> 	Move coregroup_enabled before getting associativity (Gautham)
>
>  arch/powerpc/mm/numa.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index 0d57779e7942..8b3b3ec7fcc4 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -1218,6 +1218,26 @@ int find_and_online_cpu_nid(int cpu)
>  
>  int cpu_to_coregroup_id(int cpu)
>  {
> +	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
> +	int index;
> +
> +	if (cpu < 0 || cpu > nr_cpu_ids)
> +		return -1;
> +
> +	if (!coregroup_enabled)
> +		goto out;
> +
> +	if (!firmware_has_feature(FW_FEATURE_VPHN))
> +		goto out;
> +
> +	if (vphn_get_associativity(cpu, associativity))
> +		goto out;
> +
> +	index = of_read_number(associativity, 1);
> +	if (index > min_common_depth + 1)
> +		return of_read_number(&associativity[index - 1], 1);
> +
> +out:
>  	return cpu_to_core_id(cpu);
>  }
>  
> -- 
> 2.17.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
  2020-07-31  7:49   ` Michael Ellerman
@ 2020-07-31  9:18     ` Srikar Dronamraju
  2020-07-31 11:31       ` Michael Ellerman
  0 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-31  9:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:49:55]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Add support for grouping cores based on the device-tree classification.
> > - The last domain in the associativity domains always refers to the
> > core.
> > - If primary reference domain happens to be the penultimate domain in
> > the associativity domains device-tree property, then there are no
> > coregroups. However if its not a penultimate domain, then there are
> > coregroups. There can be more than one coregroup. For now we would be
> > interested in the last or the smallest coregroups.
> 
> This still doesn't tell me what a coregroup actually represents.
> 
> I get that it's a grouping of cores, and that the device tree specifies
> it for us, but grouping based on what?
> 

We have just abstracted the fact that we are creating a sub-group of cores
within a DIE. We are limiting to one sub-group per core. However this would
allow the firmware the flexibility to vary the grouping. Once the firmware
starts using this group, we could add more code to detect the type of
grouping and adjust the sd domain flags accordingly.

> I think the answer is we aren't being told by firmware, it's just a
> grouping based on some opaque performance characteristic and we just
> have to take that as given.
> 

This is partially true. At this time, we dont have firmwares that can
exploit this code. Once the firmwares start using this grouping, we could
add more code to align the grouping to the scheduler topology.

> But please explain that clearly in the change log and the code comments.
> 

Okay, I will do the needful.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
  2020-07-31  7:45   ` Michael Ellerman
@ 2020-07-31  9:29     ` Srikar Dronamraju
  2020-07-31 12:22       ` Michael Ellerman
  0 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-31  9:29 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:45:37]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Currently "CACHE" domain happens to be the 2nd sched domain as per
> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
> > same as SMT domain. However we could generalize this domain such that it
> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
> >
> > While setting up the "CACHE" domain, check if shared_cache is already
> > set.
> 
> PeterZ asked for some overview of what you're doing and why, you
> responded to his mail, but I was expecting to see that text incorporated
> here somewhere.
> 

Okay, do you want that as part of the code or documentation dir or the
changelog?

> He also asked for some comments, which I would also like to see.
> 
> 
> I'm also not clear why we want to rename it to "bigcore", that's not a
> commonly understood term, I don't think it's clear to new readers what
> it means.
> 
> Leaving it as the shared cache domain, and having a comment mentioning
> that "bigcores" share a cache, would be clearer I think.
> 

Today, Shared cache is equal to Big Core. However in not too distant future,
Shared cache domain and Big Core may not be the same. For example lets
assume that L2 cache were to Shrink per small core with the firmware
exposing the core as a bigcore. Then with the current design, we have a SMT
== SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
thread level. Keeping the Bigcore as a domain and mapping the shared
cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
helps us in this scenario.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
  2020-07-31  7:52   ` Michael Ellerman
@ 2020-07-31  9:49     ` Srikar Dronamraju
  2020-07-31 12:14       ` Michael Ellerman
  0 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-31  9:49 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:52:15]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > If allocated earlier and the search fails, then cpumask need to be
> > freed. However cpu_l1_cache_map can be allocated after we search thread
> > group.
> 
> It's not freed anywhere AFAICS?
> 

Yes, its never freed. Infact we are never checking if
zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
freed/checked. I did dig into this a bit and it appears that ..
(Please do correct me if I am wrong!! )

Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

So from include/linux/cpumask.h

typedef struct cpumask cpumask_var_t[1];
and
zalloc_cpumask_var_node ends up being cpumask_clear

So I think we are historically we seem to assume we are always
!CPUMASK_OFFSTACK and hence we dont need to check for return as well as
free..

I would look forward to your comments on how we should handle this going
forward. But I would keep this the same for this patchset.

One of the questions that I have is if we most likely are to be in
!CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
variables. 
 
The reason being we end up using NR_CPU cpumask for each percpu cpumask
variable instead of using NR_CPU cpumask_t pointer.

> And even after this change there's still an error path that doesn't free
> it, isn't there?
> 
> cheers
> 
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: LKML <linux-kernel@vger.kernel.org>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: Nicholas Piggin <npiggin@gmail.com>
> > Cc: Anton Blanchard <anton@ozlabs.org>
> > Cc: Oliver O'Halloran <oohall@gmail.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@neuling.org>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Ingo Molnar <mingo@kernel.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Valentin Schneider <valentin.schneider@arm.com>
> > Cc: Jordan Niethe <jniethe5@gmail.com>
> > Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 7 +++----
> >  1 file changed, 3 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 698000c7f76f..dab96a1203ec 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
> >  	if (err)
> >  		goto out;
> >  
> > -	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > -				GFP_KERNEL,
> > -				cpu_to_node(cpu));
> > -
> >  	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
> >  
> >  	if (unlikely(cpu_group_start == -1)) {
> > @@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
> >  		goto out;
> >  	}
> >  
> > +	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
> > +				GFP_KERNEL, cpu_to_node(cpu));
> > +
> >  	for (i = first_thread; i < first_thread + threads_per_core; i++) {
> >  		int i_group_start = get_cpu_thread_group_start(i, &tg);
> >  
> > -- 
> > 2.17.1

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
  2020-07-31  8:02   ` Michael Ellerman
@ 2020-07-31  9:58     ` Srikar Dronamraju
  2020-07-31 11:29       ` Michael Ellerman
  0 siblings, 1 reply; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-31  9:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

* Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 18:02:21]:

> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> > Lookup the coregroup id from the associativity array.
> 

Thanks Michael for all your comments and inputs.

> It's slightly strange that this is called in patch 9, but only properly
> implemented here in patch 10.
> 
> I'm not saying you have to squash them together, but it would be good if
> the change log for patch 9 mentioned that a subsequent commit will
> complete the implementation and how that affects the behaviour.
> 

I probably got influenced by few LKML community members who always add a
stub and implement the gory details in a subsequent patch. I will surely add
the change log in patch 9 about the subsequent patches.

> cheers
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id
  2020-07-31  9:58     ` Srikar Dronamraju
@ 2020-07-31 11:29       ` Michael Ellerman
  0 siblings, 0 replies; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31 11:29 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 18:02:21]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Lookup the coregroup id from the associativity array.
>
> Thanks Michael for all your comments and inputs.
>
>> It's slightly strange that this is called in patch 9, but only properly
>> implemented here in patch 10.
>> 
>> I'm not saying you have to squash them together, but it would be good if
>> the change log for patch 9 mentioned that a subsequent commit will
>> complete the implementation and how that affects the behaviour.
>
> I probably got influenced by few LKML community members who always add a
> stub and implement the gory details in a subsequent patch. I will surely add
> the change log in patch 9 about the subsequent patches.

That's OK, it's a valid way to do things, and can be good for keeping
the size of individual patches down to make them easier to review.

But yeah a mention in the change log of the preceeding patch is helpful
for anyone looking at that commit on its own in the future.

cheers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup
  2020-07-31  9:18     ` Srikar Dronamraju
@ 2020-07-31 11:31       ` Michael Ellerman
  0 siblings, 0 replies; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31 11:31 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:49:55]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Add support for grouping cores based on the device-tree classification.
>> > - The last domain in the associativity domains always refers to the
>> > core.
>> > - If primary reference domain happens to be the penultimate domain in
>> > the associativity domains device-tree property, then there are no
>> > coregroups. However if its not a penultimate domain, then there are
>> > coregroups. There can be more than one coregroup. For now we would be
>> > interested in the last or the smallest coregroups.
>> 
>> This still doesn't tell me what a coregroup actually represents.
>> 
>> I get that it's a grouping of cores, and that the device tree specifies
>> it for us, but grouping based on what?
>
> We have just abstracted the fact that we are creating a sub-group of cores
> within a DIE. We are limiting to one sub-group per core. However this would
> allow the firmware the flexibility to vary the grouping. Once the firmware
> starts using this group, we could add more code to detect the type of
> grouping and adjust the sd domain flags accordingly.

OK. That's good info to have in the change log.

>> I think the answer is we aren't being told by firmware, it's just a
>> grouping based on some opaque performance characteristic and we just
>> have to take that as given.
>> 
>
> This is partially true. At this time, we dont have firmwares that can
> exploit this code. Once the firmwares start using this grouping, we could
> add more code to align the grouping to the scheduler topology.
>
>> But please explain that clearly in the change log and the code comments.
>> 
>
> Okay, I will do the needful.

Thanks.

cheers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
  2020-07-31  9:49     ` Srikar Dronamraju
@ 2020-07-31 12:14       ` Michael Ellerman
  0 siblings, 0 replies; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31 12:14 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:52:15]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > If allocated earlier and the search fails, then cpumask need to be
>> > freed. However cpu_l1_cache_map can be allocated after we search thread
>> > group.
>> 
>> It's not freed anywhere AFAICS?
>
> Yes, its never freed. Infact we are never checking if
> zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
> all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
> freed/checked. I did dig into this a bit and it appears that ..
> (Please do correct me if I am wrong!! )

That's correct.

> Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
> to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

I remember Rusty adding that code, but I don't know if we ever
considered enabling CPUMASK_OFFSTACK.

Probably we meant to but never got around to doing it.

> So from include/linux/cpumask.h
>
> typedef struct cpumask cpumask_var_t[1];
> and
> zalloc_cpumask_var_node ends up being cpumask_clear
>
> So I think we are historically we seem to assume we are always
> !CPUMASK_OFFSTACK and hence we dont need to check for return as well as
> free..

Right.

> I would look forward to your comments on how we should handle this going
> forward. But I would keep this the same for this patchset.

Agreed, just clarify in the change log that it's not freed at the moment
because of CPU_MASK_OFFSTACK=n

> One of the questions that I have is if we most likely are to be in
> !CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
> variables. 

I don't think so, cpumask_t is semi-deprecated AIUI.
  
> The reason being we end up using NR_CPU cpumask for each percpu cpumask
> variable instead of using NR_CPU cpumask_t pointer.

Our current defconfigs have NR_CPUS=2048, which is probably just small
enough to continue using OFFSTACK=n.

But we allow configuring NR_CPUS up to 8192, which surely would need
OFFSTACK=y in order to work.

So I think we need to stick with cpumask_var_t, but we should test with
OFFSTACK=y, and should probably be a bit more careful with checking the
allocations succeed.

And then we should select OFFSTACK=y for NR_CPUS above some threshold.

cheers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
  2020-07-31  9:29     ` Srikar Dronamraju
@ 2020-07-31 12:22       ` Michael Ellerman
  0 siblings, 0 replies; 32+ messages in thread
From: Michael Ellerman @ 2020-07-31 12:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:45:37]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Currently "CACHE" domain happens to be the 2nd sched domain as per
>> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
>> > same as SMT domain. However we could generalize this domain such that it
>> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
>> >
>> > While setting up the "CACHE" domain, check if shared_cache is already
>> > set.
>> 
>> PeterZ asked for some overview of what you're doing and why, you
>> responded to his mail, but I was expecting to see that text incorporated
>> here somewhere.
>> 
>
> Okay, do you want that as part of the code or documentation dir or the
> changelog?

I guess a comment is best, as that's most likely to be seen by people
looking at the code in future.

A little bit of overview in the change log is also good.

>> He also asked for some comments, which I would also like to see.
>> 
>> 
>> I'm also not clear why we want to rename it to "bigcore", that's not a
>> commonly understood term, I don't think it's clear to new readers what
>> it means.
>> 
>> Leaving it as the shared cache domain, and having a comment mentioning
>> that "bigcores" share a cache, would be clearer I think.
>> 
>
> Today, Shared cache is equal to Big Core. However in not too distant future,
> Shared cache domain and Big Core may not be the same. For example lets
> assume that L2 cache were to Shrink per small core with the firmware
> exposing the core as a bigcore. Then with the current design, we have a SMT
> == SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
> thread level. Keeping the Bigcore as a domain and mapping the shared
> cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
> helps us in this scenario.

Yeah OK.

In that scenario it's not really clear what the 8 thread level domain
expresses, if the two "halves" of the bigcore have separate L2s. But
presumably there is still some benefit to exposing it.

cheers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 09/10] Powerpc/smp: Create coregroup domain
  2020-07-31  1:05       ` Valentin Schneider
@ 2020-08-03  6:01         ` Srikar Dronamraju
  0 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-08-03  6:01 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	LKML, Nicholas Piggin, Morten Rasmussen, Oliver O'Halloran,
	Jordan Niethe, linuxppc-dev, Ingo Molnar

> > Also in the current P9 itself, two neighbouring core-pairs form a quad.
> > Cache latency within a quad is better than a latency to a distant core-pair.
> > Cache latency within a core pair is way better than latency within a quad.
> > So if we have only 4 threads running on a DIE all of them accessing the same
> > cache-lines, then we could probably benefit if all the tasks were to run
> > within the quad aka MC/Coregroup.
> >
> 
> Did you test this? WRT load balance we do try to balance "load" over the
> different domain spans, so if you represent quads as their own MC domain,
> you would AFAICT end up spreading tasks over the quads (rather than packing
> them) when balancing at e.g. DIE level. The desired behaviour might be
> hackable with some more ASYM_PACKING, but I'm not sure I should be
> suggesting that :-)
> 

Agree, load balance will try to spread the load across the quads. In my hack,
I was explicitly marking QUAD domains as !SD_PREFER_SIBLING + relaxing few
load spreading rules when SD_PREFER_SIBLING was not set. And this was on a
slightly older kernel (without recent Vincent's load balance overhaul).

> > I have found some benchmarks which are latency sensitive to benefit by
> > having a grouping a quad level (using kernel hacks and not backed by
> > firmware changes). Gautham also found similar results in his experiments
> > but he only used binding within the stock kernel.
> >
> 
> IIUC you reflect this "fabric quirk" (i.e. coregroups) using this DT
> binding thing.
> 
> That's also where things get interesting (for me) because I experienced
> something similar on another arm64 platform (ThunderX1). This was more
> about cache bandwidth than cache latency, but IMO it's in the same bag of
> fabric quirks. I blabbered a bit about this at last LPC [1], but kind of
> gave up on it given the TX1 was the only (arm64) platform where I could get
> both significant and reproducible results.
> 
> Now, if you folks are seeing this on completely different hardware and have
> "real" workloads that truly benefit from this kind of domain partitioning,
> this might be another incentive to try and sort of generalize this. That's
> outside the scope of your series, but your findings give me some hope!
> 
> I think what I had in mind back then was that if enough folks cared about
> it, we might get some bits added to the ACPI spec; something along the
> lines of proximity domains for the caches described in PPTT, IOW a cache
> distance matrix. I don't really know what it'll take to get there, but I
> figured I'd dump this in case someone's listening :-)
> 

Very interesting.

> > I am not setting SD_SHARE_PKG_RESOURCES in MC/Coregroup sd_flags as in MC
> > domain need not be LLC domain for Power.
> 
> From what I understood your MC domain does seem to map to LLC; but in any
> case, shouldn't you set that flag at least for BIGCORE (i.e. L2)? AIUI with
> your changes your sd_llc is gonna be SMT, and that's not going to be a very
> big mask. IMO you do want to correctly reflect your LLC situation via this
> flag to make cpus_share_cache() work properly.

I detect if the LLC is shared at BIGCORE, and if they are shared at BIGCORE,
then I dynamically rename the DOMAIN as CACHE and enable
SD_SHARE_PKG_RESOURCES in that domain.

> 
> [1]: https://linuxplumbersconf.org/event/4/contributions/484/

Thanks for the pointer.

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
  2020-07-27  5:17 Srikar Dronamraju
@ 2020-07-27  5:18 ` Srikar Dronamraju
  0 siblings, 0 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2020-07-27  5:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling,
	Srikar Dronamraju, Peter Zijlstra, Jordan Niethe, LKML,
	Nicholas Piggin, Valentin Schneider, Oliver O'Halloran,
	linuxppc-dev, Ingo Molnar

Currently "CACHE" domain happens to be the 2nd sched domain as per
powerpc_topology. This domain will collapse if cpumask of l2-cache is
same as SMT domain. However we could generalize this domain such that it
could mean either be a "CACHE" domain or a "BIGCORE" domain.

While setting up the "CACHE" domain, check if shared_cache is already
set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Jordan Niethe <jniethe5@gmail.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1 -> v2:
	Moved shared_cache topology fixup to fixup_topology (Gautham)

 arch/powerpc/kernel/smp.c | 49 ++++++++++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index d997c7411664..3c5ccf6d2b1c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 EXPORT_SYMBOL_GPL(has_big_cores);
 
+enum {
+#ifdef CONFIG_SCHED_SMT
+	smt_idx,
+#endif
+	bigcore_idx,
+	die_idx,
+};
+
 #define MAX_THREAD_LIST_SIZE	8
 #define THREAD_GROUP_SHARE_L1   1
 struct thread_groups {
@@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	if (shared_caches)
-		return cpu_l2_cache_mask(cpu);
-
-	if (has_big_cores)
-		return cpu_smallcore_mask(cpu);
-
-	return per_cpu(cpu_sibling_map, cpu);
+	return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static const struct cpumask *cpu_bigcore_mask(int cpu)
+{
+	return per_cpu(cpu_sibling_map, cpu);
+}
+
 static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -1311,7 +1318,6 @@ static void add_cpu_to_masks(int cpu)
 void start_secondary(void *unused)
 {
 	unsigned int cpu = smp_processor_id();
-	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
 	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
@@ -1337,14 +1343,20 @@ void start_secondary(void *unused)
 	/* Update topology CPU masks */
 	add_cpu_to_masks(cpu);
 
-	if (has_big_cores)
-		sibling_mask = cpu_smallcore_mask;
 	/*
 	 * Check for any shared caches. Note that this must be done on a
 	 * per-core basis because one core in the pair might be disabled.
 	 */
-	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
-		shared_caches = true;
+	if (!shared_caches) {
+		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+		struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+		if (has_big_cores)
+			sibling_mask = cpu_smallcore_mask;
+
+		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+			shared_caches = true;
+	}
 
 	set_numa_node(numa_cpu_lookup_table[cpu]);
 	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
@@ -1375,9 +1387,17 @@ static void fixup_topology(void)
 #ifdef CONFIG_SCHED_SMT
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
+		powerpc_topology[smt_idx].mask = smallcore_smt_mask;
 	}
 #endif
+	if (shared_caches) {
+		pr_info("Using shared cache scheduler topology\n");
+		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
+		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
+#ifdef CONFIG_SCHED_DEBUG
+		powerpc_topology[bigcore_idx].name = "CACHE";
+#endif
+	}
 }
 
 void __init smp_cpus_done(unsigned int max_cpus)
-- 
2.17.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, back to index

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-27  5:32 [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 01/10] powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 02/10] powerpc/smp: Merge Power9 topology with Power topology Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 03/10] powerpc/smp: Move powerpc_topology above Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 04/10] powerpc/smp: Move topology fixups into a new function Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 05/10] powerpc/smp: Dont assume l2-cache to be superset of sibling Srikar Dronamraju
2020-07-27  5:32 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju
2020-07-30  5:55   ` Gautham R Shenoy
2020-07-31  7:45   ` Michael Ellerman
2020-07-31  9:29     ` Srikar Dronamraju
2020-07-31 12:22       ` Michael Ellerman
2020-07-27  5:32 ` [PATCH v4 07/10] Powerpc/numa: Detect support for coregroup Srikar Dronamraju
2020-07-31  7:49   ` Michael Ellerman
2020-07-31  9:18     ` Srikar Dronamraju
2020-07-31 11:31       ` Michael Ellerman
2020-07-27  5:32 ` [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group Srikar Dronamraju
2020-07-31  7:52   ` Michael Ellerman
2020-07-31  9:49     ` Srikar Dronamraju
2020-07-31 12:14       ` Michael Ellerman
2020-07-27  5:32 ` [PATCH v4 09/10] Powerpc/smp: Create coregroup domain Srikar Dronamraju
2020-07-27 18:52   ` Gautham R Shenoy
2020-07-28 15:03   ` Valentin Schneider
2020-07-29  6:13     ` Srikar Dronamraju
2020-07-31  1:05       ` Valentin Schneider
2020-08-03  6:01         ` Srikar Dronamraju
2020-07-31  7:36       ` Gautham R Shenoy
2020-07-27  5:32 ` [PATCH v4 10/10] powerpc/smp: Implement cpu_to_coregroup_id Srikar Dronamraju
2020-07-31  8:02   ` Michael Ellerman
2020-07-31  9:58     ` Srikar Dronamraju
2020-07-31 11:29       ` Michael Ellerman
2020-07-30 17:22 ` [PATCH v4 00/10] Coregroup support on Powerpc Srikar Dronamraju
  -- strict thread matches above, loose matches on Subject: below --
2020-07-27  5:17 Srikar Dronamraju
2020-07-27  5:18 ` [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain Srikar Dronamraju

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org
	public-inbox-index linuxppc-dev

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git