[RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

* [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
@ 2021-04-02  5:37 Gautham R. Shenoy
  2021-04-02  7:36 ` Gautham R Shenoy
  2021-04-12  6:24 ` Srikar Dronamraju
  0 siblings, 2 replies; 13+ messages in thread
From: Gautham R. Shenoy @ 2021-04-02  5:37 UTC (permalink / raw)
  To: Michael Ellerman, Michael Neuling, Mel Gorman, Rik van Riel,
	Srikar Dronamraju, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Nicholas Piggin, Anton Blanchard, Parth Shah,
	Vaidyanathan Srinivasan
  Cc: LKML, linuxppc-dev, Gautham R. Shenoy

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

On POWER10 systems, the L2 cache is at the SMT4 small core level. The
following commits ensure that L2 cache gets correctly discovered and
the Last-Level-Cache domain (LLC) is set to the SMT sched-domain.

    790a166 powerpc/smp: Parse ibm,thread-groups with multiple properties
    1fdc1d6 powerpc/smp: Rename cpu_l1_cache_map as thread_group_l1_cache_map
    fbd2b67 powerpc/smp: Rename init_thread_group_l1_cache_map() to make
                         it generic
    538abe powerpc/smp: Add support detecting thread-groups sharing L2 cache
    0be4763 powerpc/cacheinfo: Print correct cache-sibling map/list for L2 cache

However, with the LLC now on the SMT sched-domain, we are seeing some
regressions in the performance of applications that requires
single-threaded performance. The reason for this is as follows:

Prior to the change (we call this P9-sched below), the sched-domain
hierarchy was:

	  SMT (SMT4) --> CACHE (SMT8)[LLC] --> MC (Hemisphere) --> DIE

where the CACHE sched-domain is defined to be the Last Level Cache (LLC).

On the upstream kernel, with the aforementioned commmits (P10-sched),
the sched-domain hierarchy is:

    	  SMT (SMT4)[LLC] --> MC (Hemisphere) --> DIE

with the SMT sched-domain as the LLC.

When the scheduler tries to wakeup a task, it chooses between the
waker-CPU and the wakee's previous-CPU. Suppose this choice is called
the "target", then in the target's LLC domain, the scheduler

a) tries to find an idle core in the LLC. This helps exploit the
   SMT folding that the wakee task can benefit from. If an idle
   core is found, the wakee is woken up on it.

b) Failing to find an idle core, the scheduler tries to find an idle
   CPU in the LLC. This helps minimise the wakeup latency for the
   wakee since it gets to run on the CPU immediately.

c) Failing this, it will wake it up on target CPU.

Thus, with P9-sched topology, since the CACHE domain comprises of two
SMT4 cores, there is a decent chance that we get an idle core, failing
which there is a relatively higher probability of finding an idle CPU
among the 8 threads in the domain.

However, in P10-sched topology, since the SMT domain is the LLC and it
contains only a single SMT4 core, the probability that we find that
core to be idle is less. Furthermore, since there are only 4 CPUs to
search for an idle CPU, there is lower probability that we can get an
idle CPU to wake up the task on.

Thus applications which require single threaded performance will end
up getting woken up on potentially busy core, even though there are
idle cores in the system.

To remedy this, this patch proposes that the LLC be moved to the MC
level which is a group of cores in one half of the chip.

      SMT (SMT4) --> MC (Hemisphere)[LLC] --> DIE

While there is no cache being shared at this level, this is still the
level where some amount of cache-snooping takes place and it is
relatively faster to access the data from the caches of the cores
within this domain. With this change, we no longer see regressions on
P10 for applications which require single threaded performance.

The patch also improves the tail latencies on schbench and the
usecs/op on "perf bench sched pipe"

On a 10 core P10 system with 80 CPUs,

schbench
============
(https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/)

Values : Lower the better.
99th percentile is the tail latency.


99th percentile
~~~~~~~~~~~~~~~~~~
No. messenger
threads       5.12-rc4    5.12-rc4
              P10-sched   MC-LLC
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1             70 us         85 us
2             81 us        101 us
3             92 us        107 us
4             96 us        110 us
5            103 us        123 us
6           3412 us ---->  122 us
7           1490 us        136 us
8           6200 us       3572 us


Hackbench
============
(perf bench sched pipe)
values: lower the better

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
No. of
parallel
instances   5.12-rc4       5.12-rc4
            P10-sched      MC-LLC 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1           24.04 us/op    18.72 us/op 
2           24.04 us/op    18.65 us/op 
4           24.01 us/op    18.76 us/op 
8           24.10 us/op    19.11 us/op 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 5a4d59a..c75dbd4 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -976,6 +976,13 @@ static bool has_coregroup_support(void)
 	return coregroup_enabled;
 }
 
+static int powerpc_mc_flags(void)
+{
+	if(has_coregroup_support())
+		return SD_SHARE_PKG_RESOURCES;
+	return 0;
+}
+
 static const struct cpumask *cpu_mc_mask(int cpu)
 {
 	return cpu_coregroup_mask(cpu);
@@ -986,7 +993,7 @@ static const struct cpumask *cpu_mc_mask(int cpu)
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
 	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
-	{ cpu_mc_mask, SD_INIT_NAME(MC) },
+	{ cpu_mc_mask, powerpc_mc_flags, SD_INIT_NAME(MC) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
-- 
1.9.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread