[PATCH 0/4] powerpc/smp: Shared processor sched optimizations

From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Michael Ellerman <mpe@ellerman.id.au>
Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Juergen Gross <jgross@suse.com>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	Josh Poimboeuf <jpoimboe@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Paul E McKenney <paulmck@kernel.org>,
	Valentin Schneider <vschneid@redhat.com>,
	Nathan Lynch <nathanl@linux.ibm.com>,
	virtualization@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org
Subject: [PATCH 0/4] powerpc/smp: Shared processor sched optimizations
Date: Wed, 30 Aug 2023 16:22:40 +0530	[thread overview]
Message-ID: <20230830105244.62477-1-srikar@linux.vnet.ibm.com> (raw)

PowerVM systems configured in shared processors mode have some unique
challenges. Some device-tree properties will be missing on a shared
processor. Hence some sched domains may not make sense for shared processor
systems.

Most shared processor systems are over-provisioned. Underlying PowerVM
Hypervisor would schedule at a Big Core granularity. The most recent power
processors support two almost independent cores. In a lightly loaded
condition, it helps the overall system performance if we pack to lesser
number of Big Cores.

System Configuration
type=Shared mode=Capped smt=8 lcpu=128 mem=1066732224 kB cpus=96 ent=40.00
So *40 Entitled cores / 128 Virtual processors* scenario.

lscpu
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          1024
On-line CPU(s) list:             0-1023
Model name:                      POWER10 (architected), altivec supported
Model:                           2.0 (pvr 0080 0200)
Thread(s) per core:              8
Core(s) per socket:              16
Socket(s):                       8
Hypervisor vendor:               pHyp
Virtualization type:             para
L1d cache:                       8 MiB (256 instances)
L1i cache:                       12 MiB (256 instances)
NUMA node(s):                    8
NUMA node0 CPU(s): 0-7,64-71,128-135,192-199,256-263,320-327,384-391,448-455,512-519,576-583,640-647,704-711,768-775,832-839,896-903,960-967
NUMA node1 CPU(s): 8-15,72-79,136-143,200-207,264-271,328-335,392-399,456-463,520-527,584-591,648-655,712-719,776-783,840-847,904-911,968-975
NUMA node2 CPU(s): 16-23,80-87,144-151,208-215,272-279,336-343,400-407,464-471,528-535,592-599,656-663,720-727,784-791,848-855,912-919,976-983
NUMA node3 CPU(s): 24-31,88-95,152-159,216-223,280-287,344-351,408-415,472-479,536-543,600-607,664-671,728-735,792-799,856-863,920-927,984-991
NUMA node4 CPU(s): 32-39,96-103,160-167,224-231,288-295,352-359,416-423,480-487,544-551,608-615,672-679,736-743,800-807,864-871,928-935,992-999
NUMA node5 CPU(s): 40-47,104-111,168-175,232-239,296-303,360-367,424-431,488-495,552-559,616-623,680-687,744-751,808-815,872-879,936-943,1000-1007
NUMA node6 CPU(s): 48-55,112-119,176-183,240-247,304-311,368-375,432-439,496-503,560-567,624-631,688-695,752-759,816-823,880-887,944-951,1008-1015
NUMA node7 CPU(s): 56-63,120-127,184-191,248-255,312-319,376-383,440-447,504-511,568-575,632-639,696-703,760-767,824-831,888-895,952-959,1016-1023

ebizzy -t 40 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N  Min       Max       Median    Avg        Stddev     %Change
v6.5       5  4664647   5148125   5130549   5043050.2  211756.06
+patch     5  4769453   5220808   5137476   5040333.8  193586.43  -0.0538642

From lparstat (when the workload stabilized)
Kernel  %user  %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw       phint
v6.5    6.23   0.00  0.00   93.77  40.06  100.15  6.23   55.92  138699651  100
+patch  6.26   0.01  0.00   93.73  21.15  52.87   6.27   74.78  71743299   148

ebizzy -t 80 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N  Min       Max       Median    Avg        Stddev     %Change
v6.5       5  8735907   9121401   8986218   8967125.6  152793.38
+patch     5  9636679   9990229   9765958   9770081.8  143913.29  8.95444

From lparstat (when the workload stabilized)
Kernel  %user  %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw      phint
v6.5    12.40  0.01  0.00   87.60  71.05  177.62  12.40  24.61  98047012  85
+patch  12.47  0.02  0.00   87.50  41.06  102.65  12.50  54.90  77821678  158

ebizzy -t 160 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel     N  Min       Max       Median    Avg        Stddev     %Change
v6.5       5  12378356  12946633  12780732  12682369   266135.73
+patch     5  16756702  17676670  17406971  17341585   346054.89  36.7377

From lparstat (when the workload stabilized)
Kernel  %user  %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw       phint
v6.5    24.56  0.09  0.15   75.19  77.42  193.55  24.65  17.94  135625276  98
+patch  24.78  0.03  0.00   75.19  78.33  195.83  24.81  17.17  107826112  215
-------------------------------------------------------------------------

System Configuration
type=Shared mode=Capped smt=8 lcpu=40 mem=1066732672 kB cpus=96 ent=40.00
So *40 Entitled cores / 40 Virtual processors* scenario.

lscpu
Architecture:                    ppc64le
Byte Order:                      Little Endian
CPU(s):                          320
On-line CPU(s) list:             0-319
Model name:                      POWER10 (architected), altivec supported
Model:                           2.0 (pvr 0080 0200)
Thread(s) per core:              8
Core(s) per socket:              10
Socket(s):                       4
Hypervisor vendor:               pHyp
Virtualization type:             para
L1d cache:                       2.5 MiB (80 instances)
L1i cache:                       3.8 MiB (80 instances)
NUMA node(s):                    4
NUMA node0 CPU(s):               0-7,32-39,64-71,96-103,128-135,160-167,192-199,224-231,256-263,288-295
NUMA node1 CPU(s):               8-15,40-47,72-79,104-111,136-143,168-175,200-207,232-239,264-271,296-303
NUMA node2 CPU(s):               16-23,48-55,80-87,112-119,144-151,176-183,208-215,240-247,272-279,304-311
NUMA node3 CPU(s):               24-31,56-63,88-95,120-127,152-159,184-191,216-223,248-255,280-287,312-319

ebizzy -t 40 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel    N  Min       Max       Median    Avg        Stddev     %Change
v6.5      5  4966196   5148045   5078348   5072977.4  66572.122
+patch    5  5035210   5232882   5158456   5151734    78906.893  1.55247

From lparstat (when the workload stabilized)
Kernel  %user  %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw     phint
v6.5    12.58  0.02  0.00   87.41  40.00  100.00  12.59  55.97  1029603  82
+patch  12.58  0.02  0.00   87.40  21.16  52.90   12.60  74.82  1188571  657

ebizzy -t 80 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel    N  Min       Max       Median    Avg        Stddev     %Change
v6.5      5  10081713  10162128  10145721  10128119   35603.196
+patch    5  9928483   10430256  10338097  10218466   221155.16  0.892041

From lparstat (when the workload stabilized)
Kernel  %user  %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw     phint
v6.5    25.02  0.06  0.00   74.93  40.00  100.00  25.07  55.99  1530297  92
+patch  25.03  0.04  0.00   74.93  40.00  100.00  25.07  55.99  2475875  667

ebizzy -t 160 -S 200 (5 iterations) Records per second. (Higher is better)
Kernel    N  Min       Max       Median    Avg        Stddev     %Change
v6.5      5  9064802   9169798   9115250   9123968.2  44901.261
+patch    5  9064533   9235200   9072374   9119558.2  76260.411  -0.0483342

From lparstat (when the workload stabilized)
Kernel  %user  %sys  %wait  %idle  physc  %entc   lbusy  app    vcsw     phint
v6.5    49.94  0.03  0.00   50.03  40.06  100.15  49.97  55.99  2058879  93
+patch  49.94  0.03  0.00   50.03  40.06  100.15  49.97  55.99  2058879  93
-------------------------------------------------------------------------

Observation:
We are able to see Improvement in ebizzy throughput even with lesser
core utilization (almost half the core utilization) in low utilization
scenarios while still retaining throughput in mid and higher utilization
scenarios.
Note: The numbers are with Uncapped + no-noise case. In the Capped and/or
noise case, due to contention on the Cores, the numbers are expected to
further improve.

Srikar Dronamraju (4):
  powerpc/smp: Cache CPU has Asymmetric SMP
  powerpc/smp: Move shared_processor static key to smp.h
  powerpc/smp: Enable Asym packing for cores on shared processor
  powerpc/smp: Disable MC domain for shared processor

 arch/powerpc/include/asm/paravirt.h | 12 -----------
 arch/powerpc/include/asm/smp.h      | 14 +++++++++++++
 arch/powerpc/kernel/smp.c           | 31 +++++++++++++++++++----------
 3 files changed, 35 insertions(+), 22 deletions(-)

-- 
2.41.0