All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/8] sched/fair: wake_affine improvements
@ 2021-05-06 16:45 Srikar Dronamraju
  2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Srikar Dronamraju @ 2021-05-06 16:45 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju,
	Thomas Gleixner, Valentin Schneider, Vincent Guittot,
	Dietmar Eggemann, Michael Ellerman, Gautham R Shenoy, Parth Shah

Changelog v1->v2:
v1 Link: http://lore.kernel.org/lkml/20210422102326.35889-1-srikar@linux.vnet.ibm.com/t/#u
 - Fallback LLC domain has been split out as a subsequent patchset.
   					(suggested by Mel)
 - Fix a panic due to two wakeups racing for the same idle-core
					(Reported by Mel)
 - Differentiate if a LLC surely has no idle-cores(-2) vs a LLC may or
   may not have idle-cores(-1).
 - Rebased to v5.12

Recently we found that some of the benchmark numbers on Power10 were lesser
than expected. Some analysis showed that the problem lies in the fact that
L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.

One probable solution to the problem was worked by Gautham where he posted
http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
a patch that marks MC domain as LLC.

Here the focus is on improving the current core scheduler's wakeup
mechanism by looking at idle-cores and nr_busy_cpus that is already
maintained per Last level cache(aka LLC)

Hence this approach can work well with the mc-llc too. It can help
other architectures too.

Request you to please review and provide your feedback.

Benchmarking numbers are from Power 10 but I have verified that we don't
regress on Power 9 setup.

# lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              80
On-line CPU(s) list: 0-79
Thread(s) per core:  8
Core(s) per socket:  10
Socket(s):           1
NUMA node(s):        1
Model:               1.0 (pvr 0080 0100)
Model name:          POWER10 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
L2 cache:            256K
L3 cache:            8K
NUMA node2 CPU(s):   0-79

Hackbench: (latency, lower is better)

v5.12-rc5
instances = 1, min = 24.102529 usecs/op, median =  usecs/op, max = 24.102529 usecs/op
instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op

v5.12-rc5 + mc-llc+
instances = 1, min = 18.676505 usecs/op, median =  usecs/op, max = 18.676505 usecs/op
instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op

v5.12-rc5 + this patchset
instances = 1, min = 19.240824 usecs/op, median =  usecs/op, max = 19.240824 usecs/op
instances = 2, min = 19.143470 usecs/op, median = 19.143470 usecs/op, max = 19.249875 usecs/op
instances = 4, min = 19.399812 usecs/op, median = 19.487433 usecs/op, max = 19.501298 usecs/op
instances = 8, min = 19.024297 usecs/op, median = 19.908682 usecs/op, max = 20.741605 usecs/op
instances = 16, min = 22.209444 usecs/op, median = 23.971275 usecs/op, max = 25.145198 usecs/op
instances = 32, min = 31.220392 usecs/op, median = 32.689189 usecs/op, max = 34.081588 usecs/op
instances = 64, min = 39.012110 usecs/op, median = 44.062042 usecs/op, max = 45.370525 usecs/op
instances = 80, min = 43.884358 usecs/op, median = 44.326417 usecs/op, max = 48.031303 usecs/op

Summary:
mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5

DayTrader (throughput, higher is better)
		     v5.12-rc5   v5.12-rc5     v5.12-rc5
                                 + mc-llc      + patchset
64CPUs/1JVM/ 60Users  6373.7      7520.5        7375.6
64CPUs/1JVM/ 80Users  6742.1      7940.9        7832.9
64CPUs/1JVM/100Users  6482.2      7730.3        7538.4
64CPUs/2JVM/ 60Users  6335        8081.6        8000.2
64CPUs/2JVM/ 80Users  6360.8      8259.6        8315.4
64CPUs/2JVM/100Users  6215.6      8046.5        8049.4
64CPUs/4JVM/ 60Users  5385.4      7685.3        8013.5
64CPUs/4JVM/ 80Users  5380.8      7753.3        7868
64CPUs/4JVM/100Users  5275.2      7549.2        7620

Summary: Across all profiles, this patchset or mc-llc out perform
vanilla v5.12-rc5
Not: Only 64 cores were online during this test.

schbench (latency: lesser is better)
======== Running schbench -m 3 -r 30 =================
Latency percentiles (usec) runtime 10 (s) (2545 total samples)
v5.12-rc5                  |  v5.12-rc5 + mc-llc                 | v5.12-rc5 + patchset

50.0th: 56 (1301 samples)  |     50.0th: 49 (1309 samples)       | 50.0th: 53 (1285 samples)
75.0th: 76 (623 samples)   |     75.0th: 66 (628 samples)        | 75.0th: 72 (635 samples)
90.0th: 93 (371 samples)   |     90.0th: 78 (371 samples)        | 90.0th: 88 (388 samples)
95.0th: 107 (123 samples)  |     95.0th: 87 (117 samples)        | 95.0th: 94 (118 samples)
*99.0th: 12560 (102 samples)    *99.0th: 100 (97 samples)        | *99.0th: 108 (108 samples)
99.5th: 15312 (14 samples) |     99.5th: 104 (12 samples)        | 99.5th: 108 (0 samples)
99.9th: 19936 (9 samples)  |     99.9th: 106 (8 samples)         | 99.9th: 110 (8 samples)
min=13, max=20684          |     min=15, max=113                 | min=15, max=1433

Latency percentiles (usec) runtime 20 (s) (7649 total samples)

50.0th: 51 (3884 samples)  |     50.0th: 50 (3935 samples)       | 50.0th: 51 (3843 samples)
75.0th: 69 (1859 samples)  |     75.0th: 66 (1817 samples)       | 75.0th: 69 (1962 samples)
90.0th: 87 (1173 samples)  |     90.0th: 80 (1204 samples)       | 90.0th: 84 (1103 samples)
95.0th: 97 (368 samples)   |     95.0th: 87 (342 samples)        | 95.0th: 93 (386 samples)
*99.0th: 8624 (290 samples)|     *99.0th: 98 (294 samples)       | *99.0th: 107 (297 samples)
99.5th: 11344 (37 samples) |     99.5th: 102 (37 samples)        | 99.5th: 110 (39 samples)
99.9th: 18592 (31 samples) |     99.9th: 106 (30 samples)        | 99.9th: 1714 (27 samples)
min=13, max=20684          |     min=12, max=113                 | min=15, max=4456

Latency percentiles (usec) runtime 30 (s) (12785 total samples)

50.0th: 50 (6614 samples)  |     50.0th: 49 (6544 samples)       | 50.0th: 50 (6443 samples)
75.0th: 67 (3059 samples)  |     75.0th: 65 (3100 samples)       | 75.0th: 67 (3263 samples)
90.0th: 84 (1894 samples)  |     90.0th: 79 (1912 samples)       | 90.0th: 82 (1890 samples)
95.0th: 94 (586 samples)   |     95.0th: 87 (646 samples)        | 95.0th: 92 (652 samples)
*99.0th: 8304 (507 samples)|     *99.0th: 101 (496 samples)      | *99.0th: 107 (464 samples)
99.5th: 11696 (62 samples) |     99.5th: 104 (45 samples)        | 99.5th: 110 (61 samples)
99.9th: 18592 (51 samples) |     99.9th: 110 (51 samples)        | 99.9th: 1434 (47 samples)
min=12, max=21421          |     min=1, max=126                  | min=15, max=4456

Summary:
mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5

mongodb (threads=6) (throughput, higher is better)
					 Throughput         read        clean      update
					                    latency     latency    latency
v5.12-rc5            JVM=YCSB_CLIENTS=14  68116.05 ops/sec   1109.82 us  944.19 us  1342.29 us
v5.12-rc5            JVM=YCSB_CLIENTS=21  64802.69 ops/sec   1772.64 us  944.69 us  2099.57 us
v5.12-rc5            JVM=YCSB_CLIENTS=28  61792.78 ops/sec   2490.48 us  930.09 us  2928.03 us
v5.12-rc5            JVM=YCSB_CLIENTS=35  59604.44 ops/sec   3236.86 us  870.28 us  3787.48 us

v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=14  70948.51 ops/sec   1060.21 us  842.02 us  1289.44 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=21  68732.48 ops/sec   1669.91 us  871.57 us  1975.19 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=28  66674.81 ops/sec   2313.79 us  889.59 us  2702.36 us
v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=35  64397.51 ops/sec   3010.66 us  966.28 us  3484.19 us

v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14  67604.51 ops/sec   1117.91 us  947.07 us  1353.41 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21  63979.39 ops/sec   1793.63 us  869.72 us  2130.22 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28  62032.34 ops/sec   2475.89 us  869.06 us  2922.01 us
v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35  60152.96 ops/sec   3203.84 us  972.00 us  3756.52 us

Summary:
mc-llc outperforms, this patchset and upstream almost give similar performance.

Cc: LKML <linux-kernel@vger.kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Parth Shah <parth@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Rik van Riel <riel@surriel.com>

Srikar Dronamraju (8):
  sched/fair: Update affine statistics when needed
  sched/fair: Maintain the identity of idle-core
  sched/fair: Update idle-core more often
  sched/fair: Prefer idle CPU to cache affinity
  sched/fair: Use affine_idler_llc for wakeups across LLC
  sched/idle: Move busy_cpu accounting to idle callback
  sched/fair: Remove ifdefs in waker_affine_idler_llc
  sched/fair: Dont iterate if no idle CPUs

 include/linux/sched/topology.h |   2 +-
 kernel/sched/fair.c            | 204 ++++++++++++++++++++++++++-------
 kernel/sched/features.h        |   1 +
 kernel/sched/idle.c            |  33 +++++-
 kernel/sched/sched.h           |   6 +
 kernel/sched/topology.c        |   9 ++
 6 files changed, 214 insertions(+), 41 deletions(-)

-- 
2.18.2


^ permalink raw reply	[flat|nested] 32+ messages in thread
* Re: [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core
@ 2021-05-06 20:24 kernel test robot
  0 siblings, 0 replies; 32+ messages in thread
From: kernel test robot @ 2021-05-06 20:24 UTC (permalink / raw)
  To: kbuild

[-- Attachment #1: Type: text/plain, Size: 4775 bytes --]

CC: kbuild-all(a)lists.01.org
In-Reply-To: <20210506164543.90688-3-srikar@linux.vnet.ibm.com>
References: <20210506164543.90688-3-srikar@linux.vnet.ibm.com>
TO: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
TO: Ingo Molnar <mingo@kernel.org>
TO: Peter Zijlstra <peterz@infradead.org>
CC: LKML <linux-kernel@vger.kernel.org>
CC: Mel Gorman <mgorman@suse.de>
CC: Rik van Riel <riel@surriel.com>
CC: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
CC: Thomas Gleixner <tglx@linutronix.de>
CC: Valentin Schneider <valentin.schneider@arm.com>
CC: Vincent Guittot <vincent.guittot@linaro.org>
CC: Dietmar Eggemann <dietmar.eggemann@arm.com>

Hi Srikar,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on v5.12]
[cannot apply to tip/sched/core tip/master linus/master next-20210506]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Srikar-Dronamraju/sched-fair-wake_affine-improvements/20210507-004927
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 1fe5501ba1abf2b7e78295df73675423bd6899a0
:::::: branch date: 4 hours ago
:::::: commit date: 4 hours ago
compiler: arm-linux-gnueabi-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


cppcheck warnings: (new ones prefixed by >>)
   kernel/sched/fair.c:10894:62: warning: Same value in both branches of ternary operator. [duplicateValueTernary]
    update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : SKIP_AGE_LOAD);
                                                                ^

cppcheck possible warnings: (new ones prefixed by >>, may not real problems)

>> kernel/sched/fair.c:6113:2: warning: Non-boolean value returned from function returning bool [returnNonBoolInBooleanFunction]
    return def;
    ^
   kernel/sched/fair.c:6199:16: warning: Local variable task_util shadows outer function [shadowFunction]
    unsigned long task_util, best_cap = 0;
                  ^
   kernel/sched/fair.c:3881:29: note: Shadowed declaration
   static inline unsigned long task_util(struct task_struct *p)
                               ^
   kernel/sched/fair.c:6199:16: note: Shadow variable
    unsigned long task_util, best_cap = 0;
                  ^
   kernel/sched/fair.c:6239:16: warning: Local variable task_util shadows outer function [shadowFunction]
    unsigned long task_util;
                  ^
   kernel/sched/fair.c:3881:29: note: Shadowed declaration
   static inline unsigned long task_util(struct task_struct *p)
                               ^
   kernel/sched/fair.c:6239:16: note: Shadow variable
    unsigned long task_util;
                  ^
   kernel/sched/fair.c:6528:17: warning: Local variable cpu_util shadows outer function [shadowFunction]
     unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
                   ^
   kernel/sched/fair.c:5450:29: note: Shadowed declaration
   static inline unsigned long cpu_util(int cpu);
                               ^
   kernel/sched/fair.c:6528:17: note: Shadow variable
     unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
                   ^
   kernel/sched/fair.c:6789:7: warning: Local variable min_vruntime shadows outer function [shadowFunction]
     u64 min_vruntime;
         ^
   kernel/sched/fair.c:525:19: note: Shadowed declaration
   static inline u64 min_vruntime(u64 min_vruntime, u64 vruntime)
                     ^
   kernel/sched/fair.c:6789:7: note: Shadow variable
     u64 min_vruntime;
         ^
   kernel/sched/fair.c:9998:6: warning: Local variable update_next_balance shadows outer function [shadowFunction]
    int update_next_balance = 0;
        ^
   kernel/sched/fair.c:9870:1: note: Shadowed declaration
   update_next_balance(struct sched_domain *sd, unsigned long *next_balance)
   ^
   kernel/sched/fair.c:9998:6: note: Shadow variable
    int update_next_balance = 0;
        ^

vim +6113 kernel/sched/fair.c

10e2f1acd0106c0 Peter Zijlstra    2016-05-09  6110  
c6513fba6f118be Srikar Dronamraju 2021-05-06  6111  static inline bool get_idle_core(int cpu, int def)
9fe1f127b913318 Mel Gorman        2021-01-27  6112  {
9fe1f127b913318 Mel Gorman        2021-01-27 @6113  	return def;
9fe1f127b913318 Mel Gorman        2021-01-27  6114  }
10e2f1acd0106c0 Peter Zijlstra    2016-05-09  6115  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-05-19 17:35 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-06 16:45 [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 1/8] sched/fair: Update affine statistics when needed Srikar Dronamraju
2021-05-07 16:08   ` Valentin Schneider
2021-05-07 17:05     ` Srikar Dronamraju
2021-05-11 11:51       ` Valentin Schneider
2021-05-11 16:22         ` Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
2021-05-11 11:51   ` Valentin Schneider
2021-05-11 16:27     ` Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 3/8] sched/fair: Update idle-core more often Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 4/8] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 5/8] sched/fair: Use affine_idler_llc for wakeups across LLC Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 6/8] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
2021-05-11 11:51   ` Valentin Schneider
2021-05-11 16:55     ` Srikar Dronamraju
2021-05-12  0:32     ` Aubrey Li
2021-05-12  8:08   ` Aubrey Li
2021-05-13  7:31     ` Srikar Dronamraju
2021-05-14  4:11       ` Aubrey Li
2021-05-17 10:40         ` Srikar Dronamraju
2021-05-17 12:48           ` Aubrey Li
2021-05-17 12:57             ` Srikar Dronamraju
2021-05-18  0:59               ` Aubrey Li
2021-05-18  4:00                 ` Srikar Dronamraju
2021-05-18  6:05                   ` Aubrey Li
2021-05-18  7:18                     ` Srikar Dronamraju
2021-05-19  9:43                       ` Aubrey Li
2021-05-19 17:34                         ` Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 7/8] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
2021-05-06 16:45 ` [PATCH v2 8/8] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
2021-05-06 16:53 ` [PATCH v2 0/8] sched/fair: wake_affine improvements Srikar Dronamraju
2021-05-06 20:24 [PATCH v2 2/8] sched/fair: Maintain the identity of idle-core kernel test robot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.