All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Rik van Riel <riel@surriel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Gautham R Shenoy <ego@linux.vnet.ibm.com>,
	Parth Shah <parth@linux.ibm.com>
Subject: Re: [PATCH 00/10] sched/fair: wake_affine improvements
Date: Tue, 27 Apr 2021 16:52:30 +0200	[thread overview]
Message-ID: <CAKfTPtAuFpr05-ZBNjB9OiNNQnmgPSX3S4=Sz-A8sOnFAkr7Tg@mail.gmail.com> (raw)
In-Reply-To: <20210422102326.35889-1-srikar@linux.vnet.ibm.com>

Hi Srikar,

On Thu, 22 Apr 2021 at 12:23, Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> Recently we found that some of the benchmark numbers on Power10 were lesser
> than expected. Some analysis showed that the problem lies in the fact that
> L2-Cache on Power10 is at core level i.e only 4 threads share the L2-cache.
>
> One probable solution to the problem was worked by Gautham where he posted
> http://lore.kernel.org/lkml/1617341874-1205-1-git-send-email-ego@linux.vnet.ibm.com/t/#u
> a patch that marks MC domain as LLC.
>
> Here the focus is on seeing if we can improve the current core scheduler's
> wakeup mechanism by looking at idle-cores and nr_busy_cpus that is already
> maintained per Last level cache(aka LLC) (first 8 patches) + explore the
> possibility to provide a fallback LLC domain, that can be preferred if the
> current LLC is busy (last 2 patches).
>
> Except the last 2 patches, the rest patches should work independently of the
> other proposed solution. i.e if the mc-llc patch is accepted, then the last
> two patches may not be needed for Power10. However this may be helpful for
> other archs/platforms.
>
> In the fallback approach, we look for a one-to-one mapping for each LLC.
> However this can be easily modified to look for all LLC's in the current
> LLC's parent. Also fallback is only used for sync wakeups. This is because
> that is where we expect the maximum benefit of moving the task closer to the
> task. For non-sync wakeups, its expected that CPU from previous LLC may be
> better off.
>
> Request you to please review and provide your feedback.
>
> Benchmarking numbers are from Power 10 but I have verified that we don't
> regress on Power 9 setup.
>
> # lscpu
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              80
> On-line CPU(s) list: 0-79
> Thread(s) per core:  8
> Core(s) per socket:  10
> Socket(s):           1
> NUMA node(s):        1
> Model:               1.0 (pvr 0080 0100)
> Model name:          POWER10 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:           64K
> L1i cache:           32K
> L2 cache:            256K
> L3 cache:            8K
> NUMA node2 CPU(s):   0-79
>
> Hackbench: (latency, lower is better)
>
> v5.12-rc5
> instances = 1, min = 24.102529 usecs/op, median =  usecs/op, max = 24.102529 usecs/op
> instances = 2, min = 24.096112 usecs/op, median = 24.096112 usecs/op, max = 24.178903 usecs/op
> instances = 4, min = 24.080541 usecs/op, median = 24.082990 usecs/op, max = 24.166873 usecs/op
> instances = 8, min = 24.088969 usecs/op, median = 24.116081 usecs/op, max = 24.199853 usecs/op
> instances = 16, min = 24.267228 usecs/op, median = 26.204510 usecs/op, max = 29.218360 usecs/op
> instances = 32, min = 30.680071 usecs/op, median = 32.911664 usecs/op, max = 37.380470 usecs/op
> instances = 64, min = 43.908331 usecs/op, median = 44.454343 usecs/op, max = 46.210298 usecs/op
> instances = 80, min = 44.585754 usecs/op, median = 56.738546 usecs/op, max = 60.625826 usecs/op
>
> v5.12-rc5 + mc-llc
> instances = 1, min = 18.676505 usecs/op, median =  usecs/op, max = 18.676505 usecs/op
> instances = 2, min = 18.488627 usecs/op, median = 18.488627 usecs/op, max = 18.574946 usecs/op
> instances = 4, min = 18.428399 usecs/op, median = 18.589051 usecs/op, max = 18.872548 usecs/op
> instances = 8, min = 18.597389 usecs/op, median = 18.783815 usecs/op, max = 19.265532 usecs/op
> instances = 16, min = 21.922350 usecs/op, median = 22.737792 usecs/op, max = 24.832429 usecs/op
> instances = 32, min = 29.770446 usecs/op, median = 31.996687 usecs/op, max = 34.053042 usecs/op
> instances = 64, min = 53.067842 usecs/op, median = 53.295139 usecs/op, max = 53.473059 usecs/op
> instances = 80, min = 44.423288 usecs/op, median = 44.713767 usecs/op, max = 45.159761 usecs/op
>
> v5.12-rc5 + this patchset
> instances = 1, min = 19.368805 usecs/op, median =  usecs/op, max = 19.368805 usecs/op
> instances = 2, min = 19.423674 usecs/op, median = 19.423674 usecs/op, max = 19.506203 usecs/op
> instances = 4, min = 19.454523 usecs/op, median = 19.596947 usecs/op, max = 19.863620 usecs/op
> instances = 8, min = 20.005272 usecs/op, median = 20.239924 usecs/op, max = 20.878947 usecs/op
> instances = 16, min = 21.856779 usecs/op, median = 24.102147 usecs/op, max = 25.496110 usecs/op
> instances = 32, min = 31.460159 usecs/op, median = 32.809621 usecs/op, max = 33.939650 usecs/op
> instances = 64, min = 39.506553 usecs/op, median = 43.835221 usecs/op, max = 45.645505 usecs/op
> instances = 80, min = 43.805716 usecs/op, median = 44.314757 usecs/op, max = 48.910236 usecs/op
>
> Summary:
> mc-llc and this patchset seem to be performing much better than vanilla v5.12-rc5
>
> DayTrader (throughput, higher is better)
>                      v5.12-rc5   v5.12-rc5     v5.12-rc5
>                                  + mc-llc      + patchset
> 64CPUs/1JVM/ 60Users  6373.7      7520.5        7232.3
> 64CPUs/1JVM/ 80Users  6742.1      7940.9        7732.8
> 64CPUs/1JVM/100Users  6482.2      7730.3        7540
> 64CPUs/2JVM/ 60Users  6335        8081.6        7914.3
> 64CPUs/2JVM/ 80Users  6360.8      8259.6        8138.6
> 64CPUs/2JVM/100Users  6215.6      8046.5        8039.2
> 64CPUs/4JVM/ 60Users  5385.4      7685.3        7706.1
> 64CPUs/4JVM/ 80Users  5380.8      7753.3        7721.5
> 64CPUs/4JVM/100Users  5275.2      7549.2        7608.3
>
> Summary: Across all profiles, this patchset or mc-llc out perform
> vanilla v5.12-rc5
> Not: Only 64 cores were online during this test.
>
> schbench (latency: lesser is better)
> ======== Running schbench -m 3 -r 30 =================
> Latency percentiles (usec) runtime 10 (s) (2545 total samples)
> v5.12-rc5                  |  v5.12-rc5 + mc-llc                 | v5.12-rc5 + patchset
>
> 50.0th: 56 (1301 samples)  |     50.0th: 49 (1309 samples)       | 50.0th: 50 (1310 samples)
> 75.0th: 76 (623 samples)   |     75.0th: 66 (628 samples)        | 75.0th: 68 (632 samples)
> 90.0th: 93 (371 samples)   |     90.0th: 78 (371 samples)        | 90.0th: 80 (354 samples)
> 95.0th: 107 (123 samples)  |     95.0th: 87 (117 samples)        | 95.0th: 86 (126 samples)
> *99.0th: 12560 (102 samples)    *99.0th: 100 (97 samples)        | *99.0th: 103 (97 samples)
> 99.5th: 15312 (14 samples) |     99.5th: 104 (12 samples)        | 99.5th: 1202 (13 samples)
> 99.9th: 19936 (9 samples)  |     99.9th: 106 (8 samples)         | 99.9th: 14992 (10 samples)
> min=13, max=20684          |     min=15, max=113                 | min=15, max=18721
>
> Latency percentiles (usec) runtime 20 (s) (7649 total samples)
>
> 50.0th: 51 (3884 samples)  |     50.0th: 50 (3935 samples)       | 50.0th: 49 (3841 samples)
> 75.0th: 69 (1859 samples)  |     75.0th: 66 (1817 samples)       | 75.0th: 67 (1965 samples)
> 90.0th: 87 (1173 samples)  |     90.0th: 80 (1204 samples)       | 90.0th: 78 (1134 samples)
> 95.0th: 97 (368 samples)   |     95.0th: 87 (342 samples)        | 95.0th: 83 (359 samples)
> *99.0th: 8624 (290 samples)|     *99.0th: 98 (294 samples)       | *99.0th: 93 (296 samples)
> 99.5th: 11344 (37 samples) |     99.5th: 102 (37 samples)        | 99.5th: 98 (34 samples)
> 99.9th: 18592 (31 samples) |     99.9th: 106 (30 samples)        | 99.9th: 7544 (28 samples)
> min=13, max=20684          |     min=12, max=113                 | min=13, max=18721
>
> Latency percentiles (usec) runtime 30 (s) (12785 total samples)
>
> 50.0th: 50 (6614 samples)  |     50.0th: 49 (6544 samples)       | 50.0th: 48 (6527 samples)
> 75.0th: 67 (3059 samples)  |     75.0th: 65 (3100 samples)       | 75.0th: 64 (3143 samples)
> 90.0th: 84 (1894 samples)  |     90.0th: 79 (1912 samples)       | 90.0th: 76 (1985 samples)
> 95.0th: 94 (586 samples)   |     95.0th: 87 (646 samples)        | 95.0th: 81 (585 samples)
> *99.0th: 8304 (507 samples)|     *99.0th: 101 (496 samples)      | *99.0th: 90 (453 samples)
> 99.5th: 11696 (62 samples) |     99.5th: 104 (45 samples)        | 99.5th: 94 (66 samples)
> 99.9th: 18592 (51 samples) |     99.9th: 110 (51 samples)        | 99.9th: 1202 (49 samples)
> min=12, max=21421          |     min=1, max=126                  | min=3, max=18721
>
> Summary:
> mc-llc is the best option, but this patchset also helps compared to vanilla v5.12-rc5
>
>
> mongodb (threads=6) (throughput, higher is better)
>                                          Throughput         read        clean      update
>                                                             latency     latency    latency
> v5.12-rc5            JVM=YCSB_CLIENTS=14  68116.05 ops/sec   1109.82 us  944.19 us  1342.29 us
> v5.12-rc5            JVM=YCSB_CLIENTS=21  64802.69 ops/sec   1772.64 us  944.69 us  2099.57 us
> v5.12-rc5            JVM=YCSB_CLIENTS=28  61792.78 ops/sec   2490.48 us  930.09 us  2928.03 us
> v5.12-rc5            JVM=YCSB_CLIENTS=35  59604.44 ops/sec   3236.86 us  870.28 us  3787.48 us
>
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=14  70948.51 ops/sec   1060.21 us  842.02 us  1289.44 us
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=21  68732.48 ops/sec   1669.91 us  871.57 us  1975.19 us
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=28  66674.81 ops/sec   2313.79 us  889.59 us  2702.36 us
> v5.12-rc5 + mc-llc   JVM=YCSB_CLIENTS=35  64397.51 ops/sec   3010.66 us  966.28 us  3484.19 us
>
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=14  67403.29 ops/sec   1121.80 us  797.81 us  1357.28 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=21  63952.79 ops/sec   1792.86 us  779.59 us  2130.54 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=28  62198.83 ops/sec   2469.60 us  780.00 us  2914.48 us
> v5.12-rc5 + patchset JVM=YCSB_CLIENTS=35  60333.81 ops/sec   3192.41 us  822.09 us  3748.24 us
>
> Summary:
> mc-llc outperforms, this patchset and upstream almost give similar performance.

So mc-llc patch seems to be the best approach IMHO. Although the
hemisphere don't share cache, they share enough resources so
cache-snooping is as efficient as sharing cache

>
>
> Cc: LKML <linux-kernel@vger.kernel.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Parth Shah <parth@linux.ibm.com>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valentin Schneider <valentin.schneider@arm.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Rik van Riel <riel@surriel.com>
>
> Srikar Dronamraju (10):
>   sched/fair: Update affine statistics when needed
>   sched/fair: Maintain the identity of idle-core
>   sched/fair: Update idle-core more often
>   sched/fair: Prefer idle CPU to cache affinity
>   sched/fair: Call wake_affine only if necessary
>   sched/idle: Move busy_cpu accounting to idle callback
>   sched/fair: Remove ifdefs in waker_affine_idler_llc
>   sched/fair: Dont iterate if no idle CPUs
>   sched/topology: Introduce fallback LLC
>   powerpc/smp: Add fallback flag to powerpc MC domain
>
>  arch/powerpc/kernel/smp.c      |   7 +-
>  include/linux/sched/sd_flags.h |   7 +
>  include/linux/sched/topology.h |   3 +-
>  kernel/sched/fair.c            | 229 +++++++++++++++++++++++++++------
>  kernel/sched/features.h        |   1 +
>  kernel/sched/idle.c            |  33 ++++-
>  kernel/sched/sched.h           |   6 +
>  kernel/sched/topology.c        |  54 +++++++-
>  8 files changed, 296 insertions(+), 44 deletions(-)
>
> --
> 2.18.2
>

  parent reply	other threads:[~2021-04-27 14:52 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-22 10:23 [PATCH 00/10] sched/fair: wake_affine improvements Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 01/10] sched/fair: Update affine statistics when needed Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 02/10] sched/fair: Maintain the identity of idle-core Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 03/10] sched/fair: Update idle-core more often Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 04/10] sched/fair: Prefer idle CPU to cache affinity Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 05/10] sched/fair: Call wake_affine only if necessary Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 06/10] sched/idle: Move busy_cpu accounting to idle callback Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 07/10] sched/fair: Remove ifdefs in waker_affine_idler_llc Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 08/10] sched/fair: Dont iterate if no idle CPUs Srikar Dronamraju
2021-04-22 10:23 ` [PATCH 09/10] sched/topology: Introduce fallback LLC Srikar Dronamraju
2021-04-22 15:10   ` kernel test robot
2021-04-22 15:10     ` kernel test robot
2021-04-22 10:23 ` [PATCH 10/10] powerpc/smp: Add fallback flag to powerpc MC domain Srikar Dronamraju
2021-04-23  8:25 ` [PATCH 00/10] sched/fair: wake_affine improvements Mel Gorman
2021-04-23 10:31   ` Srikar Dronamraju
2021-04-23 12:38     ` Mel Gorman
2021-04-26 10:30       ` Srikar Dronamraju
2021-04-26 11:35         ` Mel Gorman
2021-04-26 10:39   ` Srikar Dronamraju
2021-04-26 11:41     ` Mel Gorman
2021-04-28 12:57       ` Srikar Dronamraju
2021-04-27 14:52 ` Vincent Guittot [this message]
2021-04-28 12:49   ` Srikar Dronamraju

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtAuFpr05-ZBNjB9OiNNQnmgPSX3S4=Sz-A8sOnFAkr7Tg@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=dietmar.eggemann@arm.com \
    --cc=ego@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=mpe@ellerman.id.au \
    --cc=parth@linux.ibm.com \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.