linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Vernet <void@manifault.com>
To: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Cc: linux-kernel@vger.kernel.org, mingo@redhat.com,
	peterz@infradead.org, juri.lelli@redhat.com,
	vincent.guittot@linaro.org, dietmar.eggemann@arm.com,
	rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de,
	bristot@redhat.com, vschneid@redhat.com, kprateek.nayak@amd.com,
	aaron.lu@intel.com, clm@meta.com, tj@kernel.org,
	roman.gushchin@linux.dev, kernel-team@meta.com
Subject: Re: [PATCH v2 0/7] sched: Implement shared runqueue in CFS
Date: Tue, 25 Jul 2023 15:22:55 -0500	[thread overview]
Message-ID: <20230725202255.GA13618@maniforge> (raw)
In-Reply-To: <ZLpMGVPDXqWEu+gm@BLR-5CG11610CF.amd.com>

On Fri, Jul 21, 2023 at 02:42:57PM +0530, Gautham R. Shenoy wrote:
> Hello David,

Hello Gautham,

Thank you for taking the time to run these benchmarks. Apologies for the
delayed response -- I've been traveling this week.

> On Mon, Jul 10, 2023 at 03:03:35PM -0500, David Vernet wrote:
> > Changes
> > -------
> > 
> > This is v2 of the shared wakequeue (now called shared runqueue)
> > patchset. The following are changes from the RFC v1 patchset
> > (https://lore.kernel.org/lkml/20230613052004.2836135-1-void@manifault.com/).
> > 
> > v1 -> v2 changes:
> > - Change name from swqueue to shared_runq (Peter)
> > 
> > - Sharded per-LLC shared runqueues to avoid contention on
> >   scheduler-heavy workloads (Peter)
> > 
> > - Pull tasks from the shared_runq in newidle_balance() rather than in
> >   pick_next_task_fair() (Peter and Vincent)
> > 
> > - Rename a few functions to reflect their actual purpose. For example,
> >   shared_runq_dequeue_task() instead of swqueue_remove_task() (Peter)
> > 
> > - Expose move_queued_task() from core.c rather than migrate_task_to()
> >   (Peter)
> > 
> > - Properly check is_cpu_allowed() when pulling a task from a shared_runq
> >   to ensure it can actually be migrated (Peter and Gautham)
> > 
> > - Dropped RFC tag
> > 
> > This patch set is based off of commit ebb83d84e49b ("sched/core: Avoid
> > multiple calling update_rq_clock() in __cfsb_csd_unthrottle()") on the
> > sched/core branch of tip.git.
> 
> I have evaluated this v2 patchset on AMD Zen3 and Zen4 servers.
> 
> tldr:
> 
> * We see non-trivial improvements on hackbench on both Zen3 and Zen4
>   until the system is super-overloaded, at which point we see
>   regressions.

This makes sense to me. SHARED_RUNQ is more likely to help performance
when the system is not over-utilized, as it has more of a chance to
actually increase work conservation. If the system is over-utilized,
it's likely that a core will be able to find a task regardless of
whether it looks at the shared runq.

That said, I wasn't able to reproduce the regressions (with --groups 16)
on my 7950X, presumably because it only has 8 cores / CCX.

> * tbench shows regressions on Zen3 with each client
>   configuration. tbench on Zen4 shows some improvements when the system is
>   overloaded.

Hmm, I also observed tbench not performing well with SHARED_RUNQ on my
Zen4 / 7950X, but only with heavy load. It also seems that sharding
helps a lot for tbench on Zen3, whereas Zen4 performs fine without it.
I'm having trouble reasoning about why Zen4 wouldn't require sharding
whereas Zen3 would given that Zen4 has more cores per CCX.

Just to verify -- these benchmarks were run with boost disabled,
correct? Otherwise, there could be a lot of run-to-run variance
depending on thermal throttling.

> 
> * netperf shows minor improvements on Zen3 when the system is under
>   low to moderate load. netperf regresses on Zen3 at high load, and at
>   all load-points on Zen4.

netperf in general seems to regress as the size of the LLC inreases due
to it relentlessly hammering the runqueue, though it's still surprising
to me that your Zen4 test showed regressions under low / moderate load
as well. Was this with -t TCP_RR, or -t UDP_RR? I observed SHARED_RUNQ
improving performance on my 7950X for -t TCP_RR as described on [0], so
I'd be curious to better understand where the slowdowns are coming from
(presumably it's just contending on the shard lock due to having a
larger CCX?)

[0]: https://lore.kernel.org/all/20230615000103.GC2883716@maniforge/

> * Stream, SPECjbb2015 and Mongodb show no significant difference compared
>   to the current tip.
> 
> * With netperf and tbench, using the shared-runqueue during
>   enqueue_entity performs badly.

My reading of your Zen4 numbers on tbench seem to imply that it actually
performs well under heavy load. Copying here for convenience:

Zen4, 2 Sockets, 128 cores per socket, SMT2:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Clients:    tip[pct imp](CV)       swqueue[pct imp](CV)     noshard[pct imp](CV)    shard_wakeup[pct imp](CV)   shard_all[pct imp](CV)
    1     1.00 [  0.00]( 0.19)     0.98 [ -1.72]( 0.19)     0.99 [ -1.15]( 0.28)     0.98 [ -1.79]( 0.28)     0.99 [ -1.49]( 0.10)
    2     1.00 [  0.00]( 0.63)     0.98 [ -2.28]( 0.63)     0.98 [ -1.91]( 0.26)     0.97 [ -3.14]( 0.25)     0.98 [ -1.77]( 0.32)
    4     1.00 [  0.00]( 0.22)     1.00 [  0.00]( 1.13)     0.99 [ -0.69]( 0.57)     0.98 [ -1.59]( 0.35)     0.99 [ -0.64]( 0.18)
    8     1.00 [  0.00]( 1.14)     0.99 [ -0.73]( 0.61)     0.98 [ -2.28]( 2.61)     0.97 [ -2.56]( 0.34)     0.98 [ -1.77]( 0.70)
   16     1.00 [  0.00]( 0.98)     0.97 [ -2.54]( 1.24)     0.98 [ -1.71]( 1.86)     0.98 [ -1.53]( 0.62)     0.96 [ -3.56]( 0.93)
   32     1.00 [  0.00]( 0.76)     0.98 [ -2.31]( 1.35)     0.98 [ -2.06]( 0.77)     0.96 [ -3.53]( 1.63)     0.88 [-11.72]( 2.77)
   64     1.00 [  0.00]( 0.96)     0.96 [ -4.45]( 3.53)     0.97 [ -3.44]( 1.53)     0.96 [ -3.52]( 0.89)     0.31 [-69.03]( 0.64)
  128     1.00 [  0.00]( 3.03)     0.95 [ -4.78]( 0.56)     0.98 [ -2.48]( 0.47)     0.92 [ -7.73]( 0.16)     0.20 [-79.75]( 0.24)
  256     1.00 [  0.00]( 0.04)     0.93 [ -7.21]( 1.00)     0.94 [ -5.90]( 0.63)     0.59 [-41.29]( 1.76)     0.16 [-83.71]( 0.07)
  512     1.00 [  0.00]( 3.08)     1.07 [  7.07](17.78)     1.15 [ 15.49]( 2.65)     0.82 [-17.53](29.11)     0.93 [ -7.18](32.23)
 1024     1.00 [  0.00]( 0.60)     1.16 [ 15.61]( 0.07)     1.16 [ 15.92]( 0.06)     1.12 [ 11.57]( 1.86)     1.12 [ 11.97]( 0.21)
 2048     1.00 [  0.00]( 0.16)     1.15 [ 14.62]( 0.90)     1.15 [ 15.20]( 0.29)     1.08 [  7.64]( 1.44)     1.15 [ 14.57]( 0.23)

 I'm also struggling to come up for an explanation for why Zen4 would
 operate well with SHARED_RUNQ under heavy load. Do you have a theory?

> Server configurations used:
> 
> AMD Zen3 Server:
>   * 2 sockets,
>   * 64 cores per socket,
>   * SMT2 enabled
>   * Total of 256 threads.
>   * Configured in Nodes-Per-Socket(NPS)=1
> 
> AMD Zen4 Server:
>   * 2 sockets,
>   * 128 cores per socket,
>   * SMT2 enabled
>   * Total of 512 threads.
>   * Configured in Nodes-Per-Socket(NPS)=1
>
> The trends on NPS=2 and NPS=4 are similar. So I am not posting those.
> 
> 
> Legend:
> tip           : Tip kernel with top commit ebb83d84e49b
>                 ("sched/core: Avoid multiple calling update_rq_clock() in __cfsb_csd_unthrottle()")
> 
> swqueue_v1    : Your v1 patches applied on top of the aforemenioned tip commit.
> 
> noshard       : shared-runqueue v2 patches 1-5. This uses a shared-runqueue
>                 during wakeup. No sharding.
> 
> shard_wakeup  : shared-runqueue v2 patches 1-6. This uses a
>                 shared-runqueue during wakeup and has shards with
>                 shard size = 6 (default)
> 
> shard_all     : v2 patches 1-7. This uses a sharded shared-runqueue during
>                 enqueue_entity

So, what's your overall impression from these numbers? My general
impression so far is the following:

- SHARED_RUNQ works best when the system would otherwise be
  under-utilized. If the system is going to be overloaded, it's unlikely
  to provide a significant benefit over CFS, and may even just add
  overhead with no benefit (or just cause worse cache locality).

- SHARED_RUNQ isn't well-suited to workloads such as netperf which
  pummel the scheduler. Sharding helps a lot here, but doesn't
  completely fix the problem depending on how aggressively tasks are
  hammering the runqueue.

- To the point above, using SHARED_RUNQ in __enqueue_entity() /
  __dequeue_entity(), rather than just on the wakeup path, is a net
  positive. Workloads which hammer the runq such as netperf or schbench
  -L -m 52 -p 512 -r 10 -t 1 will do poorly in both scenarios, so we may
  as well get the better work conservation from __enqueue_entity() /
  __dequeue_entity(). hackbench is one example of a workload that
  benefits from this, another is kernel compile, and I strongly suspect
  that HHVM would similarly benefit.

- Sharding in general doesn't seem to regress performance by much when
  it wouldn't have otherwise been necessary to avoid contention.
  hackbench is better without sharding on Zen3, but it's also better
  with shard_all on Zen4.

  In general, if our goal is to better support hosts with large CCXs, I
  think we'll just need to support sharding.

Thoughts?

I have the v3 version of the patch set which properly supports domain
recreation and hotplug, but I still need to get updated benchmark
numbers on it, as well as benchmark spreading a shared_runq over
multiple CCXs per Peter's comment in [1] about the initial motivation
behind SIS_NODE also applying to SHARED_RUNQ.

[1]: https://lore.kernel.org/all/20230711114207.GK3062772@hirez.programming.kicks-ass.net/

Given the points above, I would ideally like to just run the shard_all
variant and compare that to the numbers I collected on v2 and shared in
[2]. What do you think? There will be tradeoffs no matter what we choose
to do, but enqueuing / dequeuing in __enqueue_entity() /
__dequeue_entity() seems to perform the best for workloads that don't
hammer the runqueue, and sharding seems like a given if we do decide to
do that.

[2]: https://lore.kernel.org/all/20230710200342.358255-1-void@manifault.com/

Thanks,
David

  reply	other threads:[~2023-07-25 20:23 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-10 20:03 [PATCH v2 0/7] sched: Implement shared runqueue in CFS David Vernet
2023-07-10 20:03 ` [PATCH v2 1/7] sched: Expose move_queued_task() from core.c David Vernet
2023-07-10 20:03 ` [PATCH v2 2/7] sched: Move is_cpu_allowed() into sched.h David Vernet
2023-07-10 20:03 ` [PATCH v2 3/7] sched: Check cpu_active() earlier in newidle_balance() David Vernet
2023-07-10 20:03 ` [PATCH v2 4/7] sched/fair: Add SHARED_RUNQ sched feature and skeleton calls David Vernet
2023-07-11  9:45   ` Peter Zijlstra
2023-07-11 16:19     ` David Vernet
2023-07-12  8:39   ` Abel Wu
2023-07-12 21:34     ` David Vernet
2023-07-10 20:03 ` [PATCH v2 5/7] sched: Implement shared runqueue in CFS David Vernet
2023-07-11 10:18   ` Peter Zijlstra
2023-07-11 16:26     ` David Vernet
2023-07-12  6:00   ` Gautham R. Shenoy
2023-07-12 19:13     ` David Vernet
2023-07-12 10:47   ` Abel Wu
2023-07-12 22:16     ` David Vernet
2023-07-13  3:43       ` Abel Wu
2023-07-13  4:05         ` David Vernet
2023-07-13  7:58   ` Aaron Lu
2023-07-13  8:29   ` Peter Zijlstra
2023-07-10 20:03 ` [PATCH v2 6/7] sched: Shard per-LLC shared runqueues David Vernet
2023-07-11 10:49   ` Peter Zijlstra
2023-07-11 19:57     ` David Vernet
2023-07-12 10:06       ` Gautham R. Shenoy
2023-07-12 12:22         ` Peter Zijlstra
2023-07-10 20:03 ` [PATCH v2 7/7] sched: Move shared_runq to __{enqueue,dequeue}_entity() David Vernet
2023-07-11 10:51   ` Peter Zijlstra
2023-07-11 16:30     ` David Vernet
2023-07-11 11:42 ` [PATCH v2 0/7] sched: Implement shared runqueue in CFS Peter Zijlstra
2023-07-11 21:33   ` David Vernet
2023-07-21  9:12 ` Gautham R. Shenoy
2023-07-25 20:22   ` David Vernet [this message]
2023-08-02  6:32     ` Gautham R. Shenoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230725202255.GA13618@maniforge \
    --to=void@manifault.com \
    --cc=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=clm@meta.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=gautham.shenoy@amd.com \
    --cc=juri.lelli@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=kprateek.nayak@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=roman.gushchin@linux.dev \
    --cc=rostedt@goodmis.org \
    --cc=tj@kernel.org \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).