From: Vincent Guittot <firstname.lastname@example.org>
To: Mel Gorman <email@example.com>
Cc: Srikar Dronamraju <firstname.lastname@example.org>,
"Gautham R. Shenoy" <email@example.com>,
Michael Ellerman <firstname.lastname@example.org>,
Michael Neuling <email@example.com>,
Rik van Riel <firstname.lastname@example.org>,
Valentin Schneider <email@example.com>,
Dietmar Eggemann <firstname.lastname@example.org>,
Nicholas Piggin <email@example.com>,
Anton Blanchard <firstname.lastname@example.org>,
Parth Shah <email@example.com>,
Vaidyanathan Srinivasan <firstname.lastname@example.org>,
Subject: Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Date: Tue, 13 Apr 2021 09:10:31 +0200 [thread overview]
Message-ID: <CAKfTPtDTyyrqBymQx16P+uhQKZkCyCNvDtEebG7FX5Emh=kkCg@mail.gmail.com> (raw)
On Mon, 12 Apr 2021 at 17:24, Mel Gorman <email@example.com> wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have any
> > > > suggestions on how we could handle the same in the core scheduler?
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> From topology.c
> SD_SHARE_PKG_RESOURCES - describes shared caches
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.
But the SMT4 level is still present here with select_idle_core taking
of the spreading
> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.
What I mean is that you should directly do for_each_sched_domain in
the fast path because that what you are proposing at the end. It's no
more looks like a fast path but a traditional LB
> But still, your comment reminded me that different architectures have
> different requirements
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
> spillover to L3 when selecting and idle CPU.
> X86 varies, it might want the Power10 approach for some families and prefer
> L3 spilling over to a CPU on the same node in others.
> S390 cares about something called books and drawers although I've no
> what it means as such and whether it has any preferences on
> search order.
> ARM has similar requirements again according to "scheduler: expose the
> topology of clusters and add cluster scheduler" and that one *does*
> add another domain.
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The default would be the existing behaviour --
> search CPUs sharing a last-level-cache.
> > SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> > efficiently move task between CPUs at wakeup
> The definition of "efficiently" varies. Moving tasks between CPUs sharing
> a cache is most efficient but moving the task to a CPU that at least has
> local memory channels is a reasonable option if there are no idle CPUs
> sharing cache and preferable to stacking.
That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me.
This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to
be considered in fast path
> > > allows within the node with the LLC CPUs masked out. While there would be
> > > a latency hit because cache is not shared, it would still be a CPU local
> > > to memory that is idle. That would potentially be beneficial on Zen*
> > > as well without having to introduce new domains in the topology hierarchy.
> > What is the current sched_domain topology description for zen ?
> The cache and NUMA topologies differ slightly between each generation
> of Zen. The common pattern is that a single NUMA node can have multiple
> L3 caches and at one point I thought it might be reasonable to allow
> spillover to select a local idle CPU instead of stacking multiple tasks
> on a CPU sharing cache. I never got as far as thinking how it could be
> done in a way that multiple architectures would be happy with.
> Mel Gorman
> SUSE Labs
next prev parent reply other threads:[~2021-04-13 7:10 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-02 5:37 [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain Gautham R. Shenoy
2021-04-02 7:36 ` Gautham R Shenoy
2021-04-12 6:24 ` Srikar Dronamraju
2021-04-12 9:37 ` Mel Gorman
2021-04-12 10:06 ` Valentin Schneider
2021-04-12 10:48 ` Mel Gorman
2021-04-19 6:14 ` Gautham R Shenoy
2021-04-12 12:21 ` Vincent Guittot
2021-04-12 15:24 ` Mel Gorman
2021-04-12 16:33 ` Michal Suchánek
2021-04-14 7:02 ` Gautham R Shenoy
2021-04-13 7:10 ` Vincent Guittot [this message]
2021-04-14 7:00 ` Gautham R Shenoy
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).