archive mirror
 help / color / mirror / Atom feed
From: Vincent Guittot <>
To: Mel Gorman <>
Cc: Srikar Dronamraju <>,
	"Gautham R. Shenoy" <>,
	Michael Ellerman <>,
	Michael Neuling <>,
	Rik van Riel <>,
	Valentin Schneider <>,
	Dietmar Eggemann <>,
	Nicholas Piggin <>,
	Anton Blanchard <>,
	Parth Shah <>,
	Vaidyanathan Srinivasan <>,
	LKML <>,
Subject: Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Date: Tue, 13 Apr 2021 09:10:31 +0200	[thread overview]
Message-ID: <> (raw)
In-Reply-To: <>

On Mon, 12 Apr 2021 at 17:24, Mel Gorman <> wrote:
> On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > Peter, Valentin, Vincent, Mel, etal
> > > >
> > > > On architectures where we have multiple levels of cache access latencies
> > > > within a DIE, (For example: one within the current LLC or SMT core and the
> > > > other at MC or Hemisphere, and finally across hemispheres), do you have any
> > > > suggestions on how we could handle the same in the core scheduler?
> >
> > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > only rely on cache
> >
> From topology.c
>         SD_SHARE_PKG_RESOURCES - describes shared caches
> I'm guessing here because I am not familiar with power10 but the central
> problem appears to be when to prefer selecting a CPU sharing L2 or L3
> cache and the core assumes the last-level-cache is the only relevant one.
> For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> unintended consequences for load balancing because load within a die may
> not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> the MC level.

But the SMT4 level is still present  here with select_idle_core taking
of the spreading

> > >
> > > Minimally I think it would be worth detecting when there are multiple
> > > LLCs per node and detecting that in generic code as a static branch. In
> > > select_idle_cpu, consider taking two passes -- first on the LLC domain
> > > and if no idle CPU is found then taking a second pass if the search depth
> >
> > We have done a lot of changes to reduce and optimize the fast path and
> > I don't think re adding another layer  in the fast path makes sense as
> > you will end up unrolling the for_each_domain behind some
> > static_banches.
> >
> Searching the node would only happen if a) there was enough search depth
> left and b) there were no idle CPUs at the LLC level. As no new domain
> is added, it's not clear to me why for_each_domain would change.

What I mean is that you should directly do for_each_sched_domain in
the fast path because that what you are proposing at the end. It's no
more looks like a fast path but a traditional LB

> But still, your comment reminded me that different architectures have
> different requirements
> Power 10 appears to prefer CPU selection sharing L2 cache but desires
>         spillover to L3 when selecting and idle CPU.
> X86 varies, it might want the Power10 approach for some families and prefer
>         L3 spilling over to a CPU on the same node in others.
> S390 cares about something called books and drawers although I've no
>         what it means as such and whether it has any preferences on
>         search order.
> ARM has similar requirements again according to "scheduler: expose the
>         topology of clusters and add cluster scheduler" and that one *does*
>         add another domain.
> I had forgotten about the ARM patches but remembered that they were
> interesting because they potentially help the Zen situation but I didn't
> get the chance to review them before they fell off my radar again. About
> all I recall is that I thought the "cluster" terminology was vague.
> The only commonality I thought might exist is that architectures may
> like to define what the first domain to search for an idle CPU and a
> second domain. Alternatively, architectures could specify a domain to
> search primarily but also search the next domain in the hierarchy if
> search depth permits. The default would be the existing behaviour --
> search CPUs sharing a last-level-cache.
> > SD_SHARE_PKG_RESOURCES should be set to the last level where we can
> > efficiently move task between CPUs at wakeup
> >
> The definition of "efficiently" varies. Moving tasks between CPUs sharing
> a cache is most efficient but moving the task to a CPU that at least has
> local memory channels is a reasonable option if there are no idle CPUs
> sharing cache and preferable to stacking.

That's why setting SD_SHARE_PKG_RESOURCES for P10 looks fine to me.
This last level of SD_SHARE_PKG_RESOURCES should define the cpumask to
be considered  in fast path

> > > allows within the node with the LLC CPUs masked out. While there would be
> > > a latency hit because cache is not shared, it would still be a CPU local
> > > to memory that is idle. That would potentially be beneficial on Zen*
> > > as well without having to introduce new domains in the topology hierarchy.
> >
> > What is the current sched_domain topology description for zen ?
> >
> The cache and NUMA topologies differ slightly between each generation
> of Zen. The common pattern is that a single NUMA node can have multiple
> L3 caches and at one point I thought it might be reasonable to allow
> spillover to select a local idle CPU instead of stacking multiple tasks
> on a CPU sharing cache. I never got as far as thinking how it could be
> done in a way that multiple architectures would be happy with.
> --
> Mel Gorman
> SUSE Labs

  parent reply	other threads:[~2021-04-13  7:10 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-02  5:37 [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain Gautham R. Shenoy
2021-04-02  7:36 ` Gautham R Shenoy
2021-04-12  6:24 ` Srikar Dronamraju
2021-04-12  9:37   ` Mel Gorman
2021-04-12 10:06     ` Valentin Schneider
2021-04-12 10:48       ` Mel Gorman
2021-04-19  6:14         ` Gautham R Shenoy
2021-04-12 12:21     ` Vincent Guittot
2021-04-12 15:24       ` Mel Gorman
2021-04-12 16:33         ` Michal Suchánek
2021-04-14  7:02           ` Gautham R Shenoy
2021-04-13  7:10         ` Vincent Guittot [this message]
2021-04-14  7:00         ` Gautham R Shenoy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='' \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).