Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain

From: Gautham R Shenoy <ego@linux.vnet.ibm.com>
To: "Michal Suchánek" <msuchanek@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	"Gautham R. Shenoy" <ego@linux.vnet.ibm.com>,
	Michael Neuling <mikey@neuling.org>,
	Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Rik van Riel <riel@surriel.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Nicholas Piggin <npiggin@gmail.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Parth Shah <parth@linux.ibm.com>,
	linuxppc-dev@lists.ozlabs.org,
	Valentin Schneider <valentin.schneider@arm.com>
Subject: Re: [RFC/PATCH] powerpc/smp: Add SD_SHARE_PKG_RESOURCES flag to MC sched-domain
Date: Wed, 14 Apr 2021 12:32:46 +0530	[thread overview]
Message-ID: <20210414070246.GB13782@in.ibm.com> (raw)
In-Reply-To: <20210412163355.GV6564@kitsune.suse.cz>

On Mon, Apr 12, 2021 at 06:33:55PM +0200, Michal Suchánek wrote:
> On Mon, Apr 12, 2021 at 04:24:44PM +0100, Mel Gorman wrote:
> > On Mon, Apr 12, 2021 at 02:21:47PM +0200, Vincent Guittot wrote:
> > > > > Peter, Valentin, Vincent, Mel, etal
> > > > >
> > > > > On architectures where we have multiple levels of cache access latencies
> > > > > within a DIE, (For example: one within the current LLC or SMT core and the
> > > > > other at MC or Hemisphere, and finally across hemispheres), do you have any
> > > > > suggestions on how we could handle the same in the core scheduler?
> > >
> > > I would say that SD_SHARE_PKG_RESOURCES is there for that and doesn't
> > > only rely on cache
> > >
> >
> > From topology.c
> >
> > 	SD_SHARE_PKG_RESOURCES - describes shared caches
> >
> > I'm guessing here because I am not familiar with power10 but the central
> > problem appears to be when to prefer selecting a CPU sharing L2 or L3
> > cache and the core assumes the last-level-cache is the only relevant one.
> 
> It does not seem to be the case according to original description:
> 
> >>>> When the scheduler tries to wakeup a task, it chooses between the
> >>>> waker-CPU and the wakee's previous-CPU. Suppose this choice is called
> >>>> the "target", then in the target's LLC domain, the scheduler
> >>>> 
> >>>> a) tries to find an idle core in the LLC. This helps exploit the
> This is the same as (b) Should this be SMT^^^ ?

On POWER10, without this patch, the LLC is at SMT sched-domain
domain. The difference between a) and b) is a) searches for an idle
core, while b) searches for an idle CPU. 

> >>>>    SMT folding that the wakee task can benefit from. If an idle
> >>>>    core is found, the wakee is woken up on it.
> >>>> 
> >>>> b) Failing to find an idle core, the scheduler tries to find an idle
> >>>>    CPU in the LLC. This helps minimise the wakeup latency for the
> >>>>    wakee since it gets to run on the CPU immediately.
> >>>> 
> >>>> c) Failing this, it will wake it up on target CPU.
> >>>> 
> >>>> Thus, with P9-sched topology, since the CACHE domain comprises of two
> >>>> SMT4 cores, there is a decent chance that we get an idle core, failing
> >>>> which there is a relatively higher probability of finding an idle CPU
> >>>> among the 8 threads in the domain.
> >>>> 
> >>>> However, in P10-sched topology, since the SMT domain is the LLC and it
> >>>> contains only a single SMT4 core, the probability that we find that
> >>>> core to be idle is less. Furthermore, since there are only 4 CPUs to
> >>>> search for an idle CPU, there is lower probability that we can get an
> >>>> idle CPU to wake up the task on.
> 
> >
> > For this patch, I wondered if setting SD_SHARE_PKG_RESOURCES would have
> > unintended consequences for load balancing because load within a die may
> > not be spread between SMT4 domains if SD_SHARE_PKG_RESOURCES was set at
> > the MC level.
> 
> Not spreading load between SMT4 domains within MC is exactly what setting LLC
> at MC level would address, wouldn't it?
>
> As in on P10 we have two relevant levels but the topology as is describes only
> one, and moving the LLC level lower gives two levels the scheduler looks at
> again. Or am I missing something?

This is my current understanding as well, since with this patch we
would then be able to move tasks quickly between the SMT4 cores,
perhaps at the expense of losing out on cache-affinity. Which is why
it would be good to verify this using a test/benchmark.

> 
> Thanks
> 
> Michal
> 

--
Thanks and Regards
gautham.