Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: Peter Zijlstra <peterz@infradead.org>
To: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>,
	Krupa Ramakrishnan <Krupa.Ramakrishnan@amd.com>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Mon, 6 Dec 2021 15:51:58 +0100	[thread overview]
Message-ID: <Ya4jjgnejp3XP4yi@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <Ya3OVWUftTL5a2C6@BLR-5CG11610CF.amd.com>

On Mon, Dec 06, 2021 at 02:18:21PM +0530, Gautham R. Shenoy wrote:
> On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > > +	/* Calculate allowed NUMA imbalance */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		int imb_numa_nr = 0;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				int nr_groups;
> > > +
> > > +				nr_groups = sd->span_weight / child->span_weight;
> > > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > > +						(nr_groups * num_online_nodes()));
> > > +			}
> > > +
> > > +			sd->imb_numa_nr = imb_numa_nr;
> > > +		}
> > 
> > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> > imb_numa_nr = 0, all domains above it will have the same value
> > calculated here.
> > 
> > So far so good I suppose :-)
> 
> Well, we will still have the same imb_numa_nr set for different NUMA
> domains which have different distances!

Fair enough; that would need making the computation depends on more
thing, but that shouldn't be too hard.

> > Then nr_groups is what it says on the tin; we could've equally well
> > iterated sd->groups and gotten the same number, but this is simpler.
> > 
> > Now, imb_numa_nr is where the magic happens, the way it's written
> > doesn't help, but it's something like:
> > 
> > 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> > 
> > With a minimum value of 1. So the larger the system is, or the smaller
> > the LLCs, the smaller this number gets, right?
> > 
> > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> > 
> > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> > of how many nodes it has.
> 
> That's correct. On a Zen3 box with 2 sockets with 64 cores per
> sockets, we can configure it with either 1/2/4 Nodes Per Socket
> (NPS). The imb_numa_nr value for each of the NPS configurations is as
> follows:

Cute; that's similar to the whole Intel sub-numa-cluster stuff then;
perhaps update the comment that goes with x86_has_numa_in_package ?
Currently that only mentions AMD Magny-Cours which is a few generations
ago.

> NPS 4:
> ~~~~~~~
> SMT [span_wt=2]
>    --> MC [span_wt=16, LLC]
>        --> NODE [span_wt=32]
>            --> NUMA [span_wt=128, SD_NUMA]
> 	       --> NUMA [span_wt=256, SD_NUMA]

OK, so at max nodes you still have at least 2 LLCs per node.

> While the imb_numa_nr = 1 is good for the NUMA domain within a socket
> (the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
> little bit aggressive for the NUMA domain spanning the two sockets. If
> we have only a pair of communicating tasks in a socket, we will end up
> spreading them across the two sockets with this patch.
> 
> > 
> > Now, I'm thinking this assumes (fairly reasonable) that the level above
> > LLC is a node, but I don't think we need to assume this, while also not
> > assuming the balance domain spans the whole machine (yay paritions!).
> > 
> > 	for (top = sd; top->parent; top = top->parent)
> > 		;
> > 
> > 	nr_llcs = top->span_weight / child->span_weight;
> > 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> > 
> > which for my ivb-ep gets me:  20 / (40 / 20) = 10
> > and the Zen system will have:  4 / (huge number) = 1
> > 
> > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> > can also write the above as:
> > 
> > 	(child->span_weight * child->span_weight) / top->span_weight;
> 
> 
> Assuming that "child" here refers to the LLC domain, on Zen3 we would have
> (a) child->span_weight = 16. (b) top->span_weight = 256.
> 
> So we get a^2/b = 1.

Yes, it would be in the same place as the current imb_numa_nr
calculation, so child would be the largest domain having
SHARE_PKG_RESOURCES, aka. LLC.

> Last week, I tried a modification on top of Mel's current patch where
> we spread tasks between the LLCs of the groups within each NUMA domain
> and compute the value of imb_numa_nr per NUMA domain. The idea is to set
> 
>     sd->imb_numa_nr = min(1U,
>     		         (Number of LLCs in each sd group / Number of sd groups))

s/min/max/

Which is basically something like:

for_each (sd in NUMA):
  llc_per_group = child->span / llc->span;
  nr_group = sd->span / child->span;
  imb = max(1, llc_per_group / nr_group);

> This won't work for processors which have a single LLC in a socket,
> since the sd->imb_numa_nr will be 1 which is probably too low. 

Right.

> FWIW,
> with this heuristic, the imb_numa_nr across the different NPS
> configurations of a Zen3 server is as follows
> 
> NPS1:
>     NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4
> 
> NPS2:
>     1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> NPS4:
>     1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> Thus, at the highest NUMA level (socket), we don't spread across the
> two sockets until there are 4 tasks within the socket. If there is
> only a pair of communicating tasks in the socket, they will be left
> alone within that socket. 

Something that might work:

imb = 0;
imb_span = 1;

for_each_sd(sd) {
	child = sd->child;

	if (!(sd->flags & SD_SPR) && child && (child->flags & SD_SPR)) {

		imb = /* initial magic */
		imb_span = sd->span;
		sd->imb = imb;

	} else if (imb) {
		sd->imb = imb * (sd->span / imb_span);
	}
}

Where we calculate the initial imbalance for the LLC boundary, and then
increase that for subsequent domains based on how often that boundary sd
fits in it. That gives the same progression you have, but also works for
NODE==LLC I think.