linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
Cc: Mel Gorman <mgorman@techsingularity.net>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>,
	Krupa Ramakrishnan <Krupa.Ramakrishnan@amd.com>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Mon, 6 Dec 2021 15:51:58 +0100	[thread overview]
Message-ID: <Ya4jjgnejp3XP4yi@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <Ya3OVWUftTL5a2C6@BLR-5CG11610CF.amd.com>

On Mon, Dec 06, 2021 at 02:18:21PM +0530, Gautham R. Shenoy wrote:
> On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > > +	/* Calculate allowed NUMA imbalance */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		int imb_numa_nr = 0;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				int nr_groups;
> > > +
> > > +				nr_groups = sd->span_weight / child->span_weight;
> > > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > > +						(nr_groups * num_online_nodes()));
> > > +			}
> > > +
> > > +			sd->imb_numa_nr = imb_numa_nr;
> > > +		}
> > 
> > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> > imb_numa_nr = 0, all domains above it will have the same value
> > calculated here.
> > 
> > So far so good I suppose :-)
> 
> Well, we will still have the same imb_numa_nr set for different NUMA
> domains which have different distances!

Fair enough; that would need making the computation depends on more
thing, but that shouldn't be too hard.

> > Then nr_groups is what it says on the tin; we could've equally well
> > iterated sd->groups and gotten the same number, but this is simpler.
> > 
> > Now, imb_numa_nr is where the magic happens, the way it's written
> > doesn't help, but it's something like:
> > 
> > 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> > 
> > With a minimum value of 1. So the larger the system is, or the smaller
> > the LLCs, the smaller this number gets, right?
> > 
> > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> > 
> > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> > of how many nodes it has.
> 
> That's correct. On a Zen3 box with 2 sockets with 64 cores per
> sockets, we can configure it with either 1/2/4 Nodes Per Socket
> (NPS). The imb_numa_nr value for each of the NPS configurations is as
> follows:

Cute; that's similar to the whole Intel sub-numa-cluster stuff then;
perhaps update the comment that goes with x86_has_numa_in_package ?
Currently that only mentions AMD Magny-Cours which is a few generations
ago.


> NPS 4:
> ~~~~~~~
> SMT [span_wt=2]
>    --> MC [span_wt=16, LLC]
>        --> NODE [span_wt=32]
>            --> NUMA [span_wt=128, SD_NUMA]
> 	       --> NUMA [span_wt=256, SD_NUMA]

OK, so at max nodes you still have at least 2 LLCs per node.

> While the imb_numa_nr = 1 is good for the NUMA domain within a socket
> (the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
> little bit aggressive for the NUMA domain spanning the two sockets. If
> we have only a pair of communicating tasks in a socket, we will end up
> spreading them across the two sockets with this patch.
> 
> > 
> > Now, I'm thinking this assumes (fairly reasonable) that the level above
> > LLC is a node, but I don't think we need to assume this, while also not
> > assuming the balance domain spans the whole machine (yay paritions!).
> > 
> > 	for (top = sd; top->parent; top = top->parent)
> > 		;
> > 
> > 	nr_llcs = top->span_weight / child->span_weight;
> > 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> > 
> > which for my ivb-ep gets me:  20 / (40 / 20) = 10
> > and the Zen system will have:  4 / (huge number) = 1
> > 
> > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> > can also write the above as:
> > 
> > 	(child->span_weight * child->span_weight) / top->span_weight;
> 
> 
> Assuming that "child" here refers to the LLC domain, on Zen3 we would have
> (a) child->span_weight = 16. (b) top->span_weight = 256.
> 
> So we get a^2/b = 1.

Yes, it would be in the same place as the current imb_numa_nr
calculation, so child would be the largest domain having
SHARE_PKG_RESOURCES, aka. LLC.

> Last week, I tried a modification on top of Mel's current patch where
> we spread tasks between the LLCs of the groups within each NUMA domain
> and compute the value of imb_numa_nr per NUMA domain. The idea is to set
> 
>     sd->imb_numa_nr = min(1U,
>     		         (Number of LLCs in each sd group / Number of sd groups))

s/min/max/

Which is basically something like:

for_each (sd in NUMA):
  llc_per_group = child->span / llc->span;
  nr_group = sd->span / child->span;
  imb = max(1, llc_per_group / nr_group);


> This won't work for processors which have a single LLC in a socket,
> since the sd->imb_numa_nr will be 1 which is probably too low. 

Right.

> FWIW,
> with this heuristic, the imb_numa_nr across the different NPS
> configurations of a Zen3 server is as follows
> 
> NPS1:
>     NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4
> 
> NPS2:
>     1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> NPS4:
>     1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> Thus, at the highest NUMA level (socket), we don't spread across the
> two sockets until there are 4 tasks within the socket. If there is
> only a pair of communicating tasks in the socket, they will be left
> alone within that socket. 

Something that might work:

imb = 0;
imb_span = 1;

for_each_sd(sd) {
	child = sd->child;

	if (!(sd->flags & SD_SPR) && child && (child->flags & SD_SPR)) {

		imb = /* initial magic */
		imb_span = sd->span;
		sd->imb = imb;

	} else if (imb) {
		sd->imb = imb * (sd->span / imb_span);
	}
}

Where we calculate the initial imbalance for the LLC boundary, and then
increase that for subsequent domains based on how often that boundary sd
fits in it. That gives the same progression you have, but also works for
NODE==LLC I think.

  reply	other threads:[~2021-12-06 14:52 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-03  8:38   ` Barry Song
2021-12-03  9:51     ` Gautham R. Shenoy
2021-12-03 10:53     ` Mel Gorman
2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-12-03  8:15   ` Barry Song
2021-12-03 10:50     ` Mel Gorman
2021-12-03 11:14       ` Barry Song
2021-12-03 13:27         ` Mel Gorman
2021-12-04 10:40   ` Peter Zijlstra
2021-12-06  8:48     ` Gautham R. Shenoy
2021-12-06 14:51       ` Peter Zijlstra [this message]
2021-12-06 15:12     ` Mel Gorman
2021-12-09 14:23       ` Valentin Schneider
2021-12-09 15:43         ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-08 16:19   ` Gautham R. Shenoy
2022-02-09  5:10   ` K Prateek Nayak
2022-02-09 10:33     ` Mel Gorman
2022-02-11 19:02       ` Jirka Hladky
2022-02-14 10:27   ` Srikar Dronamraju
2022-02-14 11:03   ` Vincent Guittot
2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-04  7:06   ` Srikar Dronamraju
2022-02-04  9:04     ` Mel Gorman
2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45     ` Mel Gorman
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-12-13  8:28   ` Gautham R. Shenoy
2021-12-13 13:01     ` Mel Gorman
2021-12-13 14:47       ` Gautham R. Shenoy
2021-12-15 11:52         ` Gautham R. Shenoy
2021-12-15 12:25           ` Mel Gorman
2021-12-16 18:33             ` Gautham R. Shenoy
2021-12-20 11:12               ` Mel Gorman
2021-12-21 15:03                 ` Gautham R. Shenoy
2021-12-21 17:13                 ` Vincent Guittot
2021-12-22  8:52                   ` Jirka Hladky
2022-01-04 19:52                     ` Jirka Hladky
2022-01-05 10:42                   ` Mel Gorman
2022-01-05 10:49                     ` Mel Gorman
2022-01-10 15:53                     ` Vincent Guittot
2022-01-12 10:24                       ` Mel Gorman
2021-12-17 19:54   ` Gautham R. Shenoy
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Ya4jjgnejp3XP4yi@hirez.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=Krupa.Ramakrishnan@amd.com \
    --cc=Sadagopan.Srinivasan@amd.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=gautham.shenoy@amd.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).