All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Tue, 21 Dec 2021 20:33:59 +0530	[thread overview]
Message-ID: <YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com> (raw)
In-Reply-To: <20211220111243.GS3366@techsingularity.net>

On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote:
> (sorry for the delay, was offline for a few days)
> 
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> > 
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > 
> > [..SNIP..]
> > 
> > > > On a 2 Socket Zen3:
> > > > 
> > > > NPS=1
> > > >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > > >    top_p = NUMA, imb_span = 256.
> > > > 
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > > 
> > > > NPS=2
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > NPS=4:
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > Again, we will be more aggressively load balancing across the two
> > > > sockets in NPS=1 mode compared to NPS=2/4.
> > > > 
> > > 
> > > Yes, but I felt it was reasonable behaviour because we have to strike
> > > some sort of balance between allowing a NUMA imbalance up to a point
> > > to prevent communicating tasks being pulled apart and v3 broke that
> > > completely. There will always be a tradeoff between tasks that want to
> > > remain local to each other and others that prefer to spread as wide as
> > > possible as quickly as possible.
> > 
> > I agree with this argument that we want to be conservative while
> > pulling tasks across NUMA domains. My point was that the threshold at
> > the NUMA domain that spans the 2 sockets is lower for NPS=1
> > (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> > domain when NPS=2/4 (imb_numa_nr = 4).
> > 
> 
> Is that a problem though? On an Intel machine with sub-numa clustering,
> the distances are 11 and 21 for a "node" that is the split cache and the
> remote node respectively.

So, my question was, on an Intel machine, with sub-numa clustering
enabled vs disabled, is the value of imb_numa_nr for the NUMA domain
which spans the remote nodes (distance=21) the same or different. And
if it is different, what is the rationale behind that. I am totally
on-board with the idea that for the different NUMA levels, the
corresponding imb_numa_nr should be different.

Just in case, I was not making myself clear earlier, on Zen3, the
NUMA-A sched-domain, in the figures below, has groups where each group
spans a socket in all the NPS configurations. However, only on NPS=1
we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of
sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing
data, on NPS=2/4, they would reside on the same socket, while on
NPS=1, we will have 2 tasks on one socket, and the other 2 will
migrated to the other socket.

That said, I have not been able to observe any significiant difference
with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr
set to 2 vs 4.



Zen3, NPS=1
------------------------------------------------------------------
|                                                                |
|  NUMA-A : sd->imb_numa_nr = 2   			     	 |
|     ------------------------     ------------------------  	 |
|     |DIE                   |     |DIE                   |  	 |
|     |                      |     |                      |  	 |
|     |    ------   ------   |     |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     ------------------------	   ------------------------  	 |
|							     	 |
------------------------------------------------------------------



Zen3, NPS=2
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |         | |         | |  | |         | |         | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------


Zen3, NPS=4
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |    |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------




> > Irrespective of what NPS mode we are operating in, the NUMA distance
> > between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> > thresholds be the same for that level of NUMA? 
> > 
> 
> Maybe, but then it is not a property of the sched_domain and instead
> needs to account for distance when balancing between two nodes that may
> be varying distances from each other.
> 
> > Would something like the following work ?
> > 
> > if (sd->flags & SD_NUMA) {
> > 
> >    /* We are using the child as a proxy for the group. */
> >    group_span = sd->child->span_weight;
> >    sd_distance = /* NUMA distance at this sd level */
> > 
> 
> NUMA distance relative to what? On Zen, the distance to a remote node may
> be fixed but on topologies with multiple nodes that are not fully linked
> to every other node by one hop, the same is not true.

Fair enough. The "sd_distance" I was referring to the node_distance()
between the CPUs of any two groups in this NUMA domain. However, this
was assuming that the node_distance() between the CPUs of any two
groups would be the same, which is not the case for certain
platforms. So this wouldn't work.



> 
> >    /* By default we set the threshold to 1/4th the sched-group span. */
> >    imb_numa_shift = 2;
> > 
> >    /*
> >     * We can be a little aggressive if the cost of migrating tasks
> >     * across groups of this NUMA level is not high.
> >     * Assuming 
> >     */
> >    
> >    if (sd_distance < REMOTE_DISTANCE)
> >       imb_numa_shift++;
> > 
> 
> The distance would have to be accounted for elsewhere because here we
> are considering one node in isolation, not relative to other nodes.
> 
> >    /*
> >     * Compute the number of LLCs in each group.
> >     * More the LLCs, more aggressively we migrate across
> >     * the groups at this NUMA sd.
> >     */
> >     nr_llcs = group_span/llc_size;
> > 
> >     sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> > }
> > 
> 
> Same, any adjustment would have to happen during load balancing taking
> into account the relatively NUMA distances. I'm not necessarily opposed
> but it would be a separate patch.


Sure, we can look into this separately.


> 
> > > > <SNIP>
> > > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > > allow_numa_imbalance() as in v3 we get
> > > > 
> > > > NPS=4
> > > > Test:	 mel-v4.2
> > > >  Copy:	 225860.12 (498.11%)
> > > > Scale:	 227869.07 (572.58%)
> > > >   Add:	 278365.58 (624.93%)
> > > > Triad:	 264315.44 (596.62%)
> > > > 
> > > 
> > > The potential problem with this is that it probably will work for
> > > netperf when it's a single communicating pair but may not work as well
> > > when there are multiple communicating pairs or a number of communicating
> > > tasks that exceed numa_imb_nr.
> > 
> > 
> > Yes that's true. I think what you are doing in v4 is the right thing.
> > 
> > In case of stream in NPS=4, it just manages to hit the corner case for
> > this heuristic which results in a suboptimal behaviour. Description
> > follows:
> > 
> 
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like


Actually I was able to root-cause the reason behind the drop in the
performance of stream on NPS-4. I have already responded earlier in
another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@BLR-5CG11610CF.amd.com/
Appending the patch here:


---
 kernel/sched/fair.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec354bf88b0d..c1b2a422a877 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay local
-			 * to its wakeup source if the number of running tasks
-			 * are below the allowed imbalance. If there is a real
-			 * need of migration, periodic load balance will take
-			 * care of it.
+			 * Otherwise, keep the task on this node to
+			 * stay local to its wakeup source if the
+			 * number of running tasks (including the new
+			 * one) are below the allowed imbalance. If
+			 * there is a real need of migration, periodic
+			 * load balance will take care of it.
 			 */
-			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
+			if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
 				return NULL;
 		}
 
-- 

With this fix on top of your fix to compute the imb_numa_nr at the
relevant level (v4.1:
https://lore.kernel.org/lkml/20211213130131.GQ3366@techsingularity.net/),
the stream regression for NPS4 is no longer there.


Test:	tip-core	    v4		       v4.1		    v4.1-find_idlest_group_fix
 Copy:	 37762.14 (0.00%)   48748.60 (29.09%)  164765.83 (336.32%)  205963.99 (445.42%)
Scale:	 33879.66 (0.00%)   48317.66 (42.61%)  159641.67 (371.20%)  218136.57 (543.85%)
  Add:	 38398.90 (0.00%)   54259.56 (41.30%)  184583.70 (380.70%)  257857.90 (571.52%)
Triad:	 37942.38 (0.00%)   54503.74 (43.64%)  181250.80 (377.70%)  251410.28 (562.61%)

--
Thanks and Regards
gautham.




  reply	other threads:[~2021-12-21 15:04 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-21 10:53   ` Vincent Guittot
2021-12-21 11:32     ` Mel Gorman
2021-12-21 13:05       ` Vincent Guittot
2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-12-13  8:28   ` Gautham R. Shenoy
2021-12-13 13:01     ` Mel Gorman
2021-12-13 14:47       ` Gautham R. Shenoy
2021-12-15 11:52         ` Gautham R. Shenoy
2021-12-15 12:25           ` Mel Gorman
2021-12-16 18:33             ` Gautham R. Shenoy
2021-12-20 11:12               ` Mel Gorman
2021-12-21 15:03                 ` Gautham R. Shenoy [this message]
2021-12-21 17:13                 ` Vincent Guittot
2021-12-22  8:52                   ` Jirka Hladky
2022-01-04 19:52                     ` Jirka Hladky
2022-01-05 10:42                   ` Mel Gorman
2022-01-05 10:49                     ` Mel Gorman
2022-01-10 15:53                     ` Vincent Guittot
2022-01-12 10:24                       ` Mel Gorman
2021-12-17 19:54   ` Gautham R. Shenoy
  -- strict thread matches above, loose matches on Subject: below --
2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-08 16:19   ` Gautham R. Shenoy
2022-02-09  5:10   ` K Prateek Nayak
2022-02-09 10:33     ` Mel Gorman
2022-02-11 19:02       ` Jirka Hladky
2022-02-14 10:27   ` Srikar Dronamraju
2022-02-14 11:03   ` Vincent Guittot
2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-04  1:30   ` kernel test robot
2022-02-04  7:06   ` Srikar Dronamraju
2022-02-04  9:04     ` Mel Gorman
2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45     ` Mel Gorman
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-12-03  8:15   ` Barry Song
2021-12-03 10:50     ` Mel Gorman
2021-12-03 11:14       ` Barry Song
2021-12-03 13:27         ` Mel Gorman
2021-12-04 10:40   ` Peter Zijlstra
2021-12-06  8:48     ` Gautham R. Shenoy
2021-12-06 14:51       ` Peter Zijlstra
2021-12-06 15:12     ` Mel Gorman
2021-12-09 14:23       ` Valentin Schneider
2021-12-09 15:43         ` Mel Gorman
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-11-26 23:22   ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com \
    --to=gautham.shenoy@amd.com \
    --cc=aubrey.li@linux.intel.com \
    --cc=efault@gmx.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=song.bao.hua@hisilicon.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.