Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Tue, 21 Dec 2021 20:33:59 +0530	[thread overview]
Message-ID: <YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com> (raw)
In-Reply-To: <20211220111243.GS3366@techsingularity.net>

On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote:
> (sorry for the delay, was offline for a few days)
> 
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> > 
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > 
> > [..SNIP..]
> > 
> > > > On a 2 Socket Zen3:
> > > > 
> > > > NPS=1
> > > >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > > >    top_p = NUMA, imb_span = 256.
> > > > 
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > > 
> > > > NPS=2
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > NPS=4:
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > Again, we will be more aggressively load balancing across the two
> > > > sockets in NPS=1 mode compared to NPS=2/4.
> > > > 
> > > 
> > > Yes, but I felt it was reasonable behaviour because we have to strike
> > > some sort of balance between allowing a NUMA imbalance up to a point
> > > to prevent communicating tasks being pulled apart and v3 broke that
> > > completely. There will always be a tradeoff between tasks that want to
> > > remain local to each other and others that prefer to spread as wide as
> > > possible as quickly as possible.
> > 
> > I agree with this argument that we want to be conservative while
> > pulling tasks across NUMA domains. My point was that the threshold at
> > the NUMA domain that spans the 2 sockets is lower for NPS=1
> > (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> > domain when NPS=2/4 (imb_numa_nr = 4).
> > 
> 
> Is that a problem though? On an Intel machine with sub-numa clustering,
> the distances are 11 and 21 for a "node" that is the split cache and the
> remote node respectively.

So, my question was, on an Intel machine, with sub-numa clustering
enabled vs disabled, is the value of imb_numa_nr for the NUMA domain
which spans the remote nodes (distance=21) the same or different. And
if it is different, what is the rationale behind that. I am totally
on-board with the idea that for the different NUMA levels, the
corresponding imb_numa_nr should be different.

Just in case, I was not making myself clear earlier, on Zen3, the
NUMA-A sched-domain, in the figures below, has groups where each group
spans a socket in all the NPS configurations. However, only on NPS=1
we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of
sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing
data, on NPS=2/4, they would reside on the same socket, while on
NPS=1, we will have 2 tasks on one socket, and the other 2 will
migrated to the other socket.

That said, I have not been able to observe any significiant difference
with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr
set to 2 vs 4.



Zen3, NPS=1
------------------------------------------------------------------
|                                                                |
|  NUMA-A : sd->imb_numa_nr = 2   			     	 |
|     ------------------------     ------------------------  	 |
|     |DIE                   |     |DIE                   |  	 |
|     |                      |     |                      |  	 |
|     |    ------   ------   |     |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     ------------------------	   ------------------------  	 |
|							     	 |
------------------------------------------------------------------



Zen3, NPS=2
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |         | |         | |  | |         | |         | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------


Zen3, NPS=4
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |    |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------




> > Irrespective of what NPS mode we are operating in, the NUMA distance
> > between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> > thresholds be the same for that level of NUMA? 
> > 
> 
> Maybe, but then it is not a property of the sched_domain and instead
> needs to account for distance when balancing between two nodes that may
> be varying distances from each other.
> 
> > Would something like the following work ?
> > 
> > if (sd->flags & SD_NUMA) {
> > 
> >    /* We are using the child as a proxy for the group. */
> >    group_span = sd->child->span_weight;
> >    sd_distance = /* NUMA distance at this sd level */
> > 
> 
> NUMA distance relative to what? On Zen, the distance to a remote node may
> be fixed but on topologies with multiple nodes that are not fully linked
> to every other node by one hop, the same is not true.

Fair enough. The "sd_distance" I was referring to the node_distance()
between the CPUs of any two groups in this NUMA domain. However, this
was assuming that the node_distance() between the CPUs of any two
groups would be the same, which is not the case for certain
platforms. So this wouldn't work.



> 
> >    /* By default we set the threshold to 1/4th the sched-group span. */
> >    imb_numa_shift = 2;
> > 
> >    /*
> >     * We can be a little aggressive if the cost of migrating tasks
> >     * across groups of this NUMA level is not high.
> >     * Assuming 
> >     */
> >    
> >    if (sd_distance < REMOTE_DISTANCE)
> >       imb_numa_shift++;
> > 
> 
> The distance would have to be accounted for elsewhere because here we
> are considering one node in isolation, not relative to other nodes.
> 
> >    /*
> >     * Compute the number of LLCs in each group.
> >     * More the LLCs, more aggressively we migrate across
> >     * the groups at this NUMA sd.
> >     */
> >     nr_llcs = group_span/llc_size;
> > 
> >     sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> > }
> > 
> 
> Same, any adjustment would have to happen during load balancing taking
> into account the relatively NUMA distances. I'm not necessarily opposed
> but it would be a separate patch.


Sure, we can look into this separately.


> 
> > > > <SNIP>
> > > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > > allow_numa_imbalance() as in v3 we get
> > > > 
> > > > NPS=4
> > > > Test:	 mel-v4.2
> > > >  Copy:	 225860.12 (498.11%)
> > > > Scale:	 227869.07 (572.58%)
> > > >   Add:	 278365.58 (624.93%)
> > > > Triad:	 264315.44 (596.62%)
> > > > 
> > > 
> > > The potential problem with this is that it probably will work for
> > > netperf when it's a single communicating pair but may not work as well
> > > when there are multiple communicating pairs or a number of communicating
> > > tasks that exceed numa_imb_nr.
> > 
> > 
> > Yes that's true. I think what you are doing in v4 is the right thing.
> > 
> > In case of stream in NPS=4, it just manages to hit the corner case for
> > this heuristic which results in a suboptimal behaviour. Description
> > follows:
> > 
> 
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like


Actually I was able to root-cause the reason behind the drop in the
performance of stream on NPS-4. I have already responded earlier in
another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@BLR-5CG11610CF.amd.com/
Appending the patch here:


---
 kernel/sched/fair.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec354bf88b0d..c1b2a422a877 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay local
-			 * to its wakeup source if the number of running tasks
-			 * are below the allowed imbalance. If there is a real
-			 * need of migration, periodic load balance will take
-			 * care of it.
+			 * Otherwise, keep the task on this node to
+			 * stay local to its wakeup source if the
+			 * number of running tasks (including the new
+			 * one) are below the allowed imbalance. If
+			 * there is a real need of migration, periodic
+			 * load balance will take care of it.
 			 */
-			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
+			if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
 				return NULL;
 		}
 
-- 

With this fix on top of your fix to compute the imb_numa_nr at the
relevant level (v4.1:
https://lore.kernel.org/lkml/20211213130131.GQ3366@techsingularity.net/),
the stream regression for NPS4 is no longer there.


Test:	tip-core	    v4		       v4.1		    v4.1-find_idlest_group_fix
 Copy:	 37762.14 (0.00%)   48748.60 (29.09%)  164765.83 (336.32%)  205963.99 (445.42%)
Scale:	 33879.66 (0.00%)   48317.66 (42.61%)  159641.67 (371.20%)  218136.57 (543.85%)
  Add:	 38398.90 (0.00%)   54259.56 (41.30%)  184583.70 (380.70%)  257857.90 (571.52%)
Triad:	 37942.38 (0.00%)   54503.74 (43.64%)  181250.80 (377.70%)  251410.28 (562.61%)

--
Thanks and Regards
gautham.