Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

From: "Gautham R. Shenoy" <gautham.shenoy@amd.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Aubrey Li <aubrey.li@linux.intel.com>,
	Barry Song <song.bao.hua@hisilicon.com>,
	Mike Galbraith <efault@gmx.de>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
Date: Fri, 17 Dec 2021 00:03:06 +0530	[thread overview]
Message-ID: <YbuGYtxRSqVkOdbj@BLR-5CG11610CF.amd.com> (raw)
In-Reply-To: <20211215122550.GR3366@techsingularity.net>

Hello Mel,

On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:

[..SNIP..]

> > On a 2 Socket Zen3:
> > 
> > NPS=1
> >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> >    top_p = NUMA, imb_span = 256.
> > 
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > 
> > NPS=2
> >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> >    top_p = NUMA, imb_span = 128.
> > 
> >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > 
> > NPS=4:
> >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> >    top_p = NUMA, imb_span = 128.
> > 
> >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > 
> > Again, we will be more aggressively load balancing across the two
> > sockets in NPS=1 mode compared to NPS=2/4.
> > 
> 
> Yes, but I felt it was reasonable behaviour because we have to strike
> some sort of balance between allowing a NUMA imbalance up to a point
> to prevent communicating tasks being pulled apart and v3 broke that
> completely. There will always be a tradeoff between tasks that want to
> remain local to each other and others that prefer to spread as wide as
> possible as quickly as possible.

I agree with this argument that we want to be conservative while
pulling tasks across NUMA domains. My point was that the threshold at
the NUMA domain that spans the 2 sockets is lower for NPS=1
(imb_numa_nr = 2) when compared to the threshold for the same NUMA
domain when NPS=2/4 (imb_numa_nr = 4).

Irrespective of what NPS mode we are operating in, the NUMA distance
between the two sockets is 32 on Zen3 systems. Hence shouldn't the
thresholds be the same for that level of NUMA? 

Would something like the following work ?

if (sd->flags & SD_NUMA) {

   /* We are using the child as a proxy for the group. */
   group_span = sd->child->span_weight;
   sd_distance = /* NUMA distance at this sd level */

   /* By default we set the threshold to 1/4th the sched-group span. */
   imb_numa_shift = 2;

   /*
    * We can be a little aggressive if the cost of migrating tasks
    * across groups of this NUMA level is not high.
    * Assuming 
    */

   if (sd_distance < REMOTE_DISTANCE)
      imb_numa_shift++;

   /*
    * Compute the number of LLCs in each group.
    * More the LLCs, more aggressively we migrate across
    * the groups at this NUMA sd.
    */
    nr_llcs = group_span/llc_size;

    sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
}

With this, on Intel platforms, we will get sd->imb_numa_nr = (span of socket)/4

On Zen3,

NPS=1, Inter-socket NUMA : sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

NPS=2, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (64/4) >> (2+1)) = 2
       Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

NPS=4, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (32/2) >> (2+1)) = 2
       Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

> 
> > <SNIP>
> > If we retain the (2,4) thresholds from v4.1 but use them in
> > allow_numa_imbalance() as in v3 we get
> > 
> > NPS=4
> > Test:	 mel-v4.2
> >  Copy:	 225860.12 (498.11%)
> > Scale:	 227869.07 (572.58%)
> >   Add:	 278365.58 (624.93%)
> > Triad:	 264315.44 (596.62%)
> > 
> 
> The potential problem with this is that it probably will work for
> netperf when it's a single communicating pair but may not work as well
> when there are multiple communicating pairs or a number of communicating
> tasks that exceed numa_imb_nr.

Yes that's true. I think what you are doing in v4 is the right thing.

In case of stream in NPS=4, it just manages to hit the corner case for
this heuristic which results in a suboptimal behaviour. Description
follows:

On NPS=4, if we run 8 stream tasks bound to a socket with v4.1, we get
the following initial placement based on data obtained via the
sched:sched_wakeup_new tracepoint. This behaviour is consistently
reproducible.

-------------------------------------------------------
| NUMA                                                |
|   ----------------------- ------------------------  |
|   | NODE0               | | NODE1                |  |
|   |   -------------     | |    -------------     |  |
|   |   |  0 tasks  | MC0 | |    |  1 tasks  | MC2 |  |
|   |   -------------     | |    -------------     |  |
|   |   -------------     | |    -------------     |  |
|   |   |  1 tasks  | MC1 | |    |  1 tasks  | MC3 |  |
|   |   -------------     | |    -------------     |  |
|   |                     | |                      |  |
|   ----------------------- ------------------------  |
|   ----------------------- ------------------------  |
|   | NODE2               | | NODE3                |  |
|   |   -------------     | |    -------------     |  |
|   |   |  1 tasks  | MC4 | |    |  1 tasks  | MC6 |  |
|   |   -------------     | |    -------------     |  |
|   |   -------------     | |    -------------     |  |
|   |   |  2 tasks  | MC5 | |    |  1 tasks  | MC7 |  |
|   |   -------------     | |    -------------     |  |
|   |                     | |                      |  |
|   ----------------------- ------------------------  |
|                                                     |
-------------------------------------------------------

From the trace data obtained for sched:sched_wakeup_new and
sched:sched_migrate_task, we see

PID 106089 : timestamp 35607.831040 : was running  in MC5
PID 106090 : timestamp 35607.831040 : first placed in MC4
PID 106091 : timestamp 35607.831081 : first placed in MC5
PID 106092 : timestamp 35607.831155 : first placed in MC7
PID 106093 : timestamp 35607.831209 : first placed in MC3
PID 106094 : timestamp 35607.831254 : first placed in MC1
PID 106095 : timestamp 35607.831300 : first placed in MC6
PID 106096 : timestamp 35607.831344 : first placed in MC2

Subsequently we do not see any migrations for stream tasks (via the
sched:sched_migrate_task tracepoint), even though they run for nearly
10 seconds. The reasons:

  - No load-balancing is possible at any of the NODE sched-domains
    since the groups are more or less balanced within each NODE.

  - At NUMA sched-domain, busiest group would be NODE2.  When any CPU
    in NODE0 performs load-balancing at NUMA level, it can pull tasks
    only if the imbalance between NODE0 and NODE2 is greater than
    imb_numa_nr = 2, which isn't the case here.

Hence, with v4.1, we get the following numbers which are better than
the current upstream, but are still not the best.
Copy:           78182.7
Scale:          76344.1
Add:            87638.7
Triad:          86388.9

However, if I run an "mpstat 1 10 > /tmp/mpstat.log&" just before
kickstarting stream-8, the performance significantly improves (again,
consistently reproducible).

Copy:          122804.6
Scale:         115192.9
Add:           137191.6
Triad:         133338.5

In this case, from the trace data for stream, we see:
PID 105174 : timestamp 35547.526816 : was running  in  MC4
PID 105174 : timestamp 35547.577635 : moved to         MC5

PID 105175 : timestamp 35547.526816 : first placed in  MC4
PID 105176 : timestamp 35547.526846 : first placed in  MC3
PID 105177 : timestamp 35547.526893 : first placed in  MC7
PID 105178 : timestamp 35547.526928 : first placed in  MC1
PID 105179 : timestamp 35547.526961 : first placed in  MC2
PID 105180 : timestamp 35547.527001 : first placed in  MC6
PID 105181 : timestamp 35547.527032 : first placed in  MC0

In this case, at the time of the initial placement
(find_idlest_group() ?), we are able to spread out farther away. The
subsequent load-balance at the NODE2 domain is able to balance the
tasks between MC4 and MC5.

> 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > NPS=1
> > ======
> > Clients: tip-core   mel-v3    mel-v4    mel-v4.1
> >     1	 633.19     619.16    632.94    619.27
> >     	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
> > 	 
> >     2	 1152.48    1189.88   1184.82   1189.19
> >     	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
> > 	 
> >     4	 1946.46    2177.40   1979.56	2196.09
> >     	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
> > 	 
> >     8	 3553.29    3564.50   3678.07	3668.77
> >     	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
> > 	 
> >    16	 6217.03    6484.58   6249.29	6534.73
> >    	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
> > 	 
> >    32	 11702.59   12185.77  12005.99	11917.57
> >    	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
> > 	 
> >    64	 18394.56   19535.11  19080.19	19500.55
> >    	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
> > 	 
> >   128	 27231.02   31759.92  27200.52	30358.99
> >   	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
> > 	 
> >   256	 33166.10   24474.30  31639.98	24788.12
> >   	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
> > 	 
> >   512	 41605.44   54823.57  46684.48	54559.02
> >   	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
> > 	 
> >  1024	 53650.54   56329.39  44422.99	56320.66
> >  	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 
> > 
> > 
> > We see that the v4.1 performs better than v4 in most cases except when
> > the number of clients=256 where the spread strategy seems to be
> > hurting as we see degradation in both v3 and v4.1. This is true even
> > for NPS=2 and NPS=4 cases (see below).
> > 
> 
> The 256 client case is a bit of a crapshoot. At that point, the NUMA
> imbalancing is disabled and the machine is overloaded.

Yup. 

[..snip..]

> Most likely because v4.2 is disabling the allowed NUMA imbalance too
> soon. This is the trade-off between favouring communicating tasks over
> embararassingly parallel problems.

v4.1 does allow the NUMA imbalance for a longer duration. But since
the thresholds are small enough, I guess it should be a ok for most
workloads.

--
Thanks and Regards
gautham.