linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
@ 2019-12-20  8:42 Mel Gorman
  2019-12-20 12:40 ` Valentin Schneider
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Mel Gorman @ 2019-12-20  8:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML, Mel Gorman

Changelog since V1
o Alter code flow 						vincent.guittot
o Use idle CPUs for comparison instead of sum_nr_running	vincent.guittot
o Note that the division is still in place. Without it and taking
  imbalance_adj into account before the cutoff, two NUMA domains
  do not converage as being equally balanced when the number of
  busy tasks equals the size of one domain (50% of the sum).
  Some data is in the changelog.

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a degree of imbalance to exist between NUMA domains
based on the imbalance_pct defined by the scheduler domain. This slight
imbalance is allowed until the scheduler domain reaches almost 50%
utilisation at which point other factors like HT utilisation and memory
bandwidth come into play. While not commented upon in the code, the cutoff
is important for memory-bound parallelised non-communicating workloads
that do not fully utilise the entire machine. This is not necessarily the
best universal cut-off point but it appeared appropriate for a variety
of workloads and machines.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offloaded depending on the
transmission rate.

2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
                       	      baseline              lbnuma-v1
Hmean     64         666.68 (   0.00%)      667.31 (   0.09%)
Hmean     128       1276.18 (   0.00%)     1288.92 *   1.00%*
Hmean     256       2366.78 (   0.00%)     2422.22 *   2.34%*
Hmean     1024      8123.94 (   0.00%)     8464.15 *   4.19%*
Hmean     2048     12962.45 (   0.00%)    13693.79 *   5.64%*
Hmean     3312     17709.24 (   0.00%)    17494.23 (  -1.21%)
Hmean     4096     19756.01 (   0.00%)    19472.58 (  -1.43%)
Hmean     8192     27469.59 (   0.00%)    27787.32 (   1.16%)
Hmean     16384    30062.82 (   0.00%)    30657.62 *   1.98%*
Stddev    64           2.64 (   0.00%)        2.09 (  20.76%)
Stddev    128          6.22 (   0.00%)        6.48 (  -4.28%)
Stddev    256          9.75 (   0.00%)       22.85 (-134.30%)
Stddev    1024        69.62 (   0.00%)       58.41 (  16.11%)
Stddev    2048        72.73 (   0.00%)       83.47 ( -14.77%)
Stddev    3312       412.35 (   0.00%)       75.77 (  81.63%)
Stddev    4096       345.02 (   0.00%)      297.01 (  13.91%)
Stddev    8192       280.09 (   0.00%)      485.36 ( -73.29%)
Stddev    16384      452.99 (   0.00%)      250.21 (  44.76%)

Fairly small impact on average performance but note how much the standard
deviation is reduced in many cases. A clearer story is visible from the
NUMA Balancing stats

Ops NUMA base-page range updates       21596.00         282.00
Ops NUMA PTE updates                   21596.00         282.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                   17786.00         137.00
Ops NUMA hint local faults %            9916.00         137.00
Ops NUMA hint local percent               55.75         100.00
Ops NUMA pages migrated                 4231.00           0.00

Without the patch, only 55.75% of sampled accesses are local.
With the patch, 100% of sampled accesses are local. A 2-socket
Broadwell showed better results on average but are not presented
for brevity. The patch holds up for 4-socket boxes as well

4-socket Haswell machine, 144 core, HT enabled
netperf-tcp

                       	      baseline              lbnuma-v1
Hmean     64         953.51 (   0.00%)      977.27 *   2.49%*
Hmean     128       1826.48 (   0.00%)     1863.37 *   2.02%*
Hmean     256       3295.19 (   0.00%)     3329.37 (   1.04%)
Hmean     1024     10915.40 (   0.00%)    11339.60 *   3.89%*
Hmean     2048     17833.82 (   0.00%)    19066.12 *   6.91%*
Hmean     3312     22690.72 (   0.00%)    24048.92 *   5.99%*
Hmean     4096     24422.23 (   0.00%)    26606.60 *   8.94%*
Hmean     8192     31250.11 (   0.00%)    33374.62 *   6.80%*
Hmean     16384    37033.70 (   0.00%)    38684.28 *   4.46%*
Hmean     16384    37033.70 (   0.00%)    38732.22 *   4.59%*

On this machine, the baseline measured 58.11% locality for sampled accesses
and 100% local accesses with the patch. Similarly, the patch holds up
for 2-socket machines with multiple L3 caches such as the AMD Epyc 2

2-socket EPYC-2 machine, 256 cores
netperf-tcp
Hmean     64        1564.63 (   0.00%)     1550.59 (  -0.90%)
Hmean     128       3028.83 (   0.00%)     3030.48 (   0.05%)
Hmean     256       5733.47 (   0.00%)     5769.51 (   0.63%)
Hmean     1024     18936.04 (   0.00%)    19216.15 *   1.48%*
Hmean     2048     27589.77 (   0.00%)    28200.45 *   2.21%*
Hmean     3312     35361.97 (   0.00%)    35881.94 *   1.47%*
Hmean     4096     37965.59 (   0.00%)    38702.01 *   1.94%*
Hmean     8192     48499.92 (   0.00%)    49530.62 *   2.13%*
Hmean     16384    54249.96 (   0.00%)    55937.24 *   3.11%*

For amusement purposes, here are two graphs showing CPU utilisation on
the 2-socket Haswell machine over time based on mpstat with the ordering
of the CPUs based on topology.

http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png
http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png

The lines on the left match up CPUs that are HT siblings or on the same
node. The machine has only one L3 cache per NUMA node or that would also
be shown.  It should be very clear from the images that the baseline
kernel spread the load with lighter utilisation across nodes while the
patched kernel had heavy utilisation of fewer CPUs on one node.

Hackbench generally shows good results across machines with some
differences depending on whether threads or sockets are used as well as
pipes or sockets.  This is the *worst* result from the 2-socket Haswell
machine

2-socket Haswell machine 48 core, HT enabled
hackbench-process-pipes -- mmtests config config-scheduler-unbound
                           5.5.0-rc1              5.5.0-rc1
                     	    baseline              lbnuma-v1
Amean     1        1.2580 (   0.00%)      1.2393 (   1.48%)
Amean     4        5.3293 (   0.00%)      5.2683 *   1.14%*
Amean     7        8.9067 (   0.00%)      8.7130 *   2.17%*
Amean     12      14.9577 (   0.00%)     14.5773 *   2.54%*
Amean     21      25.9570 (   0.00%)     25.6657 *   1.12%*
Amean     30      37.7287 (   0.00%)     37.1277 *   1.59%*
Amean     48      61.6757 (   0.00%)     60.0433 *   2.65%*
Amean     79     100.4740 (   0.00%)     98.4507 (   2.01%)
Amean     110    141.2450 (   0.00%)    136.8900 *   3.08%*
Amean     141    179.7747 (   0.00%)    174.5110 *   2.93%*
Amean     172    221.0700 (   0.00%)    214.7857 *   2.84%*
Amean     192    245.2007 (   0.00%)    238.3680 *   2.79%*

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a
patch discussed during review that avoids a divide by putting the cutoff
at exactly 50% instead of accounting for imbalance_adj.

NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half
                     	     baseline               nodivide              lbnuma-v1
Amean     bt.C       64.29 (   0.00%)       76.33 * -18.72%*       69.55 *  -8.17%*
Amean     cg.C       26.33 (   0.00%)       26.26 (   0.27%)       26.36 (  -0.11%)
Amean     ep.C       10.26 (   0.00%)       10.29 (  -0.31%)       10.26 (  -0.04%)
Amean     ft.C       17.98 (   0.00%)       19.73 *  -9.71%*       19.51 *  -8.52%*
Amean     is.C        0.99 (   0.00%)        0.99 (   0.40%)        0.99 (   0.00%)
Amean     lu.C       51.72 (   0.00%)       48.57 (   6.09%)       48.68 *   5.88%*
Amean     mg.C        8.12 (   0.00%)        8.27 (  -1.82%)        8.24 (  -1.50%)
Amean     sp.C       82.76 (   0.00%)       86.06 *  -3.99%*       83.42 (  -0.80%)
Amean     ua.C       58.64 (   0.00%)       57.66 (   1.67%)       57.79 (   1.45%)

There is some impact but there is a degree of variability and the ones
showing impact are mainly workloads that are mostly parallelised
and communicate infrequently between tests. It's a corner case where
the workload benefits heavily from spreading wide and early which is
not common. This is intended to illustrate the worst case measured.

In general, the patch simply seeks to avoid unnecessarily cross-node
migrations when a machine is lightly loaded but shows benefits for other
workloads. While tests are still running, so far it seems to benefit
light-utilisation smaller workloads on large machines and does not appear
to do any harm to larger or parallelised workloads.

[valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..60a780e1420e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			return;
 		}
 
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			unsigned int imbalance_adj, imbalance_max;
+
+			/*
+			 * imbalance_adj is the allowable degree of imbalance
+			 * to exist between two NUMA domains. It's calculated
+			 * relative to imbalance_pct with a minimum of two
+			 * tasks or idle CPUs. The choice of two is due to
+			 * the most basic case of two communicating tasks
+			 * that should remain on the same NUMA node after
+			 * wakeup.
+			 */
+			imbalance_adj = max(2U, (busiest->group_weight *
+				(env->sd->imbalance_pct - 100) / 100) >> 1);
+
+			/*
+			 * Ignore small imbalances unless the busiest sd has
+			 * almost half as many busy CPUs as there are
+			 * available CPUs in the busiest group. Note that
+			 * it is not exactly half as imbalance_adj must be
+			 * accounted for or the two domains do not converge
+			 * as equally balanced if the number of busy tasks is
+			 * roughly the size of one NUMA domain.
+			 */
+			imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
+			if (env->imbalance <= imbalance_adj &&
+			    busiest->idle_cpus >= imbalance_max) {
+				env->imbalance = 0;
+				return;
+			}
+		}
+
 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
 			unsigned int nr_diff = busiest->sum_nr_running;
 			/*

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-20  8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman
@ 2019-12-20 12:40 ` Valentin Schneider
  2019-12-20 14:22   ` Mel Gorman
  2019-12-21 11:25   ` Mel Gorman
  2019-12-22 12:00 ` Srikar Dronamraju
  2019-12-23 13:31 ` Vincent Guittot
  2 siblings, 2 replies; 28+ messages in thread
From: Valentin Schneider @ 2019-12-20 12:40 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML

On 20/12/2019 08:42, Mel Gorman wrote:
> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
> 
> [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus]

I think only the comment bit is still there in this version and it's not
really worth mentioning (but I do thank you for doing it!).

> @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			return;
>  		}
>  
> +		/* Consider allowing a small imbalance between NUMA groups */
> +		if (env->sd->flags & SD_NUMA) {
> +			unsigned int imbalance_adj, imbalance_max;
> +
> +			/*
> +			 * imbalance_adj is the allowable degree of imbalance
> +			 * to exist between two NUMA domains. It's calculated
> +			 * relative to imbalance_pct with a minimum of two
> +			 * tasks or idle CPUs. The choice of two is due to
> +			 * the most basic case of two communicating tasks
> +			 * that should remain on the same NUMA node after
> +			 * wakeup.
> +			 */
> +			imbalance_adj = max(2U, (busiest->group_weight *
> +				(env->sd->imbalance_pct - 100) / 100) >> 1);
> +
> +			/*
> +			 * Ignore small imbalances unless the busiest sd has
> +			 * almost half as many busy CPUs as there are
> +			 * available CPUs in the busiest group. Note that

This is all on the busiest group, so this should be more like:

			 * Ignore small imbalances unless almost half of the
			 * busiest sg's CPUs are busy.

> +			 * it is not exactly half as imbalance_adj must be
> +			 * accounted for or the two domains do not converge
> +			 * as equally balanced if the number of busy tasks is
> +			 * roughly the size of one NUMA domain.
> +			 */
> +			imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> +			if (env->imbalance <= imbalance_adj &&

I'm confused now, have we set env->imbalance to anything at this point? AIUI
Vincent's suggestion was to hinge this purely on the weight vs idle_cpus /
nr_running, IOW not use imbalance:

if (sd->flags & SD_NUMA) {                                                                         
	imbalance_adj = max(2U, (busiest->group_weight *                                           
				 (env->sd->imbalance_pct - 100) / 100) >> 1);                      
	imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;                              
                                                                                                     
	if (busiest->idle_cpus >= imbalance_max) {                                                 
		env->imbalance = 0;                                                                
		return;                                                                            
	}                                                                                          
}                                                                                                  
       
Now, I have to say I'm not sold on the idle_cpus thing, I'd much rather use
the number of runnable tasks. We are setting up a threshold for how far we
are willing to ignore imbalances; if we have overloaded CPUs we *really*
should try to solve this. Number of tasks is the safer option IMO: when we
do have one task per CPU, it'll be the same as if we had used idle_cpus, and
when we don't have one task per CPU we'll load-balance more often that we
would have with idle_cpus.

> +			    busiest->idle_cpus >= imbalance_max) {
> +				env->imbalance = 0;
> +				return;
> +			}
> +		}
> +
>  		if (busiest->group_weight == 1 || sds->prefer_sibling) {
>  			unsigned int nr_diff = busiest->sum_nr_running;
>  			/*
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-20 12:40 ` Valentin Schneider
@ 2019-12-20 14:22   ` Mel Gorman
  2019-12-20 15:32     ` Valentin Schneider
  2019-12-21 11:25   ` Mel Gorman
  1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2019-12-20 14:22 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Fri, Dec 20, 2019 at 12:40:02PM +0000, Valentin Schneider wrote:
> > +		/* Consider allowing a small imbalance between NUMA groups */
> > +		if (env->sd->flags & SD_NUMA) {
> > +			unsigned int imbalance_adj, imbalance_max;
> > +
> > +			/*
> > +			 * imbalance_adj is the allowable degree of imbalance
> > +			 * to exist between two NUMA domains. It's calculated
> > +			 * relative to imbalance_pct with a minimum of two
> > +			 * tasks or idle CPUs. The choice of two is due to
> > +			 * the most basic case of two communicating tasks
> > +			 * that should remain on the same NUMA node after
> > +			 * wakeup.
> > +			 */
> > +			imbalance_adj = max(2U, (busiest->group_weight *
> > +				(env->sd->imbalance_pct - 100) / 100) >> 1);
> > +
> > +			/*
> > +			 * Ignore small imbalances unless the busiest sd has
> > +			 * almost half as many busy CPUs as there are
> > +			 * available CPUs in the busiest group. Note that
> 
> This is all on the busiest group, so this should be more like:
> 
> 			 * Ignore small imbalances unless almost half of the
> 			 * busiest sg's CPUs are busy.
> 

Updated.

> > +			 * it is not exactly half as imbalance_adj must be
> > +			 * accounted for or the two domains do not converge
> > +			 * as equally balanced if the number of busy tasks is
> > +			 * roughly the size of one NUMA domain.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> 
> I'm confused now, have we set env->imbalance to anything at this point?

No, it's a no-op in this context and I meant to take the check out.

> AIUI
> Vincent's suggestion was to hinge this purely on the weight vs idle_cpus /
> nr_running, IOW not use imbalance:
> 
> if (sd->flags & SD_NUMA) {                                                                         
> 	imbalance_adj = max(2U, (busiest->group_weight *                                           
> 				 (env->sd->imbalance_pct - 100) / 100) >> 1);                      
> 	imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;                              
>                                                                                                      
> 	if (busiest->idle_cpus >= imbalance_max) {                                                 
> 		env->imbalance = 0;                                                                
> 		return;                                                                            
> 	}                                                                                          
> }                                                                                                  
>        

And that's what it is now. Functionally it's equivalent similar other
than imbalance could be potentially anything (but should be zero).

> Now, I have to say I'm not sold on the idle_cpus thing, I'd much rather use
> the number of runnable tasks. We are setting up a threshold for how far we
> are willing to ignore imbalances; if we have overloaded CPUs we *really*
> should try to solve this. Number of tasks is the safer option IMO: when we
> do have one task per CPU, it'll be the same as if we had used idle_cpus, and
> when we don't have one task per CPU we'll load-balance more often that we
> would have with idle_cpus.
> 

I couldn't convince myself to really push back hard on the sum_nr_runnable
versus idle_cpus.  If the local group has spare capacity and the busiest
group has multiple tasks stacked on CPUs then it's most likely due to
CPU affinity. In that case, there is no guarantee tasks can move to the
local group either. In that case, the difference between sum_nr_running
and idle_cpus is almost moot.  There may be common use cases where the
distinction really matters but right now, I'm at the point where I think
such a change could be a separate patch with the use case included and
supporting data on why it must be sum_nr_running.  Right now, I feel it's
mostly a cosmetic issue given the context and intent of the patch.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-20 14:22   ` Mel Gorman
@ 2019-12-20 15:32     ` Valentin Schneider
  0 siblings, 0 replies; 28+ messages in thread
From: Valentin Schneider @ 2019-12-20 15:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On 20/12/2019 14:22, Mel Gorman wrote:
>> Now, I have to say I'm not sold on the idle_cpus thing, I'd much rather use
>> the number of runnable tasks. We are setting up a threshold for how far we
>> are willing to ignore imbalances; if we have overloaded CPUs we *really*
>> should try to solve this. Number of tasks is the safer option IMO: when we
>> do have one task per CPU, it'll be the same as if we had used idle_cpus, and
>> when we don't have one task per CPU we'll load-balance more often that we
>> would have with idle_cpus.
>>
> 
> I couldn't convince myself to really push back hard on the sum_nr_runnable
> versus idle_cpus.  If the local group has spare capacity and the busiest
> group has multiple tasks stacked on CPUs then it's most likely due to
> CPU affinity.

Not necessarily, for instance wakeup balancing (select_idle_sibling()) could
end up packing stuff within a node if said node spans more than one LLC
domain, which IIRC is the case on some AMD chips.

Or, still with the same LLC < node topology, you could start with the node
being completely utilized, then some tasks on some LLC domains terminate but
there's an LLC that still has a bunch of tasks running, and then you're left
with an imbalance between LLC domains that the wakeup balance cannot solve.

> In that case, there is no guarantee tasks can move to the
> local group either. In that case, the difference between sum_nr_running
> and idle_cpus is almost moot.  There may be common use cases where the
> distinction really matters but right now, I'm at the point where I think
> such a change could be a separate patch with the use case included and
> supporting data on why it must be sum_nr_running.  Right now, I feel it's
> mostly a cosmetic issue given the context and intent of the patch.
> 

Let me spin it this way: do we need to push this ignoring of the imbalance
as far as possible, or are we okay with it only happening when there's just a
few tasks running? The latter is achieved with sum_nr_running and is the
safer option IMO.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-20 12:40 ` Valentin Schneider
  2019-12-20 14:22   ` Mel Gorman
@ 2019-12-21 11:25   ` Mel Gorman
  1 sibling, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2019-12-21 11:25 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Fri, Dec 20, 2019 at 12:40:02PM +0000, Valentin Schneider wrote:
> > +			 * it is not exactly half as imbalance_adj must be
> > +			 * accounted for or the two domains do not converge
> > +			 * as equally balanced if the number of busy tasks is
> > +			 * roughly the size of one NUMA domain.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> 
> I'm confused now, have we set env->imbalance to anything at this point? AIUI
> Vincent's suggestion was to hinge this purely on the weight vs idle_cpus /
> nr_running, IOW not use imbalance:
> 
> if (sd->flags & SD_NUMA) {                                                                         
> 	imbalance_adj = max(2U, (busiest->group_weight *                                           
> 				 (env->sd->imbalance_pct - 100) / 100) >> 1);                      
> 	imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;                              
>                                                                                                      
> 	if (busiest->idle_cpus >= imbalance_max) {                                                 
> 		env->imbalance = 0;                                                                
> 		return;                                                                            
> 	}                                                                                          
> }                                                                                                  
>        

Ok, I tried this just in case and it does not work out when the number
of CPUs per NUMA node gets very large. It allows very large imbalances
between the number of CPUs calculated based on imbalance_pct and based on
group_weight >> 1 and caused some very large regressions on a modern AMD
machine. The version I had that calculated an imbalance first and decided
whether to ignore it still has been consistently the best approach across
multiple machines and workloads.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-20  8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman
  2019-12-20 12:40 ` Valentin Schneider
@ 2019-12-22 12:00 ` Srikar Dronamraju
  2019-12-23 13:31 ` Vincent Guittot
  2 siblings, 0 replies; 28+ messages in thread
From: Srikar Dronamraju @ 2019-12-22 12:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld,
	valentin.schneider, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton, parth, riel, LKML

Hi Mel,

* Mel Gorman <mgorman@techsingularity.net> [2019-12-20 08:42:52]:

> Changelog since V1
> o Alter code flow 						vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running	vincent.guittot
> o Note that the division is still in place. Without it and taking
>   imbalance_adj into account before the cutoff, two NUMA domains
>   do not converage as being equally balanced when the number of
>   busy tasks equals the size of one domain (50% of the sum).
>   Some data is in the changelog.

I still plan to run SpecJBB on the same setup but for now I have few perf
bench test results. The setup and results are below.
We might notice that except for numa01, all other cases the system was not
fully loaded.
Summary: We seem to be regressing in all the 5 situations I tried.

lscpu o/p
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           32
NUMA node(s):        8
Model:               2.1 (pvr 004b 0201)
Model name:          POWER8 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
NUMA node2 CPU(s):   64-95
NUMA node3 CPU(s):   96-127
NUMA node4 CPU(s):   128-159
NUMA node5 CPU(s):   160-191
NUMA node6 CPU(s):   192-223
NUMA node7 CPU(s):   224-255

numa01 is a set of 2 process each running 128 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa02 is a single process with 32 threads;
each thread doing 800 loops on 32MB thread local memory operations.

numa03 is a single process with 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa04 is a set of 8 process (as many nodes) each running 4 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa05 is a set of 16 process (twice as many nodes) each running 2 threads;
each thread doing 50 loops on 3GB process shared memory operations.

(Since we are measuring duration in seconds, lower is better;
-ve %Change would mean duration has increased)

v5.5-rc2
Testcase       Time:  Min        Max        Avg        StdDev
numa01.sh      Real:  479.46     501.07     490.26     10.80
numa01.sh      Sys:   79.74      128.24     103.99     24.25
numa01.sh      User:  118577.01  121002.82  119789.92  1212.90
numa02.sh      Real:  12.53      14.80      13.66      1.14
numa02.sh      Sys:   0.89       1.49       1.19       0.30
numa02.sh      User:  318.74     354.86     336.80     18.06
numa03.sh      Real:  80.19      83.64      81.91      1.73
numa03.sh      Sys:   3.87       3.91       3.89       0.02
numa03.sh      User:  2501.34    2574.68    2538.01    36.67
numa04.sh      Real:  59.62      65.79      62.70      3.09
numa04.sh      Sys:   26.99      29.36      28.17      1.19
numa04.sh      User:  1637.57    1657.55    1647.56    9.99
numa05.sh      Real:  58.03      60.54      59.28      1.26
numa05.sh      Sys:   66.01      66.48      66.25      0.24
numa05.sh      User:  1522.61    1526.41    1524.51    1.90

v5.5-rc2 + patch
Testcase       Time:  Min        Max        Avg        StdDev   %Change
numa01.sh      Real:  518.80     526.75     522.77     3.97     -6.2188%
numa01.sh      Sys:   133.63     154.35     143.99     10.36    -27.7797%
numa01.sh      User:  123964.16  127670.38  125817.27  1853.11  -4.79056%
numa02.sh      Real:  23.70      24.58      24.14      0.44     -43.4134%
numa02.sh      Sys:   1.66       1.99       1.82       0.17     -34.6154%
numa02.sh      User:  525.45     550.45     537.95     12.50    -37.392%
numa03.sh      Real:  111.58     139.86     125.72     14.14    -34.8473%
numa03.sh      Sys:   4.32       4.34       4.33       0.01     -10.1617%
numa03.sh      User:  2837.44    3581.76    3209.60    372.16   -20.9244%
numa04.sh      Real:  102.01     110.14     106.08     4.06     -40.8937%
numa04.sh      Sys:   41.62      41.63      41.62      0.01     -32.3162%
numa04.sh      User:  2402.66    2465.31    2433.98    31.33    -32.31%
numa05.sh      Real:  97.96      109.59     103.78     5.81     -42.8792%
numa05.sh      Sys:   76.57      84.83      80.70      4.13     -17.9058%
numa05.sh      User:  2196.67    2689.63    2443.15    246.48   -37.6006%

vmstat counters

+ve %Change would mean hits have increased)
vmstat counters for numa01
param                   v5.5-rc2    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        2075794     2162993     4.20075%
numa_hint_faults_local  363105      393400      8.34332%
numa_hit                1333086     1190972     -10.6605%
numa_local              1332945     1190789     -10.6648%
numa_other              141         183         29.7872%
numa_pages_migrated     303268      316709      4.43205%
numa_pte_updates        2076930     2164327     4.20799%
pgfault                 4773768     4378411     -8.28186%
pgmajfault              288         266         -7.63889%
pgmigrate_success       303268      316709      4.43205%

vmstat counters for numa02
param                   v5.5-rc2    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        61234       90455       47.7202%
numa_hint_faults_local  31024       69953       125.48%
numa_hit                78931       68528       -13.1799%
numa_local              78931       68528       -13.1799%
numa_pages_migrated     30005       20281       -32.4079%
numa_pte_updates        63694       92503       45.2303%
pgfault                 139196      161149      15.7713%
pgmajfault              65          70          7.69231%
pgmigrate_success       30005       20281       -32.4079%

vmstat counters for numa03
param                   v5.5-rc2    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        692956      799162      15.3265%
numa_hint_faults_local  95121       167989      76.6056%
numa_hit                256729      223528      -12.9323%
numa_local              256729      223528      -12.9323%
numa_pages_migrated     106499      116825      9.69587%
numa_pte_updates        693020      803373      15.9235%
pgfault                 1003178     1004960     0.177635%
pgmajfault              84          79          -5.95238%
pgmigrate_success       106499      116825      9.69587%

vmstat counters for numa04
param                   v5.5-rc2    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        4766686     5512065     15.6373%
numa_hint_faults_local  2574830     3040466     18.0841%
numa_hit                1797339     1975390     9.90637%
numa_local              1797330     1975357     9.90508%
numa_pages_migrated     958674      1144133     19.3454%
numa_pte_updates        4784502     5533075     15.6458%
pgfault                 5855953     6546520     11.7926%
pgmajfault              124         118         -4.83871%
pgmigrate_success       958674      1144133     19.3454%

vmstat counters for numa05
param                   v5.5-rc2    with_patch  %Change
-----                   ----------  ----------  -------
numa_hint_faults        11252825    10988104    -2.35248%
numa_hint_faults_local  7379141     8290124     12.3454%
numa_hit                3669519     2964756     -19.2059%
numa_local              3669518     2964702     -19.2073%
numa_pages_migrated     2050570     1696810     -17.2518%
numa_pte_updates        11326394    11003544    -2.85042%
pgfault                 13059535    12747105    -2.39235%
pgmajfault              83          433         421.687%
pgmigrate_success       2050570     1696810     -17.2518%

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-20  8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman
  2019-12-20 12:40 ` Valentin Schneider
  2019-12-22 12:00 ` Srikar Dronamraju
@ 2019-12-23 13:31 ` Vincent Guittot
  2019-12-23 13:41   ` Vincent Guittot
  2020-01-03 14:31   ` Mel Gorman
  2 siblings, 2 replies; 28+ messages in thread
From: Vincent Guittot @ 2019-12-23 13:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, 20 Dec 2019 at 09:42, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Changelog since V1
> o Alter code flow                                               vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running        vincent.guittot
> o Note that the division is still in place. Without it and taking
>   imbalance_adj into account before the cutoff, two NUMA domains
>   do not converage as being equally balanced when the number of
>   busy tasks equals the size of one domain (50% of the sum).
>   Some data is in the changelog.
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a degree of imbalance to exist between NUMA domains
> based on the imbalance_pct defined by the scheduler domain. This slight
> imbalance is allowed until the scheduler domain reaches almost 50%
> utilisation at which point other factors like HT utilisation and memory
> bandwidth come into play. While not commented upon in the code, the cutoff
> is important for memory-bound parallelised non-communicating workloads
> that do not fully utilise the entire machine. This is not necessarily the
> best universal cut-off point but it appeared appropriate for a variety
> of workloads and machines.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offloaded depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
>                               baseline              lbnuma-v1
> Hmean     64         666.68 (   0.00%)      667.31 (   0.09%)
> Hmean     128       1276.18 (   0.00%)     1288.92 *   1.00%*
> Hmean     256       2366.78 (   0.00%)     2422.22 *   2.34%*
> Hmean     1024      8123.94 (   0.00%)     8464.15 *   4.19%*
> Hmean     2048     12962.45 (   0.00%)    13693.79 *   5.64%*
> Hmean     3312     17709.24 (   0.00%)    17494.23 (  -1.21%)
> Hmean     4096     19756.01 (   0.00%)    19472.58 (  -1.43%)
> Hmean     8192     27469.59 (   0.00%)    27787.32 (   1.16%)
> Hmean     16384    30062.82 (   0.00%)    30657.62 *   1.98%*
> Stddev    64           2.64 (   0.00%)        2.09 (  20.76%)
> Stddev    128          6.22 (   0.00%)        6.48 (  -4.28%)
> Stddev    256          9.75 (   0.00%)       22.85 (-134.30%)
> Stddev    1024        69.62 (   0.00%)       58.41 (  16.11%)
> Stddev    2048        72.73 (   0.00%)       83.47 ( -14.77%)
> Stddev    3312       412.35 (   0.00%)       75.77 (  81.63%)
> Stddev    4096       345.02 (   0.00%)      297.01 (  13.91%)
> Stddev    8192       280.09 (   0.00%)      485.36 ( -73.29%)
> Stddev    16384      452.99 (   0.00%)      250.21 (  44.76%)
>
> Fairly small impact on average performance but note how much the standard
> deviation is reduced in many cases. A clearer story is visible from the
> NUMA Balancing stats
>
> Ops NUMA base-page range updates       21596.00         282.00
> Ops NUMA PTE updates                   21596.00         282.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                   17786.00         137.00
> Ops NUMA hint local faults %            9916.00         137.00
> Ops NUMA hint local percent               55.75         100.00
> Ops NUMA pages migrated                 4231.00           0.00
>
> Without the patch, only 55.75% of sampled accesses are local.
> With the patch, 100% of sampled accesses are local. A 2-socket
> Broadwell showed better results on average but are not presented
> for brevity. The patch holds up for 4-socket boxes as well
>
> 4-socket Haswell machine, 144 core, HT enabled
> netperf-tcp
>
>                               baseline              lbnuma-v1
> Hmean     64         953.51 (   0.00%)      977.27 *   2.49%*
> Hmean     128       1826.48 (   0.00%)     1863.37 *   2.02%*
> Hmean     256       3295.19 (   0.00%)     3329.37 (   1.04%)
> Hmean     1024     10915.40 (   0.00%)    11339.60 *   3.89%*
> Hmean     2048     17833.82 (   0.00%)    19066.12 *   6.91%*
> Hmean     3312     22690.72 (   0.00%)    24048.92 *   5.99%*
> Hmean     4096     24422.23 (   0.00%)    26606.60 *   8.94%*
> Hmean     8192     31250.11 (   0.00%)    33374.62 *   6.80%*
> Hmean     16384    37033.70 (   0.00%)    38684.28 *   4.46%*
> Hmean     16384    37033.70 (   0.00%)    38732.22 *   4.59%*
>
> On this machine, the baseline measured 58.11% locality for sampled accesses
> and 100% local accesses with the patch. Similarly, the patch holds up
> for 2-socket machines with multiple L3 caches such as the AMD Epyc 2
>
> 2-socket EPYC-2 machine, 256 cores
> netperf-tcp
> Hmean     64        1564.63 (   0.00%)     1550.59 (  -0.90%)
> Hmean     128       3028.83 (   0.00%)     3030.48 (   0.05%)
> Hmean     256       5733.47 (   0.00%)     5769.51 (   0.63%)
> Hmean     1024     18936.04 (   0.00%)    19216.15 *   1.48%*
> Hmean     2048     27589.77 (   0.00%)    28200.45 *   2.21%*
> Hmean     3312     35361.97 (   0.00%)    35881.94 *   1.47%*
> Hmean     4096     37965.59 (   0.00%)    38702.01 *   1.94%*
> Hmean     8192     48499.92 (   0.00%)    49530.62 *   2.13%*
> Hmean     16384    54249.96 (   0.00%)    55937.24 *   3.11%*
>
> For amusement purposes, here are two graphs showing CPU utilisation on
> the 2-socket Haswell machine over time based on mpstat with the ordering
> of the CPUs based on topology.
>
> http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png
> http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png
>
> The lines on the left match up CPUs that are HT siblings or on the same
> node. The machine has only one L3 cache per NUMA node or that would also
> be shown.  It should be very clear from the images that the baseline
> kernel spread the load with lighter utilisation across nodes while the
> patched kernel had heavy utilisation of fewer CPUs on one node.
>
> Hackbench generally shows good results across machines with some
> differences depending on whether threads or sockets are used as well as
> pipes or sockets.  This is the *worst* result from the 2-socket Haswell
> machine
>
> 2-socket Haswell machine 48 core, HT enabled
> hackbench-process-pipes -- mmtests config config-scheduler-unbound
>                            5.5.0-rc1              5.5.0-rc1
>                             baseline              lbnuma-v1
> Amean     1        1.2580 (   0.00%)      1.2393 (   1.48%)
> Amean     4        5.3293 (   0.00%)      5.2683 *   1.14%*
> Amean     7        8.9067 (   0.00%)      8.7130 *   2.17%*
> Amean     12      14.9577 (   0.00%)     14.5773 *   2.54%*
> Amean     21      25.9570 (   0.00%)     25.6657 *   1.12%*
> Amean     30      37.7287 (   0.00%)     37.1277 *   1.59%*
> Amean     48      61.6757 (   0.00%)     60.0433 *   2.65%*
> Amean     79     100.4740 (   0.00%)     98.4507 (   2.01%)
> Amean     110    141.2450 (   0.00%)    136.8900 *   3.08%*
> Amean     141    179.7747 (   0.00%)    174.5110 *   2.93%*
> Amean     172    221.0700 (   0.00%)    214.7857 *   2.84%*
> Amean     192    245.2007 (   0.00%)    238.3680 *   2.79%*
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a
> patch discussed during review that avoids a divide by putting the cutoff
> at exactly 50% instead of accounting for imbalance_adj.
>
> NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half
>                              baseline               nodivide              lbnuma-v1
> Amean     bt.C       64.29 (   0.00%)       76.33 * -18.72%*       69.55 *  -8.17%*
> Amean     cg.C       26.33 (   0.00%)       26.26 (   0.27%)       26.36 (  -0.11%)
> Amean     ep.C       10.26 (   0.00%)       10.29 (  -0.31%)       10.26 (  -0.04%)
> Amean     ft.C       17.98 (   0.00%)       19.73 *  -9.71%*       19.51 *  -8.52%*
> Amean     is.C        0.99 (   0.00%)        0.99 (   0.40%)        0.99 (   0.00%)
> Amean     lu.C       51.72 (   0.00%)       48.57 (   6.09%)       48.68 *   5.88%*
> Amean     mg.C        8.12 (   0.00%)        8.27 (  -1.82%)        8.24 (  -1.50%)
> Amean     sp.C       82.76 (   0.00%)       86.06 *  -3.99%*       83.42 (  -0.80%)
> Amean     ua.C       58.64 (   0.00%)       57.66 (   1.67%)       57.79 (   1.45%)
>
> There is some impact but there is a degree of variability and the ones
> showing impact are mainly workloads that are mostly parallelised
> and communicate infrequently between tests. It's a corner case where
> the workload benefits heavily from spreading wide and early which is
> not common. This is intended to illustrate the worst case measured.
>
> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
>
> [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus]
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++----
>  1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..60a780e1420e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>         /*
>          * Try to use spare capacity of local group without overloading it or
>          * emptying busiest.
> -        * XXX Spreading tasks across NUMA nodes is not always the best policy
> -        * and special care should be taken for SD_NUMA domain level before
> -        * spreading the tasks. For now, load_balance() fully relies on
> -        * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>          */
>         if (local->group_type == group_has_spare) {
>                 if (busiest->group_type > group_fully_busy) {
> @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                         return;
>                 }
>
> +               /* Consider allowing a small imbalance between NUMA groups */
> +               if (env->sd->flags & SD_NUMA) {
> +                       unsigned int imbalance_adj, imbalance_max;
> +
> +                       /*
> +                        * imbalance_adj is the allowable degree of imbalance
> +                        * to exist between two NUMA domains. It's calculated
> +                        * relative to imbalance_pct with a minimum of two
> +                        * tasks or idle CPUs. The choice of two is due to
> +                        * the most basic case of two communicating tasks
> +                        * that should remain on the same NUMA node after
> +                        * wakeup.
> +                        */
> +                       imbalance_adj = max(2U, (busiest->group_weight *
> +                               (env->sd->imbalance_pct - 100) / 100) >> 1);
> +
> +                       /*
> +                        * Ignore small imbalances unless the busiest sd has
> +                        * almost half as many busy CPUs as there are
> +                        * available CPUs in the busiest group. Note that
> +                        * it is not exactly half as imbalance_adj must be
> +                        * accounted for or the two domains do not converge
> +                        * as equally balanced if the number of busy tasks is
> +                        * roughly the size of one NUMA domain.
> +                        */
> +                       imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> +                       if (env->imbalance <= imbalance_adj &&

AFAICT, env->imbalance is undefined there. I have tried your patch
with the below instead

-                       if (env->imbalance <= imbalance_adj &&
-                           busiest->idle_cpus >= imbalance_max) {
+                       if (busiest->idle_cpus >= imbalance_max) {

Sorry for the delay but running tests tooks more time than expected. I
have applied your patch on top of v5.5-rc3+apparmor fix
I can see an improvement for
hackbench -l (256000/#grp) -g #grp
  1 groups    14.197 +/-0.95%   12.127 +/-1.19% (+14.58%)

I haven't seen any difference otherwise

> +                           busiest->idle_cpus >= imbalance_max) {
> +                               env->imbalance = 0;
> +                               return;
> +                       }
> +               }
> +
>                 if (busiest->group_weight == 1 || sds->prefer_sibling) {
>                         unsigned int nr_diff = busiest->sum_nr_running;
>                         /*

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-23 13:31 ` Vincent Guittot
@ 2019-12-23 13:41   ` Vincent Guittot
  2020-01-03 14:31   ` Mel Gorman
  1 sibling, 0 replies; 28+ messages in thread
From: Vincent Guittot @ 2019-12-23 13:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Mon, 23 Dec 2019 at 14:31, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 20 Dec 2019 at 09:42, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > Changelog since V1
> > o Alter code flow                                               vincent.guittot
> > o Use idle CPUs for comparison instead of sum_nr_running        vincent.guittot
> > o Note that the division is still in place. Without it and taking
> >   imbalance_adj into account before the cutoff, two NUMA domains
> >   do not converage as being equally balanced when the number of
> >   busy tasks equals the size of one domain (50% of the sum).
> >   Some data is in the changelog.
> >
> > The CPU load balancer balances between different domains to spread load
> > and strives to have equal balance everywhere. Communicating tasks can
> > migrate so they are topologically close to each other but these decisions
> > are independent. On a lightly loaded NUMA machine, two communicating tasks
> > pulled together at wakeup time can be pushed apart by the load balancer.
> > In isolation, the load balancer decision is fine but it ignores the tasks
> > data locality and the wakeup/LB paths continually conflict. NUMA balancing
> > is also a factor but it also simply conflicts with the load balancer.
> >
> > This patch allows a degree of imbalance to exist between NUMA domains
> > based on the imbalance_pct defined by the scheduler domain. This slight
> > imbalance is allowed until the scheduler domain reaches almost 50%
> > utilisation at which point other factors like HT utilisation and memory
> > bandwidth come into play. While not commented upon in the code, the cutoff
> > is important for memory-bound parallelised non-communicating workloads
> > that do not fully utilise the entire machine. This is not necessarily the
> > best universal cut-off point but it appeared appropriate for a variety
> > of workloads and machines.
> >
> > The most obvious impact is on netperf TCP_STREAM -- two simple
> > communicating tasks with some softirq offloaded depending on the
> > transmission rate.
> >
> > 2-socket Haswell machine 48 core, HT enabled
> > netperf-tcp -- mmtests config config-network-netperf-unbound
> >                               baseline              lbnuma-v1
> > Hmean     64         666.68 (   0.00%)      667.31 (   0.09%)
> > Hmean     128       1276.18 (   0.00%)     1288.92 *   1.00%*
> > Hmean     256       2366.78 (   0.00%)     2422.22 *   2.34%*
> > Hmean     1024      8123.94 (   0.00%)     8464.15 *   4.19%*
> > Hmean     2048     12962.45 (   0.00%)    13693.79 *   5.64%*
> > Hmean     3312     17709.24 (   0.00%)    17494.23 (  -1.21%)
> > Hmean     4096     19756.01 (   0.00%)    19472.58 (  -1.43%)
> > Hmean     8192     27469.59 (   0.00%)    27787.32 (   1.16%)
> > Hmean     16384    30062.82 (   0.00%)    30657.62 *   1.98%*
> > Stddev    64           2.64 (   0.00%)        2.09 (  20.76%)
> > Stddev    128          6.22 (   0.00%)        6.48 (  -4.28%)
> > Stddev    256          9.75 (   0.00%)       22.85 (-134.30%)
> > Stddev    1024        69.62 (   0.00%)       58.41 (  16.11%)
> > Stddev    2048        72.73 (   0.00%)       83.47 ( -14.77%)
> > Stddev    3312       412.35 (   0.00%)       75.77 (  81.63%)
> > Stddev    4096       345.02 (   0.00%)      297.01 (  13.91%)
> > Stddev    8192       280.09 (   0.00%)      485.36 ( -73.29%)
> > Stddev    16384      452.99 (   0.00%)      250.21 (  44.76%)
> >
> > Fairly small impact on average performance but note how much the standard
> > deviation is reduced in many cases. A clearer story is visible from the
> > NUMA Balancing stats
> >
> > Ops NUMA base-page range updates       21596.00         282.00
> > Ops NUMA PTE updates                   21596.00         282.00
> > Ops NUMA PMD updates                       0.00           0.00
> > Ops NUMA hint faults                   17786.00         137.00
> > Ops NUMA hint local faults %            9916.00         137.00
> > Ops NUMA hint local percent               55.75         100.00
> > Ops NUMA pages migrated                 4231.00           0.00
> >
> > Without the patch, only 55.75% of sampled accesses are local.
> > With the patch, 100% of sampled accesses are local. A 2-socket
> > Broadwell showed better results on average but are not presented
> > for brevity. The patch holds up for 4-socket boxes as well
> >
> > 4-socket Haswell machine, 144 core, HT enabled
> > netperf-tcp
> >
> >                               baseline              lbnuma-v1
> > Hmean     64         953.51 (   0.00%)      977.27 *   2.49%*
> > Hmean     128       1826.48 (   0.00%)     1863.37 *   2.02%*
> > Hmean     256       3295.19 (   0.00%)     3329.37 (   1.04%)
> > Hmean     1024     10915.40 (   0.00%)    11339.60 *   3.89%*
> > Hmean     2048     17833.82 (   0.00%)    19066.12 *   6.91%*
> > Hmean     3312     22690.72 (   0.00%)    24048.92 *   5.99%*
> > Hmean     4096     24422.23 (   0.00%)    26606.60 *   8.94%*
> > Hmean     8192     31250.11 (   0.00%)    33374.62 *   6.80%*
> > Hmean     16384    37033.70 (   0.00%)    38684.28 *   4.46%*
> > Hmean     16384    37033.70 (   0.00%)    38732.22 *   4.59%*
> >
> > On this machine, the baseline measured 58.11% locality for sampled accesses
> > and 100% local accesses with the patch. Similarly, the patch holds up
> > for 2-socket machines with multiple L3 caches such as the AMD Epyc 2
> >
> > 2-socket EPYC-2 machine, 256 cores
> > netperf-tcp
> > Hmean     64        1564.63 (   0.00%)     1550.59 (  -0.90%)
> > Hmean     128       3028.83 (   0.00%)     3030.48 (   0.05%)
> > Hmean     256       5733.47 (   0.00%)     5769.51 (   0.63%)
> > Hmean     1024     18936.04 (   0.00%)    19216.15 *   1.48%*
> > Hmean     2048     27589.77 (   0.00%)    28200.45 *   2.21%*
> > Hmean     3312     35361.97 (   0.00%)    35881.94 *   1.47%*
> > Hmean     4096     37965.59 (   0.00%)    38702.01 *   1.94%*
> > Hmean     8192     48499.92 (   0.00%)    49530.62 *   2.13%*
> > Hmean     16384    54249.96 (   0.00%)    55937.24 *   3.11%*
> >
> > For amusement purposes, here are two graphs showing CPU utilisation on
> > the 2-socket Haswell machine over time based on mpstat with the ordering
> > of the CPUs based on topology.
> >
> > http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png
> > http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png
> >
> > The lines on the left match up CPUs that are HT siblings or on the same
> > node. The machine has only one L3 cache per NUMA node or that would also
> > be shown.  It should be very clear from the images that the baseline
> > kernel spread the load with lighter utilisation across nodes while the
> > patched kernel had heavy utilisation of fewer CPUs on one node.
> >
> > Hackbench generally shows good results across machines with some
> > differences depending on whether threads or sockets are used as well as
> > pipes or sockets.  This is the *worst* result from the 2-socket Haswell
> > machine
> >
> > 2-socket Haswell machine 48 core, HT enabled
> > hackbench-process-pipes -- mmtests config config-scheduler-unbound
> >                            5.5.0-rc1              5.5.0-rc1
> >                             baseline              lbnuma-v1
> > Amean     1        1.2580 (   0.00%)      1.2393 (   1.48%)
> > Amean     4        5.3293 (   0.00%)      5.2683 *   1.14%*
> > Amean     7        8.9067 (   0.00%)      8.7130 *   2.17%*
> > Amean     12      14.9577 (   0.00%)     14.5773 *   2.54%*
> > Amean     21      25.9570 (   0.00%)     25.6657 *   1.12%*
> > Amean     30      37.7287 (   0.00%)     37.1277 *   1.59%*
> > Amean     48      61.6757 (   0.00%)     60.0433 *   2.65%*
> > Amean     79     100.4740 (   0.00%)     98.4507 (   2.01%)
> > Amean     110    141.2450 (   0.00%)    136.8900 *   3.08%*
> > Amean     141    179.7747 (   0.00%)    174.5110 *   2.93%*
> > Amean     172    221.0700 (   0.00%)    214.7857 *   2.84%*
> > Amean     192    245.2007 (   0.00%)    238.3680 *   2.79%*
> >
> > An earlier prototype of the patch showed major regressions for NAS C-class
> > when running with only half of the available CPUs -- 20-30% performance
> > hits were measured at the time. With this version of the patch, the impact
> > is marginal. In this case, the patch is lbnuma-v2 where as nodivide is a
> > patch discussed during review that avoids a divide by putting the cutoff
> > at exactly 50% instead of accounting for imbalance_adj.
> >
> > NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half
> >                              baseline               nodivide              lbnuma-v1
> > Amean     bt.C       64.29 (   0.00%)       76.33 * -18.72%*       69.55 *  -8.17%*
> > Amean     cg.C       26.33 (   0.00%)       26.26 (   0.27%)       26.36 (  -0.11%)
> > Amean     ep.C       10.26 (   0.00%)       10.29 (  -0.31%)       10.26 (  -0.04%)
> > Amean     ft.C       17.98 (   0.00%)       19.73 *  -9.71%*       19.51 *  -8.52%*
> > Amean     is.C        0.99 (   0.00%)        0.99 (   0.40%)        0.99 (   0.00%)
> > Amean     lu.C       51.72 (   0.00%)       48.57 (   6.09%)       48.68 *   5.88%*
> > Amean     mg.C        8.12 (   0.00%)        8.27 (  -1.82%)        8.24 (  -1.50%)
> > Amean     sp.C       82.76 (   0.00%)       86.06 *  -3.99%*       83.42 (  -0.80%)
> > Amean     ua.C       58.64 (   0.00%)       57.66 (   1.67%)       57.79 (   1.45%)
> >
> > There is some impact but there is a degree of variability and the ones
> > showing impact are mainly workloads that are mostly parallelised
> > and communicate infrequently between tests. It's a corner case where
> > the workload benefits heavily from spreading wide and early which is
> > not common. This is intended to illustrate the worst case measured.
> >
> > In general, the patch simply seeks to avoid unnecessarily cross-node
> > migrations when a machine is lightly loaded but shows benefits for other
> > workloads. While tests are still running, so far it seems to benefit
> > light-utilisation smaller workloads on large machines and does not appear
> > to do any harm to larger or parallelised workloads.
> >
> > [valentin.schneider@arm.com: Reformat code flow, correct comment, use idle_cpus]
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c | 37 +++++++++++++++++++++++++++++++++----
> >  1 file changed, 33 insertions(+), 4 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 08a233e97a01..60a780e1420e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >         /*
> >          * Try to use spare capacity of local group without overloading it or
> >          * emptying busiest.
> > -        * XXX Spreading tasks across NUMA nodes is not always the best policy
> > -        * and special care should be taken for SD_NUMA domain level before
> > -        * spreading the tasks. For now, load_balance() fully relies on
> > -        * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
> >          */
> >         if (local->group_type == group_has_spare) {
> >                 if (busiest->group_type > group_fully_busy) {
> > @@ -8671,6 +8667,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                         return;
> >                 }
> >
> > +               /* Consider allowing a small imbalance between NUMA groups */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       unsigned int imbalance_adj, imbalance_max;
> > +
> > +                       /*
> > +                        * imbalance_adj is the allowable degree of imbalance
> > +                        * to exist between two NUMA domains. It's calculated
> > +                        * relative to imbalance_pct with a minimum of two
> > +                        * tasks or idle CPUs. The choice of two is due to
> > +                        * the most basic case of two communicating tasks
> > +                        * that should remain on the same NUMA node after
> > +                        * wakeup.
> > +                        */
> > +                       imbalance_adj = max(2U, (busiest->group_weight *
> > +                               (env->sd->imbalance_pct - 100) / 100) >> 1);
> > +
> > +                       /*
> > +                        * Ignore small imbalances unless the busiest sd has
> > +                        * almost half as many busy CPUs as there are
> > +                        * available CPUs in the busiest group. Note that
> > +                        * it is not exactly half as imbalance_adj must be
> > +                        * accounted for or the two domains do not converge
> > +                        * as equally balanced if the number of busy tasks is
> > +                        * roughly the size of one NUMA domain.
> > +                        */
> > +                       imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +                       if (env->imbalance <= imbalance_adj &&
>
> AFAICT, env->imbalance is undefined there. I have tried your patch
> with the below instead
>
> -                       if (env->imbalance <= imbalance_adj &&
> -                           busiest->idle_cpus >= imbalance_max) {
> +                       if (busiest->idle_cpus >= imbalance_max) {
>
> Sorry for the delay but running tests tooks more time than expected. I
> have applied your patch on top of v5.5-rc3+apparmor fix

I forgot to mentionned that the platform that used for the tests, is a
2 nodes 224 CPUs arm64

> I can see an improvement for
> hackbench -l (256000/#grp) -g #grp
>   1 groups    14.197 +/-0.95%   12.127 +/-1.19% (+14.58%)
>
> I haven't seen any difference otherwise
>
> > +                           busiest->idle_cpus >= imbalance_max) {
> > +                               env->imbalance = 0;
> > +                               return;
> > +                       }
> > +               }
> > +
> >                 if (busiest->group_weight == 1 || sds->prefer_sibling) {
> >                         unsigned int nr_diff = busiest->sum_nr_running;
> >                         /*

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2019-12-23 13:31 ` Vincent Guittot
  2019-12-23 13:41   ` Vincent Guittot
@ 2020-01-03 14:31   ` Mel Gorman
  2020-01-06 13:55     ` Vincent Guittot
  1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2020-01-03 14:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Mon, Dec 23, 2019 at 02:31:40PM +0100, Vincent Guittot wrote:
> > +               /* Consider allowing a small imbalance between NUMA groups */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       unsigned int imbalance_adj, imbalance_max;
> > +
> > +                       /*
> > +                        * imbalance_adj is the allowable degree of imbalance
> > +                        * to exist between two NUMA domains. It's calculated
> > +                        * relative to imbalance_pct with a minimum of two
> > +                        * tasks or idle CPUs. The choice of two is due to
> > +                        * the most basic case of two communicating tasks
> > +                        * that should remain on the same NUMA node after
> > +                        * wakeup.
> > +                        */
> > +                       imbalance_adj = max(2U, (busiest->group_weight *
> > +                               (env->sd->imbalance_pct - 100) / 100) >> 1);
> > +
> > +                       /*
> > +                        * Ignore small imbalances unless the busiest sd has
> > +                        * almost half as many busy CPUs as there are
> > +                        * available CPUs in the busiest group. Note that
> > +                        * it is not exactly half as imbalance_adj must be
> > +                        * accounted for or the two domains do not converge
> > +                        * as equally balanced if the number of busy tasks is
> > +                        * roughly the size of one NUMA domain.
> > +                        */
> > +                       imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > +                       if (env->imbalance <= imbalance_adj &&
> 
> AFAICT, env->imbalance is undefined there. I have tried your patch
> with the below instead
> 

Even when fixed, other corner cases were hit for parallelised loads that
benefit from spreading early. At the moment I'm testing a variant of the
first approach except it adjust small balances at low utilisation and
otherwise balances at normal. It appears to work for basic communicating
tasks at relatively low utitisation that fits within a node without
obviously impacting higher utilisation non-communicating workloads but
more testing time will be needed to be sure.

It's still based on sum_nr_running as a cut-off instead of idle_cpus as
at low utilisation, there is not much of a material difference in the
cut-offs given that either approach can be misleading depending on the
load of the individual tasks, any cpu binding configurations and the
degree the tasks are memory-bound.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..58ba2f0c6363 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8691,16 +8686,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			long imbalance_adj, imbalance_max;
+
+			/*
+			 * imbalance_adj is the allowable degree of imbalance
+			 * to exist between two NUMA domains. imbalance_pct
+			 * is used to estimate the number of active tasks
+			 * needed before memory bandwidth may be as important
+			 * as memory locality.
+			 */
+			imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
+
+			/*
+			 * Allow small imbalances when the busiest group has
+			 * low utilisation.
+			 */
+			imbalance_max = imbalance_adj << 1;
+			if (busiest->sum_nr_running < imbalance_max)
+				env->imbalance -= min(env->imbalance, imbalance_adj);
+		}
+
 		return;
 	}
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-03 14:31   ` Mel Gorman
@ 2020-01-06 13:55     ` Vincent Guittot
  2020-01-06 14:52       ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-01-06 13:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, 3 Jan 2020 at 15:31, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Dec 23, 2019 at 02:31:40PM +0100, Vincent Guittot wrote:
> > > +               /* Consider allowing a small imbalance between NUMA groups */
> > > +               if (env->sd->flags & SD_NUMA) {
> > > +                       unsigned int imbalance_adj, imbalance_max;
> > > +
> > > +                       /*
> > > +                        * imbalance_adj is the allowable degree of imbalance
> > > +                        * to exist between two NUMA domains. It's calculated
> > > +                        * relative to imbalance_pct with a minimum of two
> > > +                        * tasks or idle CPUs. The choice of two is due to
> > > +                        * the most basic case of two communicating tasks
> > > +                        * that should remain on the same NUMA node after
> > > +                        * wakeup.
> > > +                        */
> > > +                       imbalance_adj = max(2U, (busiest->group_weight *
> > > +                               (env->sd->imbalance_pct - 100) / 100) >> 1);
> > > +
> > > +                       /*
> > > +                        * Ignore small imbalances unless the busiest sd has
> > > +                        * almost half as many busy CPUs as there are
> > > +                        * available CPUs in the busiest group. Note that
> > > +                        * it is not exactly half as imbalance_adj must be
> > > +                        * accounted for or the two domains do not converge
> > > +                        * as equally balanced if the number of busy tasks is
> > > +                        * roughly the size of one NUMA domain.
> > > +                        */
> > > +                       imbalance_max = (busiest->group_weight >> 1) + imbalance_adj;
> > > +                       if (env->imbalance <= imbalance_adj &&
> >
> > AFAICT, env->imbalance is undefined there. I have tried your patch
> > with the below instead
> >
>
> Even when fixed, other corner cases were hit for parallelised loads that
> benefit from spreading early. At the moment I'm testing a variant of the
> first approach except it adjust small balances at low utilisation and
> otherwise balances at normal. It appears to work for basic communicating
> tasks at relatively low utitisation that fits within a node without
> obviously impacting higher utilisation non-communicating workloads but
> more testing time will be needed to be sure.
>
> It's still based on sum_nr_running as a cut-off instead of idle_cpus as
> at low utilisation, there is not much of a material difference in the
> cut-offs given that either approach can be misleading depending on the
> load of the individual tasks, any cpu binding configurations and the
> degree the tasks are memory-bound.
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..58ba2f0c6363 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>         /*
>          * Try to use spare capacity of local group without overloading it or
>          * emptying busiest.
> -        * XXX Spreading tasks across NUMA nodes is not always the best policy
> -        * and special care should be taken for SD_NUMA domain level before
> -        * spreading the tasks. For now, load_balance() fully relies on
> -        * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>          */
>         if (local->group_type == group_has_spare) {
>                 if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8686,39 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                         env->migration_type = migrate_task;
>                         lsub_positive(&nr_diff, local->sum_nr_running);
>                         env->imbalance = nr_diff >> 1;
> -                       return;
> -               }
> +               } else {
>
> -               /*
> -                * If there is no overload, we just want to even the number of
> -                * idle cpus.
> -                */
> -               env->migration_type = migrate_task;
> -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> +                       /*
> +                        * If there is no overload, we just want to even the number of
> +                        * idle cpus.
> +                        */
> +                       env->migration_type = migrate_task;
> +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
>                                                  busiest->idle_cpus) >> 1);
> +               }
> +
> +               /* Consider allowing a small imbalance between NUMA groups */
> +               if (env->sd->flags & SD_NUMA) {
> +                       long imbalance_adj, imbalance_max;
> +
> +                       /*
> +                        * imbalance_adj is the allowable degree of imbalance
> +                        * to exist between two NUMA domains. imbalance_pct
> +                        * is used to estimate the number of active tasks
> +                        * needed before memory bandwidth may be as important
> +                        * as memory locality.
> +                        */
> +                       imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;

This looks weird to me because you use imbalance_pct, which is
meaningful only compare a ratio, to define a number that will be then
compared to a number of tasks without taking into account the weight
of the node. So whatever the node size, 32 or 128 CPUs, the
imbalance_adj will be the same: 3 with the default imbalance_pct of
NUMA level which is 125 AFAICT

> +
> +                       /*
> +                        * Allow small imbalances when the busiest group has
> +                        * low utilisation.
> +                        */
> +                       imbalance_max = imbalance_adj << 1;

Why do you add this shift ?

So according to the above, imbalance_max = 6 whatever the size of the node


Regards,
Vincent

> +                       if (busiest->sum_nr_running < imbalance_max)
> +                               env->imbalance -= min(env->imbalance, imbalance_adj);
> +               }
> +
>                 return;
>         }
>
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-06 13:55     ` Vincent Guittot
@ 2020-01-06 14:52       ` Mel Gorman
  2020-01-07  8:38         ` Vincent Guittot
  0 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2020-01-06 14:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

Sorry I sent out v3 before seeing this email as my mail only synchronises
periodically.

On Mon, Jan 06, 2020 at 02:55:00PM +0100, Vincent Guittot wrote:
> > -                       return;
> > -               }
> > +               } else {
> >
> > -               /*
> > -                * If there is no overload, we just want to even the number of
> > -                * idle cpus.
> > -                */
> > -               env->migration_type = migrate_task;
> > -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> > +                       /*
> > +                        * If there is no overload, we just want to even the number of
> > +                        * idle cpus.
> > +                        */
> > +                       env->migration_type = migrate_task;
> > +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
> >                                                  busiest->idle_cpus) >> 1);
> > +               }
> > +
> > +               /* Consider allowing a small imbalance between NUMA groups */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       long imbalance_adj, imbalance_max;
> > +
> > +                       /*
> > +                        * imbalance_adj is the allowable degree of imbalance
> > +                        * to exist between two NUMA domains. imbalance_pct
> > +                        * is used to estimate the number of active tasks
> > +                        * needed before memory bandwidth may be as important
> > +                        * as memory locality.
> > +                        */
> > +                       imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
> 
> This looks weird to me because you use imbalance_pct, which is
> meaningful only compare a ratio, to define a number that will be then
> compared to a number of tasks without taking into account the weight
> of the node. So whatever the node size, 32 or 128 CPUs, the
> imbalance_adj will be the same: 3 with the default imbalance_pct of
> NUMA level which is 125 AFAICT
> 

The intent in this version was to only cover the low utilisation case
regardless of the NUMA node size. There were too many corner cases
where the tradeoff of local memory latency versus local memory bandwidth
cannot be quantified. See Srikar's report as an example but it's a general
problem. The use of imbalance_pct was simply to find the smallest number of
running tasks were (imbalance_pct - 100) would be 1 running task and limit
the patch to address the low utilisation case only. It could be simply
hard-coded but that would ignore cases where an architecture overrides
imbalance_pct. I'm open to suggestion on how we could identify the point
where imbalances can be ignored without hitting other corner cases.

> > +
> > +                       /*
> > +                        * Allow small imbalances when the busiest group has
> > +                        * low utilisation.
> > +                        */
> > +                       imbalance_max = imbalance_adj << 1;
> 
> Why do you add this shift ?
> 

For very low utilisation, there is no balancing between nodes. For slightly
above that, there is limited balancing. After that, the load balancing
behaviour is unchanged as I believe we cannot determine if memory latency
or memory bandwidth is more important for arbitrary workloads.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-06 14:52       ` Mel Gorman
@ 2020-01-07  8:38         ` Vincent Guittot
  2020-01-07  9:56           ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-01-07  8:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Mon, 6 Jan 2020 at 15:52, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Sorry I sent out v3 before seeing this email as my mail only synchronises
> periodically.
>
> On Mon, Jan 06, 2020 at 02:55:00PM +0100, Vincent Guittot wrote:
> > > -                       return;
> > > -               }
> > > +               } else {
> > >
> > > -               /*
> > > -                * If there is no overload, we just want to even the number of
> > > -                * idle cpus.
> > > -                */
> > > -               env->migration_type = migrate_task;
> > > -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > +                       /*
> > > +                        * If there is no overload, we just want to even the number of
> > > +                        * idle cpus.
> > > +                        */
> > > +                       env->migration_type = migrate_task;
> > > +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
> > >                                                  busiest->idle_cpus) >> 1);
> > > +               }
> > > +
> > > +               /* Consider allowing a small imbalance between NUMA groups */
> > > +               if (env->sd->flags & SD_NUMA) {
> > > +                       long imbalance_adj, imbalance_max;
> > > +
> > > +                       /*
> > > +                        * imbalance_adj is the allowable degree of imbalance
> > > +                        * to exist between two NUMA domains. imbalance_pct
> > > +                        * is used to estimate the number of active tasks
> > > +                        * needed before memory bandwidth may be as important
> > > +                        * as memory locality.
> > > +                        */
> > > +                       imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
> >
> > This looks weird to me because you use imbalance_pct, which is
> > meaningful only compare a ratio, to define a number that will be then
> > compared to a number of tasks without taking into account the weight
> > of the node. So whatever the node size, 32 or 128 CPUs, the
> > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > NUMA level which is 125 AFAICT
> >
>
> The intent in this version was to only cover the low utilisation case
> regardless of the NUMA node size. There were too many corner cases
> where the tradeoff of local memory latency versus local memory bandwidth
> cannot be quantified. See Srikar's report as an example but it's a general
> problem. The use of imbalance_pct was simply to find the smallest number of
> running tasks were (imbalance_pct - 100) would be 1 running task and limit

But using imbalance_pct alone doesn't mean anything. Using similar to the below

    busiest->group_weight * (env->sd->imbalance_pct - 100) / 100

would be more meaningful

Or you could use the util_avg so you will take into account if the
tasks are short running ones or long running ones

> the patch to address the low utilisation case only. It could be simply
> hard-coded but that would ignore cases where an architecture overrides
> imbalance_pct. I'm open to suggestion on how we could identify the point
> where imbalances can be ignored without hitting other corner cases.
>
> > > +
> > > +                       /*
> > > +                        * Allow small imbalances when the busiest group has
> > > +                        * low utilisation.
> > > +                        */
> > > +                       imbalance_max = imbalance_adj << 1;
> >
> > Why do you add this shift ?
> >
>
> For very low utilisation, there is no balancing between nodes. For slightly
> above that, there is limited balancing. After that, the load balancing
> behaviour is unchanged as I believe we cannot determine if memory latency
> or memory bandwidth is more important for arbitrary workloads.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07  8:38         ` Vincent Guittot
@ 2020-01-07  9:56           ` Mel Gorman
  2020-01-07 11:17             ` Vincent Guittot
                               ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Mel Gorman @ 2020-01-07  9:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > This looks weird to me because you use imbalance_pct, which is
> > > meaningful only compare a ratio, to define a number that will be then
> > > compared to a number of tasks without taking into account the weight
> > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > NUMA level which is 125 AFAICT
> > >
> >
> > The intent in this version was to only cover the low utilisation case
> > regardless of the NUMA node size. There were too many corner cases
> > where the tradeoff of local memory latency versus local memory bandwidth
> > cannot be quantified. See Srikar's report as an example but it's a general
> > problem. The use of imbalance_pct was simply to find the smallest number of
> > running tasks were (imbalance_pct - 100) would be 1 running task and limit
> 
> But using imbalance_pct alone doesn't mean anything.

Other than figuring out "how many running tasks are required before
imbalance_pct is roughly equivalent to one fully active CPU?". Even then,
it's a bit weak as imbalance_pct makes hard-coded assumptions on what
the tradeoff of cross-domain migration is without considering the hardware.

> Using similar to the below
> 
>     busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
> 
> would be more meaningful
> 

It's meaningful to some sense from a conceptual point of view but
setting the low utilisation cutoff depending on the number of CPUs on
the node does not account for any local memory latency vs bandwidth.
i.e. on a small or mid-range machine the cutoff will make sense. On
larger machines, the cutoff could be at the point where memory bandwidth
is saturated leading to a scenario whereby upgrading to a larger
machine performs worse than the smaller machine.

Much more importantly, doing what you suggest allows an imbalance
of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
caches. That is going to have side-effects. While I also do not account
for the LLC group_weight, it's unlikely the cut-off I used would be
smaller than an LLC cache on a large machine as the cache.

These two points are why I didn't take the group weight into account.

Now if you want, I can do what you suggest anyway as long as you are happy
that the child domain weight is also taken into account and to bound the
largest possible allowed imbalance to deal with the case of a node having
multiple small LLC caches. That means that some machines will be using the
size of the node and some machines will use the size of an LLC. It's less
predictable overall as some machines will be "special" relative to others
making it harder to reproduce certain problems locally but it would take
imbalance_pct into account in a way that you're happy with.

Also bear in mind that whether LLC is accounted for or not, the final
result should be halved similar to the other imbalance calculations to
avoid over or under load balancing.

> Or you could use the util_avg so you will take into account if the
> tasks are short running ones or long running ones
> 

util_avg can be skewed if there are big outliers. Even then, it's not
a great metric for the low utilisation cutoff. Large numbers of mostly
idle but running tasks would be treated similarly to small numbers of
fully active tasks. It's less predictable and harder to reason about how
load balancing behaves across a variety of workloads.

Based on what you suggest, the result looks like this (build tested
only)


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..1b2c7bed2db5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			struct sched_domain *child = env->sd->child;
+			unsigned int imbalance_adj;
+
+			/*
+			 * Calculate an acceptable degree of imbalance based
+			 * on imbalance_adj. However, do not allow a greater
+			 * imbalance than the child domains weight to avoid
+			 * a case where the allowed imbalance spans multiple
+			 * LLCs.
+			 */
+			imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;
+			imbalance_adj = min(imbalance_adj, child->span_weight);
+			imbalance_adj >>= 1;
+
+			/*
+			 * Ignore small imbalances when the busiest group has
+			 * low utilisation.
+			 */
+			if (busiest->sum_nr_running < imbalance_adj)
+				env->imbalance = 0;
+		}
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07  9:56           ` Mel Gorman
@ 2020-01-07 11:17             ` Vincent Guittot
  2020-01-07 11:56               ` Mel Gorman
  2020-01-07 11:22             ` Peter Zijlstra
  2020-01-07 19:26             ` Phil Auld
  2 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-01-07 11:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > > This looks weird to me because you use imbalance_pct, which is
> > > > meaningful only compare a ratio, to define a number that will be then
> > > > compared to a number of tasks without taking into account the weight
> > > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > > NUMA level which is 125 AFAICT
> > > >
> > >
> > > The intent in this version was to only cover the low utilisation case
> > > regardless of the NUMA node size. There were too many corner cases
> > > where the tradeoff of local memory latency versus local memory bandwidth
> > > cannot be quantified. See Srikar's report as an example but it's a general
> > > problem. The use of imbalance_pct was simply to find the smallest number of
> > > running tasks were (imbalance_pct - 100) would be 1 running task and limit
> >
> > But using imbalance_pct alone doesn't mean anything.
>
> Other than figuring out "how many running tasks are required before
> imbalance_pct is roughly equivalent to one fully active CPU?". Even then,

sorry, I don't see how you deduct this from only using imbalance_pct
which is mainly there to say how much percent of difference is
significant

> it's a bit weak as imbalance_pct makes hard-coded assumptions on what
> the tradeoff of cross-domain migration is without considering the hardware.
>
> > Using similar to the below
> >
> >     busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
> >
> > would be more meaningful
> >
>
> It's meaningful to some sense from a conceptual point of view but
> setting the low utilisation cutoff depending on the number of CPUs on
> the node does not account for any local memory latency vs bandwidth.
> i.e. on a small or mid-range machine the cutoff will make sense. On
> larger machines, the cutoff could be at the point where memory bandwidth
> is saturated leading to a scenario whereby upgrading to a larger
> machine performs worse than the smaller machine.
>
> Much more importantly, doing what you suggest allows an imbalance
> of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> caches. That is going to have side-effects. While I also do not account
> for the LLC group_weight, it's unlikely the cut-off I used would be
> smaller than an LLC cache on a large machine as the cache.
>
> These two points are why I didn't take the group weight into account.
>
> Now if you want, I can do what you suggest anyway as long as you are happy
> that the child domain weight is also taken into account and to bound the

Taking into account child domain makes sense to me, but shouldn't we
take into account the number of child group instead ? This should
reflect the number of different LLC caches.
IIUC your reasoning, we want to make sure that tasks will not start to
fight for using same resources which is the connection between LLC
cache and memory in this case

> largest possible allowed imbalance to deal with the case of a node having
> multiple small LLC caches. That means that some machines will be using the
> size of the node and some machines will use the size of an LLC. It's less
> predictable overall as some machines will be "special" relative to others
> making it harder to reproduce certain problems locally but it would take
> imbalance_pct into account in a way that you're happy with.
>
> Also bear in mind that whether LLC is accounted for or not, the final
> result should be halved similar to the other imbalance calculations to
> avoid over or under load balancing.
>
> > Or you could use the util_avg so you will take into account if the
> > tasks are short running ones or long running ones
> >
>
> util_avg can be skewed if there are big outliers. Even then, it's not
> a great metric for the low utilisation cutoff. Large numbers of mostly
> idle but running tasks would be treated similarly to small numbers of
> fully active tasks. It's less predictable and harder to reason about how

Yes but this also have the advantage of reflecting more accurately how
the system is used.
with nr_running, we consider that mostly idle and fully active tasks
will have the exact same impact on the memory

> load balancing behaves across a variety of workloads.
>
> Based on what you suggest, the result looks like this (build tested
> only)

I'm going to make a try of this patch

>
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..1b2c7bed2db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>         /*
>          * Try to use spare capacity of local group without overloading it or
>          * emptying busiest.
> -        * XXX Spreading tasks across NUMA nodes is not always the best policy
> -        * and special care should be taken for SD_NUMA domain level before
> -        * spreading the tasks. For now, load_balance() fully relies on
> -        * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>          */
>         if (local->group_type == group_has_spare) {
>                 if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                         env->migration_type = migrate_task;
>                         lsub_positive(&nr_diff, local->sum_nr_running);
>                         env->imbalance = nr_diff >> 1;
> -                       return;
> -               }
> +               } else {
>
> -               /*
> -                * If there is no overload, we just want to even the number of
> -                * idle cpus.
> -                */
> -               env->migration_type = migrate_task;
> -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> +                       /*
> +                        * If there is no overload, we just want to even the number of
> +                        * idle cpus.
> +                        */
> +                       env->migration_type = migrate_task;
> +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
>                                                  busiest->idle_cpus) >> 1);
> +               }
> +
> +               /* Consider allowing a small imbalance between NUMA groups */
> +               if (env->sd->flags & SD_NUMA) {
> +                       struct sched_domain *child = env->sd->child;
> +                       unsigned int imbalance_adj;
> +
> +                       /*
> +                        * Calculate an acceptable degree of imbalance based
> +                        * on imbalance_adj. However, do not allow a greater
> +                        * imbalance than the child domains weight to avoid
> +                        * a case where the allowed imbalance spans multiple
> +                        * LLCs.
> +                        */
> +                       imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;
> +                       imbalance_adj = min(imbalance_adj, child->span_weight);
> +                       imbalance_adj >>= 1;
> +
> +                       /*
> +                        * Ignore small imbalances when the busiest group has
> +                        * low utilisation.
> +                        */
> +                       if (busiest->sum_nr_running < imbalance_adj)
> +                               env->imbalance = 0;
> +               }
> +
>                 return;
>         }
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07  9:56           ` Mel Gorman
  2020-01-07 11:17             ` Vincent Guittot
@ 2020-01-07 11:22             ` Peter Zijlstra
  2020-01-07 11:42               ` Mel Gorman
  2020-01-07 12:28               ` Peter Zijlstra
  2020-01-07 19:26             ` Phil Auld
  2 siblings, 2 replies; 28+ messages in thread
From: Peter Zijlstra @ 2020-01-07 11:22 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 09:56:55AM +0000, Mel Gorman wrote:

> Much more importantly, doing what you suggest allows an imbalance
> of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> caches. That is going to have side-effects. While I also do not account
> for the LLC group_weight, it's unlikely the cut-off I used would be
> smaller than an LLC cache on a large machine as the cache.
> 
> These two points are why I didn't take the group weight into account.
> 
> Now if you want, I can do what you suggest anyway as long as you are happy
> that the child domain weight is also taken into account and to bound the
> largest possible allowed imbalance to deal with the case of a node having
> multiple small LLC caches. That means that some machines will be using the
> size of the node and some machines will use the size of an LLC. It's less
> predictable overall as some machines will be "special" relative to others
> making it harder to reproduce certain problems locally but it would take
> imbalance_pct into account in a way that you're happy with.
> 
> Also bear in mind that whether LLC is accounted for or not, the final
> result should be halved similar to the other imbalance calculations to
> avoid over or under load balancing.

> +		/* Consider allowing a small imbalance between NUMA groups */
> +		if (env->sd->flags & SD_NUMA) {
> +			struct sched_domain *child = env->sd->child;

This assumes sd-child exists, which should be true for NUMA domains I
suppose.

> +			unsigned int imbalance_adj;
> +
> +			/*
> +			 * Calculate an acceptable degree of imbalance based
> +			 * on imbalance_adj. However, do not allow a greater
> +			 * imbalance than the child domains weight to avoid
> +			 * a case where the allowed imbalance spans multiple
> +			 * LLCs.
> +			 */

That comment is a wee misleading, @child is not an LLC per se. This
could be the NUMA distance 2 domain, in which case @child is the NUMA
distance 1 group.

That said, even then it probably makes sense to ensure you don't idle a
whole smaller distance group.

> +			imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;
> +			imbalance_adj = min(imbalance_adj, child->span_weight);
> +			imbalance_adj >>= 1;
> +
> +			/*
> +			 * Ignore small imbalances when the busiest group has
> +			 * low utilisation.
> +			 */
> +			if (busiest->sum_nr_running < imbalance_adj)
> +				env->imbalance = 0;
> +		}
> +
>  		return;
>  	}
>  

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 11:22             ` Peter Zijlstra
@ 2020-01-07 11:42               ` Mel Gorman
  2020-01-07 12:29                 ` Peter Zijlstra
  2020-01-07 12:28               ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2020-01-07 11:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 12:22:55PM +0100, Peter Zijlstra wrote:
> > Much more importantly, doing what you suggest allows an imbalance
> > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> > caches. That is going to have side-effects. While I also do not account
> > for the LLC group_weight, it's unlikely the cut-off I used would be
> > smaller than an LLC cache on a large machine as the cache.
> > 
> > These two points are why I didn't take the group weight into account.
> > 
> > Now if you want, I can do what you suggest anyway as long as you are happy
> > that the child domain weight is also taken into account and to bound the
> > largest possible allowed imbalance to deal with the case of a node having
> > multiple small LLC caches. That means that some machines will be using the
> > size of the node and some machines will use the size of an LLC. It's less
> > predictable overall as some machines will be "special" relative to others
> > making it harder to reproduce certain problems locally but it would take
> > imbalance_pct into account in a way that you're happy with.
> > 
> > Also bear in mind that whether LLC is accounted for or not, the final
> > result should be halved similar to the other imbalance calculations to
> > avoid over or under load balancing.
> 
> > +		/* Consider allowing a small imbalance between NUMA groups */
> > +		if (env->sd->flags & SD_NUMA) {
> > +			struct sched_domain *child = env->sd->child;
> 
> This assumes sd-child exists, which should be true for NUMA domains I
> suppose.
> 

I would be stunned if it was not. What sort of NUMA domain would not have
child domains? Does a memory-only NUMA node with no CPUs even generate
a scheduler domain? If it does, then I guess the check is necessary.

> > +			unsigned int imbalance_adj;
> > +
> > +			/*
> > +			 * Calculate an acceptable degree of imbalance based
> > +			 * on imbalance_adj. However, do not allow a greater
> > +			 * imbalance than the child domains weight to avoid
> > +			 * a case where the allowed imbalance spans multiple
> > +			 * LLCs.
> > +			 */
> 
> That comment is a wee misleading, @child is not an LLC per se. This
> could be the NUMA distance 2 domain, in which case @child is the NUMA
> distance 1 group.
> 
> That said, even then it probably makes sense to ensure you don't idle a
> whole smaller distance group.
> 

I hadn't considered that case but even then, it's just a comment fix.
Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 11:17             ` Vincent Guittot
@ 2020-01-07 11:56               ` Mel Gorman
  2020-01-07 16:00                 ` Vincent Guittot
  0 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2020-01-07 11:56 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote:
> On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > > > This looks weird to me because you use imbalance_pct, which is
> > > > > meaningful only compare a ratio, to define a number that will be then
> > > > > compared to a number of tasks without taking into account the weight
> > > > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > > > NUMA level which is 125 AFAICT
> > > > >
> > > >
> > > > The intent in this version was to only cover the low utilisation case
> > > > regardless of the NUMA node size. There were too many corner cases
> > > > where the tradeoff of local memory latency versus local memory bandwidth
> > > > cannot be quantified. See Srikar's report as an example but it's a general
> > > > problem. The use of imbalance_pct was simply to find the smallest number of
> > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit
> > >
> > > But using imbalance_pct alone doesn't mean anything.
> >
> > Other than figuring out "how many running tasks are required before
> > imbalance_pct is roughly equivalent to one fully active CPU?". Even then,
> 
> sorry, I don't see how you deduct this from only using imbalance_pct
> which is mainly there to say how much percent of difference is
> significant
> 

Because if the difference is 25% then 1 CPU out of 4 active is enough
for imbalance_pct to potentially be a factor. Anyway, the approach seems
almost universally disliked so even if I had reasons for not scaling
this by the group_weight, no one appears to agree with them :P

> > it's a bit weak as imbalance_pct makes hard-coded assumptions on what
> > the tradeoff of cross-domain migration is without considering the hardware.
> >
> > > Using similar to the below
> > >
> > >     busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
> > >
> > > would be more meaningful
> > >
> >
> > It's meaningful to some sense from a conceptual point of view but
> > setting the low utilisation cutoff depending on the number of CPUs on
> > the node does not account for any local memory latency vs bandwidth.
> > i.e. on a small or mid-range machine the cutoff will make sense. On
> > larger machines, the cutoff could be at the point where memory bandwidth
> > is saturated leading to a scenario whereby upgrading to a larger
> > machine performs worse than the smaller machine.
> >
> > Much more importantly, doing what you suggest allows an imbalance
> > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> > caches. That is going to have side-effects. While I also do not account
> > for the LLC group_weight, it's unlikely the cut-off I used would be
> > smaller than an LLC cache on a large machine as the cache.
> >
> > These two points are why I didn't take the group weight into account.
> >
> > Now if you want, I can do what you suggest anyway as long as you are happy
> > that the child domain weight is also taken into account and to bound the
> 
> Taking into account child domain makes sense to me, but shouldn't we
> take into account the number of child group instead ? This should
> reflect the number of different LLC caches.

I guess it would but why is it inherently better? The number of domains
would yield a similar result if we assume that all the lower domains
have equal weight so it simply because the weight of the SD_NUMA domain
divided by the number of child domains.

Now, I could be missing something with asymmetric setups. I don't know
if it's possible for child domains of a NUMA domain to have different
sizes. I would be somewhat surprised if they did but I also do not work
on such machines nor have I ever accessed one (to my knowledge).

> IIUC your reasoning, we want to make sure that tasks will not start to
> fight for using same resources which is the connection between LLC
> cache and memory in this case
> 

Yep. I don't want a case where the allowed imbalance causes the load
balancer to have to balance between the lower domains. *Maybe* that is
actually better in some cases but it's far from intuitive so I would
prefer that change would be a patch on its own with a big fat comment
explaining the reasoning behind the additional complexity.

> > largest possible allowed imbalance to deal with the case of a node having
> > multiple small LLC caches. That means that some machines will be using the
> > size of the node and some machines will use the size of an LLC. It's less
> > predictable overall as some machines will be "special" relative to others
> > making it harder to reproduce certain problems locally but it would take
> > imbalance_pct into account in a way that you're happy with.
> >
> > Also bear in mind that whether LLC is accounted for or not, the final
> > result should be halved similar to the other imbalance calculations to
> > avoid over or under load balancing.
> >
> > > Or you could use the util_avg so you will take into account if the
> > > tasks are short running ones or long running ones
> > >
> >
> > util_avg can be skewed if there are big outliers. Even then, it's not
> > a great metric for the low utilisation cutoff. Large numbers of mostly
> > idle but running tasks would be treated similarly to small numbers of
> > fully active tasks. It's less predictable and harder to reason about how
> 
> Yes but this also have the advantage of reflecting more accurately how
> the system is used.
> with nr_running, we consider that mostly idle and fully active tasks
> will have the exact same impact on the memory
> 

Maybe, maybe not. When there is spare capacity in the domain overall and
we are only interested in the low utilisation case, it seems to me that
we should consider the most obvious and understandable metric. Now, if we
were talking about a nearly fully loaded domain or an overloaded domain
then I would fully agree with you as balancing utilisation in that case
becomes critical.

> > load balancing behaves across a variety of workloads.
> >
> > Based on what you suggest, the result looks like this (build tested
> > only)
> 
> I'm going to make a try of this patch
> 

Thanks. I've queued the same patch on one machine to see what falls out.
I don't want to tie up half my test grid until we get some sort of
consensus.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 11:22             ` Peter Zijlstra
  2020-01-07 11:42               ` Mel Gorman
@ 2020-01-07 12:28               ` Peter Zijlstra
  1 sibling, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2020-01-07 12:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 12:22:55PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2020 at 09:56:55AM +0000, Mel Gorman wrote:

> > +			unsigned int imbalance_adj;
> > +
> > +			/*
> > +			 * Calculate an acceptable degree of imbalance based
> > +			 * on imbalance_adj. However, do not allow a greater
> > +			 * imbalance than the child domains weight to avoid
> > +			 * a case where the allowed imbalance spans multiple
> > +			 * LLCs.
> > +			 */
> 
> That comment is a wee misleading, @child is not an LLC per se. This
> could be the NUMA distance 2 domain, in which case @child is the NUMA
> distance 1 group.
> 
> That said, even then it probably makes sense to ensure you don't idle a
> whole smaller distance group.

Hmm, one more thing. On AMD EPYC, which the multiple LLCs, you'll have
the single NODE domain in between, and that is not marked with SD_NUMA
(iirc).

So specifically the case you want to handle is not in fact handled. The
first SD_NUMA (distance-1) will have all NODE children, which on EPYC
are not LLCs.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 11:42               ` Mel Gorman
@ 2020-01-07 12:29                 ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2020-01-07 12:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 11:42:11AM +0000, Mel Gorman wrote:
> On Tue, Jan 07, 2020 at 12:22:55PM +0100, Peter Zijlstra wrote:

> > > +		/* Consider allowing a small imbalance between NUMA groups */
> > > +		if (env->sd->flags & SD_NUMA) {
> > > +			struct sched_domain *child = env->sd->child;
> > 
> > This assumes sd-child exists, which should be true for NUMA domains I
> > suppose.
> > 
> 
> I would be stunned if it was not. What sort of NUMA domain would not have
> child domains? Does a memory-only NUMA node with no CPUs even generate
> a scheduler domain? If it does, then I guess the check is necessary.

I think it's fine, it was just my paranoia triggering. At the very least
we'll have the single CPU domain there.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 11:56               ` Mel Gorman
@ 2020-01-07 16:00                 ` Vincent Guittot
  2020-01-07 20:24                   ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-01-07 16:00 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

 busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;

On Tue, 7 Jan 2020 at 12:56, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote:
> > On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > > > > This looks weird to me because you use imbalance_pct, which is
> > > > > > meaningful only compare a ratio, to define a number that will be then
> > > > > > compared to a number of tasks without taking into account the weight
> > > > > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > > > > NUMA level which is 125 AFAICT
> > > > > >
> > > > >
> > > > > The intent in this version was to only cover the low utilisation case
> > > > > regardless of the NUMA node size. There were too many corner cases
> > > > > where the tradeoff of local memory latency versus local memory bandwidth
> > > > > cannot be quantified. See Srikar's report as an example but it's a general
> > > > > problem. The use of imbalance_pct was simply to find the smallest number of
> > > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit
> > > >
> > > > But using imbalance_pct alone doesn't mean anything.
> > >
> > > Other than figuring out "how many running tasks are required before
> > > imbalance_pct is roughly equivalent to one fully active CPU?". Even then,
> >
> > sorry, I don't see how you deduct this from only using imbalance_pct
> > which is mainly there to say how much percent of difference is
> > significant
> >
>
> Because if the difference is 25% then 1 CPU out of 4 active is enough
> for imbalance_pct to potentially be a factor. Anyway, the approach seems
> almost universally disliked so even if I had reasons for not scaling
> this by the group_weight, no one appears to agree with them :P
>
> > > it's a bit weak as imbalance_pct makes hard-coded assumptions on what
> > > the tradeoff of cross-domain migration is without considering the hardware.
> > >
> > > > Using similar to the below
> > > >
> > > >     busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
> > > >
> > > > would be more meaningful
> > > >
> > >
> > > It's meaningful to some sense from a conceptual point of view but
> > > setting the low utilisation cutoff depending on the number of CPUs on
> > > the node does not account for any local memory latency vs bandwidth.
> > > i.e. on a small or mid-range machine the cutoff will make sense. On
> > > larger machines, the cutoff could be at the point where memory bandwidth
> > > is saturated leading to a scenario whereby upgrading to a larger
> > > machine performs worse than the smaller machine.
> > >
> > > Much more importantly, doing what you suggest allows an imbalance
> > > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> > > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> > > caches. That is going to have side-effects. While I also do not account
> > > for the LLC group_weight, it's unlikely the cut-off I used would be
> > > smaller than an LLC cache on a large machine as the cache.
> > >
> > > These two points are why I didn't take the group weight into account.
> > >
> > > Now if you want, I can do what you suggest anyway as long as you are happy
> > > that the child domain weight is also taken into account and to bound the
> >
> > Taking into account child domain makes sense to me, but shouldn't we
> > take into account the number of child group instead ? This should
> > reflect the number of different LLC caches.
>
> I guess it would but why is it inherently better? The number of domains
> would yield a similar result if we assume that all the lower domains
> have equal weight so it simply because the weight of the SD_NUMA domain
> divided by the number of child domains.

but that's not what you are doing in your proposal. You are using
directly child->span_weight which reflects the number of CPUs in the
child and not the number of group

you should do something like  sds->busiest->span_weight /
sds->busiest->child->span_weight which gives you an approximation of
the number of independent group inside the busiest numa node from a
shared resource pov

>
> Now, I could be missing something with asymmetric setups. I don't know
> if it's possible for child domains of a NUMA domain to have different
> sizes. I would be somewhat surprised if they did but I also do not work
> on such machines nor have I ever accessed one (to my knowledge).
>
> > IIUC your reasoning, we want to make sure that tasks will not start to
> > fight for using same resources which is the connection between LLC
> > cache and memory in this case
> >
>
> Yep. I don't want a case where the allowed imbalance causes the load
> balancer to have to balance between the lower domains. *Maybe* that is
> actually better in some cases but it's far from intuitive so I would
> prefer that change would be a patch on its own with a big fat comment
> explaining the reasoning behind the additional complexity.
>
> > > largest possible allowed imbalance to deal with the case of a node having
> > > multiple small LLC caches. That means that some machines will be using the
> > > size of the node and some machines will use the size of an LLC. It's less
> > > predictable overall as some machines will be "special" relative to others
> > > making it harder to reproduce certain problems locally but it would take
> > > imbalance_pct into account in a way that you're happy with.
> > >
> > > Also bear in mind that whether LLC is accounted for or not, the final
> > > result should be halved similar to the other imbalance calculations to
> > > avoid over or under load balancing.
> > >
> > > > Or you could use the util_avg so you will take into account if the
> > > > tasks are short running ones or long running ones
> > > >
> > >
> > > util_avg can be skewed if there are big outliers. Even then, it's not
> > > a great metric for the low utilisation cutoff. Large numbers of mostly
> > > idle but running tasks would be treated similarly to small numbers of
> > > fully active tasks. It's less predictable and harder to reason about how
> >
> > Yes but this also have the advantage of reflecting more accurately how
> > the system is used.
> > with nr_running, we consider that mostly idle and fully active tasks
> > will have the exact same impact on the memory
> >
>
> Maybe, maybe not. When there is spare capacity in the domain overall and
> we are only interested in the low utilisation case, it seems to me that
> we should consider the most obvious and understandable metric. Now, if we
> were talking about a nearly fully loaded domain or an overloaded domain
> then I would fully agree with you as balancing utilisation in that case
> becomes critical.
>
> > > load balancing behaves across a variety of workloads.
> > >
> > > Based on what you suggest, the result looks like this (build tested
> > > only)
> >
> > I'm going to make a try of this patch
> >
>
> Thanks. I've queued the same patch on one machine to see what falls out.
> I don't want to tie up half my test grid until we get some sort of
> consensus.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07  9:56           ` Mel Gorman
  2020-01-07 11:17             ` Vincent Guittot
  2020-01-07 11:22             ` Peter Zijlstra
@ 2020-01-07 19:26             ` Phil Auld
  2 siblings, 0 replies; 28+ messages in thread
From: Phil Auld @ 2020-01-07 19:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML


Hi,

On Tue, Jan 07, 2020 at 09:56:55AM +0000 Mel Gorman wrote:
> 
> util_avg can be skewed if there are big outliers. Even then, it's not
> a great metric for the low utilisation cutoff. Large numbers of mostly
> idle but running tasks would be treated similarly to small numbers of
> fully active tasks. It's less predictable and harder to reason about how
> load balancing behaves across a variety of workloads.
> 
> Based on what you suggest, the result looks like this (build tested
> only)

(Here I'm calling the below patch v4 for lack of a better name.)

One of my concerns is to have the group imbalance issue addressed. This
is the one remaining issue from the wasted cores paper. I have a setup
that is designed to illustrate this case. I ran a number of tests with
the small imbalance patches (v3 and v4 in this case) and both before
and after Vincent's load balancing rework. 

The basic test is to run an LU.c benchmark from the NAS parallel benchmark 
suite along with a couple of other cpu burning tasks. The GROUP case is LU 
and each cpu hog in separate cgroups. The NORMAL case is all of these in 
one cgroup.  This shows of the problems of averaging the group 
scheduling load by failing to balance the jobs across the NUMA nodes.
It ends up with idle CPUs in the nodes where the cpu hogs are running while
overloading LU.c threads on others, with a big impact on the benchmark's
performance.  This test benefits from getting balanced well quickly.


The test machine is a 4-node 80 cpu x86_64 system (smt on).  There are 76 
threads in the LU.c test and 2 stress cpu jobs.  Each row shows the numbers
for 10 runs to smooth it out and make the mean more, well, meaningful.  
It's still got a fair bit of variance as you can see from the 3 sets of
data points for each kernel. 

5.4.0 is before load balancing rework (the really bad case).
5.5-rc2  is with the load balancing rework.
lbv3  is Mel's posted v3 patch on top of 5.5-rc2
lbv4 is Mel's experimental v4 which is from email discussion with Vincent.


lbv4 appears a little worse for the GROUP case. v3 and 5.5-rc2 are pretty 
close to the same.  

All of the post 5.4.0 cases lose a little on the NORMAL case. lbv3 seems
to get a fair bit of that loss back on average but with a bit higher 
variability. 
 

This test can be pretty variable though so the minor differences probably
don't mean that much. In all the post re-work cases we are still showing
vast improvement in the GROUP case, which given the common use of cgroups
in modern workloads is a good thing. 

----------------------------------

GROUP - LU.c and cpu hogs in separate cgroups
Mop/s - Higher is better
============76_GROUP========Mop/s===================================
	min	q1	median	q3	max
5.4.0	 1671.8	 4211.2	 6103.0	 6934.1	 7865.4
5.4.0	 1777.1	 3719.9	 4861.8	 5822.5	13479.6
5.4.0	 2015.3	 2716.2	 5007.1	 6214.5	 9491.7
5.5-rc2	27641.0	30684.7	32091.8	33417.3	38118.1
5.5-rc2	27386.0	29795.2	32484.1	36004.0	37704.3
5.5-rc2	26649.6	29485.0	30379.7	33116.0	36832.8
lbv3	28496.3	29716.0	30634.8	32998.4	40945.2
lbv3	27294.7	29336.4	30186.0	31888.3	35839.1
lbv3	27099.3	29325.3	31680.1	35973.5	39000.0
lbv4	27936.4	30109.0	31724.8	33150.7	35905.1
lbv4	26431.0	29355.6	29850.1	32704.4	36060.3
lbv4	27436.6	29945.9	31076.9	32207.8	35401.5

Runtime - Lower is better
============76_GROUP========time====================================
	min	q1	median	q3	max
5.4.0	259.2	294.92	335.39	484.33	1219.61
5.4.0	151.3	351.1	419.4	551.99	1147.3
5.4.0	214.8	328.16	407.27	751.03	1011.77
5.5-rc2	 53.49	 61.03	 63.56	 66.46	  73.77
5.5-rc2  54.08	 56.67	 62.78	 68.44	  74.45
5.5-rc2	 55.36	 61.61	 67.14	 69.16	  76.51
lbv3	 49.8	 61.8	 66.59	 68.62	  71.55
lbv3	 56.89	 63.95	 67.55	 69.51	  74.7
lbv3	 52.28	 56.68	 64.38	 69.54	  75.24
lbv4	 56.79	 61.52	 64.3	 67.73	  72.99
lbv4	 56.54	 62.36	 68.31	 69.47	  77.14
lbv4	 57.6	 63.33	 65.64	 68.11	  74.32

NORMAL - LU.c and cpu hogs all in one cgroup
Mop/s - Higher is better
============76_NORMAL========Mop/s===================================
	min	q1	median	q3	max
5.4.0	32912.6	34047.5	36739.4	39124.1	41592.5
5.4.0	29937.7	33060.6	34860.8	39528.8	43328.1
5.4.0	31851.2	34281.1	35284.4	36016.8	38847.4
5.5-rc2	30475.6	32505.1	33977.3	34876	36233.8
5.5-rc2	30657.7	31301.1	32059.4	34396.7	38661.8
5.5-rc2	31022	32247.6	32628.9	33245	38572.3
lbv3	30606.4	32794.4	34258.6	35699	38669.2
lbv3	29722.7	30558.9	32731.2	36412	40752.3
lbv3	30297.7	32568.3	36654.6	38066.2	38988.3
lbv4	30084.9	31227.5	32312.8	33222.8	36039.7
lbv4	29875.9	32903.6	33803.1	34519.3	38663.5
lbv4	27923.3	30631.1	32666.9	33516.7	36663.4

Runtime - Lower is better
============76_NORMAL========time====================================
	min	q1	median	q3	max
5.4.0	49.02	52.115	55.58	59.89	61.95
5.4.0	47.06	51.615	58.57	61.68	68.11
5.4.0	52.49	56.615	57.795	59.48	64.02
5.5-rc2	56.27	58.47	60.02	62.735	66.91
5.5-rc2	52.74	59.295	63.605	65.145	66.51
5.5-rc2	52.86	61.335	62.495	63.23	65.73
lbv3	52.73	57.12	59.52	62.19	66.62
lbv3	50.03	56.02	62.39	66.725	68.6
lbv3	52.3	53.565	55.65	62.645	67.3
lbv4	56.58	61.375	63.135	65.3	67.77
lbv4	52.74	59.07	60.335	61.97	68.25
lbv4	55.61	60.835	62.42	66.635	73.02


So aside from the theoretical disputes the posted v3 seems reasonable.
When a final version comes toghether I'll have the perf team run a
fuller set of tests. 


Cheers,
Phil


> 
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..1b2c7bed2db5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	/*
>  	 * Try to use spare capacity of local group without overloading it or
>  	 * emptying busiest.
> -	 * XXX Spreading tasks across NUMA nodes is not always the best policy
> -	 * and special care should be taken for SD_NUMA domain level before
> -	 * spreading the tasks. For now, load_balance() fully relies on
> -	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>  	 */
>  	if (local->group_type == group_has_spare) {
>  		if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,41 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->migration_type = migrate_task;
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff >> 1;
> -			return;
> -		}
> +		} else {
>  
> -		/*
> -		 * If there is no overload, we just want to even the number of
> -		 * idle cpus.
> -		 */
> -		env->migration_type = migrate_task;
> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
> +			/*
> +			 * If there is no overload, we just want to even the number of
> +			 * idle cpus.
> +			 */
> +			env->migration_type = migrate_task;
> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
>  						 busiest->idle_cpus) >> 1);
> +		}
> +
> +		/* Consider allowing a small imbalance between NUMA groups */
> +		if (env->sd->flags & SD_NUMA) {
> +			struct sched_domain *child = env->sd->child;
> +			unsigned int imbalance_adj;
> +
> +			/*
> +			 * Calculate an acceptable degree of imbalance based
> +			 * on imbalance_adj. However, do not allow a greater
> +			 * imbalance than the child domains weight to avoid
> +			 * a case where the allowed imbalance spans multiple
> +			 * LLCs.
> +			 */
> +			imbalance_adj = busiest->group_weight * (env->sd->imbalance_pct - 100) / 100;
> +			imbalance_adj = min(imbalance_adj, child->span_weight);
> +			imbalance_adj >>= 1;
> +
> +			/*
> +			 * Ignore small imbalances when the busiest group has
> +			 * low utilisation.
> +			 */
> +			if (busiest->sum_nr_running < imbalance_adj)
> +				env->imbalance = 0;
> +		}
> +
>  		return;
>  	}
>  
> 

-- 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 16:00                 ` Vincent Guittot
@ 2020-01-07 20:24                   ` Mel Gorman
  2020-01-08  8:25                     ` Vincent Guittot
  2020-01-08 13:18                     ` Peter Zijlstra
  0 siblings, 2 replies; 28+ messages in thread
From: Mel Gorman @ 2020-01-07 20:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote:
> > > Taking into account child domain makes sense to me, but shouldn't we
> > > take into account the number of child group instead ? This should
> > > reflect the number of different LLC caches.
> >
> > I guess it would but why is it inherently better? The number of domains
> > would yield a similar result if we assume that all the lower domains
> > have equal weight so it simply because the weight of the SD_NUMA domain
> > divided by the number of child domains.
> 
> but that's not what you are doing in your proposal. You are using
> directly child->span_weight which reflects the number of CPUs in the
> child and not the number of group
> 
> you should do something like  sds->busiest->span_weight /
> sds->busiest->child->span_weight which gives you an approximation of
> the number of independent group inside the busiest numa node from a
> shared resource pov
> 

Now I get you, but unfortunately it also would not work out. The number
of groups is not related to the LLC except in some specific cases.
It's possible to use the first CPU to find the size of an LLC but now I
worry that it would lead to unpredictable behaviour. AMD has different
numbers of LLCs per node depending on the CPU family and while Intel
generally has one LLC per node, I imagine there are counter examples.
This means that load balancing on different machines with similar core
counts will behave differently due to the LLC size. It might be possible
to infer it if the intermediate domain was DIE instead of MC but I doubt
that's guaranteed and it would still be unpredictable. It may be the type
of complexity that should only be introduced with a separate patch with
clear rationale as to why it's necessary and we are not at that threshold
so I withdraw the suggestion.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 20:24                   ` Mel Gorman
@ 2020-01-08  8:25                     ` Vincent Guittot
  2020-01-08  8:49                       ` Mel Gorman
  2020-01-08 13:18                     ` Peter Zijlstra
  1 sibling, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-01-08  8:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, 7 Jan 2020 at 21:24, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote:
> > > > Taking into account child domain makes sense to me, but shouldn't we
> > > > take into account the number of child group instead ? This should
> > > > reflect the number of different LLC caches.
> > >
> > > I guess it would but why is it inherently better? The number of domains
> > > would yield a similar result if we assume that all the lower domains
> > > have equal weight so it simply because the weight of the SD_NUMA domain
> > > divided by the number of child domains.
> >
> > but that's not what you are doing in your proposal. You are using
> > directly child->span_weight which reflects the number of CPUs in the
> > child and not the number of group
> >
> > you should do something like  sds->busiest->span_weight /
> > sds->busiest->child->span_weight which gives you an approximation of
> > the number of independent group inside the busiest numa node from a
> > shared resource pov
> >
>
> Now I get you, but unfortunately it also would not work out. The number
> of groups is not related to the LLC except in some specific cases.
> It's possible to use the first CPU to find the size of an LLC but now I
> worry that it would lead to unpredictable behaviour. AMD has different
> numbers of LLCs per node depending on the CPU family and while Intel
> generally has one LLC per node, I imagine there are counter examples.
> This means that load balancing on different machines with similar core
> counts will behave differently due to the LLC size. It might be possible

But the degree of allowed imbalance is related to this topology so
using the same value for those different machine will generate a
different behavior because they don't have the same HW topology but we
use the same threshold

> to infer it if the intermediate domain was DIE instead of MC but I doubt
> that's guaranteed and it would still be unpredictable. It may be the type
> of complexity that should only be introduced with a separate patch with
> clear rationale as to why it's necessary and we are not at that threshold
> so I withdraw the suggestion.

The problem is that you proposal is not aligned to what you would like
to do: You want to take into account the number of groups but you use
the number of CPUs per group instead

Vincent
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-08  8:25                     ` Vincent Guittot
@ 2020-01-08  8:49                       ` Mel Gorman
  0 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2020-01-08  8:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Wed, Jan 08, 2020 at 09:25:38AM +0100, Vincent Guittot wrote:
> On Tue, 7 Jan 2020 at 21:24, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote:
> > > > > Taking into account child domain makes sense to me, but shouldn't we
> > > > > take into account the number of child group instead ? This should
> > > > > reflect the number of different LLC caches.
> > > >
> > > > I guess it would but why is it inherently better? The number of domains
> > > > would yield a similar result if we assume that all the lower domains
> > > > have equal weight so it simply because the weight of the SD_NUMA domain
> > > > divided by the number of child domains.
> > >
> > > but that's not what you are doing in your proposal. You are using
> > > directly child->span_weight which reflects the number of CPUs in the
> > > child and not the number of group
> > >
> > > you should do something like  sds->busiest->span_weight /
> > > sds->busiest->child->span_weight which gives you an approximation of
> > > the number of independent group inside the busiest numa node from a
> > > shared resource pov
> > >
> >
> > Now I get you, but unfortunately it also would not work out. The number
> > of groups is not related to the LLC except in some specific cases.
> > It's possible to use the first CPU to find the size of an LLC but now I
> > worry that it would lead to unpredictable behaviour. AMD has different
> > numbers of LLCs per node depending on the CPU family and while Intel
> > generally has one LLC per node, I imagine there are counter examples.
> > This means that load balancing on different machines with similar core
> > counts will behave differently due to the LLC size. It might be possible
> 
> But the degree of allowed imbalance is related to this topology so
> using the same value for those different machine will generate a
> different behavior because they don't have the same HW topology but we
> use the same threshold
> 

The differences in behaviour would be marginal given that the original
fixed value for the v3 patch would generally be smaller than an LLC. For
the moment, I'm assuming that v4 will be based on the number of CPUs in
the node.

> > to infer it if the intermediate domain was DIE instead of MC but I doubt
> > that's guaranteed and it would still be unpredictable. It may be the type
> > of complexity that should only be introduced with a separate patch with
> > clear rationale as to why it's necessary and we are not at that threshold
> > so I withdraw the suggestion.
> 
> The problem is that you proposal is not aligned to what you would like
> to do: You want to take into account the number of groups but you use
> the number of CPUs per group instead
> 

I'm dropping the check of the child domain entirely. The lookups to get
the LLC size are relatively expensive without any data indicating it's
worthwhile.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-07 20:24                   ` Mel Gorman
  2020-01-08  8:25                     ` Vincent Guittot
@ 2020-01-08 13:18                     ` Peter Zijlstra
  2020-01-08 14:03                       ` Mel Gorman
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2020-01-08 13:18 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote:
> Now I get you, but unfortunately it also would not work out. The number
> of groups is not related to the LLC except in some specific cases.
> It's possible to use the first CPU to find the size of an LLC but now I
> worry that it would lead to unpredictable behaviour. AMD has different
> numbers of LLCs per node depending on the CPU family and while Intel
> generally has one LLC per node, I imagine there are counter examples.

Intel has the 'fun' case of an LLC spanning nodes :-), although Linux
pretends this isn't so and truncates the LLC topology information to be
the node again -- see arch/x86/kernel/smpboot.c:match_llc().

And of course, in the Core2 era we had the Core2Quad chips which was a
dual-die solution and therefore also had multiple LLCs, and I think the
Xeon variant of that would allow the multiple LLC per node situation
too, although this is of course ancient hardware nobody really cares
about anymore.

> This means that load balancing on different machines with similar core
> counts will behave differently due to the LLC size.

That sounds like perfectly fine/expected behaviour to me.

> It might be possible
> to infer it if the intermediate domain was DIE instead of MC but I doubt
> that's guaranteed and it would still be unpredictable. It may be the type
> of complexity that should only be introduced with a separate patch with
> clear rationale as to why it's necessary and we are not at that threshold
> so I withdraw the suggestion.

So IIRC the initial patch(es) had the idea to allow for 1 extra task
imbalance to get 1-1 pairs on the same node, instead of across nodes. I
don't immediately see that in these later patches.

Would that be something to go back to? Would that not side-step much of
the issues under discussion here?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-08 13:18                     ` Peter Zijlstra
@ 2020-01-08 14:03                       ` Mel Gorman
  2020-01-08 16:46                         ` Vincent Guittot
  0 siblings, 1 reply; 28+ messages in thread
From: Mel Gorman @ 2020-01-08 14:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Wed, Jan 08, 2020 at 02:18:58PM +0100, Peter Zijlstra wrote:
> On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote:
> > Now I get you, but unfortunately it also would not work out. The number
> > of groups is not related to the LLC except in some specific cases.
> > It's possible to use the first CPU to find the size of an LLC but now I
> > worry that it would lead to unpredictable behaviour. AMD has different
> > numbers of LLCs per node depending on the CPU family and while Intel
> > generally has one LLC per node, I imagine there are counter examples.
> 
> Intel has the 'fun' case of an LLC spanning nodes :-), although Linux
> pretends this isn't so and truncates the LLC topology information to be
> the node again -- see arch/x86/kernel/smpboot.c:match_llc().
> 

That's the Sub-NUMA Clustering, right? I thought that manifested as a
separate NUMA node in Linux. Not exactly right from a topology point of
view, but close enough. Still, you're right in that identifying the LLC
specifically would not necessarily be the way to go. FWIW, I have one
machine with SNC enabled and didn't notice anything problematic in the
versions of the patch released to date.

> And of course, in the Core2 era we had the Core2Quad chips which was a
> dual-die solution and therefore also had multiple LLCs, and I think the
> Xeon variant of that would allow the multiple LLC per node situation
> too, although this is of course ancient hardware nobody really cares
> about anymore.
> 

Ok, even though they're older, it's enough counter-examples to be concerned
about. I'm definitely dropping the LLC considerations for the moment :)

> > This means that load balancing on different machines with similar core
> > counts will behave differently due to the LLC size.
> 
> That sounds like perfectly fine/expected behaviour to me.
> 

Even so, I think it should not be part of the initial patch. I would only
be happy if I had enough different machine types to prove that specific
special casing is required and we cannot simply rely on standard load
balancing of the domains below SD_NUMA.

> > It might be possible
> > to infer it if the intermediate domain was DIE instead of MC but I doubt
> > that's guaranteed and it would still be unpredictable. It may be the type
> > of complexity that should only be introduced with a separate patch with
> > clear rationale as to why it's necessary and we are not at that threshold
> > so I withdraw the suggestion.
> 
> So IIRC the initial patch(es) had the idea to allow for 1 extra task
> imbalance to get 1-1 pairs on the same node, instead of across nodes. I
> don't immediately see that in these later patches.
> 

Not quite -- I had a minimum allowed imbalance of 2 tasks -- mostly for
very small NUMA domains. By v3, it was not necessary because the value
was hard-coded regardless of the number of CPUs. I think I'll leave it
out because I don't think it's worth worrying about the imbalance between
NUMA domains of less than 4 CPUs (do they even exist any more?)

> Would that be something to go back to? Would that not side-step much of
> the issues under discussion here?

Allowing just 1 extra task would work for netperf in some cases except when
softirq is involved. It would partially work for IO on ext4 as it's only
communicating with one journal thread but a bit more borderline for XFS
due to workqueue usage. XFS is not a massive concern in this context as
the workqueue is close to the IO issuer and short-lived so I don't think
it would crop up much for load balancing unlike ext4 where jbd2 could be
very active.

If v4 of the patch fails to meet approval then I'll try a patch that
allows a hard-coded imbalance of 2 tasks (one communicating task and
one kthread) regardless of NUMA domain span up to 50% of utilisation
and see what that gets. One way or the other, I would like to get basic
NUMA balancing issue out of the way before even looking at NUMA balancing
and whether it needs to be adjusted based on the load balancing rewrite.
NUMA balancing already has some logic that fights load balancer decisions
and I don't want to make that worse, I would be delighted if we could
even delete that migration retry check that overrides the load balancer
because it always was a bit nasty.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-08 14:03                       ` Mel Gorman
@ 2020-01-08 16:46                         ` Vincent Guittot
  2020-01-08 18:03                           ` Mel Gorman
  0 siblings, 1 reply; 28+ messages in thread
From: Vincent Guittot @ 2020-01-08 16:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

Le Wednesday 08 Jan 2020 à 14:03:49 (+0000), Mel Gorman a écrit :
> On Wed, Jan 08, 2020 at 02:18:58PM +0100, Peter Zijlstra wrote:
> > On Tue, Jan 07, 2020 at 08:24:06PM +0000, Mel Gorman wrote:
> > > Now I get you, but unfortunately it also would not work out. The number
> > > of groups is not related to the LLC except in some specific cases.
> > > It's possible to use the first CPU to find the size of an LLC but now I
> > > worry that it would lead to unpredictable behaviour. AMD has different
> > > numbers of LLCs per node depending on the CPU family and while Intel
> > > generally has one LLC per node, I imagine there are counter examples.
> > 
> > Intel has the 'fun' case of an LLC spanning nodes :-), although Linux
> > pretends this isn't so and truncates the LLC topology information to be
> > the node again -- see arch/x86/kernel/smpboot.c:match_llc().
> > 
> 
> That's the Sub-NUMA Clustering, right? I thought that manifested as a
> separate NUMA node in Linux. Not exactly right from a topology point of
> view, but close enough. Still, you're right in that identifying the LLC
> specifically would not necessarily be the way to go. FWIW, I have one
> machine with SNC enabled and didn't notice anything problematic in the
> versions of the patch released to date.
> 
> > And of course, in the Core2 era we had the Core2Quad chips which was a
> > dual-die solution and therefore also had multiple LLCs, and I think the
> > Xeon variant of that would allow the multiple LLC per node situation
> > too, although this is of course ancient hardware nobody really cares
> > about anymore.
> > 
> 
> Ok, even though they're older, it's enough counter-examples to be concerned
> about. I'm definitely dropping the LLC considerations for the moment :)
> 
> > > This means that load balancing on different machines with similar core
> > > counts will behave differently due to the LLC size.
> > 
> > That sounds like perfectly fine/expected behaviour to me.
> > 
> 
> Even so, I think it should not be part of the initial patch. I would only
> be happy if I had enough different machine types to prove that specific
> special casing is required and we cannot simply rely on standard load
> balancing of the domains below SD_NUMA.
> 
> > > It might be possible
> > > to infer it if the intermediate domain was DIE instead of MC but I doubt
> > > that's guaranteed and it would still be unpredictable. It may be the type
> > > of complexity that should only be introduced with a separate patch with
> > > clear rationale as to why it's necessary and we are not at that threshold
> > > so I withdraw the suggestion.
> > 
> > So IIRC the initial patch(es) had the idea to allow for 1 extra task
> > imbalance to get 1-1 pairs on the same node, instead of across nodes. I
> > don't immediately see that in these later patches.
> > 
> 
> Not quite -- I had a minimum allowed imbalance of 2 tasks -- mostly for
> very small NUMA domains. By v3, it was not necessary because the value
> was hard-coded regardless of the number of CPUs. I think I'll leave it
> out because I don't think it's worth worrying about the imbalance between
> NUMA domains of less than 4 CPUs (do they even exist any more?)
> 
> > Would that be something to go back to? Would that not side-step much of
> > the issues under discussion here?
> 
> Allowing just 1 extra task would work for netperf in some cases except when
> softirq is involved. It would partially work for IO on ext4 as it's only
> communicating with one journal thread but a bit more borderline for XFS
> due to workqueue usage. XFS is not a massive concern in this context as
> the workqueue is close to the IO issuer and short-lived so I don't think
> it would crop up much for load balancing unlike ext4 where jbd2 could be
> very active.
> 
> If v4 of the patch fails to meet approval then I'll try a patch that

My main concern with v4 was the mismatch between the computed value and the goal to not overload the LLCs

> allows a hard-coded imbalance of 2 tasks (one communicating task and

If there is no good way to compute the allowed imbalance, a hard coded value of 2 is probably simple value to start with

> one kthread) regardless of NUMA domain span up to 50% of utilisation

Are you sure that it's necessary ? This degree of imbalance already applies only if the group has spare capacity

something like

+               /* Consider allowing a small imbalance between NUMA groups */
+               if (env->sd->flags & SD_NUMA) {
+
+                       /*
+                        * Until we found a good way to compute an acceptable
+						 * degree of imbalance linked to the system topology
+						 * and that will not impatc mem bandwith and latency,
+						 * let start with a fixed small value.
+						 */
+                       imbalance_adj = 2;
+
+                       /*
+                        * Ignore small imbalances when the busiest group has
+                        * low utilisation.
+                        */
+                        env->imbalance -= min(env->imbalance, imbalance_adj);
+               }

> and see what that gets. One way or the other, I would like to get basic
> NUMA balancing issue out of the way before even looking at NUMA balancing
> and whether it needs to be adjusted based on the load balancing rewrite.
> NUMA balancing already has some logic that fights load balancer decisions
> and I don't want to make that worse, I would be delighted if we could
> even delete that migration retry check that overrides the load balancer
> because it always was a bit nasty.
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2
  2020-01-08 16:46                         ` Vincent Guittot
@ 2020-01-08 18:03                           ` Mel Gorman
  0 siblings, 0 replies; 28+ messages in thread
From: Mel Gorman @ 2020-01-08 18:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Phil Auld, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Wed, Jan 08, 2020 at 05:46:57PM +0100, Vincent Guittot wrote:
> > Allowing just 1 extra task would work for netperf in some cases except when
> > softirq is involved. It would partially work for IO on ext4 as it's only
> > communicating with one journal thread but a bit more borderline for XFS
> > due to workqueue usage. XFS is not a massive concern in this context as
> > the workqueue is close to the IO issuer and short-lived so I don't think
> > it would crop up much for load balancing unlike ext4 where jbd2 could be
> > very active.
> > 
> > If v4 of the patch fails to meet approval then I'll try a patch that
> 
> My main concern with v4 was the mismatch between the computed value and the goal to not overload the LLCs
> 

Fair enough.

> > allows a hard-coded imbalance of 2 tasks (one communicating task and
> 
> If there is no good way to compute the allowed imbalance, a hard coded
> value of 2 is probably simple value to start with

Indeed.

> 
> > one kthread) regardless of NUMA domain span up to 50% of utilisation
> 
> Are you sure that it's necessary ? This degree of imbalance already applies only if the group has spare capacity
> 
> something like
> 
> +               /* Consider allowing a small imbalance between NUMA groups */
> +               if (env->sd->flags & SD_NUMA) {
> +
> +                       /*
> +                        * Until we found a good way to compute an acceptable
> +						 * degree of imbalance linked to the system topology
> +						 * and that will not impatc mem bandwith and latency,
> +						 * let start with a fixed small value.
> +						 */
> +                       imbalance_adj = 2;
> +
> +                       /*
> +                        * Ignore small imbalances when the busiest group has
> +                        * low utilisation.
> +                        */
> +                        env->imbalance -= min(env->imbalance, imbalance_adj);
> +               }
> 

This is more or less what I had in mind with the exception that the "low
utilisation" part of the comment would go away. The 50% utilisation may
be unnecessary and was based simply on the idea that at that point memory
bandwidth, HT considerations or both would also be dominating factors. I
can leave out the check and add it in as a separate patch if proven to
be necessary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2020-01-08 18:03 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-20  8:42 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2 Mel Gorman
2019-12-20 12:40 ` Valentin Schneider
2019-12-20 14:22   ` Mel Gorman
2019-12-20 15:32     ` Valentin Schneider
2019-12-21 11:25   ` Mel Gorman
2019-12-22 12:00 ` Srikar Dronamraju
2019-12-23 13:31 ` Vincent Guittot
2019-12-23 13:41   ` Vincent Guittot
2020-01-03 14:31   ` Mel Gorman
2020-01-06 13:55     ` Vincent Guittot
2020-01-06 14:52       ` Mel Gorman
2020-01-07  8:38         ` Vincent Guittot
2020-01-07  9:56           ` Mel Gorman
2020-01-07 11:17             ` Vincent Guittot
2020-01-07 11:56               ` Mel Gorman
2020-01-07 16:00                 ` Vincent Guittot
2020-01-07 20:24                   ` Mel Gorman
2020-01-08  8:25                     ` Vincent Guittot
2020-01-08  8:49                       ` Mel Gorman
2020-01-08 13:18                     ` Peter Zijlstra
2020-01-08 14:03                       ` Mel Gorman
2020-01-08 16:46                         ` Vincent Guittot
2020-01-08 18:03                           ` Mel Gorman
2020-01-07 11:22             ` Peter Zijlstra
2020-01-07 11:42               ` Mel Gorman
2020-01-07 12:29                 ` Peter Zijlstra
2020-01-07 12:28               ` Peter Zijlstra
2020-01-07 19:26             ` Phil Auld

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).