linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
@ 2019-12-18 15:44 Mel Gorman
  2019-12-18 18:50 ` Valentin Schneider
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Mel Gorman @ 2019-12-18 15:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML, Mel Gorman

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a degree of imbalance to exist between NUMA domains
based on the imbalance_pct defined by the scheduler domain to take into
account that data locality is also important. This slight imbalance is
allowed until the scheduler domain reaches almost 50% utilisation at which
point other factors like HT utilisation and memory bandwidth come into
play. While not commented upon in the code, the cutoff is important for
memory-bound parallelised non-communicating workloads that do not fully
utilise the entire machine. This is not necessarily the best universal
cut-off point but it appeared appropriate for a variety of workloads
and machines.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offloaded depending on the
transmission rate.

2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
                       	      baseline              lbnuma-v1
Hmean     64         666.68 (   0.00%)      669.00 (   0.35%)
Hmean     128       1276.18 (   0.00%)     1285.59 *   0.74%*
Hmean     256       2366.78 (   0.00%)     2419.42 *   2.22%*
Hmean     1024      8123.94 (   0.00%)     8494.92 *   4.57%*
Hmean     2048     12962.45 (   0.00%)    13430.37 *   3.61%*
Hmean     3312     17709.24 (   0.00%)    17317.23 *  -2.21%*
Hmean     4096     19756.01 (   0.00%)    19480.56 (  -1.39%)
Hmean     8192     27469.59 (   0.00%)    27208.17 (  -0.95%)
Hmean     16384    30062.82 (   0.00%)    31135.21 *   3.57%*
Stddev    64           2.64 (   0.00%)        1.19 (  54.86%)
Stddev    128          6.22 (   0.00%)        0.65 (  89.51%)
Stddev    256          9.75 (   0.00%)       11.81 ( -21.07%)
Stddev    1024        69.62 (   0.00%)       38.48 (  44.74%)
Stddev    2048        72.73 (   0.00%)       58.22 (  19.94%)
Stddev    3312       412.35 (   0.00%)       67.77 (  83.57%)
Stddev    4096       345.02 (   0.00%)       81.07 (  76.50%)
Stddev    8192       280.09 (   0.00%)      250.19 (  10.68%)
Stddev    16384      452.99 (   0.00%)      222.97 (  50.78%)

Fairly small impact on average performance but note how much the standard
deviation is reduced showing much more stable results. A clearer story
is visible from the NUMA Balancing stats

Ops NUMA base-page range updates       21596.00         282.00
Ops NUMA PTE updates                   21596.00         282.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                   17786.00         134.00
Ops NUMA hint local faults %            9916.00         134.00
Ops NUMA hint local percent               55.75         100.00
Ops NUMA pages migrated                 4231.00           0.00

Without the patch, only 55.75% of sampled accesses are local.
With the patch, 100% of sampled accesses are local. A 2-socket
Broadwell showed better results on average but are not presented
for brevity. The patch holds up for 4-socket boxes as well

4-socket Haswell machine, 144 core, HT enabled
netperf-tcp

                       	      baseline              lbnuma-v1
Hmean     64         953.51 (   0.00%)      986.63 *   3.47%*
Hmean     128       1826.48 (   0.00%)     1887.48 *   3.34%*
Hmean     256       3295.19 (   0.00%)     3402.08 *   3.24%*
Hmean     1024     10915.40 (   0.00%)    11482.92 *   5.20%*
Hmean     2048     17833.82 (   0.00%)    19033.89 *   6.73%*
Hmean     3312     22690.72 (   0.00%)    24101.77 *   6.22%*
Hmean     4096     24422.23 (   0.00%)    26665.46 *   9.19%*
Hmean     8192     31250.11 (   0.00%)    33514.74 *   7.25%*
Hmean     16384    37033.70 (   0.00%)    38732.22 *   4.59%*

On this machine, the baseline measured 58.11% locality for sampled accesses
and 100% local accesses with the patch. Similarly, the patch holds up
for 2-socket machines with multiple L3 caches such as the AMD Epyc 2

2-socket EPYC-2 machine, 256 cores
netperf-tcp
Hmean     64        1564.63 (   0.00%)     1550.59 (  -0.90%)
Hmean     128       3028.83 (   0.00%)     3030.48 (   0.05%)
Hmean     256       5733.47 (   0.00%)     5769.51 (   0.63%)
Hmean     1024     18936.04 (   0.00%)    19216.15 *   1.48%*
Hmean     2048     27589.77 (   0.00%)    28200.45 *   2.21%*
Hmean     3312     35361.97 (   0.00%)    35881.94 *   1.47%*
Hmean     4096     37965.59 (   0.00%)    38702.01 *   1.94%*
Hmean     8192     48499.92 (   0.00%)    49530.62 *   2.13%*
Hmean     16384    54249.96 (   0.00%)    55937.24 *   3.11%*

For amusement purposes, here are two graphs showing CPU utilisation on
the 2-socket Haswell machine over time based on mpstat with the ordering
of the CPUs based on topology.

http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png
http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png

The lines on the left match up CPUs that are HT siblings or on the same
node. The machine has only one L3 cache per NUMA node or that would also
be shown.  It should be very clear from the images that the baseline
kernel spread the load with lighter utilisation across nodes while the
patched kernel had heavy utilisation of fewer CPUs on one node.

Hackbench generally shows good results across machines with some
differences depending on whether threads or sockets are used as well as
pipes or sockets.  This is the *worst* result from the 2-socket Haswell
machine

2-socket Haswell machine 48 core, HT enabled
hackbench-process-pipes -- mmtests config config-scheduler-unbound
                           5.5.0-rc1              5.5.0-rc1
                     	    baseline              lbnuma-v1
Amean     1        1.2580 (   0.00%)      1.2393 (   1.48%)
Amean     4        5.3293 (   0.00%)      5.2683 *   1.14%*
Amean     7        8.9067 (   0.00%)      8.7130 *   2.17%*
Amean     12      14.9577 (   0.00%)     14.5773 *   2.54%*
Amean     21      25.9570 (   0.00%)     25.6657 *   1.12%*
Amean     30      37.7287 (   0.00%)     37.1277 *   1.59%*
Amean     48      61.6757 (   0.00%)     60.0433 *   2.65%*
Amean     79     100.4740 (   0.00%)     98.4507 (   2.01%)
Amean     110    141.2450 (   0.00%)    136.8900 *   3.08%*
Amean     141    179.7747 (   0.00%)    174.5110 *   2.93%*
Amean     172    221.0700 (   0.00%)    214.7857 *   2.84%*
Amean     192    245.2007 (   0.00%)    238.3680 *   2.79%*

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is marginal

NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half
                     	     baseline              lbnuma-v1
Amean     bt.C       64.29 (   0.00%)       70.31 *  -9.36%*
Amean     cg.C       26.33 (   0.00%)       25.73 (   2.31%)
Amean     ep.C       10.26 (   0.00%)       10.27 (  -0.10%)
Amean     ft.C       17.98 (   0.00%)       19.03 (  -5.84%)
Amean     is.C        0.99 (   0.00%)        0.99 (   0.40%)
Amean     lu.C       51.72 (   0.00%)       49.11 (   5.04%)
Amean     mg.C        8.12 (   0.00%)        8.13 (  -0.15%)
Amean     sp.C       82.76 (   0.00%)       84.52 (  -2.13%)
Amean     ua.C       58.64 (   0.00%)       57.57 (   1.82%)

There is some impact but there is a degree of variability and the ones
showing impact are mainly workloads that are mostly parallelised
and communicate infrequently between tests. It's a corner case where
the workload benefits heavily from spreading wide and early which is
not common. This is intended to illustrate the worst case measured.

In general, the patch simply seeks to avoid unnecessarily cross-node
migrations when a machine is lightly loaded but shows benefits for other
workloads. While tests are still running, so far it seems to benefit
light-utilisation smaller workloads on large machines and does not appear
to do any harm to larger or parallelised workloads.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..1dc8c7800fc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
+			goto out_spare;
 		}
 
 		/*
@@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		env->migration_type = migrate_task;
 		env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+
+out_spare:
+		/*
+		 * Whether balancing the number of running tasks or the number
+		 * of idle CPUs, consider allowing some degree of imbalance if
+		 * migrating between NUMA domains.
+		 */
+		if (env->sd->flags & SD_NUMA) {
+			unsigned int imbalance_adj, imbalance_max;
+
+			/*
+			 * imbalance_adj is the allowable degree of imbalance
+			 * to exist between two NUMA domains. It's calculated
+			 * relative to imbalance_pct with a minimum of two
+			 * tasks or idle CPUs.
+			 */
+			imbalance_adj = (busiest->group_weight *
+				(env->sd->imbalance_pct - 100) / 100) >> 1;
+			imbalance_adj = max(imbalance_adj, 2U);
+
+			/*
+			 * Ignore imbalance unless busiest sd is close to 50%
+			 * utilisation. At that point balancing for memory
+			 * bandwidth and potentially avoiding unnecessary use
+			 * of HT siblings is as relevant as memory locality.
+			 */
+			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
+			if (env->imbalance <= imbalance_adj &&
+			    busiest->sum_nr_running < imbalance_max) {
+				env->imbalance = 0;
+			}
+		}
 		return;
 	}
 

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
@ 2019-12-18 18:50 ` Valentin Schneider
  2019-12-18 22:50   ` Mel Gorman
  2019-12-19 10:02   ` Peter Zijlstra
  2019-12-18 18:54 ` Valentin Schneider
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 18+ messages in thread
From: Valentin Schneider @ 2019-12-18 18:50 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML

Hi Mel,

On 18/12/2019 15:44, Mel Gorman wrote:
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
> 
> This patch allows a degree of imbalance to exist between NUMA domains
> based on the imbalance_pct defined by the scheduler domain to take into
> account that data locality is also important. This slight imbalance is
> allowed until the scheduler domain reaches almost 50% utilisation at which
> point other factors like HT utilisation and memory bandwidth come into
> play. While not commented upon in the code, the cutoff is important for
> memory-bound parallelised non-communicating workloads that do not fully
> utilise the entire machine. This is not necessarily the best universal
> cut-off point but it appeared appropriate for a variety of workloads
> and machines.
> 
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offloaded depending on the
> transmission rate.
> 

<snip>

> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
> 

Thanks for the detailed testing, I haven't digested it entirely yet but I
appreciate the effort.

> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		env->migration_type = migrate_task;
>  		env->imbalance = max_t(long, 0, (local->idle_cpus -
>  						 busiest->idle_cpus) >> 1);
> +
> +out_spare:
> +		/*
> +		 * Whether balancing the number of running tasks or the number
> +		 * of idle CPUs, consider allowing some degree of imbalance if
> +		 * migrating between NUMA domains.
> +		 */
> +		if (env->sd->flags & SD_NUMA) {
> +			unsigned int imbalance_adj, imbalance_max;
> +
> +			/*
> +			 * imbalance_adj is the allowable degree of imbalance
> +			 * to exist between two NUMA domains. It's calculated
> +			 * relative to imbalance_pct with a minimum of two
> +			 * tasks or idle CPUs.
> +			 */
> +			imbalance_adj = (busiest->group_weight *
> +				(env->sd->imbalance_pct - 100) / 100) >> 1;

IIRC imbalance_pct for NUMA domains uses the default 125, so I read this as
"allow an imbalance of 1 task per 8 CPU in the source group" (just making
sure I follow).

> +			imbalance_adj = max(imbalance_adj, 2U);
> +
> +			/*
> +			 * Ignore imbalance unless busiest sd is close to 50%
> +			 * utilisation. At that point balancing for memory
> +			 * bandwidth and potentially avoiding unnecessary use
> +			 * of HT siblings is as relevant as memory locality.
> +			 */
> +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> +			if (env->imbalance <= imbalance_adj &&
> +			    busiest->sum_nr_running < imbalance_max) {

The code does "unless busiest group has half as many runnable tasks (or more)
as it has CPUs (modulo the adj thing)", is that what you mean by "unless
busiest sd is close to 50% utilisation" in the comment? It's somewhat
different IMO.

> +				env->imbalance = 0;
> +			}
> +		}
>  		return;
>  	}
>  
> 
I'm quite sure you have reasons to have written it that way, but I was
hoping we could squash it down to something like:
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..f05d09a8452e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
+		} else {
+
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
+							 busiest->idle_cpus) >> 1);
 		}
 
 		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
+		 * Allow for a small imbalance between NUMA groups; don't do any
+		 * of it if there is at least half as many tasks / busy CPUs as
+		 * there are available CPUs in the busiest group
 		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
-						 busiest->idle_cpus) >> 1);
+		if (env->sd->flags & SD_NUMA &&
+		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&
+		    (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))
+				env->imbalance = 0;
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
  2019-12-18 18:50 ` Valentin Schneider
@ 2019-12-18 18:54 ` Valentin Schneider
  2019-12-19  2:58 ` Rik van Riel
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2019-12-18 18:54 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML

Hi Mel,

On 18/12/2019 15:44, Mel Gorman wrote:
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
> 
> This patch allows a degree of imbalance to exist between NUMA domains
> based on the imbalance_pct defined by the scheduler domain to take into
> account that data locality is also important. This slight imbalance is
> allowed until the scheduler domain reaches almost 50% utilisation at which
> point other factors like HT utilisation and memory bandwidth come into
> play. While not commented upon in the code, the cutoff is important for
> memory-bound parallelised non-communicating workloads that do not fully
> utilise the entire machine. This is not necessarily the best universal
> cut-off point but it appeared appropriate for a variety of workloads
> and machines.
> 
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offloaded depending on the
> transmission rate.
> 

<snip>

> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
> 

Thanks for the detailed testing, I haven't digested it entirely yet but I
appreciate the effort.

> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		env->migration_type = migrate_task;
>  		env->imbalance = max_t(long, 0, (local->idle_cpus -
>  						 busiest->idle_cpus) >> 1);
> +
> +out_spare:
> +		/*
> +		 * Whether balancing the number of running tasks or the number
> +		 * of idle CPUs, consider allowing some degree of imbalance if
> +		 * migrating between NUMA domains.
> +		 */
> +		if (env->sd->flags & SD_NUMA) {
> +			unsigned int imbalance_adj, imbalance_max;
> +
> +			/*
> +			 * imbalance_adj is the allowable degree of imbalance
> +			 * to exist between two NUMA domains. It's calculated
> +			 * relative to imbalance_pct with a minimum of two
> +			 * tasks or idle CPUs.
> +			 */
> +			imbalance_adj = (busiest->group_weight *
> +				(env->sd->imbalance_pct - 100) / 100) >> 1;

IIRC imbalance_pct for NUMA domains uses the default 125, so I read this as
"allow an imbalance of 1 task per 8 CPU in the source group" (just making
sure I follow).

> +			imbalance_adj = max(imbalance_adj, 2U);
> +
> +			/*
> +			 * Ignore imbalance unless busiest sd is close to 50%
> +			 * utilisation. At that point balancing for memory
> +			 * bandwidth and potentially avoiding unnecessary use
> +			 * of HT siblings is as relevant as memory locality.
> +			 */
> +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> +			if (env->imbalance <= imbalance_adj &&
> +			    busiest->sum_nr_running < imbalance_max) {

The code does "unless busiest group has half as many runnable tasks (or more)
as it has CPUs (modulo the adj thing)", is that what you mean by "unless
busiest sd is close to 50% utilisation" in the comment? It's somewhat
different IMO.

> +				env->imbalance = 0;
> +			}
> +		}
>  		return;
>  	}
>  
> 

I'm quite sure you have reasons to have written it that way, but I was
hoping we could squash it down to something like:
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..f05d09a8452e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
+		} else {
+
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
+							 busiest->idle_cpus) >> 1);
 		}
 
 		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
+		 * Allow for a small imbalance between NUMA groups; don't do any
+		 * of it if there is at least half as many tasks / busy CPUs as
+		 * there are available CPUs in the busiest group
 		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
-						 busiest->idle_cpus) >> 1);
+		if (env->sd->flags & SD_NUMA &&
+		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&
+		    (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))
+				env->imbalance = 0;
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 18:50 ` Valentin Schneider
@ 2019-12-18 22:50   ` Mel Gorman
  2019-12-19 11:56     ` Valentin Schneider
  2019-12-19 10:02   ` Peter Zijlstra
  1 sibling, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2019-12-18 22:50 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
> > In general, the patch simply seeks to avoid unnecessarily cross-node
> > migrations when a machine is lightly loaded but shows benefits for other
> > workloads. While tests are still running, so far it seems to benefit
> > light-utilisation smaller workloads on large machines and does not appear
> > to do any harm to larger or parallelised workloads.
> > 
> 
> Thanks for the detailed testing, I haven't digested it entirely yet but I
> appreciate the effort.
> 

No problem, this is one of those patches that it's best to test a bunch
of loads and machines. I haven't presented it all because the changelog
would be beyond ridiculous.

> > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		env->migration_type = migrate_task;
> >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> >  						 busiest->idle_cpus) >> 1);
> > +
> > +out_spare:
> > +		/*
> > +		 * Whether balancing the number of running tasks or the number
> > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > +		 * migrating between NUMA domains.
> > +		 */
> > +		if (env->sd->flags & SD_NUMA) {
> > +			unsigned int imbalance_adj, imbalance_max;
> > +
> > +			/*
> > +			 * imbalance_adj is the allowable degree of imbalance
> > +			 * to exist between two NUMA domains. It's calculated
> > +			 * relative to imbalance_pct with a minimum of two
> > +			 * tasks or idle CPUs.
> > +			 */
> > +			imbalance_adj = (busiest->group_weight *
> > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> 
> IIRC imbalance_pct for NUMA domains uses the default 125, so I read this as
> "allow an imbalance of 1 task per 8 CPU in the source group" (just making
> sure I follow).
> 

That is how it works out in most cases. As imbalance_pct can be anything
in theory, I didn't specify what it usually breaks down to. The >> 1 is
to go "half way" similar to how imbalance itself is calculated.

> > +			imbalance_adj = max(imbalance_adj, 2U);
> > +
> > +			/*
> > +			 * Ignore imbalance unless busiest sd is close to 50%
> > +			 * utilisation. At that point balancing for memory
> > +			 * bandwidth and potentially avoiding unnecessary use
> > +			 * of HT siblings is as relevant as memory locality.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> > +			    busiest->sum_nr_running < imbalance_max) {
> 
> The code does "unless busiest group has half as many runnable tasks (or more)
> as it has CPUs (modulo the adj thing)", is that what you mean by "unless
> busiest sd is close to 50% utilisation" in the comment? It's somewhat
> different IMO.
> 

Crap, yes. At the time of writing, I was thinking in terms of running
tasks that were fully active hence the misleading comment.

> > +				env->imbalance = 0;
> > +			}
> > +		}
> >  		return;
> >  	}
> >  
> > 
> I'm quite sure you have reasons to have written it that way, but I was
> hoping we could squash it down to something like:

I wrote it that way to make it clear exactly what has changed, the
thinking behind the checks and to avoid 80-col limits to make review
easier overall. It's a force of habit and I'm happy to reformat it as
you suggest except....

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..f05d09a8452e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->migration_type = migrate_task;
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff >> 1;
> -			return;
> +		} else {
> +
> +			/*
> +			 * If there is no overload, we just want to even the number of
> +			 * idle cpus.
> +			 */
> +			env->migration_type = migrate_task;
> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
> +							 busiest->idle_cpus) >> 1);
>  		}
>  
>  		/*
> -		 * If there is no overload, we just want to even the number of
> -		 * idle cpus.
> +		 * Allow for a small imbalance between NUMA groups; don't do any
> +		 * of it if there is at least half as many tasks / busy CPUs as
> +		 * there are available CPUs in the busiest group
>  		 */
> -		env->migration_type = migrate_task;
> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
> -						 busiest->idle_cpus) >> 1);
> +		if (env->sd->flags & SD_NUMA &&
> +		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&

This last line is not exactly equivalent to what I wrote. It would need
to be

	(busiest->sum_nr_running < (busiest->group_weight >> 1) - imbalance_adj) &&

I can test as you suggest to see if it's roughly equivalent in terms of
performance. The intent was to have a cutoff just before we reached 50%
running tasks / busy CPUs.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
  2019-12-18 18:50 ` Valentin Schneider
  2019-12-18 18:54 ` Valentin Schneider
@ 2019-12-19  2:58 ` Rik van Riel
  2019-12-19  8:41   ` Mel Gorman
  2019-12-19 10:04 ` Peter Zijlstra
  2019-12-19 14:45 ` Vincent Guittot
  4 siblings, 1 reply; 18+ messages in thread
From: Rik van Riel @ 2019-12-19  2:58 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, LKML

[-- Attachment #1: Type: text/plain, Size: 719 bytes --]

On Wed, 2019-12-18 at 15:44 +0000, Mel Gorman wrote:

> +			/*
> +			 * Ignore imbalance unless busiest sd is close
> to 50%
> +			 * utilisation. At that point balancing for
> memory
> +			 * bandwidth and potentially avoiding
> unnecessary use
> +			 * of HT siblings is as relevant as memory
> locality.
> +			 */
> +			imbalance_max = (busiest->group_weight >> 1) -
> imbalance_adj;
> +			if (env->imbalance <= imbalance_adj &&
> +			    busiest->sum_nr_running < imbalance_max) {
> +				env->imbalance = 0;
> +			}
> +		}
>  		return;
>  	}

I can see how the 50% point is often great for HT,
but I wonder if that is also the case for SMT4 and
SMT8 systems...

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19  2:58 ` Rik van Riel
@ 2019-12-19  8:41   ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2019-12-19  8:41 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld,
	valentin.schneider, srikar, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton, parth, LKML

On Wed, Dec 18, 2019 at 09:58:01PM -0500, Rik van Riel wrote:
> On Wed, 2019-12-18 at 15:44 +0000, Mel Gorman wrote:
> 
> > +			/*
> > +			 * Ignore imbalance unless busiest sd is close
> > to 50%
> > +			 * utilisation. At that point balancing for
> > memory
> > +			 * bandwidth and potentially avoiding
> > unnecessary use
> > +			 * of HT siblings is as relevant as memory
> > locality.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) -
> > imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> > +			    busiest->sum_nr_running < imbalance_max) {
> > +				env->imbalance = 0;
> > +			}
> > +		}
> >  		return;
> >  	}
> 
> I can see how the 50% point is often great for HT,
> but I wonder if that is also the case for SMT4 and
> SMT8 systems...
> 

Maybe, maybe not but it's not the most important concern. The highlight
in the comment was about memory bandwidth and HT was simply an additional
concern. Ideally memory bandwidth and consumption would be taken into
account but we know nothing about either. Even if peak memory bandwidth
was known, the reference pattern matters a *lot* which can be readily
illustrated by using STREAM and observing the different bandwidths for
different reference patterns. Similarly, while we might know pages that
were referenced, we do not know the bandwidth consumption without taking
additional overhead with a PMU. Hence, it makes sense to at least hope
that the active tasks have similar memory bandwidth requirements and load
balance as normal when we are near the 50% active tasks/busy CPUs. If
SMT4 or SMT8 have different requirements or it matters for memory
bandwidth then it would need to be carefully examined by someone with
access to such hardware to determine an arch-specific and maybe even a
per-CPU-family cutoff.

In the context of this patch, it unconditionally makes sense that the
basic case of two communicating tasks are not migrating cross-node on
wakeup and then again on load balance.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 18:50 ` Valentin Schneider
  2019-12-18 22:50   ` Mel Gorman
@ 2019-12-19 10:02   ` Peter Zijlstra
  2019-12-19 11:46     ` Valentin Schneider
  2019-12-19 15:23     ` Mel Gorman
  1 sibling, 2 replies; 18+ messages in thread
From: Peter Zijlstra @ 2019-12-19 10:02 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Mel Gorman, Vincent Guittot, Ingo Molnar, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
> I'm quite sure you have reasons to have written it that way, but I was
> hoping we could squash it down to something like:
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..f05d09a8452e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->migration_type = migrate_task;
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff >> 1;
> -			return;
> +		} else {
> +
> +			/*
> +			 * If there is no overload, we just want to even the number of
> +			 * idle cpus.
> +			 */
> +			env->migration_type = migrate_task;
> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
> +							 busiest->idle_cpus) >> 1);
>  		}
>  
>  		/*
> -		 * If there is no overload, we just want to even the number of
> -		 * idle cpus.
> +		 * Allow for a small imbalance between NUMA groups; don't do any
> +		 * of it if there is at least half as many tasks / busy CPUs as
> +		 * there are available CPUs in the busiest group
>  		 */
> -		env->migration_type = migrate_task;
> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
> -						 busiest->idle_cpus) >> 1);
> +		if (env->sd->flags & SD_NUMA &&
> +		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&
> +		    (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))

Note that this form allows avoiding the division. Every time I see that
/100 I'm thinking we should rename and make imbalance_pct a base-2
thing.

> +				env->imbalance = 0;
> +
>  		return;
>  	}
>  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
                   ` (2 preceding siblings ...)
  2019-12-19  2:58 ` Rik van Riel
@ 2019-12-19 10:04 ` Peter Zijlstra
  2019-12-19 14:45 ` Vincent Guittot
  4 siblings, 0 replies; 18+ messages in thread
From: Peter Zijlstra @ 2019-12-19 10:04 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Wed, Dec 18, 2019 at 03:44:02PM +0000, Mel Gorman wrote:
> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		env->migration_type = migrate_task;
>  		env->imbalance = max_t(long, 0, (local->idle_cpus -
>  						 busiest->idle_cpus) >> 1);
> +
> +out_spare:
> +		/*
> +		 * Whether balancing the number of running tasks or the number
> +		 * of idle CPUs, consider allowing some degree of imbalance if
> +		 * migrating between NUMA domains.
> +		 */
> +		if (env->sd->flags & SD_NUMA) {
> +			unsigned int imbalance_adj, imbalance_max;
> +
> +			/*
> +			 * imbalance_adj is the allowable degree of imbalance
> +			 * to exist between two NUMA domains. It's calculated
> +			 * relative to imbalance_pct with a minimum of two
> +			 * tasks or idle CPUs.
> +			 */
> +			imbalance_adj = (busiest->group_weight *
> +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> +			imbalance_adj = max(imbalance_adj, 2U);

The '2' here comes from a 'pair of communicating tasks' right? Perhaps
more clearly detail that in the comment, such that when we're looking at
this code again in a few years time, we're not left wondering wtf that 2
is about :-)

> +
> +			/*
> +			 * Ignore imbalance unless busiest sd is close to 50%
> +			 * utilisation. At that point balancing for memory
> +			 * bandwidth and potentially avoiding unnecessary use
> +			 * of HT siblings is as relevant as memory locality.
> +			 */
> +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> +			if (env->imbalance <= imbalance_adj &&
> +			    busiest->sum_nr_running < imbalance_max) {
> +				env->imbalance = 0;
> +			}
> +		}
>  		return;
>  	}
>  
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 10:02   ` Peter Zijlstra
@ 2019-12-19 11:46     ` Valentin Schneider
  2019-12-19 14:23       ` Valentin Schneider
  2019-12-19 15:23     ` Mel Gorman
  1 sibling, 1 reply; 18+ messages in thread
From: Valentin Schneider @ 2019-12-19 11:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Vincent Guittot, Ingo Molnar, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On 19/12/2019 10:02, Peter Zijlstra wrote:
> On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
>> I'm quite sure you have reasons to have written it that way, but I was
>> hoping we could squash it down to something like:
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 08a233e97a01..f05d09a8452e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  			env->migration_type = migrate_task;
>>  			lsub_positive(&nr_diff, local->sum_nr_running);
>>  			env->imbalance = nr_diff >> 1;
>> -			return;
>> +		} else {
>> +
>> +			/*
>> +			 * If there is no overload, we just want to even the number of
>> +			 * idle cpus.
>> +			 */
>> +			env->migration_type = migrate_task;
>> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
>> +							 busiest->idle_cpus) >> 1);
>>  		}
>>  
>>  		/*
>> -		 * If there is no overload, we just want to even the number of
>> -		 * idle cpus.
>> +		 * Allow for a small imbalance between NUMA groups; don't do any
>> +		 * of it if there is at least half as many tasks / busy CPUs as
>> +		 * there are available CPUs in the busiest group
>>  		 */
>> -		env->migration_type = migrate_task;
>> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
>> -						 busiest->idle_cpus) >> 1);
>> +		if (env->sd->flags & SD_NUMA &&
>> +		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&
>> +		    (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))
> 
> Note that this form allows avoiding the division. Every time I see that
> /100 I'm thinking we should rename and make imbalance_pct a base-2
> thing.
> 

Right, I kept the original form but we can turn that into

  env->imbalance * 100 < busiest->group_weight * (env->sd->imbalance_pct - 100)



As for the base-2 imbalance; I think you've mentioned that in the past.
Looking at check_cpu_capacity() as a lambda imbalance_pct user, we could
turn that from:

  rq->cpu_capacity * sd->imbalance_pct < rq->cpu_capacity_orig * 100

to:

  rq->cpu_capacity_orig - rq->cpu_capacity < rq->cpu_capacity_orig >> sd->imbalance_shift


And here we could just go with

  env->imbalance < busiest->group_weight >> sd->imbalance_shift


As for picking values, right now we have

  125 (default) / 117 (LLC domain) / 110 (SMT domain)

We could have

  >> 2 (25%), >> 3 (12.5%), >> 4 (6.25%).

It's not strictly equivalent but IMO the whole imbalance_pct thing isn't
very precise anyway; just needs to be good enough on a sufficient number of
topologies.



>> +				env->imbalance = 0;
>> +
>>  		return;
>>  	}
>>  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 22:50   ` Mel Gorman
@ 2019-12-19 11:56     ` Valentin Schneider
  0 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2019-12-19 11:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On 18/12/2019 22:50, Mel Gorman wrote:
>> I'm quite sure you have reasons to have written it that way, but I was
>> hoping we could squash it down to something like:
> 
> I wrote it that way to make it clear exactly what has changed, the
> thinking behind the checks and to avoid 80-col limits to make review
> easier overall. It's a force of habit and I'm happy to reformat it as
> you suggest except....
> 

I tend to disregard the 80 col limit, so I might not be the best example
here :D

>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 08a233e97a01..f05d09a8452e 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  			env->migration_type = migrate_task;
>>  			lsub_positive(&nr_diff, local->sum_nr_running);
>>  			env->imbalance = nr_diff >> 1;
>> -			return;
>> +		} else {
>> +
>> +			/*
>> +			 * If there is no overload, we just want to even the number of
>> +			 * idle cpus.
>> +			 */
>> +			env->migration_type = migrate_task;
>> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
>> +							 busiest->idle_cpus) >> 1);
>>  		}
>>  
>>  		/*
>> -		 * If there is no overload, we just want to even the number of
>> -		 * idle cpus.
>> +		 * Allow for a small imbalance between NUMA groups; don't do any
>> +		 * of it if there is at least half as many tasks / busy CPUs as
>> +		 * there are available CPUs in the busiest group
>>  		 */
>> -		env->migration_type = migrate_task;
>> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
>> -						 busiest->idle_cpus) >> 1);
>> +		if (env->sd->flags & SD_NUMA &&
>> +		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&
> 
> This last line is not exactly equivalent to what I wrote. It would need
> to be
> 
> 	(busiest->sum_nr_running < (busiest->group_weight >> 1) - imbalance_adj) &&
> 

Right, I was implicitly suggesting that maybe we could forgo the
imbalance_adj computation and just roll with the imbalance_pct (with perhaps
and extra shift here and there). IMO the important thing here is the 
half-way cutoff.

> I can test as you suggest to see if it's roughly equivalent in terms of
> performance. The intent was to have a cutoff just before we reached 50%
> running tasks / busy CPUs.
> 

I think that cutoff makes sense; it's also important that it isn't purely
busy CPU-based because we're not guaranteed to have 1 task per CPU (due to
affinity or else), so I think the "half as many tasks as available CPUs"
thing has some merit.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 11:46     ` Valentin Schneider
@ 2019-12-19 14:23       ` Valentin Schneider
  0 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2019-12-19 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Vincent Guittot, Ingo Molnar, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On 19/12/2019 11:46, Valentin Schneider wrote:
> As for picking values, right now we have
> 
>   125 (default) / 117 (LLC domain) / 110 (SMT domain)
> 
> We could have
> 
>   >> 2 (25%), >> 3 (12.5%), >> 4 (6.25%).
> 


Hmph, I see that task_numa_migrate() starts with a slightly different value
(112), and does the same halving pattern as wake_affine_weight():

  x = 100 + (sd->imbalance_pct - 100) / 2;

The 112 could use >> 3 (12.5%); the halving is just an extra shift with the
suggested changes.

> It's not strictly equivalent but IMO the whole imbalance_pct thing isn't
> very precise anyway; just needs to be good enough on a sufficient number of
> topologies.
> 
> 
> 
>>> +				env->imbalance = 0;
>>> +
>>>  		return;
>>>  	}
>>>  

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
                   ` (3 preceding siblings ...)
  2019-12-19 10:04 ` Peter Zijlstra
@ 2019-12-19 14:45 ` Vincent Guittot
  2019-12-19 15:16   ` Valentin Schneider
                     ` (2 more replies)
  4 siblings, 3 replies; 18+ messages in thread
From: Vincent Guittot @ 2019-12-19 14:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

Hi Mel,

Thanks for looking at this NUMA locality vs spreading tasks point.

Le Wednesday 18 Dec 2019 à 15:44:02 (+0000), Mel Gorman a écrit :
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
> 

[snip]

> There is some impact but there is a degree of variability and the ones
> showing impact are mainly workloads that are mostly parallelised
> and communicate infrequently between tests. It's a corner case where
> the workload benefits heavily from spreading wide and early which is
> not common. This is intended to illustrate the worst case measured.
> 
> In general, the patch simply seeks to avoid unnecessarily cross-node
> migrations when a machine is lightly loaded but shows benefits for other
> workloads. While tests are still running, so far it seems to benefit
> light-utilisation smaller workloads on large machines and does not appear
> to do any harm to larger or parallelised workloads.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++-----
>  1 file changed, 33 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 08a233e97a01..1dc8c7800fc0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	/*
>  	 * Try to use spare capacity of local group without overloading it or
>  	 * emptying busiest.
> -	 * XXX Spreading tasks across NUMA nodes is not always the best policy
> -	 * and special care should be taken for SD_NUMA domain level before
> -	 * spreading the tasks. For now, load_balance() fully relies on
> -	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>  	 */
>  	if (local->group_type == group_has_spare) {
>  		if (busiest->group_type > group_fully_busy) {
> @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->migration_type = migrate_task;
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff >> 1;
> -			return;
> +			goto out_spare;

Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.

So you should probably add your

> +                * Whether balancing the number of running tasks or the number
> +                * of idle CPUs, consider allowing some degree of imbalance if
> +                * migrating between NUMA domains.
> +                */
> +               if (env->sd->flags & SD_NUMA) {
> +                       unsigned int imbalance_adj, imbalance_max;

...

> +               }

before the prefer_sibling case :

		if (busiest->group_weight == 1 || sds->prefer_sibling) {
			unsigned int nr_diff = busiest->sum_nr_running;
			/*
			 * When prefer sibling, evenly spread running tasks on
			 * groups.
			 */


>
>  		}
>  
>  		/*
> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		env->migration_type = migrate_task;
>  		env->imbalance = max_t(long, 0, (local->idle_cpus -
>  						 busiest->idle_cpus) >> 1);
> +
> +out_spare:
> +		/*
> +		 * Whether balancing the number of running tasks or the number
> +		 * of idle CPUs, consider allowing some degree of imbalance if
> +		 * migrating between NUMA domains.
> +		 */
> +		if (env->sd->flags & SD_NUMA) {
> +			unsigned int imbalance_adj, imbalance_max;
> +
> +			/*
> +			 * imbalance_adj is the allowable degree of imbalance
> +			 * to exist between two NUMA domains. It's calculated
> +			 * relative to imbalance_pct with a minimum of two
> +			 * tasks or idle CPUs.
> +			 */
> +			imbalance_adj = (busiest->group_weight *
> +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> +			imbalance_adj = max(imbalance_adj, 2U);
> +
> +			/*
> +			 * Ignore imbalance unless busiest sd is close to 50%
> +			 * utilisation. At that point balancing for memory
> +			 * bandwidth and potentially avoiding unnecessary use
> +			 * of HT siblings is as relevant as memory locality.
> +			 */
> +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> +			if (env->imbalance <= imbalance_adj &&
> +			    busiest->sum_nr_running < imbalance_max) {i

Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?

and you could simplify by 


	if ((env->sd->flags & SD_NUMA) &&
		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
			env->imbalance = 0;
			return;
	}

And otherwise it will continue with the current path

Also I'm a bit worry about using a 50% threshold that look a bit like a
heuristic which can change depending of platform and the UCs that run of the
system.

In fact i was hoping that we could use the numa_preferred_nid ? During the
detach of tasks, we don't detach the task if busiest has spare capacity and
preferred_nid of the task is busiest.

I'm going to run some tests to see the impact on my platform 

Regards,
Vincent
}


> +				env->imbalance = 0;
> +			}
> +		}
>  		return;
>  	}
>  
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 14:45 ` Vincent Guittot
@ 2019-12-19 15:16   ` Valentin Schneider
  2019-12-19 15:18   ` Mel Gorman
  2019-12-20 13:00   ` Srikar Dronamraju
  2 siblings, 0 replies; 18+ messages in thread
From: Valentin Schneider @ 2019-12-19 15:16 UTC (permalink / raw)
  To: Vincent Guittot, Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, pauld, srikar, quentin.perret,
	dietmar.eggemann, Morten.Rasmussen, hdanton, parth, riel, LKML

On 19/12/2019 14:45, Vincent Guittot wrote:
>> @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  			env->migration_type = migrate_task;
>>  			lsub_positive(&nr_diff, local->sum_nr_running);
>>  			env->imbalance = nr_diff >> 1;
>> -			return;
>> +			goto out_spare;
> 
> Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
> 

I got confused by that as well but it's not just prefer_sibling actually;
there are cases where we enter the group_has_spare but none of its
nested if blocks, so we fall through to out_spare.

>> @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>  		env->migration_type = migrate_task;
>>  		env->imbalance = max_t(long, 0, (local->idle_cpus -
>>  						 busiest->idle_cpus) >> 1);
>> +
>> +out_spare:
>> +		/*
>> +		 * Whether balancing the number of running tasks or the number
>> +		 * of idle CPUs, consider allowing some degree of imbalance if
>> +		 * migrating between NUMA domains.
>> +		 */
>> +		if (env->sd->flags & SD_NUMA) {
>> +			unsigned int imbalance_adj, imbalance_max;
>> +
>> +			/*
>> +			 * imbalance_adj is the allowable degree of imbalance
>> +			 * to exist between two NUMA domains. It's calculated
>> +			 * relative to imbalance_pct with a minimum of two
>> +			 * tasks or idle CPUs.
>> +			 */
>> +			imbalance_adj = (busiest->group_weight *
>> +				(env->sd->imbalance_pct - 100) / 100) >> 1;
>> +			imbalance_adj = max(imbalance_adj, 2U);
>> +
>> +			/*
>> +			 * Ignore imbalance unless busiest sd is close to 50%
>> +			 * utilisation. At that point balancing for memory
>> +			 * bandwidth and potentially avoiding unnecessary use
>> +			 * of HT siblings is as relevant as memory locality.
>> +			 */
>> +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
>> +			if (env->imbalance <= imbalance_adj &&
>> +			    busiest->sum_nr_running < imbalance_max) {i
> 
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> 

I think it's better to hinge the cutoff on the busiest->sum_nr_running than
on busiest->idle_cpus. If you're balancing between big NUMA groups, you
could end up with a busiest->group_type == group_has_spare despite having
*some* of its CPUs overloaded (but still with
sg->sum_nr_running > sg->group_weight; simply because there's tons of CPUs).

> and you could simplify by 
> 
> 
> 	if ((env->sd->flags & SD_NUMA) &&
> 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> 			env->imbalance = 0;
> 			return;
> 	}
> 
> And otherwise it will continue with the current path
> 
> Also I'm a bit worry about using a 50% threshold that look a bit like a
> heuristic which can change depending of platform and the UCs that run of the
> system.
> 
> In fact i was hoping that we could use the numa_preferred_nid ? During the
> detach of tasks, we don't detach the task if busiest has spare capacity and
> preferred_nid of the task is busiest.
> 
> I'm going to run some tests to see the impact on my platform 
> 
> Regards,
> Vincent
> }
> 
> 
>> +				env->imbalance = 0;
>> +			}
>> +		}
>>  		return;
>>  	}
>>  
>>
>> -- 
>> Mel Gorman
>> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 14:45 ` Vincent Guittot
  2019-12-19 15:16   ` Valentin Schneider
@ 2019-12-19 15:18   ` Mel Gorman
  2019-12-19 15:41     ` Vincent Guittot
  2019-12-20 13:00   ` Srikar Dronamraju
  2 siblings, 1 reply; 18+ messages in thread
From: Mel Gorman @ 2019-12-19 15:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Thu, Dec 19, 2019 at 03:45:39PM +0100, Vincent Guittot wrote:
> Hi Mel,
> 
> Thanks for looking at this NUMA locality vs spreading tasks point.
> 

No problem.

> > @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  			env->migration_type = migrate_task;
> >  			lsub_positive(&nr_diff, local->sum_nr_running);
> >  			env->imbalance = nr_diff >> 1;
> > -			return;
> > +			goto out_spare;
> 
> Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
> 

It's the common case for NUMA machines I'm aware of and from the
perspective of allowing a slight imbalance when there are spare CPUs, I
felt it was the same whether we were considering idle CPUs or the number
of tasks running.

The prefer_sibling case applies to the children and the corner case is
that balancing NUMA domains takes into account whether the MC domain
prefers siblings which is a bit odd. I believe, but don't know, that the
reasoning may have been to spread load for memory bandwidth usage.

> So you should probably add your
> 
> > +                * Whether balancing the number of running tasks or the number
> > +                * of idle CPUs, consider allowing some degree of imbalance if
> > +                * migrating between NUMA domains.
> > +                */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       unsigned int imbalance_adj, imbalance_max;
> 
> ...
> 
> > +               }
> 
> before the prefer_sibling case :
> 
> 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> 			unsigned int nr_diff = busiest->sum_nr_running;
> 			/*
> 			 * When prefer sibling, evenly spread running tasks on
> 			 * groups.
> 			 */
> 

I don't understand. If I move SD_NUMA checks above the imbalance
calculation, how do I know whether the imbalance should be ignored?

> 
> >
> >  		}
> >  
> >  		/*
> > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		env->migration_type = migrate_task;
> >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> >  						 busiest->idle_cpus) >> 1);
> > +
> > +out_spare:
> > +		/*
> > +		 * Whether balancing the number of running tasks or the number
> > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > +		 * migrating between NUMA domains.
> > +		 */
> > +		if (env->sd->flags & SD_NUMA) {
> > +			unsigned int imbalance_adj, imbalance_max;
> > +
> > +			/*
> > +			 * imbalance_adj is the allowable degree of imbalance
> > +			 * to exist between two NUMA domains. It's calculated
> > +			 * relative to imbalance_pct with a minimum of two
> > +			 * tasks or idle CPUs.
> > +			 */
> > +			imbalance_adj = (busiest->group_weight *
> > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> > +			imbalance_adj = max(imbalance_adj, 2U);
> > +
> > +			/*
> > +			 * Ignore imbalance unless busiest sd is close to 50%
> > +			 * utilisation. At that point balancing for memory
> > +			 * bandwidth and potentially avoiding unnecessary use
> > +			 * of HT siblings is as relevant as memory locality.
> > +			 */
> > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > +			if (env->imbalance <= imbalance_adj &&
> > +			    busiest->sum_nr_running < imbalance_max) {i
> 
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> 

Why? CPU affinity could have stacked multiple tasks on one CPU where
as I'm looking for a proxy hint on the amount of bandwidth required.
sum_nr_running does not give me an accurate estimate but it's better than
idle cpus.

> and you could simplify by 
> 
> 
> 	if ((env->sd->flags & SD_NUMA) &&
> 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> 			env->imbalance = 0;
> 			return;
> 	}
> 
> And otherwise it will continue with the current path
> 

I ended up doing something similar to this in v2 but it's a bit more
expanded so I can put in comments on why the comparisons are the way
they are. The multiplications are in the slow path.

> Also I'm a bit worry about using a 50% threshold that look a bit like a
> heuristic which can change depending of platform and the UCs that run of the
> system.
> 

UCs?

And yes, it's a heuristic. In this case, I'm as concerned about memory
bandwidth availability as I am about improper locality due to agressive
balancing. We do not know the available memory bandwidth and we do not
know how much bandwidth the tasks required so 50% was as good as threshold
as any. I do not know of any way that can cheaply measure either bandwidth
usage (PMUs are not cheap) or available bandwidth (theoretical bandwidth !=
actual bandwidth).

In an earlier version that I never posted, I had no cutoff at all and
NAS took a roughly 30% performance penalty across all computational
kernels. Debug tracing led me to this cutoff and running a battery
of workloads led me to believe that it was a reasonable cutoff. It's
important.

> In fact i was hoping that we could use the numa_preferred_nid ?

Unfortunately not. For some tasks, they are not long-lived enough for NUMA
balancing to make a decision. For longer-lived tasks, if load balancing is
spreading the load across nodes and wakeups are pulling tasks together,
NUMA balancing will get a mix of remote/local samples and will be unable
to pick a node properly.

In the netperf figures I put in the changelog, I pointed out that NUMA
balancing sampled roughly 50% of accesses were remote. With the patch,
100% of the samples are local.

> During the
> detach of tasks, we don't detach the task if busiest has spare capacity and
> preferred_nid of the task is busiest.
> 

Sure, but again if load balancing and waker/wakees are fighting each
other, NUMA balancing gets caught in the crossfire and cannot make a
sensible decision.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 10:02   ` Peter Zijlstra
  2019-12-19 11:46     ` Valentin Schneider
@ 2019-12-19 15:23     ` Mel Gorman
  1 sibling, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2019-12-19 15:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Valentin Schneider, Vincent Guittot, Ingo Molnar, pauld, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Thu, Dec 19, 2019 at 11:02:32AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 18, 2019 at 06:50:52PM +0000, Valentin Schneider wrote:
> > I'm quite sure you have reasons to have written it that way, but I was
> > hoping we could squash it down to something like:
> > ---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 08a233e97a01..f05d09a8452e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8680,16 +8680,27 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  			env->migration_type = migrate_task;
> >  			lsub_positive(&nr_diff, local->sum_nr_running);
> >  			env->imbalance = nr_diff >> 1;
> > -			return;
> > +		} else {
> > +
> > +			/*
> > +			 * If there is no overload, we just want to even the number of
> > +			 * idle cpus.
> > +			 */
> > +			env->migration_type = migrate_task;
> > +			env->imbalance = max_t(long, 0, (local->idle_cpus -
> > +							 busiest->idle_cpus) >> 1);
> >  		}
> >  
> >  		/*
> > -		 * If there is no overload, we just want to even the number of
> > -		 * idle cpus.
> > +		 * Allow for a small imbalance between NUMA groups; don't do any
> > +		 * of it if there is at least half as many tasks / busy CPUs as
> > +		 * there are available CPUs in the busiest group
> >  		 */
> > -		env->migration_type = migrate_task;
> > -		env->imbalance = max_t(long, 0, (local->idle_cpus -
> > -						 busiest->idle_cpus) >> 1);
> > +		if (env->sd->flags & SD_NUMA &&
> > +		    (busiest->sum_nr_running < busiest->group_weight >> 1) &&
> > +		    (env->imbalance < busiest->group_weight * (env->sd->imbalance_pct - 100) / 100))
> 
> Note that this form allows avoiding the division. Every time I see that
> /100 I'm thinking we should rename and make imbalance_pct a base-2
> thing.
> 

Yeah, in this case simply (busiest->group_weight >> 2) would mostly be
equivalent to using imbalance_pct. I was tempted to use it but if
imbalance_pct ever changed, it would be inconsistent.

Running NAS OMP C-Class with 50% of the CPUs turned out to be one of the
more adverse cases I encountered during testing (although that is still
ongoing). This is the current comparison I have

			     baseline		 lbnuma-v2r1		lbnuma-v2r2
Amean     bt.C       64.29 (   0.00%)       76.33 * -18.72%*       63.82 (   0.73%)
Amean     cg.C       26.33 (   0.00%)       26.26 (   0.27%)       27.23 (  -3.39%)
Amean     ep.C       10.26 (   0.00%)       10.29 (  -0.31%)       10.27 (  -0.12%)
Amean     ft.C       17.98 (   0.00%)       19.73 *  -9.71%*       18.61 (  -3.47%)
Amean     is.C        0.99 (   0.00%)        0.99 (   0.40%)        0.98 *   1.01%*
Amean     lu.C       51.72 (   0.00%)       48.57 (   6.09%)       48.46 *   6.30%*
Amean     mg.C        8.12 (   0.00%)        8.27 (  -1.82%)        7.93 (   2.31%)
Amean     sp.C       82.76 (   0.00%)       86.06 *  -3.99%*       89.26 *  -7.86%*
Amean     ua.C       58.64 (   0.00%)       57.66 (   1.67%)       58.27 (   0.62%)

lbnuma-v2r1 is the new form that avoids the division and lbnuma-v2r2 is
a revised version of my own patch with slightly different formatting.
The results are inconclusive though. Avoiding the divison took a big hit
on bt.C but my own patch is essentially the same as the first version and
ran better this time around. Similarly with sp.C, my own patch performed
better. However, in both cases those workloads depend heavily on tasks
being spread wide and spread early so there is some luck involved. While
not presented here, the variability is quite high.

While it's irrelevant on this test machine, the new form also misses
env->imbalance being compared against at least 2 for the basic case of
two communicating tasks on NUMA domains that are very small. That is
trivially fixed.

I'm currently testing the following which uses a straight 50% cutoff.
While I liked the idea of taking the allowed imbalance into account
before allowing load to spread, I do not have hard data that says it's
required. It could always be revisited if this patch introduced a
regression.

(The figures are not updated in the changelog yet)

---8<---
sched, fair: Allow a small degree of load imbalance between SD_NUMA domains

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a degree of imbalance to exist between NUMA domains
based on the imbalance_pct defined by the scheduler domain. This slight
imbalance is allowed until the scheduler domain reaches almost 50%
utilisation at which point other factors like HT utilisation and memory
bandwidth come into play. While not commented upon in the code, the cutoff
is important for memory-bound parallelised non-communicating workloads
that do not fully utilise the entire machine. This is not necessarily the
best universal cut-off point but it appeared appropriate for a variety
of workloads and machines.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offloaded depending on the
transmission rate.

2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
                       	      baseline              lbnuma-v1
Hmean     64         666.68 (   0.00%)      669.00 (   0.35%)
Hmean     128       1276.18 (   0.00%)     1285.59 *   0.74%*
Hmean     256       2366.78 (   0.00%)     2419.42 *   2.22%*
Hmean     1024      8123.94 (   0.00%)     8494.92 *   4.57%*
Hmean     2048     12962.45 (   0.00%)    13430.37 *   3.61%*
Hmean     3312     17709.24 (   0.00%)    17317.23 *  -2.21%*
Hmean     4096     19756.01 (   0.00%)    19480.56 (  -1.39%)
Hmean     8192     27469.59 (   0.00%)    27208.17 (  -0.95%)
Hmean     16384    30062.82 (   0.00%)    31135.21 *   3.57%*
Stddev    64           2.64 (   0.00%)        1.19 (  54.86%)
Stddev    128          6.22 (   0.00%)        0.65 (  89.51%)
Stddev    256          9.75 (   0.00%)       11.81 ( -21.07%)
Stddev    1024        69.62 (   0.00%)       38.48 (  44.74%)
Stddev    2048        72.73 (   0.00%)       58.22 (  19.94%)
Stddev    3312       412.35 (   0.00%)       67.77 (  83.57%)
Stddev    4096       345.02 (   0.00%)       81.07 (  76.50%)
Stddev    8192       280.09 (   0.00%)      250.19 (  10.68%)
Stddev    16384      452.99 (   0.00%)      222.97 (  50.78%)

Fairly small impact on average performance but note how much the standard
deviation is reduced showing much more stable results. A clearer story
is visible from the NUMA Balancing stats

Ops NUMA base-page range updates       21596.00         282.00
Ops NUMA PTE updates                   21596.00         282.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                   17786.00         134.00
Ops NUMA hint local faults %            9916.00         134.00
Ops NUMA hint local percent               55.75         100.00
Ops NUMA pages migrated                 4231.00           0.00

Without the patch, only 55.75% of sampled accesses are local.
With the patch, 100% of sampled accesses are local. A 2-socket
Broadwell showed better results on average but are not presented
for brevity. The patch holds up for 4-socket boxes as well

4-socket Haswell machine, 144 core, HT enabled
netperf-tcp

                       	      baseline              lbnuma-v1
Hmean     64         953.51 (   0.00%)      986.63 *   3.47%*
Hmean     128       1826.48 (   0.00%)     1887.48 *   3.34%*
Hmean     256       3295.19 (   0.00%)     3402.08 *   3.24%*
Hmean     1024     10915.40 (   0.00%)    11482.92 *   5.20%*
Hmean     2048     17833.82 (   0.00%)    19033.89 *   6.73%*
Hmean     3312     22690.72 (   0.00%)    24101.77 *   6.22%*
Hmean     4096     24422.23 (   0.00%)    26665.46 *   9.19%*
Hmean     8192     31250.11 (   0.00%)    33514.74 *   7.25%*
Hmean     16384    37033.70 (   0.00%)    38732.22 *   4.59%*

On this machine, the baseline measured 58.11% locality for sampled accesses
and 100% local accesses with the patch. Similarly, the patch holds up
for 2-socket machines with multiple L3 caches such as the AMD Epyc 2

2-socket EPYC-2 machine, 256 cores
netperf-tcp
Hmean     64        1564.63 (   0.00%)     1550.59 (  -0.90%)
Hmean     128       3028.83 (   0.00%)     3030.48 (   0.05%)
Hmean     256       5733.47 (   0.00%)     5769.51 (   0.63%)
Hmean     1024     18936.04 (   0.00%)    19216.15 *   1.48%*
Hmean     2048     27589.77 (   0.00%)    28200.45 *   2.21%*
Hmean     3312     35361.97 (   0.00%)    35881.94 *   1.47%*
Hmean     4096     37965.59 (   0.00%)    38702.01 *   1.94%*
Hmean     8192     48499.92 (   0.00%)    49530.62 *   2.13%*
Hmean     16384    54249.96 (   0.00%)    55937.24 *   3.11%*

For amusement purposes, here are two graphs showing CPU utilisation on
the 2-socket Haswell machine over time based on mpstat with the ordering
of the CPUs based on topology.

http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-baseline.png
http://www.skynet.ie/~mel/postings/lbnuma-20191218/netperf-tcp-mpstat-lbnuma-v1r1.png

The lines on the left match up CPUs that are HT siblings or on the same
node. The machine has only one L3 cache per NUMA node or that would also
be shown.  It should be very clear from the images that the baseline
kernel spread the load with lighter utilisation across nodes while the
patched kernel had heavy utilisation of fewer CPUs on one node.

Hackbench generally shows good results across machines with some
differences depending on whether threads or sockets are used as well as
pipes or sockets.  This is the *worst* result from the 2-socket Haswell
machine

2-socket Haswell machine 48 core, HT enabled
hackbench-process-pipes -- mmtests config config-scheduler-unbound
                           5.5.0-rc1              5.5.0-rc1
                     	    baseline              lbnuma-v1
Amean     1        1.2580 (   0.00%)      1.2393 (   1.48%)
Amean     4        5.3293 (   0.00%)      5.2683 *   1.14%*
Amean     7        8.9067 (   0.00%)      8.7130 *   2.17%*
Amean     12      14.9577 (   0.00%)     14.5773 *   2.54%*
Amean     21      25.9570 (   0.00%)     25.6657 *   1.12%*
Amean     30      37.7287 (   0.00%)     37.1277 *   1.59%*
Amean     48      61.6757 (   0.00%)     60.0433 *   2.65%*
Amean     79     100.4740 (   0.00%)     98.4507 (   2.01%)
Amean     110    141.2450 (   0.00%)    136.8900 *   3.08%*
Amean     141    179.7747 (   0.00%)    174.5110 *   2.93%*
Amean     172    221.0700 (   0.00%)    214.7857 *   2.84%*
Amean     192    245.2007 (   0.00%)    238.3680 *   2.79%*

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is marginal

NAS-C class OMP -- mmtests config hpc-nas-c-class-omp-half
                     	     baseline              lbnuma-v1
Amean     bt.C       64.29 (   0.00%)       70.31 *  -9.36%*
Amean     cg.C       26.33 (   0.00%)       25.73 (   2.31%)
Amean     ep.C       10.26 (   0.00%)       10.27 (  -0.10%)
Amean     ft.C       17.98 (   0.00%)       19.03 (  -5.84%)
Amean     is.C        0.99 (   0.00%)        0.99 (   0.40%)
Amean     lu.C       51.72 (   0.00%)       49.11 (   5.04%)
Amean     mg.C        8.12 (   0.00%)        8.13 (  -0.15%)
Amean     sp.C       82.76 (   0.00%)       84.52 (  -2.13%)
Amean     ua.C       58.64 (   0.00%)       57.57 (   1.82%)

There is some impact but there is a degree of variability and the ones
showing impact are mainly workloads that are mostly parallelised
and communicate infrequently between tests. It's a corner case where
the workload benefits heavily from spreading wide and early which is
not common. This is intended to illustrate the worst case measured.

In general, the patch simply seeks to avoid unnecessarily cross-node
migrations when a machine is lightly loaded but shows benefits for other
workloads. While tests are still running, so far it seems to benefit
light-utilisation smaller workloads on large machines and does not appear
to do any harm to larger or parallelised workloads.

[valentin.schneider@arm.com: Reformat code flow and correct comment]
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 08a233e97a01..36eac7adf1cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8637,10 +8637,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8680,16 +8676,40 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
+		} else {
+
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
+							 busiest->idle_cpus) >> 1);
 		}
 
 		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
+		 * Consider allowing a small imbalance between NUMA groups
+		 * unless the busiest sd has half as many tasks / busy CPUs
+		 * as there are available CPUs in the busiest group.
 		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
-						 busiest->idle_cpus) >> 1);
+		if (env->sd->flags & SD_NUMA &&
+		    busiest->sum_nr_running < (busiest->group_weight >> 1)) {
+			unsigned int imbalance_max;
+
+			/*
+			 * imbalance_max is the allowed degree of imbalance
+			 * to exist between two NUMA domains when the SD is
+			 * lightly loaded. It's related to imbalance_pct with
+			 * a minimum of two tasks or idle CPUs to account
+			 * for the basic case of two communicating tasks
+			 * that should reside on the same node.
+			 */
+			imbalance_max = max(2U, (busiest->group_weight *
+				(env->sd->imbalance_pct - 100)) >> 1);
+
+			if (env->imbalance * 100 <= imbalance_max)
+				env->imbalance = 0;
+		}
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 15:18   ` Mel Gorman
@ 2019-12-19 15:41     ` Vincent Guittot
  2019-12-19 15:58       ` Mel Gorman
  0 siblings, 1 reply; 18+ messages in thread
From: Vincent Guittot @ 2019-12-19 15:41 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

Le Thursday 19 Dec 2019 à 15:18:24 (+0000), Mel Gorman a écrit :
> On Thu, Dec 19, 2019 at 03:45:39PM +0100, Vincent Guittot wrote:
> > Hi Mel,
> > 
> > Thanks for looking at this NUMA locality vs spreading tasks point.
> > 
> 
> No problem.
> 
> > > @@ -8680,7 +8676,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  			env->migration_type = migrate_task;
> > >  			lsub_positive(&nr_diff, local->sum_nr_running);
> > >  			env->imbalance = nr_diff >> 1;
> > > -			return;
> > > +			goto out_spare;
> > 
> > Why are you doing this only for prefer_sibling case ? That's probably the default case of most of numa system but you should also consider others case too.
> > 
> 
> It's the common case for NUMA machines I'm aware of and from the
> perspective of allowing a slight imbalance when there are spare CPUs, I
> felt it was the same whether we were considering idle CPUs or the number
> of tasks running.
> 
> The prefer_sibling case applies to the children and the corner case is
> that balancing NUMA domains takes into account whether the MC domain
> prefers siblings which is a bit odd. I believe, but don't know, that the
> reasoning may have been to spread load for memory bandwidth usage.
> 
> > So you should probably add your
> > 
> > > +                * Whether balancing the number of running tasks or the number
> > > +                * of idle CPUs, consider allowing some degree of imbalance if
> > > +                * migrating between NUMA domains.
> > > +                */
> > > +               if (env->sd->flags & SD_NUMA) {
> > > +                       unsigned int imbalance_adj, imbalance_max;
> > 
> > ...
> > 
> > > +               }
> > 
> > before the prefer_sibling case :
> > 
> > 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> > 			unsigned int nr_diff = busiest->sum_nr_running;
> > 			/*
> > 			 * When prefer sibling, evenly spread running tasks on
> > 			 * groups.
> > 			 */
> > 
> 
> I don't understand. If I move SD_NUMA checks above the imbalance
> calculation, how do I know whether the imbalance should be ignored?

You are only clearing env->imbalance before returning if the condition
between sum_nr_running with weight doesn't  match so you don't care about
what will be the value of env->imbalance in the other case so you can have 

		if ((env->sd->flags & SD_NUMA) &&
			( allow some degrees of imbalance )) {
				env->imbalance = 0
				return;
		}

		if (busiest->group_weight == 1 || sds->prefer_sibling) {
			unsigned int nr_diff = busiest->sum_nr_running;
			/*
			 * When prefer sibling, evenly spread running tasks on
			 * groups.
			 */
			env->migration_type = migrate_task;
			lsub_positive(&nr_diff, local->sum_nr_running);
			env->imbalance = nr_diff >> 1;
			return;
		}

		/*
		 * If there is no overload, we just want to even the number of
		 * idle cpus.
		 */
		env->migration_type = migrate_task;
		env->imbalance = max_t(long, 0, (local->idle_cpus -
						 busiest->idle_cpus) >> 1);
		return;
	}
	
		
> 
> > 
> > >
> > >  		}
> > >  
> > >  		/*
> > > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  		env->migration_type = migrate_task;
> > >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> > >  						 busiest->idle_cpus) >> 1);
> > > +
> > > +out_spare:
> > > +		/*
> > > +		 * Whether balancing the number of running tasks or the number
> > > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > > +		 * migrating between NUMA domains.
> > > +		 */
> > > +		if (env->sd->flags & SD_NUMA) {
> > > +			unsigned int imbalance_adj, imbalance_max;
> > > +
> > > +			/*
> > > +			 * imbalance_adj is the allowable degree of imbalance
> > > +			 * to exist between two NUMA domains. It's calculated
> > > +			 * relative to imbalance_pct with a minimum of two
> > > +			 * tasks or idle CPUs.
> > > +			 */
> > > +			imbalance_adj = (busiest->group_weight *
> > > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> > > +			imbalance_adj = max(imbalance_adj, 2U);
> > > +
> > > +			/*
> > > +			 * Ignore imbalance unless busiest sd is close to 50%
> > > +			 * utilisation. At that point balancing for memory
> > > +			 * bandwidth and potentially avoiding unnecessary use
> > > +			 * of HT siblings is as relevant as memory locality.
> > > +			 */
> > > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > > +			if (env->imbalance <= imbalance_adj &&
> > > +			    busiest->sum_nr_running < imbalance_max) {i
> > 
> > Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> > 
> 
> Why? CPU affinity could have stacked multiple tasks on one CPU where
> as I'm looking for a proxy hint on the amount of bandwidth required.
> sum_nr_running does not give me an accurate estimate but it's better than
> idle cpus.

Because even if you have multiple tasks on one CPU, only one will run at a
time on the CPU and others will wait so the bandwidth is effectively link to
the number of running CPUs more than number of runnable tasks

> 
> > and you could simplify by 
> > 
> > 
> > 	if ((env->sd->flags & SD_NUMA) &&
> > 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> > 			env->imbalance = 0;
> > 			return;
> > 	}
> > 
> > And otherwise it will continue with the current path
> > 
> 
> I ended up doing something similar to this in v2 but it's a bit more
> expanded so I can put in comments on why the comparisons are the way
> they are. The multiplications are in the slow path.
> 
> > Also I'm a bit worry about using a 50% threshold that look a bit like a
> > heuristic which can change depending of platform and the UCs that run of the
> > system.
> > 
> 
> UCs?

Use Cases

> 
> And yes, it's a heuristic. In this case, I'm as concerned about memory
> bandwidth availability as I am about improper locality due to agressive
> balancing. We do not know the available memory bandwidth and we do not
> know how much bandwidth the tasks required so 50% was as good as threshold
> as any. I do not know of any way that can cheaply measure either bandwidth
> usage (PMUs are not cheap) or available bandwidth (theoretical bandwidth !=
> actual bandwidth).
> 
> In an earlier version that I never posted, I had no cutoff at all and
> NAS took a roughly 30% performance penalty across all computational
> kernels. Debug tracing led me to this cutoff and running a battery
> of workloads led me to believe that it was a reasonable cutoff. It's
> important.
> 
> > In fact i was hoping that we could use the numa_preferred_nid ?
> 
> Unfortunately not. For some tasks, they are not long-lived enough for NUMA
> balancing to make a decision. For longer-lived tasks, if load balancing is
> spreading the load across nodes and wakeups are pulling tasks together,
> NUMA balancing will get a mix of remote/local samples and will be unable
> to pick a node properly.
> 
> In the netperf figures I put in the changelog, I pointed out that NUMA
> balancing sampled roughly 50% of accesses were remote. With the patch,
> 100% of the samples are local.
> 
> > During the
> > detach of tasks, we don't detach the task if busiest has spare capacity and
> > preferred_nid of the task is busiest.
> > 
> 
> Sure, but again if load balancing and waker/wakees are fighting each
> other, NUMA balancing gets caught in the crossfire and cannot make a
> sensible decision.
> 
> -- 
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 15:41     ` Vincent Guittot
@ 2019-12-19 15:58       ` Mel Gorman
  0 siblings, 0 replies; 18+ messages in thread
From: Mel Gorman @ 2019-12-19 15:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, Peter Zijlstra, pauld, valentin.schneider, srikar,
	quentin.perret, dietmar.eggemann, Morten.Rasmussen, hdanton,
	parth, riel, LKML

On Thu, Dec 19, 2019 at 04:41:17PM +0100, Vincent Guittot wrote:
> > I don't understand. If I move SD_NUMA checks above the imbalance
> > calculation, how do I know whether the imbalance should be ignored?
> 
> You are only clearing env->imbalance before returning if the condition
> between sum_nr_running with weight doesn't  match so you don't care about
> what will be the value of env->imbalance in the other case so you can have 
> 
> 		if ((env->sd->flags & SD_NUMA) &&
> 			( allow some degrees of imbalance )) {
> 				env->imbalance = 0
> 				return;
> 		}
> 
> 		if (busiest->group_weight == 1 || sds->prefer_sibling) {
> 			unsigned int nr_diff = busiest->sum_nr_running;
> 			/*
> 			 * When prefer sibling, evenly spread running tasks on
> 			 * groups.
> 			 */
> 			env->migration_type = migrate_task;
> 			lsub_positive(&nr_diff, local->sum_nr_running);
> 			env->imbalance = nr_diff >> 1;
> 			return;
> 		}
> 
> 		/*
> 		 * If there is no overload, we just want to even the number of
> 		 * idle cpus.
> 		 */
> 		env->migration_type = migrate_task;
> 		env->imbalance = max_t(long, 0, (local->idle_cpus -
> 						 busiest->idle_cpus) >> 1);
> 		return;
> 	}
> 	

Ok, that's clear. In an earlier version, I was not just resetting it, I
was adjusting the estimated imbalance but it was fragile. I can move the
check as you suggest.

> > > > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > >  		env->migration_type = migrate_task;
> > > >  		env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > >  						 busiest->idle_cpus) >> 1);
> > > > +
> > > > +out_spare:
> > > > +		/*
> > > > +		 * Whether balancing the number of running tasks or the number
> > > > +		 * of idle CPUs, consider allowing some degree of imbalance if
> > > > +		 * migrating between NUMA domains.
> > > > +		 */
> > > > +		if (env->sd->flags & SD_NUMA) {
> > > > +			unsigned int imbalance_adj, imbalance_max;
> > > > +
> > > > +			/*
> > > > +			 * imbalance_adj is the allowable degree of imbalance
> > > > +			 * to exist between two NUMA domains. It's calculated
> > > > +			 * relative to imbalance_pct with a minimum of two
> > > > +			 * tasks or idle CPUs.
> > > > +			 */
> > > > +			imbalance_adj = (busiest->group_weight *
> > > > +				(env->sd->imbalance_pct - 100) / 100) >> 1;
> > > > +			imbalance_adj = max(imbalance_adj, 2U);
> > > > +
> > > > +			/*
> > > > +			 * Ignore imbalance unless busiest sd is close to 50%
> > > > +			 * utilisation. At that point balancing for memory
> > > > +			 * bandwidth and potentially avoiding unnecessary use
> > > > +			 * of HT siblings is as relevant as memory locality.
> > > > +			 */
> > > > +			imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > > > +			if (env->imbalance <= imbalance_adj &&
> > > > +			    busiest->sum_nr_running < imbalance_max) {i
> > > 
> > > Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> > > 
> > 
> > Why? CPU affinity could have stacked multiple tasks on one CPU where
> > as I'm looking for a proxy hint on the amount of bandwidth required.
> > sum_nr_running does not give me an accurate estimate but it's better than
> > idle cpus.
> 
> Because even if you have multiple tasks on one CPU, only one will run at a
> time on the CPU and others will wait so the bandwidth is effectively link to
> the number of running CPUs more than number of runnable tasks
> 

Ok, I can try that. There is the corner case where tasks can be stacked on
the one CPU without affinities being involved but it's rare. That might
change again in the future if unbound workqueues get special cased to
allow a wakee to stack on top of the workqueues CPU as it's essentially
sync but it's not likely to make that much of a difference.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
  2019-12-19 14:45 ` Vincent Guittot
  2019-12-19 15:16   ` Valentin Schneider
  2019-12-19 15:18   ` Mel Gorman
@ 2019-12-20 13:00   ` Srikar Dronamraju
  2 siblings, 0 replies; 18+ messages in thread
From: Srikar Dronamraju @ 2019-12-20 13:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, pauld,
	valentin.schneider, quentin.perret, dietmar.eggemann,
	Morten.Rasmussen, hdanton, parth, riel, LKML

* Vincent Guittot <vincent.guittot@linaro.org> [2019-12-19 15:45:39]:

> Hi Mel,
> 
> Thanks for looking at this NUMA locality vs spreading tasks point.
> 
> 
> Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> and you could simplify by 
> 
> 
> 	if ((env->sd->flags & SD_NUMA) &&
> 		((100 * busiest->group_weight) <= (env->sd->imbalance_pct * (busiest->idle_cpus << 1)))) {
> 			env->imbalance = 0;
> 			return;
> 	}

Are idle_cpus and sum_nr_running good enough metrics to look at a NUMA
level? We could have asymmetric NUMA topology where one DIE/MC/groups may
have more cores than the other. In such a case looking at idle_cpus (or
sum_nr_running) of the group may not always lead us to the right load
balancing solution.


-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-12-20 13:01 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-18 15:44 [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains Mel Gorman
2019-12-18 18:50 ` Valentin Schneider
2019-12-18 22:50   ` Mel Gorman
2019-12-19 11:56     ` Valentin Schneider
2019-12-19 10:02   ` Peter Zijlstra
2019-12-19 11:46     ` Valentin Schneider
2019-12-19 14:23       ` Valentin Schneider
2019-12-19 15:23     ` Mel Gorman
2019-12-18 18:54 ` Valentin Schneider
2019-12-19  2:58 ` Rik van Riel
2019-12-19  8:41   ` Mel Gorman
2019-12-19 10:04 ` Peter Zijlstra
2019-12-19 14:45 ` Vincent Guittot
2019-12-19 15:16   ` Valentin Schneider
2019-12-19 15:18   ` Mel Gorman
2019-12-19 15:41     ` Vincent Guittot
2019-12-19 15:58       ` Mel Gorman
2019-12-20 13:00   ` Srikar Dronamraju

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).