linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs
@ 2022-02-08  9:43 Mel Gorman
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Mel Gorman @ 2022-02-08  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML, Mel Gorman

Changelog sinve v5
o Fix off-by-one error

Changelog since V4
o Scale imbalance based on the top domain that prefers siblings
o Keep allowed imbalance as 2 up to the point where LLCs can be overloaded

Changelog since V3
o Calculate imb_numa_nr for multiple SD_NUMA domains
o Restore behaviour where communicating pairs remain on the same node

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

The series addresses two problems -- inconsistent logic when allowing a
NUMA imbalance and sub-optimal performance when there are many LLCs per
NUMA node.

 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 30 ++++++++++---------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+), 13 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations
  2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2022-02-08  9:43 ` Mel Gorman
  2022-02-08 15:06   ` Gautham R. Shenoy
                     ` (3 more replies)
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2022-02-09  9:38 ` [PATCH v6 0/2] Adjust NUMA imbalance for " Peter Zijlstra
  2 siblings, 4 replies; 15+ messages in thread
From: Mel Gorman @ 2022-02-08  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML, Mel Gorman

There are inconsistencies when determining if a NUMA imbalance is allowed
that should be corrected.

o allow_numa_imbalance changes types and is not always examining
  the destination group so both the type should be corrected as
  well as the naming.
o find_idlest_group uses the sched_domain's weight instead of the
  group weight which is different to find_busiest_group
o find_busiest_group uses the source group instead of the destination
  which is different to task_numa_find_cpu
o Both find_idlest_group and find_busiest_group should account
  for the number of running tasks if a move was allowed to be
  consistent with task_numa_find_cpu

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..4592ccf82c34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9003,9 +9003,10 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool
+allow_numa_imbalance(unsigned int running, unsigned int weight)
 {
-	return (dst_running < (dst_weight >> 2));
+	return (running < (weight >> 2));
 }
 
 /*
@@ -9139,12 +9140,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task close to the wakeup source
+			 * and improve locality if the number of running tasks
+			 * would remain below threshold where an imbalance is
+			 * allowed. If there is a real need of migration,
+			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
 				return NULL;
 		}
 
@@ -9350,7 +9352,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				local->sum_nr_running + 1, local->group_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
@ 2022-02-08  9:43 ` Mel Gorman
  2022-02-08 16:19   ` Gautham R. Shenoy
                     ` (4 more replies)
  2022-02-09  9:38 ` [PATCH v6 0/2] Adjust NUMA imbalance for " Peter Zijlstra
  2 siblings, 5 replies; 15+ messages in thread
From: Mel Gorman @ 2022-02-08  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v6
MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v6
Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla       sched-numaimb-v6
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 22 +++++++-------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641c0a7b..56cffe42abbc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4592ccf82c34..538756bd8e7f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1503,7 +1504,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool
-allow_numa_imbalance(unsigned int running, unsigned int weight)
+static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
 {
-	return (running < (weight >> 2));
+	return running <= imb_numa_nr;
 }
 
 /*
@@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * allowed. If there is a real need of migration,
 			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				local->sum_nr_running + 1, local->group_weight);
+				local->sum_nr_running + 1, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..e6cd55951304 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int nr_llcs;
+
+				/*
+				 * For a single LLC per node, allow an
+				 * imbalance up to 25% of the node. This is an
+				 * arbitrary cutoff based on SMT-2 to balance
+				 * between memory bandwidth and avoiding
+				 * premature sharing of HT resources and SMT-4
+				 * or SMT-8 *may* benefit from a different
+				 * cutoff.
+				 *
+				 * For multiple LLCs, allow an imbalance
+				 * until multiple tasks would share an LLC
+				 * on one node while LLCs on another node
+				 * remain idle.
+				 */
+				nr_llcs = sd->span_weight / child->span_weight;
+				if (nr_llcs == 1)
+					imb = sd->span_weight >> 2;
+				else
+					imb = nr_llcs;
+				sd->imb_numa_nr = imb;
+
+				/* Set span based on the first NUMA domain. */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && !(top_p->flags & SD_NUMA)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
@ 2022-02-08 15:06   ` Gautham R. Shenoy
  2022-02-14  9:48   ` Vincent Guittot
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Gautham R. Shenoy @ 2022-02-08 15:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	K Prateek Nayak, LKML

Hello Mel,

On Tue, Feb 08, 2022 at 09:43:33AM +0000, Mel Gorman wrote:
> There are inconsistencies when determining if a NUMA imbalance is allowed
> that should be corrected.
> 
> o allow_numa_imbalance changes types and is not always examining
>   the destination group so both the type should be corrected as
>   well as the naming.
> o find_idlest_group uses the sched_domain's weight instead of the
>   group weight which is different to find_busiest_group
> o find_busiest_group uses the source group instead of the destination
>   which is different to task_numa_find_cpu
> o Both find_idlest_group and find_busiest_group should account
>   for the number of running tasks if a move was allowed to be
>   consistent with task_numa_find_cpu
>

This patch looks good to me. Using the local_sgs.group_weight instead
of sd->span_weight is definitely helping improve the spread of the
stream threads on AMD EPYC processors.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>


> Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..4592ccf82c34 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9003,9 +9003,10 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool
> +allow_numa_imbalance(unsigned int running, unsigned int weight)
>  {
> -	return (dst_running < (dst_weight >> 2));
> +	return (running < (weight >> 2));
>  }
>  
>  /*
> @@ -9139,12 +9140,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task close to the wakeup source
> +			 * and improve locality if the number of running tasks
> +			 * would remain below threshold where an imbalance is
> +			 * allowed. If there is a real need of migration,
> +			 * periodic load balance will take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
>  				return NULL;
>  		}
>  
> @@ -9350,7 +9352,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				busiest->sum_nr_running, busiest->group_weight);
> +				local->sum_nr_running + 1, local->group_weight);
>  		}
>  
>  		return;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2022-02-08 16:19   ` Gautham R. Shenoy
  2022-02-09  5:10   ` K Prateek Nayak
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Gautham R. Shenoy @ 2022-02-08 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	K Prateek Nayak, LKML

On Tue, Feb 08, 2022 at 09:43:34AM +0000, Mel Gorman wrote:

[..snip..]

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */
> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
> +			}
> +		}
> +	}


On a 2 Socket Zen3 servers with 64 cores per socket, the imb_numa_nr
works out to be as follows for different Node Per Socket (NPS) modes

NPS = 1:
======
SMT(span = 2) -- > MC (span = 16) --> DIE (span = 128) --> NUMA (span = 256)
	 Parent of LLC is DIE. nr_llcs = 128/16 = 8. imb = 8.
	 top_p = NUMA. imb_span = 256.

for NUMA doman, factor = max(1U, 256/256) = 1. Thus sd->imb_numa_nr = 8.



NPS = 2
========
SMT(span=2)--> MC(span=16)--> NODE(span=64)--> NUMA1(span=128)--> NUMA2(span=256)

Parent of LLC = NODE. nr_llcs = 64/16 = 4. imb = 4.
top_p = NUMA1. imb_span = 128.

For NUMA1 domain, factor = 1. sd->imb_numa_nr = 4.
For NUMA2 domain, factor = 2. sd->imb_numa_nr = 8


NPS = 4
========
SMT(span=2)--> MC(span=16)--> NODE(span=32)--> NUMA1(span=128)--> NUMA2(span=256)

Parent of LLC = NODE. nr_llcs = 32/16 = 2. imb = 2.
top_p = NUMA1. imb_span = 128.

For NUMA1 domain, factor = 1. sd->imb_numa_nr = 2.
For NUMA2 domain, factor = 2. sd->imb_numa_nr = 4


The imb_numa_nr looks good for all the NPS modes. Furthermore, running
stream with 16 threads (equal to the number of LLCs in the system)
yields good results on all the NPS modes with this imb_numa_nr.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2022-02-08 16:19   ` Gautham R. Shenoy
@ 2022-02-09  5:10   ` K Prateek Nayak
  2022-02-09 10:33     ` Mel Gorman
  2022-02-14 10:27   ` Srikar Dronamraju
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: K Prateek Nayak @ 2022-02-09  5:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	Gautham Shenoy, LKML

Hello Mel,

On 2/8/2022 3:13 PM, Mel Gorman wrote:
[..snip..]
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v6
> MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
> MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
> MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
> MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

I was able to test STREAM without binding on different
NPS configurations of two socket Zen3 machine.

The results look good:

sched-tip 	- 5.17.0-rc1 tip sched/core
mel-v6 		- 5.17.0-rc1 tip sched/core + this patch

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:                sched-tip                  mel-v6
 Copy:   114470.18 (0.00 pct)    152806.94 (33.49 pct)
Scale:   111575.12 (0.00 pct)    189784.57 (70.09 pct)
  Add:   125436.15 (0.00 pct)    213371.05 (70.10 pct)
Triad:   123068.86 (0.00 pct)    209809.11 (70.48 pct)

NPS2

Test:               sched-tip                   mel-v6
 Copy:   57936.28 (0.00 pct)     155038.70 (167.60 pct)
Scale:   55599.30 (0.00 pct)     192601.59 (246.41 pct)
  Add:   63096.96 (0.00 pct)     211462.58 (235.13 pct)
Triad:   61983.39 (0.00 pct)     208909.34 (237.04 pct)

NPS4

Test:               sched-tip                   mel-v6
 Copy:   43946.42 (0.00 pct)     119583.69 (172.11 pct)
Scale:   33750.96 (0.00 pct)     180130.83 (433.70 pct)
  Add:   39109.72 (0.00 pct)     170296.68 (335.43 pct)
Triad:   36598.88 (0.00 pct)     169953.47 (364.36 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:               sched-tip                   mel-v6
 Copy:   132402.79 (0.00 pct)    225587.85 (70.37 pct)
Scale:   126923.02 (0.00 pct)    214363.58 (68.89 pct)
  Add:   145596.55 (0.00 pct)    260901.92 (79.19 pct)
Triad:   143092.91 (0.00 pct)    249081.79 (74.06 pct)

NPS 2

Test:               sched-tip                   mel-v6
 Copy:   107386.27 (0.00 pct)    227623.31 (111.96 pct)
Scale:   100941.44 (0.00 pct)    218116.63 (116.08 pct)
  Add:   115854.52 (0.00 pct)    272756.95 (135.43 pct)
Triad:   113369.96 (0.00 pct)    260235.32 (129.54 pct)

NPS4

Test:               sched-tip                   mel-v6
 Copy:   91083.07 (0.00 pct)     247163.90 (171.36 pct)
Scale:   90352.54 (0.00 pct)     223914.31 (147.82 pct)
  Add:   101973.98 (0.00 pct)    272842.42 (167.56 pct)
Triad:   99773.65 (0.00 pct)     258904.54 (159.49 pct)


There is a significant improvement throughout the board
with v6 outperforming tip/sched/core in every case!

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards
Prateek


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs
  2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2022-02-09  9:38 ` Peter Zijlstra
  2 siblings, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2022-02-09  9:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML

On Tue, Feb 08, 2022 at 09:43:32AM +0000, Mel Gorman wrote:
> Changelog sinve v5
> o Fix off-by-one error
> 
> Changelog since V4
> o Scale imbalance based on the top domain that prefers siblings
> o Keep allowed imbalance as 2 up to the point where LLCs can be overloaded
> 
> Changelog since V3
> o Calculate imb_numa_nr for multiple SD_NUMA domains
> o Restore behaviour where communicating pairs remain on the same node
> 
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> The series addresses two problems -- inconsistent logic when allowing a
> NUMA imbalance and sub-optimal performance when there are many LLCs per
> NUMA node.
> 
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 30 ++++++++++---------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 71 insertions(+), 13 deletions(-)

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-09  5:10   ` K Prateek Nayak
@ 2022-02-09 10:33     ` Mel Gorman
  2022-02-11 19:02       ` Jirka Hladky
  0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2022-02-09 10:33 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	Gautham Shenoy, LKML

On Wed, Feb 09, 2022 at 10:40:15AM +0530, K Prateek Nayak wrote:
> There is a significant improvement throughout the board
> with v6 outperforming tip/sched/core in every case!
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 

Thanks very much Prateek and Gautham for reviewing and testing!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-09 10:33     ` Mel Gorman
@ 2022-02-11 19:02       ` Jirka Hladky
  0 siblings, 0 replies; 15+ messages in thread
From: Jirka Hladky @ 2022-02-11 19:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, Gautham Shenoy, LKML

Hi Mel,

we have tested 5.17.0_rc2 + v6 of your patch series and the results
are very good!

We see performance gains up to 25% for the hackbench (32 processes).
Even more importantly, we don't see any performance losses that we
have experienced with the previous versions of the patch series for
NAS and SPECjbb2005 workloads.

Tested-by: Jirka Hladky <jhladky@redhat.com>

Thanks a lot
Jirka


On Wed, Feb 9, 2022 at 12:08 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Feb 09, 2022 at 10:40:15AM +0530, K Prateek Nayak wrote:
> > There is a significant improvement throughout the board
> > with v6 outperforming tip/sched/core in every case!
> >
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >
>
> Thanks very much Prateek and Gautham for reviewing and testing!
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
  2022-02-08 15:06   ` Gautham R. Shenoy
@ 2022-02-14  9:48   ` Vincent Guittot
  2022-02-14 10:26   ` Srikar Dronamraju
  2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  3 siblings, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2022-02-14  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML

On Tue, 8 Feb 2022 at 10:43, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> There are inconsistencies when determining if a NUMA imbalance is allowed
> that should be corrected.
>
> o allow_numa_imbalance changes types and is not always examining
>   the destination group so both the type should be corrected as
>   well as the naming.
> o find_idlest_group uses the sched_domain's weight instead of the
>   group weight which is different to find_busiest_group
> o find_busiest_group uses the source group instead of the destination
>   which is different to task_numa_find_cpu
> o Both find_idlest_group and find_busiest_group should account
>   for the number of running tasks if a move was allowed to be
>   consistent with task_numa_find_cpu
>
> Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..4592ccf82c34 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9003,9 +9003,10 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool
> +allow_numa_imbalance(unsigned int running, unsigned int weight)
>  {
> -       return (dst_running < (dst_weight >> 2));
> +       return (running < (weight >> 2));
>  }
>
>  /*
> @@ -9139,12 +9140,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                                 return idlest;
>  #endif
>                         /*
> -                        * Otherwise, keep the task on this node to stay close
> -                        * its wakeup source and improve locality. If there is
> -                        * a real need of migration, periodic load balance will
> -                        * take care of it.
> +                        * Otherwise, keep the task close to the wakeup source
> +                        * and improve locality if the number of running tasks
> +                        * would remain below threshold where an imbalance is
> +                        * allowed. If there is a real need of migration,
> +                        * periodic load balance will take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
>                                 return NULL;
>                 }
>
> @@ -9350,7 +9352,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, busiest->group_weight);
> +                               local->sum_nr_running + 1, local->group_weight);
>                 }
>
>                 return;
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
  2022-02-08 15:06   ` Gautham R. Shenoy
  2022-02-14  9:48   ` Vincent Guittot
@ 2022-02-14 10:26   ` Srikar Dronamraju
  2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  3 siblings, 0 replies; 15+ messages in thread
From: Srikar Dronamraju @ 2022-02-14 10:26 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy,
	K Prateek Nayak, LKML

* Mel Gorman <mgorman@techsingularity.net> [2022-02-08 09:43:33]:

> There are inconsistencies when determining if a NUMA imbalance is allowed
> that should be corrected.
> 
> o allow_numa_imbalance changes types and is not always examining
>   the destination group so both the type should be corrected as
>   well as the naming.
> o find_idlest_group uses the sched_domain's weight instead of the
>   group weight which is different to find_busiest_group
> o find_busiest_group uses the source group instead of the destination
>   which is different to task_numa_find_cpu
> o Both find_idlest_group and find_busiest_group should account
>   for the number of running tasks if a move was allowed to be
>   consistent with task_numa_find_cpu
> 
> Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..4592ccf82c34 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9003,9 +9003,10 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool
> +allow_numa_imbalance(unsigned int running, unsigned int weight)
>  {
> -	return (dst_running < (dst_weight >> 2));
> +	return (running < (weight >> 2));
>  }
> 
>  /*
> @@ -9139,12 +9140,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task close to the wakeup source
> +			 * and improve locality if the number of running tasks
> +			 * would remain below threshold where an imbalance is
> +			 * allowed. If there is a real need of migration,
> +			 * periodic load balance will take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
>  				return NULL;
>  		}
> 
> @@ -9350,7 +9352,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				busiest->sum_nr_running, busiest->group_weight);
> +				local->sum_nr_running + 1, local->group_weight);
>  		}
> 
>  		return;
> -- 
> 2.31.1
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2022-02-08 16:19   ` Gautham R. Shenoy
  2022-02-09  5:10   ` K Prateek Nayak
@ 2022-02-14 10:27   ` Srikar Dronamraju
  2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2022-02-14 11:03   ` [PATCH 2/2] " Vincent Guittot
  4 siblings, 0 replies; 15+ messages in thread
From: Srikar Dronamraju @ 2022-02-14 10:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy,
	K Prateek Nayak, LKML

* Mel Gorman <mgorman@techsingularity.net> [2022-02-08 09:43:34]:

> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.17.0-rc0             5.17.0-rc0
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
                     ` (2 preceding siblings ...)
  2022-02-14 10:27   ` Srikar Dronamraju
@ 2022-02-14 10:30   ` tip-bot2 for Mel Gorman
  2022-02-14 11:03   ` [PATCH 2/2] " Vincent Guittot
  4 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Mel Gorman @ 2022-02-14 10:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel),
	Gautham R. Shenoy, K Prateek Nayak, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e496132ebedd870b67f1f6d2428f9bb9d7ae27fd
Gitweb:        https://git.kernel.org/tip/e496132ebedd870b67f1f6d2428f9bb9d7ae27fd
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Tue, 08 Feb 2022 09:43:34 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 11 Feb 2022 23:30:08 +01:00

sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v6
MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v6
Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla       sched-numaimb-v6
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.net
---
 include/linux/sched/topology.h |  1 +-
 kernel/sched/fair.c            | 22 +++++++-------
 kernel/sched/topology.c        | 53 +++++++++++++++++++++++++++++++++-
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641..56cffe4 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ea71016..5c4bfff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1503,7 +1504,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9005,10 +9008,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool
-allow_numa_imbalance(unsigned int running, unsigned int weight)
+static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
 {
-	return (running < (weight >> 2));
+	return running <= imb_numa_nr;
 }
 
 /*
@@ -9148,7 +9150,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * allowed. If there is a real need of migration,
 			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9240,9 +9242,9 @@ next_group:
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9354,7 +9356,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				local->sum_nr_running + 1, local->group_weight);
+				local->sum_nr_running + 1, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a70..e6cd559 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int nr_llcs;
+
+				/*
+				 * For a single LLC per node, allow an
+				 * imbalance up to 25% of the node. This is an
+				 * arbitrary cutoff based on SMT-2 to balance
+				 * between memory bandwidth and avoiding
+				 * premature sharing of HT resources and SMT-4
+				 * or SMT-8 *may* benefit from a different
+				 * cutoff.
+				 *
+				 * For multiple LLCs, allow an imbalance
+				 * until multiple tasks would share an LLC
+				 * on one node while LLCs on another node
+				 * remain idle.
+				 */
+				nr_llcs = sd->span_weight / child->span_weight;
+				if (nr_llcs == 1)
+					imb = sd->span_weight >> 2;
+				else
+					imb = nr_llcs;
+				sd->imb_numa_nr = imb;
+
+				/* Set span based on the first NUMA domain. */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && !(top_p->flags & SD_NUMA)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/fair: Improve consistency of allowed NUMA balance calculations
  2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
                     ` (2 preceding siblings ...)
  2022-02-14 10:26   ` Srikar Dronamraju
@ 2022-02-14 10:30   ` tip-bot2 for Mel Gorman
  3 siblings, 0 replies; 15+ messages in thread
From: tip-bot2 for Mel Gorman @ 2022-02-14 10:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel), Gautham R. Shenoy, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2cfb7a1b031b0e816af7a6ee0c6ab83b0acdf05a
Gitweb:        https://git.kernel.org/tip/2cfb7a1b031b0e816af7a6ee0c6ab83b0acdf05a
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Tue, 08 Feb 2022 09:43:33 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 11 Feb 2022 23:30:08 +01:00

sched/fair: Improve consistency of allowed NUMA balance calculations

There are inconsistencies when determining if a NUMA imbalance is allowed
that should be corrected.

o allow_numa_imbalance changes types and is not always examining
  the destination group so both the type should be corrected as
  well as the naming.
o find_idlest_group uses the sched_domain's weight instead of the
  group weight which is different to find_busiest_group
o find_busiest_group uses the source group instead of the destination
  which is different to task_numa_find_cpu
o Both find_idlest_group and find_busiest_group should account
  for the number of running tasks if a move was allowed to be
  consistent with task_numa_find_cpu

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5dca13f..ea71016 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9005,9 +9005,10 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool
+allow_numa_imbalance(unsigned int running, unsigned int weight)
 {
-	return (dst_running < (dst_weight >> 2));
+	return (running < (weight >> 2));
 }
 
 /*
@@ -9141,12 +9142,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task close to the wakeup source
+			 * and improve locality if the number of running tasks
+			 * would remain below threshold where an imbalance is
+			 * allowed. If there is a real need of migration,
+			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
 				return NULL;
 		}
 
@@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				local->sum_nr_running + 1, local->group_weight);
 		}
 
 		return;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
                     ` (3 preceding siblings ...)
  2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
@ 2022-02-14 11:03   ` Vincent Guittot
  4 siblings, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2022-02-14 11:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML

On Tue, 8 Feb 2022 at 10:44, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v6
> MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
> MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
> MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
> MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
> Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
> Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
> Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
> CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed. Even in cases where
> the average performance is neutral, the results are more stable.
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v6
> Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
> Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
> Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
> Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
> Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
> Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
> Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
> Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
> Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
> Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
> Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
> Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
> Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
> Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
> Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
> Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
> Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
> Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
> Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
> Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
> Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
> Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
> Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
> Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
> Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
> Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
> Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
> Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
> Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
> Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
> Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
> Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
> Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
> Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
>
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
>
>                               vanilla       sched-numaimb-v6
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
> Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
> Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Good to see that you have been able to move on SD_NUMA instead of
SD_PREFER_SIBLING.
The allowed imbalance looks also more consistent whatever the number of LLC

Reviewed-by: Vincent Guittot <vincent.guitto@linaro.org>


> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr running tasks that allows a NUMA imbalance */
>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..538756bd8e7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -                                       int dst_running, int dst_weight);
> +                                       int dst_running, int imb_numa_nr);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -       return (running < (weight >> 2));
> +       return running <= imb_numa_nr;
>  }
>
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * allowed. If there is a real need of migration,
>                          * periodic load balance will take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
>         /*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               local->sum_nr_running + 1, local->group_weight);
> +                               local->sum_nr_running + 1, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /*
> +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> +        * imbalanced.
> +        */
> +       for_each_cpu(i, cpu_map) {
> +               unsigned int imb = 0;
> +               unsigned int imb_span = 1;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +                               struct sched_domain *top, *top_p;
> +                               unsigned int nr_llcs;
> +
> +                               /*
> +                                * For a single LLC per node, allow an
> +                                * imbalance up to 25% of the node. This is an
> +                                * arbitrary cutoff based on SMT-2 to balance
> +                                * between memory bandwidth and avoiding
> +                                * premature sharing of HT resources and SMT-4
> +                                * or SMT-8 *may* benefit from a different
> +                                * cutoff.
> +                                *
> +                                * For multiple LLCs, allow an imbalance
> +                                * until multiple tasks would share an LLC
> +                                * on one node while LLCs on another node
> +                                * remain idle.
> +                                */
> +                               nr_llcs = sd->span_weight / child->span_weight;
> +                               if (nr_llcs == 1)
> +                                       imb = sd->span_weight >> 2;
> +                               else
> +                                       imb = nr_llcs;
> +                               sd->imb_numa_nr = imb;
> +
> +                               /* Set span based on the first NUMA domain. */
> +                               top = sd;
> +                               top_p = top->parent;
> +                               while (top_p && !(top_p->flags & SD_NUMA)) {
> +                                       top = top->parent;
> +                                       top_p = top->parent;
> +                               }
> +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +                       } else {
> +                               int factor = max(1U, (sd->span_weight / imb_span));
> +
> +                               sd->imb_numa_nr = imb * factor;
> +                       }
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2022-02-14 11:27 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2022-02-08  9:43 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
2022-02-08 15:06   ` Gautham R. Shenoy
2022-02-14  9:48   ` Vincent Guittot
2022-02-14 10:26   ` Srikar Dronamraju
2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2022-02-08 16:19   ` Gautham R. Shenoy
2022-02-09  5:10   ` K Prateek Nayak
2022-02-09 10:33     ` Mel Gorman
2022-02-11 19:02       ` Jirka Hladky
2022-02-14 10:27   ` Srikar Dronamraju
2022-02-14 10:30   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2022-02-14 11:03   ` [PATCH 2/2] " Vincent Guittot
2022-02-09  9:38 ` [PATCH v6 0/2] Adjust NUMA imbalance for " Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).