All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs
@ 2021-12-10  9:33 Mel Gorman
  2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
  2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  0 siblings, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2021-12-10  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

Changelog since V3
o Calculate imb_numa_nr for multiple SD_NUMA domains
o Restore behaviour where communicating pairs remain on the same node

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

The series addresses two problems -- inconsistent use of scheduler domain
weights and sub-optimal performance when there are many LLCs per NUMA node.

 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 36 ++++++++++++++++---------------
 kernel/sched/topology.c        | 39 ++++++++++++++++++++++++++++++++++
 3 files changed, 59 insertions(+), 17 deletions(-)

-- 
2.31.1

Mel Gorman (2):
  sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans
    multiple LLCs

 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 36 +++++++++++++++++----------------
 kernel/sched/topology.c        | 37 ++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 17 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2021-12-10  9:33 ` Mel Gorman
  2021-12-21 10:53   ` Vincent Guittot
  2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  1 sibling, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2021-12-10  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
  2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-10  9:33 ` Mel Gorman
  2021-12-13  8:28   ` Gautham R. Shenoy
  2021-12-17 19:54   ` Gautham R. Shenoy
  1 sibling, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2021-12-10  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.16.0-rc1             5.16.0-rc1
                               vanilla       sched-numaimb-v4
MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per instance, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla    sched-numaimb-v4r24
Min       Score-16   367816.09 (   0.00%)   384015.36 (   4.40%)
Hmean     Score-16   389627.78 (   0.00%)   431907.14 *  10.85%*
Max       Score-16   416178.96 (   0.00%)   480120.03 (  15.36%)
Stddev    Score-16    17361.82 (   0.00%)    32505.34 ( -87.22%)
CoeffVar  Score-16        4.45 (   0.00%)        7.49 ( -68.30%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla       sched-numaimb-v4
Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 36 +++++++++++++++++----------------
 kernel/sched/topology.c        | 37 ++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..972ba586b113 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int dst_weight,
+					int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->dst_stats.weight,
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task on this node to stay local
+			 * to its wakeup source if the number of running tasks
+			 * are below the allowed imbalance. If there is a real
+			 * need of migration, periodic load balance will take
+			 * care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
 				return NULL;
 		}
 
@@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
-#define NUMA_IMBALANCE_MIN 2
-
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int dst_weight,
+				int imb_numa_nr)
 {
 	if (!allow_numa_imbalance(dst_running, dst_weight))
 		return imbalance;
 
-	/*
-	 * Allow a small imbalance based on a simple pair of communicating
-	 * tasks that remain local when the destination is lightly loaded.
-	 */
-	if (imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= imb_numa_nr)
 		return 0;
 
 	return imbalance;
@@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->span_weight,
+				env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..bacec575ade2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top = sd;
+				unsigned int llc_sq;
+
+				/*
+				 * nr_llcs = (top->span_weight / llc_weight);
+				 * imb = (child_weight / nr_llcs) >> 2
+				 *
+				 * is equivalent to
+				 *
+				 * imb = (llc_weight^2 / top->span_weight) >> 2
+				 *
+				 */
+				llc_sq = child->span_weight * child->span_weight;
+
+				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
+				imb_span = sd->span_weight;
+
+				sd->imb_numa_nr = imb;
+			} else {
+				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2021-12-13  8:28   ` Gautham R. Shenoy
  2021-12-13 13:01     ` Mel Gorman
  2021-12-17 19:54   ` Gautham R. Shenoy
  1 sibling, 1 reply; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-13  8:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,

On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> adjusts the imbalance on multi-LLC machines to allow an imbalance up to
> the point where LLCs should be balanced between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.16.0-rc1             5.16.0-rc1
>                                vanilla       sched-numaimb-v4
> MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)


Could you please share the size of the stream array ? These numbers
are higher than what I am observing.

> 
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
> 
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per instance, the vanilla kernel
> allows threads to contend on cache. With the patch;
> 
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla    sched-numaimb-v4r24
> Min       Score-16   367816.09 (   0.00%)   384015.36 (   4.40%)
> Hmean     Score-16   389627.78 (   0.00%)   431907.14 *  10.85%*
> Max       Score-16   416178.96 (   0.00%)   480120.03 (  15.36%)
> Stddev    Score-16    17361.82 (   0.00%)    32505.34 ( -87.22%)
> CoeffVar  Score-16        4.45 (   0.00%)        7.49 ( -68.30%)
> 
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
> 
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla       sched-numaimb-v4
> Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
> Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
> Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
> Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
> Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
> Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
> Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
> Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
> Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
> Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
> Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
> Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
> Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
> Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
> Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
> Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
> Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)


> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 36 +++++++++++++++++----------------
>  kernel/sched/topology.c        | 37 ++++++++++++++++++++++++++++++++++
>  3 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>  	unsigned int busy_factor;	/* less balancing by factor if busy */
>  	unsigned int imbalance_pct;	/* No balance until over watermark */
>  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
> +	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
>  
>  	int nohz_idle;			/* NOHZ IDLE status */
>  	int flags;			/* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..972ba586b113 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>  
>  	int src_cpu, src_nid;
>  	int dst_cpu, dst_nid;
> +	int imb_numa_nr;
>  
>  	struct numa_stats src_stats, dst_stats;
>  
> @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
>  static inline long adjust_numa_imbalance(int imbalance,
> -					int dst_running, int dst_weight);
> +					int dst_running, int dst_weight,
> +					int imb_numa_nr);
>  
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		dst_running = env->dst_stats.nr_running + 1;
>  		imbalance = max(0, dst_running - src_running);
>  		imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -							env->dst_stats.weight);
> +						  env->dst_stats.weight,
> +						  env->imb_numa_nr);
>  
>  		/* Use idle CPU if there is no imbalance */
>  		if (!imbalance) {
> @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -	if (sd)
> +	if (sd) {
>  		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +		env.imb_numa_nr = sd->imb_numa_nr;
> +	}
>  	rcu_read_unlock();
>  
>  	/*
> @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task on this node to stay local
> +			 * to its wakeup source if the number of running tasks
> +			 * are below the allowed imbalance. If there is a real
> +			 * need of migration, periodic load balance will take
> +			 * care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
>  				return NULL;
>  		}
>  
> @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	}
>  }
>  
> -#define NUMA_IMBALANCE_MIN 2
> -
>  static inline long adjust_numa_imbalance(int imbalance,
> -				int dst_running, int dst_weight)
> +				int dst_running, int dst_weight,
> +				int imb_numa_nr)
>  {
>  	if (!allow_numa_imbalance(dst_running, dst_weight))
>  		return imbalance;
>

if (4 * dst_running >= dst_weight) we return imbalance here. The
dst_weight here corresponds to the span of the domain, while
dst_running is the nr_running in busiest.

On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
two groups, where each group is a socket. So, unless there are at
least 64 tasks running in one of the sockets, we would not return
imbalance here and go to the next step.


> -	/*
> -	 * Allow a small imbalance based on a simple pair of communicating
> -	 * tasks that remain local when the destination is lightly loaded.
> -	 */
> -	if (imbalance <= NUMA_IMBALANCE_MIN)
> +	if (imbalance <= imb_numa_nr)

imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
don't have PREFER_SIBLING, we would be balancing the number of idle
CPUs. We will end up doing the imbalance, as long as the difference
between the idle CPUs is at least 8.

In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
will not rebalance unless the difference between the idle CPUs is 16.

In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
threshold is now bumped up to 32.

>  		return 0;



>  
>  	return imbalance;
> @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				busiest->sum_nr_running, env->sd->span_weight);
> +				busiest->sum_nr_running, env->sd->span_weight,
> +				env->sd->imb_numa_nr);
>  		}
>  
>  		return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..bacec575ade2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top = sd;


We don't seem to be using top anywhere where sd may not be used since
we already have variables imb and imb_span to record the
top->imb_numa_nr and top->span_weight.


> +				unsigned int llc_sq;
> +
> +				/*
> +				 * nr_llcs = (top->span_weight / llc_weight);
> +				 * imb = (child_weight / nr_llcs) >> 2

child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.

> +				 *
> +				 * is equivalent to
> +				 *
> +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> +				 *
> +				 */
> +				llc_sq = child->span_weight * child->span_weight;
> +
> +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> +				imb_span = sd->span_weight;

On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
   with NPS=1
      top = DIE.
      top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.

   with NPS=2
      top = NODE.
      top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.

   with NPS=4      
      top = NODE.
      top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.

On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
   with NPS=1
      top = DIE.
      top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.

   with NPS=2
      top = NODE.
      top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.

   with NPS=4      
      top = NODE.
      top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.


> +
> +				sd->imb_numa_nr = imb;
> +			} else {
> +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> +			}

On Zen3,
   with NPS=1
        sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.

   with NPS=2
        sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8

   with NPS=4
        sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16


For Zen2, since the imb_span and imb values are the same as the
corresponding NPS=x values on Zen3, the imb_numa_nr values are the
same as well since the corresponding sd->span_weight is the same.


If we look at the highest NUMA domain, there are two groups in all the
NPS configurations. There are the same number of LLCs in each of these
groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
value, since we compute the imb_numa_nr value relative to the number
of "top" domains that can be fit within this NUMA domain. This is
because the size of the "top" domain varies with the NPS value. This
shows up in the benchmark results.



The numbers with stream, tbench and YCSB +
Mongodb are as follows:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
  Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)


NPS=2
Test:     tip/sched/core                 mel-v3                     mel-v4
 Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
  Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)

NPS=4
Test:     tip/sched/core                 mel-v3                     mel-v4
 Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
  Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)


We can see that with the v4 patch, for NPS=2 and NPS=4, the gains
start diminishing since the thresholds are higher than NPS=1.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    137362.66 (0.00 pct)     236661.65 (72.28 pct)     241148.65 (75.55 pct)
Scale:    126742.24 (0.00 pct)     214568.17 (69.29 pct)     226416.41 (78.64 pct)
  Add:    148236.33 (0.00 pct)     257114.42 (73.44 pct)     272030.50 (83.51 pct)
Triad:    146913.25 (0.00 pct)     241880.88 (64.64 pct)     259873.61 (76.88 pct)

NPS=2
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    107143.94 (0.00 pct)     244922.66 (128.59 pct)    198299.91 (85.07 pct)
Scale:    102004.90 (0.00 pct)     218738.55 (114.43 pct)    177890.23 (74.39 pct)
  Add:    117760.23 (0.00 pct)     270516.24 (129.71 pct)    211458.30 (79.56 pct)
Triad:    115927.92 (0.00 pct)     255985.20 (120.81 pct)    197812.60 (70.63 pct)


NPS=4
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    111653.17 (0.00 pct)     253912.17 (127.41 pct)     48898.34 (-56.20 pct)
Scale:    105289.35 (0.00 pct)     223710.85 (112.47 pct)     48426.03 (-54.00 pct)
  Add:    120927.64 (0.00 pct)     277701.20 (129.64 pct)     54425.48 (-54.99 pct)
Triad:    117659.97 (0.00 pct)     259473.84 (120.52 pct)     54622.82 (-53.57 pct)

with -DNTIMES=100, each of the Copy,Scale,Add,Triad kernels runs for a
longer duration. So the test takes longer time (6-10 seconds) giving
the load-balancer sufficient time to place the tasks and balance
them. In this configuration we see that the v4 shows some degradation
on NPS=4. This is due to the imb_numa_nr being higher compared to v3.

While Stream benefits from spreading, it is fair to understand the
gains that we make with benchmarks that would prefer the tasks
co-located instead of spread out. Chose tbench and YCSB+Mongodb as
representatives of these. The numbers are as follows:


~~~~~~~~~~~~~~~~~~~~~~~~
tbench
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        633.25 (0.00 pct)        619.18 (-2.22 pct)      632.96 (-0.04 pct)
    2        1152.54 (0.00 pct)       1189.91 (3.24 pct)      1184.84 (2.80 pct)
    4        1946.53 (0.00 pct)       2177.45 (11.86 pct)     1979.62 (1.69 pct)
    8        3554.65 (0.00 pct)       3565.16 (0.29 pct)      3678.13 (3.47 pct)
   16        6222.00 (0.00 pct)       6484.89 (4.22 pct)      6256.02 (0.54 pct)
   32        11707.57 (0.00 pct)      12185.93 (4.08 pct)     12006.63 (2.55 pct)
   64        18433.50 (0.00 pct)      19537.03 (5.98 pct)     19088.57 (3.55 pct)
  128        27400.07 (0.00 pct)      31771.53 (15.95 pct)    27265.00 (-0.49 pct)
  256        33195.27 (0.00 pct)      24478.67 (-26.25 pct)   34065.60 (2.62 pct)
  512        41633.10 (0.00 pct)      54833.20 (31.70 pct)    46724.00 (12.22 pct)
 1024        53877.23 (0.00 pct)      56363.37 (4.61 pct)     44813.10 (-16.82 pct)


NPS=2
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        629.76 (0.00 pct)        620.94 (-1.40 pct)      629.22 (-0.08 pct)
    2        1177.01 (0.00 pct)       1203.27 (2.23 pct)      1169.12 (-0.66 pct)
    4        1990.97 (0.00 pct)       2228.18 (11.91 pct)     1888.39 (-5.15 pct)
    8        3535.45 (0.00 pct)       3620.76 (2.41 pct)      3662.72 (3.59 pct)
   16        6309.02 (0.00 pct)       6548.66 (3.79 pct)      6508.67 (3.16 pct)
   32        12038.73 (0.00 pct)      12145.97 (0.89 pct)     11411.50 (-5.21 pct)
   64        18599.67 (0.00 pct)      19448.87 (4.56 pct)     17146.07 (-7.81 pct)
  128        27861.57 (0.00 pct)      30630.53 (9.93 pct)     28217.30 (1.27 pct)
  256        28215.80 (0.00 pct)      26864.67 (-4.78 pct)    29330.47 (3.95 pct)
  512        44239.67 (0.00 pct)      52822.47 (19.40 pct)    42652.63 (-3.58 pct)
 1024        54403.53 (0.00 pct)      53905.57 (-0.91 pct)    48490.30 (-10.86 pct)



NPS=4
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        622.68 (0.00 pct)        617.87 (-0.77 pct)      667.38 (7.17 pct)
    2        1160.74 (0.00 pct)       1182.40 (1.86 pct)      1294.12 (11.49 pct)
    4        1961.29 (0.00 pct)       2172.41 (10.76 pct)     2477.76 (26.33 pct)
    8        3664.25 (0.00 pct)       3450.80 (-5.82 pct)     4067.42 (11.00 pct)
   16        6495.53 (0.00 pct)       5873.41 (-9.57 pct)     6931.66 (6.71 pct)
   32        11833.27 (0.00 pct)      12010.43 (1.49 pct)     12710.60 (7.41 pct)
   64        17723.50 (0.00 pct)      18416.23 (3.90 pct)     18793.47 (6.03 pct)
  128        27724.83 (0.00 pct)      27894.50 (0.61 pct)     27948.60 (0.80 pct)
  256        31351.70 (0.00 pct)      23944.43 (-23.62 pct)   35430.17 (13.00 pct)
  512        43383.43 (0.00 pct)      49830.63 (14.86 pct)    43877.83 (1.13 pct)
 1024        46974.27 (0.00 pct)      53583.83 (14.07 pct)    50563.23 (7.64 pct)


With NPS=4, with v4, we see no regressions with tbench compared to
tip/sched/core and there is a considerable improvement in most
cases. So, the higher imb_numa_nr helps pack the tasks which
beneficial to tbench.



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
YCSB + Mongodb.

4 client instances, 256 threads per client instance.  These threads
have a very low utilization. The overall system utilization was in the
range of 16-20%.

YCSB workload type : A
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
               tip/sched/core   mel-v3        mel-v4
Throughput     351611.0        314981.33     329026.33
                               (-10.42 pct)  (-6.42 pct)



NPS=4
               tip/sched/core   mel-v3        mel-v4
Throughput     315808.0         316600.67     331093.67
                                (0.25 pct)    (4.84 pct)

Since at NPS=4, the imb_numa_nr=8 and 16 respectively at the lower and
higher NUMA domains, the task spreading happens more reluctantly
compared to v3 where the imb_numa_nr was 1 in both the domains.


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-13  8:28   ` Gautham R. Shenoy
@ 2021-12-13 13:01     ` Mel Gorman
  2021-12-13 14:47       ` Gautham R. Shenoy
  0 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2021-12-13 13:01 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote:
> > On a Zen3 machine running STREAM parallelised with OMP to have on instance
> > per LLC the results and without binding, the results are
> > 
> >                             5.16.0-rc1             5.16.0-rc1
> >                                vanilla       sched-numaimb-v4
> > MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> > MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> > MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> > MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)
> 
> 
> Could you please share the size of the stream array ? These numbers
> are higher than what I am observing.
> 

512MB

> > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  	}
> >  }
> >  
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -				int dst_running, int dst_weight)
> > +				int dst_running, int dst_weight,
> > +				int imb_numa_nr)
> >  {
> >  	if (!allow_numa_imbalance(dst_running, dst_weight))
> >  		return imbalance;
> >
> 
> if (4 * dst_running >= dst_weight) we return imbalance here. The
> dst_weight here corresponds to the span of the domain, while
> dst_running is the nr_running in busiest.
> 

Yes, once dst_running is high enough, no imbalance is allowed. In
previous versions I changed this but that was a mistake and in this
version, the threshold where imbalance is not allowed remains the same.

> On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
> all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
> two groups, where each group is a socket. So, unless there are at
> least 64 tasks running in one of the sockets, we would not return
> imbalance here and go to the next step.
> 

Yes

> 
> > -	/*
> > -	 * Allow a small imbalance based on a simple pair of communicating
> > -	 * tasks that remain local when the destination is lightly loaded.
> > -	 */
> > -	if (imbalance <= NUMA_IMBALANCE_MIN)
> > +	if (imbalance <= imb_numa_nr)
> 
> imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
> don't have PREFER_SIBLING, we would be balancing the number of idle
> CPUs. We will end up doing the imbalance, as long as the difference
> between the idle CPUs is at least 8.
> 
> In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
> will not rebalance unless the difference between the idle CPUs is 16.
> 
> In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
> threshold is now bumped up to 32.
> 
> >  		return 0;
> 
> 
> 
> >  
> >  	return imbalance;
> > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		/* Consider allowing a small imbalance between NUMA groups */
> >  		if (env->sd->flags & SD_NUMA) {
> >  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -				busiest->sum_nr_running, env->sd->span_weight);
> > +				busiest->sum_nr_running, env->sd->span_weight,
> > +				env->sd->imb_numa_nr);
> >  		}
> >  
> >  		return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..bacec575ade2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +	 * imbalanced.
> > +	 */
> > +	for_each_cpu(i, cpu_map) {
> > +		unsigned int imb = 0;
> > +		unsigned int imb_span = 1;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				struct sched_domain *top = sd;
> 
> 
> We don't seem to be using top anywhere where sd may not be used since
> we already have variables imb and imb_span to record the
> top->imb_numa_nr and top->span_weight.
> 

Top could have been removed but we might still need it.

> 
> > +				unsigned int llc_sq;
> > +
> > +				/*
> > +				 * nr_llcs = (top->span_weight / llc_weight);
> > +				 * imb = (child_weight / nr_llcs) >> 2
> 
> child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.
> 

That is be clearer.

> > +				 *
> > +				 * is equivalent to
> > +				 *
> > +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > +				 *
> > +				 */
> > +				llc_sq = child->span_weight * child->span_weight;
> > +
> > +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > +				imb_span = sd->span_weight;
> 
> On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
>    with NPS=1
>       top = DIE.
>       top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.
> 
>    with NPS=2
>       top = NODE.
>       top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.
> 
>    with NPS=4      
>       top = NODE.
>       top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.
> 
> On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
>    with NPS=1
>       top = DIE.
>       top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.
> 
>    with NPS=2
>       top = NODE.
>       top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.
> 
>    with NPS=4      
>       top = NODE.
>       top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.
> 
> 
> > +
> > +				sd->imb_numa_nr = imb;
> > +			} else {
> > +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > +			}
> 
> On Zen3,
>    with NPS=1
>         sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.
> 
>    with NPS=2
>         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
> 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8
> 
>    with NPS=4
>         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
> 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16
> 
> 
> For Zen2, since the imb_span and imb values are the same as the
> corresponding NPS=x values on Zen3, the imb_numa_nr values are the
> same as well since the corresponding sd->span_weight is the same.
> 
> If we look at the highest NUMA domain, there are two groups in all the
> NPS configurations. There are the same number of LLCs in each of these
> groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
> on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
> value, since we compute the imb_numa_nr value relative to the number
> of "top" domains that can be fit within this NUMA domain. This is
> because the size of the "top" domain varies with the NPS value. This
> shows up in the benchmark results.
> 

This was intentional to have some scaling but based on your results, the
scaling might be at the wrong level.

> 
> 
> The numbers with stream, tbench and YCSB +
> Mongodb are as follows:
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Stream with 16 threads.
> built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> Zen3, 64C128T per socket, 2 sockets,
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> NPS=1
> Test:     tip/sched/core                 mel-v3                    mel-v4
>  Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
> Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
>   Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
> Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)
> 
> 
> NPS=2
> Test:     tip/sched/core                 mel-v3                     mel-v4
>  Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
> Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
>   Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
> Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)
> 
> NPS=4
> Test:     tip/sched/core                 mel-v3                     mel-v4
>  Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
> Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
>   Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
> Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)
> 

At minimum, v3 is a failure because a single pair of communicating tasks
were getting split across NUMA domains and the allowed numa imbalance
gets cut off too early because of the change to allow_numa_imbalance.
So while it's a valid comparison, it's definitely not the fix.

Given how you describe NPS, maybe the scaling should only start at the
point where tasks are no longer balanced between sibling domains. Can
you try this? I've only boot tested it at this point. It should work for
STREAM at least but probably not great for tbench.


diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bacec575ade2..1fa3e977521d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
 			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
-				struct sched_domain *top = sd;
+				struct sched_domain *top, *top_p;
 				unsigned int llc_sq;
 
 				/*
-				 * nr_llcs = (top->span_weight / llc_weight);
-				 * imb = (child_weight / nr_llcs) >> 2
+				 * nr_llcs = (sd->span_weight / llc_weight);
+				 * imb = (llc_weight / nr_llcs) >> 2
 				 *
 				 * is equivalent to
 				 *
-				 * imb = (llc_weight^2 / top->span_weight) >> 2
+				 * imb = (llc_weight^2 / sd->span_weight) >> 2
 				 *
 				 */
 				llc_sq = child->span_weight * child->span_weight;
 
-				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
-				imb_span = sd->span_weight;
-
+				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
 				sd->imb_numa_nr = imb;
+
+				/*
+				 * Set span based on top domain that places
+				 * tasks in sibling domains.
+				 */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
 			} else {
-				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
 			}
 		}
 	}

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-13 13:01     ` Mel Gorman
@ 2021-12-13 14:47       ` Gautham R. Shenoy
  2021-12-15 11:52         ` Gautham R. Shenoy
  0 siblings, 1 reply; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-13 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,

On Mon, Dec 13, 2021 at 01:01:31PM +0000, Mel Gorman wrote:
> On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote:
> > > On a Zen3 machine running STREAM parallelised with OMP to have on instance
> > > per LLC the results and without binding, the results are
> > > 
> > >                             5.16.0-rc1             5.16.0-rc1
> > >                                vanilla       sched-numaimb-v4
> > > MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> > > MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> > > MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> > > MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)
> > 
> > 
> > Could you please share the size of the stream array ? These numbers
> > are higher than what I am observing.
> > 
> 
> 512MB

Thanks, I will try with this one.

> 
> > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >  	}
> > >  }
> > >  
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -				int dst_running, int dst_weight)
> > > +				int dst_running, int dst_weight,
> > > +				int imb_numa_nr)
> > >  {
> > >  	if (!allow_numa_imbalance(dst_running, dst_weight))
> > >  		return imbalance;
> > >
> > 
> > if (4 * dst_running >= dst_weight) we return imbalance here. The
> > dst_weight here corresponds to the span of the domain, while
> > dst_running is the nr_running in busiest.
> > 
> 
> Yes, once dst_running is high enough, no imbalance is allowed. In
> previous versions I changed this but that was a mistake and in this
> version, the threshold where imbalance is not allowed remains the same.
> 
> > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
> > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
> > two groups, where each group is a socket. So, unless there are at
> > least 64 tasks running in one of the sockets, we would not return
> > imbalance here and go to the next step.
> > 
> 
> Yes
> 
> > 
> > > -	/*
> > > -	 * Allow a small imbalance based on a simple pair of communicating
> > > -	 * tasks that remain local when the destination is lightly loaded.
> > > -	 */
> > > -	if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +	if (imbalance <= imb_numa_nr)
> > 
> > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
> > don't have PREFER_SIBLING, we would be balancing the number of idle
> > CPUs. We will end up doing the imbalance, as long as the difference
> > between the idle CPUs is at least 8.
> > 
> > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
> > will not rebalance unless the difference between the idle CPUs is 16.
> > 
> > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
> > threshold is now bumped up to 32.
> > 
> > >  		return 0;
> > 
> > 
> > 
> > >  
> > >  	return imbalance;
> > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  		/* Consider allowing a small imbalance between NUMA groups */
> > >  		if (env->sd->flags & SD_NUMA) {
> > >  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -				busiest->sum_nr_running, env->sd->span_weight);
> > > +				busiest->sum_nr_running, env->sd->span_weight,
> > > +				env->sd->imb_numa_nr);
> > >  		}
> > >  
> > >  		return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..bacec575ade2 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >  		}
> > >  	}
> > >  
> > > +	/*
> > > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +	 * imbalanced.
> > > +	 */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		unsigned int imb = 0;
> > > +		unsigned int imb_span = 1;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				struct sched_domain *top = sd;
> > 
> > 
> > We don't seem to be using top anywhere where sd may not be used since
> > we already have variables imb and imb_span to record the
> > top->imb_numa_nr and top->span_weight.
> > 
> 
> Top could have been removed but we might still need it.
> 
> > 
> > > +				unsigned int llc_sq;
> > > +
> > > +				/*
> > > +				 * nr_llcs = (top->span_weight / llc_weight);
> > > +				 * imb = (child_weight / nr_llcs) >> 2
> > 
> > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.
> > 
> 
> That is be clearer.
> 
> > > +				 *
> > > +				 * is equivalent to
> > > +				 *
> > > +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > > +				 *
> > > +				 */
> > > +				llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > > +				imb_span = sd->span_weight;
> > 
> > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
> >    with NPS=1
> >       top = DIE.
> >       top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.
> > 
> >    with NPS=2
> >       top = NODE.
> >       top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.
> > 
> >    with NPS=4      
> >       top = NODE.
> >       top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.
> > 
> > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
> >    with NPS=1
> >       top = DIE.
> >       top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.
> > 
> >    with NPS=2
> >       top = NODE.
> >       top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.
> > 
> >    with NPS=4      
> >       top = NODE.
> >       top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.
> > 
> > 
> > > +
> > > +				sd->imb_numa_nr = imb;
> > > +			} else {
> > > +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > > +			}
> > 
> > On Zen3,
> >    with NPS=1
> >         sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.
> > 
> >    with NPS=2
> >         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
> > 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8
> > 
> >    with NPS=4
> >         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
> > 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16
> > 
> > 
> > For Zen2, since the imb_span and imb values are the same as the
> > corresponding NPS=x values on Zen3, the imb_numa_nr values are the
> > same as well since the corresponding sd->span_weight is the same.
> > 
> > If we look at the highest NUMA domain, there are two groups in all the
> > NPS configurations. There are the same number of LLCs in each of these
> > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
> > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
> > value, since we compute the imb_numa_nr value relative to the number
> > of "top" domains that can be fit within this NUMA domain. This is
> > because the size of the "top" domain varies with the NPS value. This
> > shows up in the benchmark results.
> > 
> 
> This was intentional to have some scaling but based on your results, the
> scaling might be at the wrong level.

Ok. 


> 
> > 
> > 
> > The numbers with stream, tbench and YCSB +
> > Mongodb are as follows:
> > 
> > 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Stream with 16 threads.
> > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> > Zen3, 64C128T per socket, 2 sockets,
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > NPS=1
> > Test:     tip/sched/core                 mel-v3                    mel-v4
> >  Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
> > Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
> >   Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
> > Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)
> > 
> > 
> > NPS=2
> > Test:     tip/sched/core                 mel-v3                     mel-v4
> >  Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
> > Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
> >   Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
> > Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)
> > 
> > NPS=4
> > Test:     tip/sched/core                 mel-v3                     mel-v4
> >  Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
> > Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
> >   Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
> > Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)
> > 
> 
> At minimum, v3 is a failure because a single pair of communicating tasks
> were getting split across NUMA domains and the allowed numa imbalance
> gets cut off too early because of the change to allow_numa_imbalance.
> So while it's a valid comparison, it's definitely not the fix.

v3 is definitely not a fix. I wasn't hinting at that. It was just to
point out the opportunity that we have.


> 
> Given how you describe NPS, maybe the scaling should only start at the
> point where tasks are no longer balanced between sibling domains. Can
> you try this? I've only boot tested it at this point. It should work for
> STREAM at least but probably not great for tbench.

Thanks for the patch. I will queue this one for tonight.


> 
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index bacec575ade2..1fa3e977521d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  
>  			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
>  			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> -				struct sched_domain *top = sd;
> +				struct sched_domain *top, *top_p;
>  				unsigned int llc_sq;
>  
>  				/*
> -				 * nr_llcs = (top->span_weight / llc_weight);
> -				 * imb = (child_weight / nr_llcs) >> 2
> +				 * nr_llcs = (sd->span_weight / llc_weight);
> +				 * imb = (llc_weight / nr_llcs) >> 2
>  				 *
>  				 * is equivalent to
>  				 *
> -				 * imb = (llc_weight^2 / top->span_weight) >> 2
> +				 * imb = (llc_weight^2 / sd->span_weight) >> 2
>  				 *
>  				 */
>  				llc_sq = child->span_weight * child->span_weight;
>  
> -				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> -				imb_span = sd->span_weight;
> -
> +				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
>  				sd->imb_numa_nr = imb;
> +
> +				/*
> +				 * Set span based on top domain that places
> +				 * tasks in sibling domains.
> +				 */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
>  			} else {
> -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
>  			}
>  		}
>  	}

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-13 14:47       ` Gautham R. Shenoy
@ 2021-12-15 11:52         ` Gautham R. Shenoy
  2021-12-15 12:25           ` Mel Gorman
  0 siblings, 1 reply; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-15 11:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,


On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote:

> 
> Thanks for the patch. I will queue this one for tonight.
>

Getting the numbers took a bit longer than I expected.

> 
> > 
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index bacec575ade2..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  
> >  			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> >  			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > -				struct sched_domain *top = sd;
> > +				struct sched_domain *top, *top_p;
> >  				unsigned int llc_sq;
> >  
> >  				/*
> > -				 * nr_llcs = (top->span_weight / llc_weight);
> > -				 * imb = (child_weight / nr_llcs) >> 2
> > +				 * nr_llcs = (sd->span_weight / llc_weight);
> > +				 * imb = (llc_weight / nr_llcs) >> 2
> >  				 *
> >  				 * is equivalent to
> >  				 *
> > -				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > +				 * imb = (llc_weight^2 / sd->span_weight) >> 2
> >  				 *
> >  				 */
> >  				llc_sq = child->span_weight * child->span_weight;
> >  
> > -				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > -				imb_span = sd->span_weight;
> > -
> > +				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> >  				sd->imb_numa_nr = imb;
> > +
> > +				/*
> > +				 * Set span based on top domain that places
> > +				 * tasks in sibling domains.
> > +				 */
> > +				top = sd;
> > +				top_p = top->parent;
> > +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> > +					top = top->parent;
> > +					top_p = top->parent;
> > +				}
> > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> >  			} else {
> > -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > +				int factor = max(1U, (sd->span_weight / imb_span));
> > +


So for the first NUMA domain, the sd->imb_numa_nr will be imb, which
turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings.

On a 2 Socket Zen3:

NPS=1
   child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
   top_p = NUMA, imb_span = 256.

   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2

NPS=2
   child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
   top_p = NUMA, imb_span = 128.

   NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4

NPS=4:
   child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
   top_p = NUMA, imb_span = 128.

   NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4

Again, we will be more aggressively load balancing across the two
sockets in NPS=1 mode compared to NPS=2/4.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 113666.77      214885.89 	212162.63 	226065.79
 	 (0.00%)	(89.04%)	(86.65%)	(98.88%)
	 
Scale:	 110962.00	215475.05	220002.56	227610.64
	 (0.00%)	(94.18%)	(98.26%)	(105.12%)
	 
  Add:	 124440.44	250338.11	258569.94	271368.51
  	 (0.00%)	(101.17%)	(107.78%)	(118.07%)
	 
Triad:	 122836.42 	244993.16	251075.19	265465.57
	 (0.00%)	(99.44%)	(104.39%)	(116.11%)


NPS=2
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 58193.29	203242.78	191171.22	212785.96
 	 (0.00%)	(249.25%) 	(228.51%)	(265.65%)
	 
Scale:	 54980.48	209056.28 	175497.10 	223452.99
	 (0.00%)	(280.23%)	(219.19%)	(306.42%)
	 
  Add:	 63244.24	249610.19	203569.35	270487.22
  	 (0.00%)	(294.67%)	(221.87%)	(327.68%)
	 
Triad:	 62166.77 	242688.72	198971.26	262757.16
	 (0.00%)	(290.38%)	(220.06%)	(322.66%)


NPS=4
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 37762.14	253412.82	48748.60	164765.83
 	 (0.00%)	(571.07%)	(29.09%)	(336.32%)
	 
Scale:	 33879.66	237736.93	48317.66	159641.67
	 (0.00%)	(601.70%)	(42.61%)	(371.20%)
	 
  Add:	 38398.90	292186.56	54259.56	184583.70
  	 (0.00%)	(660.92%)	(41.30%)	(380.70%)
	 
Triad:	 37942.38	285124.76	54503.74	181250.80
	 (0.00%)	(651.46%)	(43.64%)	(377.70%)
	 


So while in NPS=1 and NPS=2 we are able to recover the performance
that was obtained with v3, on v4, we are not able to recover as much.

However it is not only due to the fact that in, the imb_numa_nr
thresholds for NPS=4 were (1,1) for the two NUMA domains, while in
v4.1 the imb_numa_nr was (2,4), but also due to the fact that in v3,
we used the imb_numa_nr thresholds in allow_numa_imbalance() while in
v4 and v4.1 we are using those in adjust_numa_imbalance().

The distinction is that in v3, we will trigger load balance as soon as
the number of tasks in the busiest group in the NUMA domain is greater
than or equal to imb_numa_nr at that domain.

In v4 and v4.1, we will trigger the load-balance if

 - the number of tasks in the busiest group is greater than 1/4th the
   CPUs in the NUMA domain.

OR

- the difference between the idle CPUs in the busiest and this group
  is greater than imb_numa_nr.


If we retain the (2,4) thresholds from v4.1 but use them in
allow_numa_imbalance() as in v3 we get

NPS=4
Test:	 mel-v4.2
 Copy:	 225860.12 (498.11%)
Scale:	 227869.07 (572.58%)
  Add:	 278365.58 (624.93%)
Triad:	 264315.44 (596.62%)

It shouldn't have made so much of a difference considering the fact
that with 16 stream tasks, we should have hit "imbalance >
imb_numa_nr" in adjust_numa_imbalance() eventually. But these are the
numbers!


The trend is similar with -DNTIMES=100. Even in this case with v4.2 we
can recover the stream performance in NPS4 case.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NPS=1
Test:	tip-core   mel-v3     mel-v4	mel-v4.1
 Copy:	137281.89  235886.13  240588.80  249005.99
 	(0.00 %)   (71.82%)   (75.25%)	 (81.38%)

Scale:	126559.48  213667.75  226411.39  228523.78
	(0.00 %)   (68.82%)   (78.89%)   (80.56%)

  Add:	148144.23  255780.73  272027.93  272872.49
	(0.00 %)   (72.65%)   (83.62%)	 (84.19%)

Triad:	146829.78  240400.29  259854.84  265640.70
	(0.00 %)   (63.72%)   (76.97%)	 (80.91%)


NPS=2
Test:	tip-core   mel-v3     mel-v4	mel-v4.1
 Copy:	 105873.29 243513.40  198299.43 258266.99
 	 (0.00%)   (130.00%)  (87.29%)	(143.93%)
	 
Scale:	 100510.54 217361.59  177890.00 231509.69
	 (0.00%)   (116.25%)  (76.98%)	(130.33%)
	 
  Add:	 115932.61 268488.03  211436.57	288396.07
  	 (0.00%)   (131.58%)  (82.37%)	(148.76%)
	 
Triad:	 113349.09 253865.68  197810.83	272219.89
	 (0.00%)   (123.96%)  (74.51%)	(140.16%)


NPS=4
Test:	tip-core   mel-v3     mel-v4	mel-v4.1   mel-v4.2
 Copy:	106798.80  251503.78  48898.24	171769.82  266788.03
 	(0.00%)	   (135.49%)  (-54.21%)	(60.83%)   (149.80%)
	
Scale:	101377.71  221135.30  48425.90	160748.21  232480.83
	(0.00%)	   (118.13%)  (-52.23%)	(58.56%)   (129.32%)
	
  Add:	116131.56  275008.74  54425.29  186387.91  290787.31
  	(0.00%)	   (136.80%)  (-53.13%)	(60.49%)   (150.39%)
	
Triad:	113443.70  256027.20  54622.68  180936.47  277456.83
	(0.00%)	   (125.68%)  (-51.85%)	(59.49%)   (144.57%)



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tbench
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NPS=1
======
Clients: tip-core   mel-v3    mel-v4    mel-v4.1
    1	 633.19     619.16    632.94    619.27
    	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
	 
    2	 1152.48    1189.88   1184.82   1189.19
    	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
	 
    4	 1946.46    2177.40   1979.56	2196.09
    	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
	 
    8	 3553.29    3564.50   3678.07	3668.77
    	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
	 
   16	 6217.03    6484.58   6249.29	6534.73
   	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
	 
   32	 11702.59   12185.77  12005.99	11917.57
   	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
	 
   64	 18394.56   19535.11  19080.19	19500.55
   	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
	 
  128	 27231.02   31759.92  27200.52	30358.99
  	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
	 
  256	 33166.10   24474.30  31639.98	24788.12
  	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
	 
  512	 41605.44   54823.57  46684.48	54559.02
  	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
	 
 1024	 53650.54   56329.39  44422.99	56320.66
 	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 


We see that the v4.1 performs better than v4 in most cases except when
the number of clients=256 where the spread strategy seems to be
hurting as we see degradation in both v3 and v4.1. This is true even
for NPS=2 and NPS=4 cases (see below).

NPS=2
=====
Clients: tip-core   mel-v3    mel-v4    mel-v4.1
    1	 629.76	    620.91    629.11	631.95
    	 (0.00%)    (-1.40%)  (-0.10%)	(0.34%)
	 
    2	 1176.96    1203.12   1169.09	1186.74
    	 (0.00%)    (2.22%)   (-0.66%)	(0.83%)
	 
    4	 1990.97    2228.04   1888.19	1995.21
    	 (0.00%)    (11.90%)  (-5.16%)	(0.21%)
	 
    8	 3534.57    3617.16   3660.30	3548.09
    	 (0.00%)    (2.33%)   (3.55%)	(0.38%)
	 
   16	 6294.71    6547.80   6504.13	6470.34
   	 (0.00%)    (4.02%)   (3.32%)	(2.79%)
	 
   32	 12035.73   12143.03  11396.26	11860.91
   	 (0.00%)    (0.89%)   (-5.31%)	(-1.45%)
	 
   64	 18583.39   19439.12  17126.47	18799.54
   	 (0.00%)    (4.60%)   (-7.83%)	(1.16%)
	 
  128	 27811.89   30562.84  28090.29	27468.94
  	 (0.00%)    (9.89%)   (1.00%)	(-1.23%)
	 
  256	 28148.95   26488.57  29117.13	23628.29
  	 (0.00%)    (-5.89%)  (3.43%)	(-16.05%)
	 
  512	 43934.15   52796.38  42603.49	41725.75
  	 (0.00%)    (20.17%)  (-3.02%)	(-5.02%)
	 
 1024	 54391.65   53891.83  48419.09	43913.40
 	 (0.00%)    (-0.91%)  (-10.98%)	(-19.26%)

In this case, v4.1 performs as good as v4 upto 64 clients. But after
that we see degradation. The degradation is significant in 1024
clients case.

NPS=4
=====
Clients: tip-core   mel-v3    mel-v4    mel-v4.1    mel-v4.2
    1	 622.65	    617.83    667.34	644.76	    617.58
    	 (0.00%)    (-0.77%)  (7.17%)	(3.55%)	    (-0.81%)
	 
    2	 1160.62    1182.30   1294.08	1193.88	    1182.55
    	 (0.00%)    (1.86%)   (11.49%)	(2.86%)	    (1.88%)
	 
    4	 1961.14    2171.91   2477.71	1929.56	    2116.01
    	 (0.00%)    (10.74%)  (26.34%)	(-1.61%)    (7.89%)
	 
    8	 3662.94    3447.98   4067.40	3627.43	    3580.32
    	 (0.00%)    (-5.86%)  (11.04%)	(-0.96%)    (-2.25%)
	 
   16	 6490.92    5871.93   6924.32	6660.13	    6413.34
   	 (0.00%)    (-9.53%)  (6.67%)	(2.60%)	    (-1.19%)
	 
   32	 11831.81   12004.30  12709.06	12187.78    11767.46
   	 (0.00%)    (1.45%)   (7.41%)	(3.00%)	    (-0.54%)
	 
   64	 17717.36   18406.79  18785.41	18820.33    18197.86
   	 (0.00%)    (3.89%)   (6.02%)	(6.22%)	    (2.71%)
	 
  128	 27723.35   27777.34  27939.63	27399.64    24310.93
  	 (0.00%)    (0.19%)   (0.78%)	(-1.16%)    (-12.30%)
	 
  256	 30919.69   23937.03  35412.26	26780.37    24642.24
  	 (0.00%)    (-22.58%) (14.52%)	(-13.38%)   (-20.30%)
	 
  512	 43366.03   49570.65  43830.84	43654.42    41031.90
  	 (0.00%)    (14.30%)  (1.07%)	(0.66%)	    (-5.38%)
	 
 1024	 46960.83   53576.16  50557.19	43743.07    40884.98
 	 (0.00%)    (14.08%)  (7.65%)	(-6.85%)    (-12.93%)


In the NPS=4 case, clearly v4 provides the best results.

v4.1 does better v4.2 since it is able to hold off spreading for a
longer period compared to v4.2.

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-15 11:52         ` Gautham R. Shenoy
@ 2021-12-15 12:25           ` Mel Gorman
  2021-12-16 18:33             ` Gautham R. Shenoy
  0 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2021-12-15 12:25 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> Hello Mel,
> 
> 
> On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote:
> 
> > 
> > Thanks for the patch. I will queue this one for tonight.
> >
> 
> Getting the numbers took a bit longer than I expected.
> 

No worries.

> > > <SNIP>
> > > +				/*
> > > +				 * Set span based on top domain that places
> > > +				 * tasks in sibling domains.
> > > +				 */
> > > +				top = sd;
> > > +				top_p = top->parent;
> > > +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> > > +					top = top->parent;
> > > +					top_p = top->parent;
> > > +				}
> > > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> > >  			} else {
> > > -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > > +				int factor = max(1U, (sd->span_weight / imb_span));
> > > +
> 
> 
> So for the first NUMA domain, the sd->imb_numa_nr will be imb, which
> turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings.
> 
> On a 2 Socket Zen3:
> 
> NPS=1
>    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
>    top_p = NUMA, imb_span = 256.
> 
>    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> 
> NPS=2
>    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
>    top_p = NUMA, imb_span = 128.
> 
>    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
>    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> 
> NPS=4:
>    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
>    top_p = NUMA, imb_span = 128.
> 
>    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
>    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> 
> Again, we will be more aggressively load balancing across the two
> sockets in NPS=1 mode compared to NPS=2/4.
> 

Yes, but I felt it was reasonable behaviour because we have to strike
some sort of balance between allowing a NUMA imbalance up to a point
to prevent communicating tasks being pulled apart and v3 broke that
completely. There will always be a tradeoff between tasks that want to
remain local to each other and others that prefer to spread as wide as
possible as quickly as possible.

> <SNIP>
> If we retain the (2,4) thresholds from v4.1 but use them in
> allow_numa_imbalance() as in v3 we get
> 
> NPS=4
> Test:	 mel-v4.2
>  Copy:	 225860.12 (498.11%)
> Scale:	 227869.07 (572.58%)
>   Add:	 278365.58 (624.93%)
> Triad:	 264315.44 (596.62%)
> 

The potential problem with this is that it probably will work for
netperf when it's a single communicating pair but may not work as well
when there are multiple communicating pairs or a number of communicating
tasks that exceed numa_imb_nr.

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NPS=1
> ======
> Clients: tip-core   mel-v3    mel-v4    mel-v4.1
>     1	 633.19     619.16    632.94    619.27
>     	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
> 	 
>     2	 1152.48    1189.88   1184.82   1189.19
>     	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
> 	 
>     4	 1946.46    2177.40   1979.56	2196.09
>     	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
> 	 
>     8	 3553.29    3564.50   3678.07	3668.77
>     	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
> 	 
>    16	 6217.03    6484.58   6249.29	6534.73
>    	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
> 	 
>    32	 11702.59   12185.77  12005.99	11917.57
>    	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
> 	 
>    64	 18394.56   19535.11  19080.19	19500.55
>    	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
> 	 
>   128	 27231.02   31759.92  27200.52	30358.99
>   	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
> 	 
>   256	 33166.10   24474.30  31639.98	24788.12
>   	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
> 	 
>   512	 41605.44   54823.57  46684.48	54559.02
>   	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
> 	 
>  1024	 53650.54   56329.39  44422.99	56320.66
>  	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 
> 
> 
> We see that the v4.1 performs better than v4 in most cases except when
> the number of clients=256 where the spread strategy seems to be
> hurting as we see degradation in both v3 and v4.1. This is true even
> for NPS=2 and NPS=4 cases (see below).
> 

The 256 client case is a bit of a crapshoot. At that point, the NUMA
imbalancing is disabled and the machine is overloaded.

> NPS=2
> =====
> Clients: tip-core   mel-v3    mel-v4    mel-v4.1
>     1	 629.76	    620.91    629.11	631.95
>     	 (0.00%)    (-1.40%)  (-0.10%)	(0.34%)
> 	 
>     2	 1176.96    1203.12   1169.09	1186.74
>     	 (0.00%)    (2.22%)   (-0.66%)	(0.83%)
> 	 
>     4	 1990.97    2228.04   1888.19	1995.21
>     	 (0.00%)    (11.90%)  (-5.16%)	(0.21%)
> 	 
>     8	 3534.57    3617.16   3660.30	3548.09
>     	 (0.00%)    (2.33%)   (3.55%)	(0.38%)
> 	 
>    16	 6294.71    6547.80   6504.13	6470.34
>    	 (0.00%)    (4.02%)   (3.32%)	(2.79%)
> 	 
>    32	 12035.73   12143.03  11396.26	11860.91
>    	 (0.00%)    (0.89%)   (-5.31%)	(-1.45%)
> 	 
>    64	 18583.39   19439.12  17126.47	18799.54
>    	 (0.00%)    (4.60%)   (-7.83%)	(1.16%)
> 	 
>   128	 27811.89   30562.84  28090.29	27468.94
>   	 (0.00%)    (9.89%)   (1.00%)	(-1.23%)
> 	 
>   256	 28148.95   26488.57  29117.13	23628.29
>   	 (0.00%)    (-5.89%)  (3.43%)	(-16.05%)
> 	 
>   512	 43934.15   52796.38  42603.49	41725.75
>   	 (0.00%)    (20.17%)  (-3.02%)	(-5.02%)
> 	 
>  1024	 54391.65   53891.83  48419.09	43913.40
>  	 (0.00%)    (-0.91%)  (-10.98%)	(-19.26%)
> 
> In this case, v4.1 performs as good as v4 upto 64 clients. But after
> that we see degradation. The degradation is significant in 1024
> clients case.
> 

Kinda the same, it's more likely to be run-to-run variance because the
machine is overloaded.

> NPS=4
> =====
> Clients: tip-core   mel-v3    mel-v4    mel-v4.1    mel-v4.2
>     1	 622.65	    617.83    667.34	644.76	    617.58
>     	 (0.00%)    (-0.77%)  (7.17%)	(3.55%)	    (-0.81%)
> 	 
>     2	 1160.62    1182.30   1294.08	1193.88	    1182.55
>     	 (0.00%)    (1.86%)   (11.49%)	(2.86%)	    (1.88%)
> 	 
>     4	 1961.14    2171.91   2477.71	1929.56	    2116.01
>     	 (0.00%)    (10.74%)  (26.34%)	(-1.61%)    (7.89%)
> 	 
>     8	 3662.94    3447.98   4067.40	3627.43	    3580.32
>     	 (0.00%)    (-5.86%)  (11.04%)	(-0.96%)    (-2.25%)
> 	 
>    16	 6490.92    5871.93   6924.32	6660.13	    6413.34
>    	 (0.00%)    (-9.53%)  (6.67%)	(2.60%)	    (-1.19%)
> 	 
>    32	 11831.81   12004.30  12709.06	12187.78    11767.46
>    	 (0.00%)    (1.45%)   (7.41%)	(3.00%)	    (-0.54%)
> 	 
>    64	 17717.36   18406.79  18785.41	18820.33    18197.86
>    	 (0.00%)    (3.89%)   (6.02%)	(6.22%)	    (2.71%)
> 	 
>   128	 27723.35   27777.34  27939.63	27399.64    24310.93
>   	 (0.00%)    (0.19%)   (0.78%)	(-1.16%)    (-12.30%)
> 	 
>   256	 30919.69   23937.03  35412.26	26780.37    24642.24
>   	 (0.00%)    (-22.58%) (14.52%)	(-13.38%)   (-20.30%)
> 	 
>   512	 43366.03   49570.65  43830.84	43654.42    41031.90
>   	 (0.00%)    (14.30%)  (1.07%)	(0.66%)	    (-5.38%)
> 	 
>  1024	 46960.83   53576.16  50557.19	43743.07    40884.98
>  	 (0.00%)    (14.08%)  (7.65%)	(-6.85%)    (-12.93%)
> 
> 
> In the NPS=4 case, clearly v4 provides the best results.
> 
> v4.1 does better v4.2 since it is able to hold off spreading for a
> longer period compared to v4.2.
> 

Most likely because v4.2 is disabling the allowed NUMA imbalance too
soon. This is the trade-off between favouring communicating tasks over
embararassingly parallel problems.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-15 12:25           ` Mel Gorman
@ 2021-12-16 18:33             ` Gautham R. Shenoy
  2021-12-20 11:12               ` Mel Gorman
  0 siblings, 1 reply; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-16 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,

On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:

[..SNIP..]

> > On a 2 Socket Zen3:
> > 
> > NPS=1
> >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> >    top_p = NUMA, imb_span = 256.
> > 
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > 
> > NPS=2
> >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> >    top_p = NUMA, imb_span = 128.
> > 
> >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > 
> > NPS=4:
> >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> >    top_p = NUMA, imb_span = 128.
> > 
> >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > 
> > Again, we will be more aggressively load balancing across the two
> > sockets in NPS=1 mode compared to NPS=2/4.
> > 
> 
> Yes, but I felt it was reasonable behaviour because we have to strike
> some sort of balance between allowing a NUMA imbalance up to a point
> to prevent communicating tasks being pulled apart and v3 broke that
> completely. There will always be a tradeoff between tasks that want to
> remain local to each other and others that prefer to spread as wide as
> possible as quickly as possible.

I agree with this argument that we want to be conservative while
pulling tasks across NUMA domains. My point was that the threshold at
the NUMA domain that spans the 2 sockets is lower for NPS=1
(imb_numa_nr = 2) when compared to the threshold for the same NUMA
domain when NPS=2/4 (imb_numa_nr = 4).

Irrespective of what NPS mode we are operating in, the NUMA distance
between the two sockets is 32 on Zen3 systems. Hence shouldn't the
thresholds be the same for that level of NUMA? 

Would something like the following work ?

if (sd->flags & SD_NUMA) {

   /* We are using the child as a proxy for the group. */
   group_span = sd->child->span_weight;
   sd_distance = /* NUMA distance at this sd level */

   /* By default we set the threshold to 1/4th the sched-group span. */
   imb_numa_shift = 2;

   /*
    * We can be a little aggressive if the cost of migrating tasks
    * across groups of this NUMA level is not high.
    * Assuming 
    */
   
   if (sd_distance < REMOTE_DISTANCE)
      imb_numa_shift++;

   /*
    * Compute the number of LLCs in each group.
    * More the LLCs, more aggressively we migrate across
    * the groups at this NUMA sd.
    */
    nr_llcs = group_span/llc_size;

    sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
}

With this, on Intel platforms, we will get sd->imb_numa_nr = (span of socket)/4

On Zen3,

NPS=1, Inter-socket NUMA : sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

NPS=2, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (64/4) >> (2+1)) = 2
       Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

NPS=4, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (32/2) >> (2+1)) = 2
       Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4




> 
> > <SNIP>
> > If we retain the (2,4) thresholds from v4.1 but use them in
> > allow_numa_imbalance() as in v3 we get
> > 
> > NPS=4
> > Test:	 mel-v4.2
> >  Copy:	 225860.12 (498.11%)
> > Scale:	 227869.07 (572.58%)
> >   Add:	 278365.58 (624.93%)
> > Triad:	 264315.44 (596.62%)
> > 
> 
> The potential problem with this is that it probably will work for
> netperf when it's a single communicating pair but may not work as well
> when there are multiple communicating pairs or a number of communicating
> tasks that exceed numa_imb_nr.


Yes that's true. I think what you are doing in v4 is the right thing.

In case of stream in NPS=4, it just manages to hit the corner case for
this heuristic which results in a suboptimal behaviour. Description
follows:

On NPS=4, if we run 8 stream tasks bound to a socket with v4.1, we get
the following initial placement based on data obtained via the
sched:sched_wakeup_new tracepoint. This behaviour is consistently
reproducible.

-------------------------------------------------------
| NUMA                                                |
|   ----------------------- ------------------------  |
|   | NODE0               | | NODE1                |  |
|   |   -------------     | |    -------------     |  |
|   |   |  0 tasks  | MC0 | |    |  1 tasks  | MC2 |  |
|   |   -------------     | |    -------------     |  |
|   |   -------------     | |    -------------     |  |
|   |   |  1 tasks  | MC1 | |    |  1 tasks  | MC3 |  |
|   |   -------------     | |    -------------     |  |
|   |                     | |                      |  |
|   ----------------------- ------------------------  |
|   ----------------------- ------------------------  |
|   | NODE2               | | NODE3                |  |
|   |   -------------     | |    -------------     |  |
|   |   |  1 tasks  | MC4 | |    |  1 tasks  | MC6 |  |
|   |   -------------     | |    -------------     |  |
|   |   -------------     | |    -------------     |  |
|   |   |  2 tasks  | MC5 | |    |  1 tasks  | MC7 |  |
|   |   -------------     | |    -------------     |  |
|   |                     | |                      |  |
|   ----------------------- ------------------------  |
|                                                     |
-------------------------------------------------------

From the trace data obtained for sched:sched_wakeup_new and
sched:sched_migrate_task, we see

PID 106089 : timestamp 35607.831040 : was running  in MC5
PID 106090 : timestamp 35607.831040 : first placed in MC4
PID 106091 : timestamp 35607.831081 : first placed in MC5
PID 106092 : timestamp 35607.831155 : first placed in MC7
PID 106093 : timestamp 35607.831209 : first placed in MC3
PID 106094 : timestamp 35607.831254 : first placed in MC1
PID 106095 : timestamp 35607.831300 : first placed in MC6
PID 106096 : timestamp 35607.831344 : first placed in MC2

Subsequently we do not see any migrations for stream tasks (via the
sched:sched_migrate_task tracepoint), even though they run for nearly
10 seconds. The reasons:

  - No load-balancing is possible at any of the NODE sched-domains
    since the groups are more or less balanced within each NODE.

  - At NUMA sched-domain, busiest group would be NODE2.  When any CPU
    in NODE0 performs load-balancing at NUMA level, it can pull tasks
    only if the imbalance between NODE0 and NODE2 is greater than
    imb_numa_nr = 2, which isn't the case here.

Hence, with v4.1, we get the following numbers which are better than
the current upstream, but are still not the best.
Copy:           78182.7
Scale:          76344.1
Add:            87638.7
Triad:          86388.9

However, if I run an "mpstat 1 10 > /tmp/mpstat.log&" just before
kickstarting stream-8, the performance significantly improves (again,
consistently reproducible).

Copy:          122804.6
Scale:         115192.9
Add:           137191.6
Triad:         133338.5

In this case, from the trace data for stream, we see:
PID 105174 : timestamp 35547.526816 : was running  in  MC4
PID 105174 : timestamp 35547.577635 : moved to         MC5

PID 105175 : timestamp 35547.526816 : first placed in  MC4
PID 105176 : timestamp 35547.526846 : first placed in  MC3
PID 105177 : timestamp 35547.526893 : first placed in  MC7
PID 105178 : timestamp 35547.526928 : first placed in  MC1
PID 105179 : timestamp 35547.526961 : first placed in  MC2
PID 105180 : timestamp 35547.527001 : first placed in  MC6
PID 105181 : timestamp 35547.527032 : first placed in  MC0

In this case, at the time of the initial placement
(find_idlest_group() ?), we are able to spread out farther away. The
subsequent load-balance at the NODE2 domain is able to balance the
tasks between MC4 and MC5.

> 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > NPS=1
> > ======
> > Clients: tip-core   mel-v3    mel-v4    mel-v4.1
> >     1	 633.19     619.16    632.94    619.27
> >     	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
> > 	 
> >     2	 1152.48    1189.88   1184.82   1189.19
> >     	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
> > 	 
> >     4	 1946.46    2177.40   1979.56	2196.09
> >     	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
> > 	 
> >     8	 3553.29    3564.50   3678.07	3668.77
> >     	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
> > 	 
> >    16	 6217.03    6484.58   6249.29	6534.73
> >    	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
> > 	 
> >    32	 11702.59   12185.77  12005.99	11917.57
> >    	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
> > 	 
> >    64	 18394.56   19535.11  19080.19	19500.55
> >    	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
> > 	 
> >   128	 27231.02   31759.92  27200.52	30358.99
> >   	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
> > 	 
> >   256	 33166.10   24474.30  31639.98	24788.12
> >   	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
> > 	 
> >   512	 41605.44   54823.57  46684.48	54559.02
> >   	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
> > 	 
> >  1024	 53650.54   56329.39  44422.99	56320.66
> >  	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 
> > 
> > 
> > We see that the v4.1 performs better than v4 in most cases except when
> > the number of clients=256 where the spread strategy seems to be
> > hurting as we see degradation in both v3 and v4.1. This is true even
> > for NPS=2 and NPS=4 cases (see below).
> > 
> 
> The 256 client case is a bit of a crapshoot. At that point, the NUMA
> imbalancing is disabled and the machine is overloaded.

Yup. 

[..snip..]

> Most likely because v4.2 is disabling the allowed NUMA imbalance too
> soon. This is the trade-off between favouring communicating tasks over
> embararassingly parallel problems.

v4.1 does allow the NUMA imbalance for a longer duration. But since
the thresholds are small enough, I guess it should be a ok for most
workloads.

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2021-12-13  8:28   ` Gautham R. Shenoy
@ 2021-12-17 19:54   ` Gautham R. Shenoy
  1 sibling, 0 replies; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-17 19:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote:
[..snip..]

> @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task on this node to stay local
> +			 * to its wakeup source if the number of running tasks
> +			 * are below the allowed imbalance. If there is a real
> +			 * need of migration, periodic load balance will take
> +			 * care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
>  				return NULL;

Considering the fact that we want to check whether or not the
imb_numa_nr threshold is going to be crossed if we let the new task
stay local, this should be

      	      	    	if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
			   	return NULL;



Without this change, on a Zen3 configured with Nodes Per Socket
(NPS)=4, the lower NUMA domain with sd->imb_numa_nr = 2, has 4 groups
(each group corresponds to a NODE sched-domain), when we run stream
with 8 threads, we see 3 of them being initially placed in the local
group and the remaining 5 distributed across the other 4 groups. None
of these 3 tasks will never get migrated to any of the other 3 groups,
because those others have at least one task.

Eg:

PID 157811 : timestamp 108921.267293 : first placed in NODE 1
PID 157812 : timestamp 108921.269877 : first placed in NODE 1
PID 157813 : timestamp 108921.269921 : first placed in NODE 1
PID 157814 : timestamp 108921.270007 : first placed in NODE 2
PID 157815 : timestamp 108921.270065 : first placed in NODE 3
PID 157816 : timestamp 108921.270118 : first placed in NODE 0
PID 157817 : timestamp 108921.270168 : first placed in NODE 2
PID 157818 : timestamp 108921.270216 : first placed in NODE 3

With the fix mentioned above, we see the 8 threads uniformly
distributed across the 4 groups.

PID 7500 : timestamp 436.156429 : first placed in     NODE   1
PID 7501 : timestamp 436.159058 : first placed in     NODE   1
PID 7502 : timestamp 436.159106 : first placed in     NODE   2
PID 7503 : timestamp 436.159173 : first placed in     NODE   3
PID 7504 : timestamp 436.159219 : first placed in     NODE   0
PID 7505 : timestamp 436.159263 : first placed in     NODE   2
PID 7506 : timestamp 436.159305 : first placed in     NODE   3
PID 7507 : timestamp 436.159348 : first placed in     NODE   0


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-16 18:33             ` Gautham R. Shenoy
@ 2021-12-20 11:12               ` Mel Gorman
  2021-12-21 15:03                 ` Gautham R. Shenoy
  2021-12-21 17:13                 ` Vincent Guittot
  0 siblings, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2021-12-20 11:12 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

(sorry for the delay, was offline for a few days)

On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> Hello Mel,
> 
> On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> 
> [..SNIP..]
> 
> > > On a 2 Socket Zen3:
> > > 
> > > NPS=1
> > >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > >    top_p = NUMA, imb_span = 256.
> > > 
> > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > 
> > > NPS=2
> > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > >    top_p = NUMA, imb_span = 128.
> > > 
> > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > 
> > > NPS=4:
> > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > >    top_p = NUMA, imb_span = 128.
> > > 
> > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > 
> > > Again, we will be more aggressively load balancing across the two
> > > sockets in NPS=1 mode compared to NPS=2/4.
> > > 
> > 
> > Yes, but I felt it was reasonable behaviour because we have to strike
> > some sort of balance between allowing a NUMA imbalance up to a point
> > to prevent communicating tasks being pulled apart and v3 broke that
> > completely. There will always be a tradeoff between tasks that want to
> > remain local to each other and others that prefer to spread as wide as
> > possible as quickly as possible.
> 
> I agree with this argument that we want to be conservative while
> pulling tasks across NUMA domains. My point was that the threshold at
> the NUMA domain that spans the 2 sockets is lower for NPS=1
> (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> domain when NPS=2/4 (imb_numa_nr = 4).
> 

Is that a problem though? On an Intel machine with sub-numa clustering,
the distances are 11 and 21 for a "node" that is the split cache and the
remote node respectively.

> Irrespective of what NPS mode we are operating in, the NUMA distance
> between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> thresholds be the same for that level of NUMA? 
> 

Maybe, but then it is not a property of the sched_domain and instead
needs to account for distance when balancing between two nodes that may
be varying distances from each other.

> Would something like the following work ?
> 
> if (sd->flags & SD_NUMA) {
> 
>    /* We are using the child as a proxy for the group. */
>    group_span = sd->child->span_weight;
>    sd_distance = /* NUMA distance at this sd level */
> 

NUMA distance relative to what? On Zen, the distance to a remote node may
be fixed but on topologies with multiple nodes that are not fully linked
to every other node by one hop, the same is not true.

>    /* By default we set the threshold to 1/4th the sched-group span. */
>    imb_numa_shift = 2;
> 
>    /*
>     * We can be a little aggressive if the cost of migrating tasks
>     * across groups of this NUMA level is not high.
>     * Assuming 
>     */
>    
>    if (sd_distance < REMOTE_DISTANCE)
>       imb_numa_shift++;
> 

The distance would have to be accounted for elsewhere because here we
are considering one node in isolation, not relative to other nodes.

>    /*
>     * Compute the number of LLCs in each group.
>     * More the LLCs, more aggressively we migrate across
>     * the groups at this NUMA sd.
>     */
>     nr_llcs = group_span/llc_size;
> 
>     sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> }
> 

Same, any adjustment would have to happen during load balancing taking
into account the relatively NUMA distances. I'm not necessarily opposed
but it would be a separate patch.

> > > <SNIP>
> > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > allow_numa_imbalance() as in v3 we get
> > > 
> > > NPS=4
> > > Test:	 mel-v4.2
> > >  Copy:	 225860.12 (498.11%)
> > > Scale:	 227869.07 (572.58%)
> > >   Add:	 278365.58 (624.93%)
> > > Triad:	 264315.44 (596.62%)
> > > 
> > 
> > The potential problem with this is that it probably will work for
> > netperf when it's a single communicating pair but may not work as well
> > when there are multiple communicating pairs or a number of communicating
> > tasks that exceed numa_imb_nr.
> 
> 
> Yes that's true. I think what you are doing in v4 is the right thing.
> 
> In case of stream in NPS=4, it just manages to hit the corner case for
> this heuristic which results in a suboptimal behaviour. Description
> follows:
> 

To avoid the corner case, we'd need to explicitly favour spreading early
and assume wakeup will pull communicating tasks together and NUMA
balancing migrate the data after some time which looks like

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..df0e84462e62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running,
+					int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
 {
-	return (dst_running < (dst_weight >> 2));
+	return dst_running < imb_numa_nr;
 }
 
 /*
@@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task on this node to stay local
+			 * to its wakeup source if the number of running tasks
+			 * are below the allowed imbalance. If there is a real
+			 * need of migration, periodic load balance will take
+			 * care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
-#define NUMA_IMBALANCE_MIN 2
-
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
-	/*
-	 * Allow a small imbalance based on a simple pair of communicating
-	 * tasks that remain local when the destination is lightly loaded.
-	 */
-	if (imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= imb_numa_nr)
 		return 0;
 
 	return imbalance;
@@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..1fa3e977521d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int llc_sq;
+
+				/*
+				 * nr_llcs = (sd->span_weight / llc_weight);
+				 * imb = (llc_weight / nr_llcs) >> 2
+				 *
+				 * is equivalent to
+				 *
+				 * imb = (llc_weight^2 / sd->span_weight) >> 2
+				 *
+				 */
+				llc_sq = child->span_weight * child->span_weight;
+
+				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
+				sd->imb_numa_nr = imb;
+
+				/*
+				 * Set span based on top domain that places
+				 * tasks in sibling domains.
+				 */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-21 10:53   ` Vincent Guittot
  2021-12-21 11:32     ` Mel Gorman
  0 siblings, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2021-12-21 10:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> find_busiest_group uses the child domain's group weight instead of
> the sched_domain's weight that has SD_NUMA set when calculating the
> allowed imbalance between NUMA nodes. This is wrong and inconsistent
> with find_idlest_group.

I agree that find_busiest_group and find_idlest_group should be
consistent and use the same parameters but I wonder if sched_domain's
weight is the right one to use instead of the target group's weight.

IIRC, the goal of adjust_numa_imbalance is to keep some threads on the
same node as long as we consider that there is no performance impact
because of sharing  resources as they can even take advantage of
locality if they interact. So we consider that tasks will not be
impacted by sharing resources if they use less than 25% of the CPUs of
a node. If we use the sd->span_weight instead, we consider that we can
pack threads in the same node as long as it uses less than 25% of the
CPUs in all nodes.

>
> This patch uses the SD_NUMA weight in both.
>
> Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e476f6d9435..0a969affca76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, busiest->group_weight);
> +                               busiest->sum_nr_running, env->sd->span_weight);
>                 }
>
>                 return;
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-21 10:53   ` Vincent Guittot
@ 2021-12-21 11:32     ` Mel Gorman
  2021-12-21 13:05       ` Vincent Guittot
  0 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2021-12-21 11:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote:
> On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> 
> I agree that find_busiest_group and find_idlest_group should be
> consistent and use the same parameters but I wonder if sched_domain's
> weight is the right one to use instead of the target group's weight.
> 

Ok

> IIRC, the goal of adjust_numa_imbalance is to keep some threads on the
> same node as long as we consider that there is no performance impact
> because of sharing  resources as they can even take advantage of
> locality if they interact.

Yes.

> So we consider that tasks will not be
> impacted by sharing resources if they use less than 25% of the CPUs of
> a node. If we use the sd->span_weight instead, we consider that we can
> pack threads in the same node as long as it uses less than 25% of the
> CPUs in all nodes.
> 

I assume you mean the target group weight instead of the node. The
primary resource we are concerned with is memory bandwidth and it's a
guess because we do not know for sure where memory channels are or how
they are configured in this context and it may or may not be correlated
with groups. I think using the group instead would deserve a series on
its own after settling on an imbalance number when there are multiple
LLCs per node.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-21 11:32     ` Mel Gorman
@ 2021-12-21 13:05       ` Vincent Guittot
  0 siblings, 0 replies; 27+ messages in thread
From: Vincent Guittot @ 2021-12-21 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

On Tue, 21 Dec 2021 at 12:32, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote:
> > On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > find_busiest_group uses the child domain's group weight instead of
> > > the sched_domain's weight that has SD_NUMA set when calculating the
> > > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > > with find_idlest_group.
> >
> > I agree that find_busiest_group and find_idlest_group should be
> > consistent and use the same parameters but I wonder if sched_domain's
> > weight is the right one to use instead of the target group's weight.
> >
>
> Ok
>
> > IIRC, the goal of adjust_numa_imbalance is to keep some threads on the
> > same node as long as we consider that there is no performance impact
> > because of sharing  resources as they can even take advantage of
> > locality if they interact.
>
> Yes.
>
> > So we consider that tasks will not be
> > impacted by sharing resources if they use less than 25% of the CPUs of
> > a node. If we use the sd->span_weight instead, we consider that we can
> > pack threads in the same node as long as it uses less than 25% of the
> > CPUs in all nodes.
> >
>
> I assume you mean the target group weight instead of the node. The

I wanted to say that with this patch, we consider the imbalance
acceptable if the number of threads in a node is less than 25% of all
CPUs of all nodes (for this numa level) , but 25% of all CPUs of all
nodes can be more that the number of CPUs in the group.

So I would have changed find_idlest_group instead of changing find_busiest_group

> primary resource we are concerned with is memory bandwidth and it's a
> guess because we do not know for sure where memory channels are or how
> they are configured in this context and it may or may not be correlated
> with groups. I think using the group instead would deserve a series on
> its own after settling on an imbalance number when there are multiple
> LLCs per node.

I haven't look yet at the patch2 for multiple LLC per node

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-20 11:12               ` Mel Gorman
@ 2021-12-21 15:03                 ` Gautham R. Shenoy
  2021-12-21 17:13                 ` Vincent Guittot
  1 sibling, 0 replies; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-21 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote:
> (sorry for the delay, was offline for a few days)
> 
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> > 
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > 
> > [..SNIP..]
> > 
> > > > On a 2 Socket Zen3:
> > > > 
> > > > NPS=1
> > > >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > > >    top_p = NUMA, imb_span = 256.
> > > > 
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > > 
> > > > NPS=2
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > NPS=4:
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > Again, we will be more aggressively load balancing across the two
> > > > sockets in NPS=1 mode compared to NPS=2/4.
> > > > 
> > > 
> > > Yes, but I felt it was reasonable behaviour because we have to strike
> > > some sort of balance between allowing a NUMA imbalance up to a point
> > > to prevent communicating tasks being pulled apart and v3 broke that
> > > completely. There will always be a tradeoff between tasks that want to
> > > remain local to each other and others that prefer to spread as wide as
> > > possible as quickly as possible.
> > 
> > I agree with this argument that we want to be conservative while
> > pulling tasks across NUMA domains. My point was that the threshold at
> > the NUMA domain that spans the 2 sockets is lower for NPS=1
> > (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> > domain when NPS=2/4 (imb_numa_nr = 4).
> > 
> 
> Is that a problem though? On an Intel machine with sub-numa clustering,
> the distances are 11 and 21 for a "node" that is the split cache and the
> remote node respectively.

So, my question was, on an Intel machine, with sub-numa clustering
enabled vs disabled, is the value of imb_numa_nr for the NUMA domain
which spans the remote nodes (distance=21) the same or different. And
if it is different, what is the rationale behind that. I am totally
on-board with the idea that for the different NUMA levels, the
corresponding imb_numa_nr should be different.

Just in case, I was not making myself clear earlier, on Zen3, the
NUMA-A sched-domain, in the figures below, has groups where each group
spans a socket in all the NPS configurations. However, only on NPS=1
we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of
sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing
data, on NPS=2/4, they would reside on the same socket, while on
NPS=1, we will have 2 tasks on one socket, and the other 2 will
migrated to the other socket.

That said, I have not been able to observe any significiant difference
with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr
set to 2 vs 4.



Zen3, NPS=1
------------------------------------------------------------------
|                                                                |
|  NUMA-A : sd->imb_numa_nr = 2   			     	 |
|     ------------------------     ------------------------  	 |
|     |DIE                   |     |DIE                   |  	 |
|     |                      |     |                      |  	 |
|     |    ------   ------   |     |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     ------------------------	   ------------------------  	 |
|							     	 |
------------------------------------------------------------------



Zen3, NPS=2
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |         | |         | |  | |         | |         | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------


Zen3, NPS=4
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |    |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------




> > Irrespective of what NPS mode we are operating in, the NUMA distance
> > between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> > thresholds be the same for that level of NUMA? 
> > 
> 
> Maybe, but then it is not a property of the sched_domain and instead
> needs to account for distance when balancing between two nodes that may
> be varying distances from each other.
> 
> > Would something like the following work ?
> > 
> > if (sd->flags & SD_NUMA) {
> > 
> >    /* We are using the child as a proxy for the group. */
> >    group_span = sd->child->span_weight;
> >    sd_distance = /* NUMA distance at this sd level */
> > 
> 
> NUMA distance relative to what? On Zen, the distance to a remote node may
> be fixed but on topologies with multiple nodes that are not fully linked
> to every other node by one hop, the same is not true.

Fair enough. The "sd_distance" I was referring to the node_distance()
between the CPUs of any two groups in this NUMA domain. However, this
was assuming that the node_distance() between the CPUs of any two
groups would be the same, which is not the case for certain
platforms. So this wouldn't work.



> 
> >    /* By default we set the threshold to 1/4th the sched-group span. */
> >    imb_numa_shift = 2;
> > 
> >    /*
> >     * We can be a little aggressive if the cost of migrating tasks
> >     * across groups of this NUMA level is not high.
> >     * Assuming 
> >     */
> >    
> >    if (sd_distance < REMOTE_DISTANCE)
> >       imb_numa_shift++;
> > 
> 
> The distance would have to be accounted for elsewhere because here we
> are considering one node in isolation, not relative to other nodes.
> 
> >    /*
> >     * Compute the number of LLCs in each group.
> >     * More the LLCs, more aggressively we migrate across
> >     * the groups at this NUMA sd.
> >     */
> >     nr_llcs = group_span/llc_size;
> > 
> >     sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> > }
> > 
> 
> Same, any adjustment would have to happen during load balancing taking
> into account the relatively NUMA distances. I'm not necessarily opposed
> but it would be a separate patch.


Sure, we can look into this separately.


> 
> > > > <SNIP>
> > > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > > allow_numa_imbalance() as in v3 we get
> > > > 
> > > > NPS=4
> > > > Test:	 mel-v4.2
> > > >  Copy:	 225860.12 (498.11%)
> > > > Scale:	 227869.07 (572.58%)
> > > >   Add:	 278365.58 (624.93%)
> > > > Triad:	 264315.44 (596.62%)
> > > > 
> > > 
> > > The potential problem with this is that it probably will work for
> > > netperf when it's a single communicating pair but may not work as well
> > > when there are multiple communicating pairs or a number of communicating
> > > tasks that exceed numa_imb_nr.
> > 
> > 
> > Yes that's true. I think what you are doing in v4 is the right thing.
> > 
> > In case of stream in NPS=4, it just manages to hit the corner case for
> > this heuristic which results in a suboptimal behaviour. Description
> > follows:
> > 
> 
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like


Actually I was able to root-cause the reason behind the drop in the
performance of stream on NPS-4. I have already responded earlier in
another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@BLR-5CG11610CF.amd.com/
Appending the patch here:


---
 kernel/sched/fair.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec354bf88b0d..c1b2a422a877 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay local
-			 * to its wakeup source if the number of running tasks
-			 * are below the allowed imbalance. If there is a real
-			 * need of migration, periodic load balance will take
-			 * care of it.
+			 * Otherwise, keep the task on this node to
+			 * stay local to its wakeup source if the
+			 * number of running tasks (including the new
+			 * one) are below the allowed imbalance. If
+			 * there is a real need of migration, periodic
+			 * load balance will take care of it.
 			 */
-			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
+			if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
 				return NULL;
 		}
 
-- 

With this fix on top of your fix to compute the imb_numa_nr at the
relevant level (v4.1:
https://lore.kernel.org/lkml/20211213130131.GQ3366@techsingularity.net/),
the stream regression for NPS4 is no longer there.


Test:	tip-core	    v4		       v4.1		    v4.1-find_idlest_group_fix
 Copy:	 37762.14 (0.00%)   48748.60 (29.09%)  164765.83 (336.32%)  205963.99 (445.42%)
Scale:	 33879.66 (0.00%)   48317.66 (42.61%)  159641.67 (371.20%)  218136.57 (543.85%)
  Add:	 38398.90 (0.00%)   54259.56 (41.30%)  184583.70 (380.70%)  257857.90 (571.52%)
Triad:	 37942.38 (0.00%)   54503.74 (43.64%)  181250.80 (377.70%)  251410.28 (562.61%)

--
Thanks and Regards
gautham.




^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-20 11:12               ` Mel Gorman
  2021-12-21 15:03                 ` Gautham R. Shenoy
@ 2021-12-21 17:13                 ` Vincent Guittot
  2021-12-22  8:52                   ` Jirka Hladky
  2022-01-05 10:42                   ` Mel Gorman
  1 sibling, 2 replies; 27+ messages in thread
From: Vincent Guittot @ 2021-12-21 17:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> (sorry for the delay, was offline for a few days)
>
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> >
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> >
> > [..SNIP..]
> >

[snip]

>
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */

So now you compute an allowed imbalance level instead of using
25% of sd->span_weight
or
25% of busiest->group_weight

And you adjust this new imb_numa_nr according to the topology.

That makes sense.

>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..df0e84462e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
>  static inline long adjust_numa_imbalance(int imbalance,
> -                                       int dst_running, int dst_weight);
> +                                       int dst_running,
> +                                       int imb_numa_nr);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
>  {
> -       return (dst_running < (dst_weight >> 2));
> +       return dst_running < imb_numa_nr;
>  }
>
>  /*
> @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                                 return idlest;
>  #endif
>                         /*
> -                        * Otherwise, keep the task on this node to stay close
> -                        * its wakeup source and improve locality. If there is
> -                        * a real need of migration, periodic load balance will
> -                        * take care of it.
> +                        * Otherwise, keep the task on this node to stay local
> +                        * to its wakeup source if the number of running tasks
> +                        * are below the allowed imbalance. If there is a real
> +                        * need of migration, periodic load balance will take
> +                        * care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>         }
>  }
>
> -#define NUMA_IMBALANCE_MIN 2
> -
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
> -       /*
> -        * Allow a small imbalance based on a simple pair of communicating
> -        * tasks that remain local when the destination is lightly loaded.
> -        */
> -       if (imbalance <= NUMA_IMBALANCE_MIN)
> +       if (imbalance <= imb_numa_nr)

Isn't this always true ?

imbalance is "always" < dst_running as imbalance is usually the number
of these tasks that we would like to migrate


>                 return 0;
>
>         return imbalance;
> @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, env->sd->span_weight);
> +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..1fa3e977521d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /*
> +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> +        * imbalanced.
> +        */
> +       for_each_cpu(i, cpu_map) {
> +               unsigned int imb = 0;
> +               unsigned int imb_span = 1;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {

sched_domains have not been degenerated yet so you found here the DIE domain

> +                               struct sched_domain *top, *top_p;
> +                               unsigned int llc_sq;
> +
> +                               /*
> +                                * nr_llcs = (sd->span_weight / llc_weight);
> +                                * imb = (llc_weight / nr_llcs) >> 2

it would be good to add a comment to explain why 25% of LLC weight /
number of LLC in a node is the right value.
For example, why is it better than just 25% of the LLC weight ?
Do you want to allow the same imbalance at node level whatever the
number of LLC in the node ?

> +                                *
> +                                * is equivalent to
> +                                *
> +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> +                                *
> +                                */
> +                               llc_sq = child->span_weight * child->span_weight;
> +
> +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> +                               sd->imb_numa_nr = imb;
> +
> +                               /*
> +                                * Set span based on top domain that places
> +                                * tasks in sibling domains.
> +                                */
> +                               top = sd;
> +                               top_p = top->parent;
> +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {

Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure
that you want to take this case into account, only numa node don't
have SD_PREFER_SIBLING

> +                                       top = top->parent;
> +                                       top_p = top->parent;
> +                               }
> +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +                       } else {
> +                               int factor = max(1U, (sd->span_weight / imb_span));
> +
> +                               sd->imb_numa_nr = imb * factor;
> +                       }
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-21 17:13                 ` Vincent Guittot
@ 2021-12-22  8:52                   ` Jirka Hladky
  2022-01-04 19:52                     ` Jirka Hladky
  2022-01-05 10:42                   ` Mel Gorman
  1 sibling, 1 reply; 27+ messages in thread
From: Jirka Hladky @ 2021-12-22  8:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML, Philip Auld

Hi Mel,

we have tested the performance impact of this patch and we see
performance drops up to 60% with the OpenMP implementation of the NAS
parallel benchmark.

Example of results:
2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs)

                                        5.16.0-rc5
5.16.0-rc5                 Difference
                                         (vanilla)             + v4 of
the patch           in %
                                         Mop/s total           Mop/s total
sp_C_x, 8 threads            36316.1               12939.6
         -64%
sp_C_x, 16 threads          64790.2               23968.0
        -63%
sp_C_x, 32 threads          67205.5               48891.4
        -27%

Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results.

It seems like the allowed imbalance is too large and the negative
impact on workloads that prefer to span wide across all NUMA nodes is
too large.

Thanks
Jirka


On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > (sorry for the delay, was offline for a few days)
> >
> > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > > Hello Mel,
> > >
> > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > >
> > > [..SNIP..]
> > >
>
> [snip]
>
> >
> > To avoid the corner case, we'd need to explicitly favour spreading early
> > and assume wakeup will pull communicating tasks together and NUMA
> > balancing migrate the data after some time which looks like
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index c07bfa2d80f2..54f5207154d3 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -93,6 +93,7 @@ struct sched_domain {
> >         unsigned int busy_factor;       /* less balancing by factor if busy */
> >         unsigned int imbalance_pct;     /* No balance until over watermark */
> >         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> > +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */
>
> So now you compute an allowed imbalance level instead of using
> 25% of sd->span_weight
> or
> 25% of busiest->group_weight
>
> And you adjust this new imb_numa_nr according to the topology.
>
> That makes sense.
>
> >
> >         int nohz_idle;                  /* NOHZ IDLE status */
> >         int flags;                      /* See SD_* */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a969affca76..df0e84462e62 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1489,6 +1489,7 @@ struct task_numa_env {
> >
> >         int src_cpu, src_nid;
> >         int dst_cpu, dst_nid;
> > +       int imb_numa_nr;
> >
> >         struct numa_stats src_stats, dst_stats;
> >
> > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
> >  static unsigned long cpu_runnable(struct rq *rq);
> >  static unsigned long cpu_util(int cpu);
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -                                       int dst_running, int dst_weight);
> > +                                       int dst_running,
> > +                                       int imb_numa_nr);
> >
> >  static inline enum
> >  numa_type numa_classify(unsigned int imbalance_pct,
> > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> >                 dst_running = env->dst_stats.nr_running + 1;
> >                 imbalance = max(0, dst_running - src_running);
> >                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> > -                                                       env->dst_stats.weight);
> > +                                                 env->imb_numa_nr);
> >
> >                 /* Use idle CPU if there is no imbalance */
> >                 if (!imbalance) {
> > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
> >          */
> >         rcu_read_lock();
> >         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> > -       if (sd)
> > +       if (sd) {
> >                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> > +               env.imb_numa_nr = sd->imb_numa_nr;
> > +       }
> >         rcu_read_unlock();
> >
> >         /*
> > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> >   * This is an approximation as the number of running tasks may not be
> >   * related to the number of busy CPUs due to sched_setaffinity.
> >   */
> > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> >  {
> > -       return (dst_running < (dst_weight >> 2));
> > +       return dst_running < imb_numa_nr;
> >  }
> >
> >  /*
> > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >                                 return idlest;
> >  #endif
> >                         /*
> > -                        * Otherwise, keep the task on this node to stay close
> > -                        * its wakeup source and improve locality. If there is
> > -                        * a real need of migration, periodic load balance will
> > -                        * take care of it.
> > +                        * Otherwise, keep the task on this node to stay local
> > +                        * to its wakeup source if the number of running tasks
> > +                        * are below the allowed imbalance. If there is a real
> > +                        * need of migration, periodic load balance will take
> > +                        * care of it.
> >                          */
> > -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> > +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
> >                                 return NULL;
> >                 }
> >
> > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >         }
> >  }
> >
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -                               int dst_running, int dst_weight)
> > +                               int dst_running, int imb_numa_nr)
> >  {
> > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> >                 return imbalance;
> >
> > -       /*
> > -        * Allow a small imbalance based on a simple pair of communicating
> > -        * tasks that remain local when the destination is lightly loaded.
> > -        */
> > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > +       if (imbalance <= imb_numa_nr)
>
> Isn't this always true ?
>
> imbalance is "always" < dst_running as imbalance is usually the number
> of these tasks that we would like to migrate
>
>
> >                 return 0;
> >
> >         return imbalance;
> > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, env->sd->span_weight);
> > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> >                 }
> >
> >                 return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >                 }
> >         }
> >
> > +       /*
> > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +        * imbalanced.
> > +        */
> > +       for_each_cpu(i, cpu_map) {
> > +               unsigned int imb = 0;
> > +               unsigned int imb_span = 1;
> > +
> > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +                       struct sched_domain *child = sd->child;
> > +
> > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
>
> sched_domains have not been degenerated yet so you found here the DIE domain
>
> > +                               struct sched_domain *top, *top_p;
> > +                               unsigned int llc_sq;
> > +
> > +                               /*
> > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > +                                * imb = (llc_weight / nr_llcs) >> 2
>
> it would be good to add a comment to explain why 25% of LLC weight /
> number of LLC in a node is the right value.
> For example, why is it better than just 25% of the LLC weight ?
> Do you want to allow the same imbalance at node level whatever the
> number of LLC in the node ?
>
> > +                                *
> > +                                * is equivalent to
> > +                                *
> > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > +                                *
> > +                                */
> > +                               llc_sq = child->span_weight * child->span_weight;
> > +
> > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > +                               sd->imb_numa_nr = imb;
> > +
> > +                               /*
> > +                                * Set span based on top domain that places
> > +                                * tasks in sibling domains.
> > +                                */
> > +                               top = sd;
> > +                               top_p = top->parent;
> > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
>
> Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
> Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure
> that you want to take this case into account, only numa node don't
> have SD_PREFER_SIBLING
>
> > +                                       top = top->parent;
> > +                                       top_p = top->parent;
> > +                               }
> > +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> > +                       } else {
> > +                               int factor = max(1U, (sd->span_weight / imb_span));
> > +
> > +                               sd->imb_numa_nr = imb * factor;
> > +                       }
> > +               }
> > +       }
> > +
> >         /* Calculate CPU capacity for physical packages and nodes */
> >         for (i = nr_cpumask_bits-1; i >= 0; i--) {
> >                 if (!cpumask_test_cpu(i, cpu_map))
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-22  8:52                   ` Jirka Hladky
@ 2022-01-04 19:52                     ` Jirka Hladky
  0 siblings, 0 replies; 27+ messages in thread
From: Jirka Hladky @ 2022-01-04 19:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML, Philip Auld

Hi Mel,

the table with results is badly formatted. Let me send it again:

We have tested the performance impact of this patch and we see
performance drops up to 60% with the OpenMP implementation of the NAS
parallel benchmark.

Example of results:
2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs)

Kernel                        5.16.0-rc5  5.16.0-rc5+patch_v4
                                  Mop/s tot    Mop/s tot                    Diff
sp_C_x, 8 threads    36316.1      12939.6                       -64%
sp_C_x, 16 threads  64790.2      23968.0                       -63%
sp_C_x, 32 threads  67205.5      48891.4                       -27%

Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results.

It seems like the allowed imbalance is too large and the negative
impact on workloads that prefer to span wide across all NUMA nodes is
too large.

Thanks
Jirka


On Wed, Dec 22, 2021 at 9:52 AM Jirka Hladky <jhladky@redhat.com> wrote:
>
> Hi Mel,
>
> we have tested the performance impact of this patch and we see
> performance drops up to 60% with the OpenMP implementation of the NAS
> parallel benchmark.
>
> Example of results:
> 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs)
>
>                                         5.16.0-rc5
> 5.16.0-rc5                 Difference
>                                          (vanilla)             + v4 of
> the patch           in %
>                                          Mop/s total           Mop/s total
> sp_C_x, 8 threads            36316.1               12939.6
>          -64%
> sp_C_x, 16 threads          64790.2               23968.0
>         -63%
> sp_C_x, 32 threads          67205.5               48891.4
>         -27%
>
> Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results.
>
> It seems like the allowed imbalance is too large and the negative
> impact on workloads that prefer to span wide across all NUMA nodes is
> too large.
>
> Thanks
> Jirka
>
>
> On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > (sorry for the delay, was offline for a few days)
> > >
> > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > > > Hello Mel,
> > > >
> > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > > >
> > > > [..SNIP..]
> > > >
> >
> > [snip]
> >
> > >
> > > To avoid the corner case, we'd need to explicitly favour spreading early
> > > and assume wakeup will pull communicating tasks together and NUMA
> > > balancing migrate the data after some time which looks like
> > >
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index c07bfa2d80f2..54f5207154d3 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -93,6 +93,7 @@ struct sched_domain {
> > >         unsigned int busy_factor;       /* less balancing by factor if busy */
> > >         unsigned int imbalance_pct;     /* No balance until over watermark */
> > >         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> > > +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */
> >
> > So now you compute an allowed imbalance level instead of using
> > 25% of sd->span_weight
> > or
> > 25% of busiest->group_weight
> >
> > And you adjust this new imb_numa_nr according to the topology.
> >
> > That makes sense.
> >
> > >
> > >         int nohz_idle;                  /* NOHZ IDLE status */
> > >         int flags;                      /* See SD_* */
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 0a969affca76..df0e84462e62 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1489,6 +1489,7 @@ struct task_numa_env {
> > >
> > >         int src_cpu, src_nid;
> > >         int dst_cpu, dst_nid;
> > > +       int imb_numa_nr;
> > >
> > >         struct numa_stats src_stats, dst_stats;
> > >
> > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
> > >  static unsigned long cpu_runnable(struct rq *rq);
> > >  static unsigned long cpu_util(int cpu);
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                                       int dst_running, int dst_weight);
> > > +                                       int dst_running,
> > > +                                       int imb_numa_nr);
> > >
> > >  static inline enum
> > >  numa_type numa_classify(unsigned int imbalance_pct,
> > > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> > >                 dst_running = env->dst_stats.nr_running + 1;
> > >                 imbalance = max(0, dst_running - src_running);
> > >                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> > > -                                                       env->dst_stats.weight);
> > > +                                                 env->imb_numa_nr);
> > >
> > >                 /* Use idle CPU if there is no imbalance */
> > >                 if (!imbalance) {
> > > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
> > >          */
> > >         rcu_read_lock();
> > >         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> > > -       if (sd)
> > > +       if (sd) {
> > >                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> > > +               env.imb_numa_nr = sd->imb_numa_nr;
> > > +       }
> > >         rcu_read_unlock();
> > >
> > >         /*
> > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >   * This is an approximation as the number of running tasks may not be
> > >   * related to the number of busy CPUs due to sched_setaffinity.
> > >   */
> > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > >  {
> > > -       return (dst_running < (dst_weight >> 2));
> > > +       return dst_running < imb_numa_nr;
> > >  }
> > >
> > >  /*
> > > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > >                                 return idlest;
> > >  #endif
> > >                         /*
> > > -                        * Otherwise, keep the task on this node to stay close
> > > -                        * its wakeup source and improve locality. If there is
> > > -                        * a real need of migration, periodic load balance will
> > > -                        * take care of it.
> > > +                        * Otherwise, keep the task on this node to stay local
> > > +                        * to its wakeup source if the number of running tasks
> > > +                        * are below the allowed imbalance. If there is a real
> > > +                        * need of migration, periodic load balance will take
> > > +                        * care of it.
> > >                          */
> > > -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> > > +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
> > >                                 return NULL;
> > >                 }
> > >
> > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >         }
> > >  }
> > >
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                               int dst_running, int dst_weight)
> > > +                               int dst_running, int imb_numa_nr)
> > >  {
> > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > >                 return imbalance;
> > >
> > > -       /*
> > > -        * Allow a small imbalance based on a simple pair of communicating
> > > -        * tasks that remain local when the destination is lightly loaded.
> > > -        */
> > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +       if (imbalance <= imb_numa_nr)
> >
> > Isn't this always true ?
> >
> > imbalance is "always" < dst_running as imbalance is usually the number
> > of these tasks that we would like to migrate
> >
> >
> > >                 return 0;
> > >
> > >         return imbalance;
> > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >                 /* Consider allowing a small imbalance between NUMA groups */
> > >                 if (env->sd->flags & SD_NUMA) {
> > >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -                               busiest->sum_nr_running, env->sd->span_weight);
> > > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> > >                 }
> > >
> > >                 return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..1fa3e977521d 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >                 }
> > >         }
> > >
> > > +       /*
> > > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +        * imbalanced.
> > > +        */
> > > +       for_each_cpu(i, cpu_map) {
> > > +               unsigned int imb = 0;
> > > +               unsigned int imb_span = 1;
> > > +
> > > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +                       struct sched_domain *child = sd->child;
> > > +
> > > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> >
> > sched_domains have not been degenerated yet so you found here the DIE domain
> >
> > > +                               struct sched_domain *top, *top_p;
> > > +                               unsigned int llc_sq;
> > > +
> > > +                               /*
> > > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > > +                                * imb = (llc_weight / nr_llcs) >> 2
> >
> > it would be good to add a comment to explain why 25% of LLC weight /
> > number of LLC in a node is the right value.
> > For example, why is it better than just 25% of the LLC weight ?
> > Do you want to allow the same imbalance at node level whatever the
> > number of LLC in the node ?
> >
> > > +                                *
> > > +                                * is equivalent to
> > > +                                *
> > > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > > +                                *
> > > +                                */
> > > +                               llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > > +                               sd->imb_numa_nr = imb;
> > > +
> > > +                               /*
> > > +                                * Set span based on top domain that places
> > > +                                * tasks in sibling domains.
> > > +                                */
> > > +                               top = sd;
> > > +                               top_p = top->parent;
> > > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> >
> > Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
> > Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure
> > that you want to take this case into account, only numa node don't
> > have SD_PREFER_SIBLING
> >
> > > +                                       top = top->parent;
> > > +                                       top_p = top->parent;
> > > +                               }
> > > +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> > > +                       } else {
> > > +                               int factor = max(1U, (sd->span_weight / imb_span));
> > > +
> > > +                               sd->imb_numa_nr = imb * factor;
> > > +                       }
> > > +               }
> > > +       }
> > > +
> > >         /* Calculate CPU capacity for physical packages and nodes */
> > >         for (i = nr_cpumask_bits-1; i >= 0; i--) {
> > >                 if (!cpumask_test_cpu(i, cpu_map))
> >
>
>
> --
> -Jirka



-- 
-Jirka


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-21 17:13                 ` Vincent Guittot
  2021-12-22  8:52                   ` Jirka Hladky
@ 2022-01-05 10:42                   ` Mel Gorman
  2022-01-05 10:49                     ` Mel Gorman
  2022-01-10 15:53                     ` Vincent Guittot
  1 sibling, 2 replies; 27+ messages in thread
From: Mel Gorman @ 2022-01-05 10:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > <SNIP>
> >
> > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> >   * This is an approximation as the number of running tasks may not be
> >   * related to the number of busy CPUs due to sched_setaffinity.
> >   */
> > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> >  {
> > -       return (dst_running < (dst_weight >> 2));
> > +       return dst_running < imb_numa_nr;
> >  }
> >
> >  /*
> >
> > <SNIP>
> >
> > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >         }
> >  }
> >
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -                               int dst_running, int dst_weight)
> > +                               int dst_running, int imb_numa_nr)
> >  {
> > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> >                 return imbalance;
> >
> > -       /*
> > -        * Allow a small imbalance based on a simple pair of communicating
> > -        * tasks that remain local when the destination is lightly loaded.
> > -        */
> > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > +       if (imbalance <= imb_numa_nr)
> 
> Isn't this always true ?
> 
> imbalance is "always" < dst_running as imbalance is usually the number
> of these tasks that we would like to migrate
> 

It's not necessarily true. allow_numa_imbalanced is checking if
dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
imbalance.

imb_numa_nr = 4
dst_running = 2
imbalance   = 1

In that case, imbalance of 1 is ok, but 2 is not.

> 
> >                 return 0;
> >
> >         return imbalance;
> > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, env->sd->span_weight);
> > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> >                 }
> >
> >                 return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >                 }
> >         }
> >
> > +       /*
> > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +        * imbalanced.
> > +        */
> > +       for_each_cpu(i, cpu_map) {
> > +               unsigned int imb = 0;
> > +               unsigned int imb_span = 1;
> > +
> > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +                       struct sched_domain *child = sd->child;
> > +
> > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> 
> sched_domains have not been degenerated yet so you found here the DIE domain
> 

Yes

> > +                               struct sched_domain *top, *top_p;
> > +                               unsigned int llc_sq;
> > +
> > +                               /*
> > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > +                                * imb = (llc_weight / nr_llcs) >> 2
> 
> it would be good to add a comment to explain why 25% of LLC weight /
> number of LLC in a node is the right value.

This?

                                 * The 25% imbalance is an arbitrary cutoff
                                 * based on SMT-2 to balance between memory
                                 * bandwidth and avoiding premature sharing
                                 * of HT resources and SMT-4 or SMT-8 *may*
                                 * benefit from a different cutoff. nr_llcs
                                 * are accounted for to mitigate premature
                                 * cache eviction due to multiple tasks
                                 * using one cache while a sibling cache
                                 * remains relatively idle.

> For example, why is it better than just 25% of the LLC weight ?

Because lets say there are 2 LLCs then an imbalance based on just the LLC
weight might allow 2 tasks to share one cache while another is idle. This
is the original problem whereby the vanilla imbalance allowed multiple
LLCs on the same node to be overloaded which hurt workloads that prefer
to spread wide.

> Do you want to allow the same imbalance at node level whatever the
> number of LLC in the node ?
> 

At this point, it's less clear how the larger domains should be
balanced and the initial scaling is as good an option as any.

> > +                                *
> > +                                * is equivalent to
> > +                                *
> > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > +                                *
> > +                                */
> > +                               llc_sq = child->span_weight * child->span_weight;
> > +
> > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > +                               sd->imb_numa_nr = imb;
> > +
> > +                               /*
> > +                                * Set span based on top domain that places
> > +                                * tasks in sibling domains.
> > +                                */
> > +                               top = sd;
> > +                               top_p = top->parent;
> > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> 
> Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?

Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to
degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only
SD_PREFER_SIBLING gave the current results.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-01-05 10:42                   ` Mel Gorman
@ 2022-01-05 10:49                     ` Mel Gorman
  2022-01-10 15:53                     ` Vincent Guittot
  1 sibling, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2022-01-05 10:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Wed, Jan 05, 2022 at 10:42:07AM +0000, Mel Gorman wrote:
> On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > > <SNIP>
> > >
> > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >   * This is an approximation as the number of running tasks may not be
> > >   * related to the number of busy CPUs due to sched_setaffinity.
> > >   */
> > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > >  {
> > > -       return (dst_running < (dst_weight >> 2));
> > > +       return dst_running < imb_numa_nr;
> > >  }
> > >
> > >  /*
> > >
> > > <SNIP>
> > >
> > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >         }
> > >  }
> > >
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                               int dst_running, int dst_weight)
> > > +                               int dst_running, int imb_numa_nr)
> > >  {
> > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > >                 return imbalance;
> > >
> > > -       /*
> > > -        * Allow a small imbalance based on a simple pair of communicating
> > > -        * tasks that remain local when the destination is lightly loaded.
> > > -        */
> > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +       if (imbalance <= imb_numa_nr)
> > 
> > Isn't this always true ?
> > 
> > imbalance is "always" < dst_running as imbalance is usually the number
> > of these tasks that we would like to migrate
> > 
> 
> It's not necessarily true. allow_numa_imbalanced is checking if
> dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
> imbalance.
> 
> imb_numa_nr = 4
> dst_running = 2
> imbalance   = 1
> 
> In that case, imbalance of 1 is ok, but 2 is not.
> 

My bad, this is based on v5 which I just queued for testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-01-05 10:42                   ` Mel Gorman
  2022-01-05 10:49                     ` Mel Gorman
@ 2022-01-10 15:53                     ` Vincent Guittot
  2022-01-12 10:24                       ` Mel Gorman
  1 sibling, 1 reply; 27+ messages in thread
From: Vincent Guittot @ 2022-01-10 15:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > > <SNIP>
> > >
> > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >   * This is an approximation as the number of running tasks may not be
> > >   * related to the number of busy CPUs due to sched_setaffinity.
> > >   */
> > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > >  {
> > > -       return (dst_running < (dst_weight >> 2));
> > > +       return dst_running < imb_numa_nr;
> > >  }
> > >
> > >  /*
> > >
> > > <SNIP>
> > >
> > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >         }
> > >  }
> > >
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                               int dst_running, int dst_weight)
> > > +                               int dst_running, int imb_numa_nr)
> > >  {
> > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > >                 return imbalance;
> > >
> > > -       /*
> > > -        * Allow a small imbalance based on a simple pair of communicating
> > > -        * tasks that remain local when the destination is lightly loaded.
> > > -        */
> > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +       if (imbalance <= imb_numa_nr)
> >
> > Isn't this always true ?
> >
> > imbalance is "always" < dst_running as imbalance is usually the number
> > of these tasks that we would like to migrate
> >
>
> It's not necessarily true. allow_numa_imbalanced is checking if
> dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
> imbalance.
>
> imb_numa_nr = 4
> dst_running = 2
> imbalance   = 1
>
> In that case, imbalance of 1 is ok, but 2 is not.

I don't catch your example. Why is imbalance = 2 not ok in your
example above ? allow_numa_imbalance still returns true (dst-running <
imb_numa_nr) and we still have imbalance <= imb_numa_nr

Also the name dst_running is quite confusing; In the case of
calculate_imbalance, busiest->nr_running is passed as dst_running
argument. But the busiest group is the src not the dst of the balance

Then,  imbalance < busiest->nr_running in load_balance because we try
to even the number of task running in each groups without emptying it
and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we
have imbalance < dst_running < imb_numa_nr

>
> >
> > >                 return 0;
> > >
> > >         return imbalance;
> > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >                 /* Consider allowing a small imbalance between NUMA groups */
> > >                 if (env->sd->flags & SD_NUMA) {
> > >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -                               busiest->sum_nr_running, env->sd->span_weight);
> > > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> > >                 }
> > >
> > >                 return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..1fa3e977521d 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >                 }
> > >         }
> > >
> > > +       /*
> > > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +        * imbalanced.
> > > +        */
> > > +       for_each_cpu(i, cpu_map) {
> > > +               unsigned int imb = 0;
> > > +               unsigned int imb_span = 1;
> > > +
> > > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +                       struct sched_domain *child = sd->child;
> > > +
> > > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> >
> > sched_domains have not been degenerated yet so you found here the DIE domain
> >
>
> Yes
>
> > > +                               struct sched_domain *top, *top_p;
> > > +                               unsigned int llc_sq;
> > > +
> > > +                               /*
> > > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > > +                                * imb = (llc_weight / nr_llcs) >> 2
> >
> > it would be good to add a comment to explain why 25% of LLC weight /
> > number of LLC in a node is the right value.
>
> This?
>
>                                  * The 25% imbalance is an arbitrary cutoff
>                                  * based on SMT-2 to balance between memory
>                                  * bandwidth and avoiding premature sharing
>                                  * of HT resources and SMT-4 or SMT-8 *may*
>                                  * benefit from a different cutoff. nr_llcs
>                                  * are accounted for to mitigate premature
>                                  * cache eviction due to multiple tasks
>                                  * using one cache while a sibling cache
>                                  * remains relatively idle.
>
> > For example, why is it better than just 25% of the LLC weight ?
>
> Because lets say there are 2 LLCs then an imbalance based on just the LLC
> weight might allow 2 tasks to share one cache while another is idle. This
> is the original problem whereby the vanilla imbalance allowed multiple
> LLCs on the same node to be overloaded which hurt workloads that prefer
> to spread wide.

In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each
llc up to 25%  ? instead of dividing by nr_llcs

As an example, you have
1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32
1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8
1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2

sd->imb_numa_nr is used at NUMA level so the more LLC you have the
lower imbalance is allowed

>
> > Do you want to allow the same imbalance at node level whatever the
> > number of LLC in the node ?
> >
>
> At this point, it's less clear how the larger domains should be
> balanced and the initial scaling is as good an option as any.
>
> > > +                                *
> > > +                                * is equivalent to
> > > +                                *
> > > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > > +                                *
> > > +                                */
> > > +                               llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > > +                               sd->imb_numa_nr = imb;
> > > +
> > > +                               /*
> > > +                                * Set span based on top domain that places
> > > +                                * tasks in sibling domains.
> > > +                                */
> > > +                               top = sd;
> > > +                               top_p = top->parent;
> > > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> >
> > Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
>
> Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to
> degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only
> SD_PREFER_SIBLING gave the current results.

SD_PREFER_SIBLING is not mandatory in childs of NUMA node (look at
heterogenous system as an example) so relying of this flag seems quite
fragile
The fact that you see inconsistency with SD_NUMA depending of NPS-1,
NPS-2 or NPS-4 topology probably means that you don't looks for the
right domain level or you try to compensate side effect of your
formula above

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-01-10 15:53                     ` Vincent Guittot
@ 2022-01-12 10:24                       ` Mel Gorman
  0 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2022-01-12 10:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Mon, Jan 10, 2022 at 04:53:26PM +0100, Vincent Guittot wrote:
> On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > > > <SNIP>
> > > >
> > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > > >   * This is an approximation as the number of running tasks may not be
> > > >   * related to the number of busy CPUs due to sched_setaffinity.
> > > >   */
> > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > > >  {
> > > > -       return (dst_running < (dst_weight >> 2));
> > > > +       return dst_running < imb_numa_nr;
> > > >  }
> > > >
> > > >  /*
> > > >
> > > > <SNIP>
> > > >
> > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > > >         }
> > > >  }
> > > >
> > > > -#define NUMA_IMBALANCE_MIN 2
> > > > -
> > > >  static inline long adjust_numa_imbalance(int imbalance,
> > > > -                               int dst_running, int dst_weight)
> > > > +                               int dst_running, int imb_numa_nr)
> > > >  {
> > > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > > >                 return imbalance;
> > > >
> > > > -       /*
> > > > -        * Allow a small imbalance based on a simple pair of communicating
> > > > -        * tasks that remain local when the destination is lightly loaded.
> > > > -        */
> > > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > > +       if (imbalance <= imb_numa_nr)
> > >
> > > Isn't this always true ?
> > >
> > > imbalance is "always" < dst_running as imbalance is usually the number
> > > of these tasks that we would like to migrate
> > >
> >
> > It's not necessarily true. allow_numa_imbalanced is checking if
> > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
> > imbalance.
> >
> > imb_numa_nr = 4
> > dst_running = 2
> > imbalance   = 1
> >
> > In that case, imbalance of 1 is ok, but 2 is not.
> 
> I don't catch your example. Why is imbalance = 2 not ok in your
> example above ? allow_numa_imbalance still returns true (dst-running <
> imb_numa_nr) and we still have imbalance <= imb_numa_nr
> 

At the time I wrote it, the comparison looked like < instead of <=.

> Also the name dst_running is quite confusing; In the case of
> calculate_imbalance, busiest->nr_running is passed as dst_running
> argument. But the busiest group is the src not the dst of the balance
> 
> Then,  imbalance < busiest->nr_running in load_balance because we try
> to even the number of task running in each groups without emptying it
> and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we
> have imbalance < dst_running < imb_numa_nr
> 

But either way, you have a valid point. The patch as-is is too complex
and doing too much and is failing to make progress as a result. I'm going
to go back to the drawing board and come up with a simpler version that
adjusts the cut-off depending on topology but only allows an imbalance
of NUMA_IMBALANCE_MIN and tidy up the inconsistencies.

> > This?
> >
> >                                  * The 25% imbalance is an arbitrary cutoff
> >                                  * based on SMT-2 to balance between memory
> >                                  * bandwidth and avoiding premature sharing
> >                                  * of HT resources and SMT-4 or SMT-8 *may*
> >                                  * benefit from a different cutoff. nr_llcs
> >                                  * are accounted for to mitigate premature
> >                                  * cache eviction due to multiple tasks
> >                                  * using one cache while a sibling cache
> >                                  * remains relatively idle.
> >
> > > For example, why is it better than just 25% of the LLC weight ?
> >
> > Because lets say there are 2 LLCs then an imbalance based on just the LLC
> > weight might allow 2 tasks to share one cache while another is idle. This
> > is the original problem whereby the vanilla imbalance allowed multiple
> > LLCs on the same node to be overloaded which hurt workloads that prefer
> > to spread wide.
> 
> In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each
> llc up to 25%  ? instead of dividing by nr_llcs
> 
> As an example, you have
> 1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32
> 1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8
> 1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2
> 
> sd->imb_numa_nr is used at NUMA level so the more LLC you have the
> lower imbalance is allowed
> 

The more LLCs, the lower the threshold where imbalances is allowed is
deliberate given that the motivating problem was that embarassingly
parallel problems on AMD suffer due to overloading some LLCs while
others remain idle.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-03  8:38   ` Barry Song
  2021-12-03  9:51     ` Gautham R. Shenoy
@ 2021-12-03 10:53     ` Mel Gorman
  1 sibling, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2021-12-03 10:53 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 03, 2021 at 09:38:33PM +1300, Barry Song wrote:
> On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> >
> > This patch uses the SD_NUMA weight in both.
> >
> > Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
> 
> Hi Mel,
> 
> sorry I might be missing something. but I have failed to figure out
> where this commit is.
> 

Stupid cut&paste error and failing to check it properly

7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-03  8:38   ` Barry Song
@ 2021-12-03  9:51     ` Gautham R. Shenoy
  2021-12-03 10:53     ` Mel Gorman
  1 sibling, 0 replies; 27+ messages in thread
From: Gautham R. Shenoy @ 2021-12-03  9:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

Hello Barry,

On Fri, Dec 03, 2021 at 09:38:33PM +1300, Barry Song wrote:
> On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> >
> > This patch uses the SD_NUMA weight in both.
> >
> > Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")

> 
> Hi Mel,
> 
> sorry I might be missing something. but I have failed to figure out
> where this commit is.


I think it is supposed to be

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
since it was the commit which introduced the "busiest->group_weight" change. 

> 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e476f6d9435..0a969affca76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, busiest->group_weight);
> > +                               busiest->sum_nr_running, env->sd->span_weight);



> >                 }
> >
> >                 return;
> > --
> > 2.31.1
> >
> 
> Thanks
> Barry

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-03  8:38   ` Barry Song
  2021-12-03  9:51     ` Gautham R. Shenoy
  2021-12-03 10:53     ` Mel Gorman
  0 siblings, 2 replies; 27+ messages in thread
From: Barry Song @ 2021-12-03  8:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> find_busiest_group uses the child domain's group weight instead of
> the sched_domain's weight that has SD_NUMA set when calculating the
> allowed imbalance between NUMA nodes. This is wrong and inconsistent
> with find_idlest_group.
>
> This patch uses the SD_NUMA weight in both.
>
> Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")

Hi Mel,

sorry I might be missing something. but I have failed to figure out
where this commit is.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e476f6d9435..0a969affca76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, busiest->group_weight);
> +                               busiest->sum_nr_running, env->sd->span_weight);
>                 }
>
>                 return;
> --
> 2.31.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman
@ 2021-12-01 15:18 ` Mel Gorman
  2021-12-03  8:38   ` Barry Song
  0 siblings, 1 reply; 27+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2021-11-25 15:19 ` Mel Gorman
  0 siblings, 0 replies; 27+ messages in thread
From: Mel Gorman @ 2021-11-25 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-01-12 10:24 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-21 10:53   ` Vincent Guittot
2021-12-21 11:32     ` Mel Gorman
2021-12-21 13:05       ` Vincent Guittot
2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-12-13  8:28   ` Gautham R. Shenoy
2021-12-13 13:01     ` Mel Gorman
2021-12-13 14:47       ` Gautham R. Shenoy
2021-12-15 11:52         ` Gautham R. Shenoy
2021-12-15 12:25           ` Mel Gorman
2021-12-16 18:33             ` Gautham R. Shenoy
2021-12-20 11:12               ` Mel Gorman
2021-12-21 15:03                 ` Gautham R. Shenoy
2021-12-21 17:13                 ` Vincent Guittot
2021-12-22  8:52                   ` Jirka Hladky
2022-01-04 19:52                     ` Jirka Hladky
2022-01-05 10:42                   ` Mel Gorman
2022-01-05 10:49                     ` Mel Gorman
2022-01-10 15:53                     ` Vincent Guittot
2022-01-12 10:24                       ` Mel Gorman
2021-12-17 19:54   ` Gautham R. Shenoy
  -- strict thread matches above, loose matches on Subject: below --
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-03  8:38   ` Barry Song
2021-12-03  9:51     ` Gautham R. Shenoy
2021-12-03 10:53     ` Mel Gorman
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-11-25 15:19 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.