linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs
@ 2021-12-01 15:18 Mel Gorman
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  0 siblings, 2 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

The series addresses two problems -- inconsistent use of scheduler domain
weights and sub-optimal performance when there are many LLCs per NUMA node.

-- 
2.31.1


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2021-12-01 15:18 ` Mel Gorman
  2021-12-03  8:38   ` Barry Song
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  1 sibling, 1 reply; 46+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-01 15:18 ` Mel Gorman
  2021-12-03  8:15   ` Barry Song
  2021-12-04 10:40   ` Peter Zijlstra
  1 sibling, 2 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.16.0-rc1             5.16.0-rc1
                               vanilla     sched-numaimb-v3r1
MB/sec copy-16    166712.18 (   0.00%)   587662.60 ( 252.50%)
MB/sec scale-16   140109.66 (   0.00%)   393528.14 ( 180.87%)
MB/sec add-16     160791.18 (   0.00%)   618622.00 ( 284.74%)
MB/sec triad-16   160043.84 (   0.00%)   589188.40 ( 268.14%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per instance, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla     sched-numaimb-v3r1
Min       Score-16   367816.09 (   0.00%)   403429.15 (   9.68%)
Hmean     Score-16   389627.78 (   0.00%)   451015.49 *  15.76%*
Max       Score-16   416178.96 (   0.00%)   480012.00 (  15.34%)
Stddev    Score-16    17361.82 (   0.00%)    32378.08 ( -86.49%)
CoeffVar  Score-16        4.45 (   0.00%)        7.14 ( -60.57%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla     sched-numaimb-v3r1
Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 26 +++++++++++++++-----------
 kernel/sched/topology.c        | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..64f211879e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
 }
 
 /*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
+ * Allow a NUMA imbalance if busy CPUs is less than the allowed
+ * imbalance. This is an approximation as the number of running
+ * tasks may not be related to the number of busy CPUs due to
+ * sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
 {
-	return (dst_running < (dst_weight >> 2));
+	return dst_running < imb_numa_nr;
 }
 
 /*
@@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..fee2930745ab 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/* Calculate allowed NUMA imbalance */
+	for_each_cpu(i, cpu_map) {
+		int imb_numa_nr = 0;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				int nr_groups;
+
+				nr_groups = sd->span_weight / child->span_weight;
+				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
+						(nr_groups * num_online_nodes()));
+			}
+
+			sd->imb_numa_nr = imb_numa_nr;
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2021-12-03  8:15   ` Barry Song
  2021-12-03 10:50     ` Mel Gorman
  2021-12-04 10:40   ` Peter Zijlstra
  1 sibling, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-12-03  8:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> adjusts the imbalance on multi-LLC machines to allow an imbalance up to
> the point where LLCs should be balanced between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.16.0-rc1             5.16.0-rc1
>                                vanilla     sched-numaimb-v3r1
> MB/sec copy-16    166712.18 (   0.00%)   587662.60 ( 252.50%)
> MB/sec scale-16   140109.66 (   0.00%)   393528.14 ( 180.87%)
> MB/sec add-16     160791.18 (   0.00%)   618622.00 ( 284.74%)
> MB/sec triad-16   160043.84 (   0.00%)   589188.40 ( 268.14%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per instance, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla     sched-numaimb-v3r1
> Min       Score-16   367816.09 (   0.00%)   403429.15 (   9.68%)
> Hmean     Score-16   389627.78 (   0.00%)   451015.49 *  15.76%*
> Max       Score-16   416178.96 (   0.00%)   480012.00 (  15.34%)
> Stddev    Score-16    17361.82 (   0.00%)    32378.08 ( -86.49%)
> CoeffVar  Score-16        4.45 (   0.00%)        7.14 ( -60.57%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
>
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla     sched-numaimb-v3r1
> Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
> Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
> Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
> Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
> Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
> Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
> Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
> Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
> Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
> Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
> Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
> Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
> Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
> Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
> Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
> Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
> Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 26 +++++++++++++++-----------
>  kernel/sched/topology.c        | 20 ++++++++++++++++++++
>  3 files changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */
>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..64f211879e43 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
>  }
>
>  /*
> - * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
> - * This is an approximation as the number of running tasks may not be
> - * related to the number of busy CPUs due to sched_setaffinity.
> + * Allow a NUMA imbalance if busy CPUs is less than the allowed
> + * imbalance. This is an approximation as the number of running
> + * tasks may not be related to the number of busy CPUs due to
> + * sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
>  {
> -       return (dst_running < (dst_weight >> 2));
> +       return dst_running < imb_numa_nr;
>  }
>
>  /*
> @@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * a real need of migration, periodic load balance will
>                          * take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
>         /*
> @@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, env->sd->span_weight);
> +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..fee2930745ab 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /* Calculate allowed NUMA imbalance */
> +       for_each_cpu(i, cpu_map) {
> +               int imb_numa_nr = 0;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +                               int nr_groups;
> +
> +                               nr_groups = sd->span_weight / child->span_weight;
> +                               imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> +                                               (nr_groups * num_online_nodes()));

Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
child->span_weight/2/2/4?

Does this patch change the behaviour for machines whose numa equals LLC?

> +                       }
> +
> +                       sd->imb_numa_nr = imb_numa_nr;
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-03  8:38   ` Barry Song
  2021-12-03  9:51     ` Gautham R. Shenoy
  2021-12-03 10:53     ` Mel Gorman
  0 siblings, 2 replies; 46+ messages in thread
From: Barry Song @ 2021-12-03  8:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> find_busiest_group uses the child domain's group weight instead of
> the sched_domain's weight that has SD_NUMA set when calculating the
> allowed imbalance between NUMA nodes. This is wrong and inconsistent
> with find_idlest_group.
>
> This patch uses the SD_NUMA weight in both.
>
> Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")

Hi Mel,

sorry I might be missing something. but I have failed to figure out
where this commit is.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e476f6d9435..0a969affca76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, busiest->group_weight);
> +                               busiest->sum_nr_running, env->sd->span_weight);
>                 }
>
>                 return;
> --
> 2.31.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-03  8:38   ` Barry Song
@ 2021-12-03  9:51     ` Gautham R. Shenoy
  2021-12-03 10:53     ` Mel Gorman
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-03  9:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

Hello Barry,

On Fri, Dec 03, 2021 at 09:38:33PM +1300, Barry Song wrote:
> On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> >
> > This patch uses the SD_NUMA weight in both.
> >
> > Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")

> 
> Hi Mel,
> 
> sorry I might be missing something. but I have failed to figure out
> where this commit is.


I think it is supposed to be

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
since it was the commit which introduced the "busiest->group_weight" change. 

> 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e476f6d9435..0a969affca76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, busiest->group_weight);
> > +                               busiest->sum_nr_running, env->sd->span_weight);



> >                 }
> >
> >                 return;
> > --
> > 2.31.1
> >
> 
> Thanks
> Barry

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-03  8:15   ` Barry Song
@ 2021-12-03 10:50     ` Mel Gorman
  2021-12-03 11:14       ` Barry Song
  0 siblings, 1 reply; 46+ messages in thread
From: Mel Gorman @ 2021-12-03 10:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 03, 2021 at 09:15:15PM +1300, Barry Song wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..fee2930745ab 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >                 }
> >         }
> >
> > +       /* Calculate allowed NUMA imbalance */
> > +       for_each_cpu(i, cpu_map) {
> > +               int imb_numa_nr = 0;
> > +
> > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +                       struct sched_domain *child = sd->child;
> > +
> > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +                               int nr_groups;
> > +
> > +                               nr_groups = sd->span_weight / child->span_weight;
> > +                               imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > +                                               (nr_groups * num_online_nodes()));
> 
> Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
> for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
> then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
> child->span_weight/2/2/4?
> 
> Does this patch change the behaviour for machines whose numa equals LLC?
> 

Yes, it changes behaviour. Instead of a flat 25%, it takes into account
the number of LLCs per node and the number of nodes overall.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-03  8:38   ` Barry Song
  2021-12-03  9:51     ` Gautham R. Shenoy
@ 2021-12-03 10:53     ` Mel Gorman
  1 sibling, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-03 10:53 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 03, 2021 at 09:38:33PM +1300, Barry Song wrote:
> On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> >
> > This patch uses the SD_NUMA weight in both.
> >
> > Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
> 
> Hi Mel,
> 
> sorry I might be missing something. but I have failed to figure out
> where this commit is.
> 

Stupid cut&paste error and failing to check it properly

7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-03 10:50     ` Mel Gorman
@ 2021-12-03 11:14       ` Barry Song
  2021-12-03 13:27         ` Mel Gorman
  0 siblings, 1 reply; 46+ messages in thread
From: Barry Song @ 2021-12-03 11:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 11:50 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Dec 03, 2021 at 09:15:15PM +1300, Barry Song wrote:
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..fee2930745ab 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >                 }
> > >         }
> > >
> > > +       /* Calculate allowed NUMA imbalance */
> > > +       for_each_cpu(i, cpu_map) {
> > > +               int imb_numa_nr = 0;
> > > +
> > > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +                       struct sched_domain *child = sd->child;
> > > +
> > > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +                               int nr_groups;
> > > +
> > > +                               nr_groups = sd->span_weight / child->span_weight;
> > > +                               imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > > +                                               (nr_groups * num_online_nodes()));
> >
> > Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
> > for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
> > then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
> > child->span_weight/2/2/4?
> >
> > Does this patch change the behaviour for machines whose numa equals LLC?
> >
>
> Yes, it changes behaviour. Instead of a flat 25%, it takes into account
> the number of LLCs per node and the number of nodes overall.

Considering  the number of nodes overall seems to be quite weird to me.
for example, for the below machines

1P * 2DIE = 2NUMA:    node1 - node0
2P * 2DIE = 4NUMA:    node1 - node0  ------ node2 - node3
4P * 2DIE = 8NUMA:    node1 - node0  ------ node2 - node3
                                      node5 - node4  ------ node6 - node7

if one service pins node1 and node0 in all above configurations, it seems in all
different machines, the app will result in different behavior.

the other example is:
in a 2P machine, if one app pins the first two NUMAs, the other app pins
the last two NUMAs, why would the  num_online_nodes() matter to them?
there is no balance requirement between the two P.

>
> --
> Mel Gorman
> SUSE Labs

Thanks
Barry

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-03 11:14       ` Barry Song
@ 2021-12-03 13:27         ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-03 13:27 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Sat, Dec 04, 2021 at 12:14:33AM +1300, Barry Song wrote:
> > > Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
> > > for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
> > > then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
> > > child->span_weight/2/2/4?
> > >
> > > Does this patch change the behaviour for machines whose numa equals LLC?
> > >
> >
> > Yes, it changes behaviour. Instead of a flat 25%, it takes into account
> > the number of LLCs per node and the number of nodes overall.
> 
> Considering  the number of nodes overall seems to be quite weird to me.
> for example, for the below machines
> 
> 1P * 2DIE = 2NUMA:    node1 - node0
> 2P * 2DIE = 4NUMA:    node1 - node0  ------ node2 - node3
> 4P * 2DIE = 8NUMA:    node1 - node0  ------ node2 - node3
>                                       node5 - node4  ------ node6 - node7
> 
> if one service pins node1 and node0 in all above configurations, it seems in all
> different machines, the app will result in different behavior.
> 

The intent is to balance between LLCs across the whole machine, hence
accounting for the number of online nodes.

> the other example is:
> in a 2P machine, if one app pins the first two NUMAs, the other app pins
> the last two NUMAs, why would the  num_online_nodes() matter to them?
> there is no balance requirement between the two P.
> 

The previous 25% imbalance also did not take pinning into account and
the choice was somewhat arbitrary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2021-12-03  8:15   ` Barry Song
@ 2021-12-04 10:40   ` Peter Zijlstra
  2021-12-06  8:48     ` Gautham R. Shenoy
  2021-12-06 15:12     ` Mel Gorman
  1 sibling, 2 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-12-04 10:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> +	/* Calculate allowed NUMA imbalance */
> +	for_each_cpu(i, cpu_map) {
> +		int imb_numa_nr = 0;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				int nr_groups;
> +
> +				nr_groups = sd->span_weight / child->span_weight;
> +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> +						(nr_groups * num_online_nodes()));
> +			}
> +
> +			sd->imb_numa_nr = imb_numa_nr;
> +		}

OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
imb_numa_nr = 0, all domains above it will have the same value
calculated here.

So far so good I suppose :-)

Then nr_groups is what it says on the tin; we could've equally well
iterated sd->groups and gotten the same number, but this is simpler.

Now, imb_numa_nr is where the magic happens, the way it's written
doesn't help, but it's something like:

	(child->span_weight / 2) / (nr_groups * num_online_nodes())

With a minimum value of 1. So the larger the system is, or the smaller
the LLCs, the smaller this number gets, right?

So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
/ (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.

But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
of how many nodes it has.

Now, I'm thinking this assumes (fairly reasonable) that the level above
LLC is a node, but I don't think we need to assume this, while also not
assuming the balance domain spans the whole machine (yay paritions!).

	for (top = sd; top->parent; top = top->parent)
		;

	nr_llcs = top->span_weight / child->span_weight;
	imb_numa_nr = max(1, child->span_weight / nr_llcs);

which for my ivb-ep gets me:  20 / (40 / 20) = 10
and the Zen system will have:  4 / (huge number) = 1

Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
can also write the above as:

	(child->span_weight * child->span_weight) / top->span_weight;

Hmm?


> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-04 10:40   ` Peter Zijlstra
@ 2021-12-06  8:48     ` Gautham R. Shenoy
  2021-12-06 14:51       ` Peter Zijlstra
  2021-12-06 15:12     ` Mel Gorman
  1 sibling, 1 reply; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-06  8:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML,
	Sadagopan Srinivasan, Krupa Ramakrishnan

Hello Peter, Mel,


On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > +	/* Calculate allowed NUMA imbalance */
> > +	for_each_cpu(i, cpu_map) {
> > +		int imb_numa_nr = 0;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				int nr_groups;
> > +
> > +				nr_groups = sd->span_weight / child->span_weight;
> > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > +						(nr_groups * num_online_nodes()));
> > +			}
> > +
> > +			sd->imb_numa_nr = imb_numa_nr;
> > +		}
> 
> OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> imb_numa_nr = 0, all domains above it will have the same value
> calculated here.
> 
> So far so good I suppose :-)

Well, we will still have the same imb_numa_nr set for different NUMA
domains which have different distances!

> 
> Then nr_groups is what it says on the tin; we could've equally well
> iterated sd->groups and gotten the same number, but this is simpler.
> 
> Now, imb_numa_nr is where the magic happens, the way it's written
> doesn't help, but it's something like:
> 
> 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> 
> With a minimum value of 1. So the larger the system is, or the smaller
> the LLCs, the smaller this number gets, right?
> 
> So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> 
> But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> of how many nodes it has.

That's correct. On a Zen3 box with 2 sockets with 64 cores per
sockets, we can configure it with either 1/2/4 Nodes Per Socket
(NPS). The imb_numa_nr value for each of the NPS configurations is as
follows:


NPS1 :
~~~~~~~~
SMT [span_wt=2]
  --> MC [span_wt=16, LLC]
     --> DIE[span_wt=128]
         --> NUMA [span_wt=256, SD_NUMA]

sd->span = 128, child->span = 16, nr_groups = 8, num_online_nodes() = 2
imb_numa_nr = max(1, (16 >> 1)/(8*2)) = max(1, 0.5) = 1.



NPS2 :
~~~~~~~~
SMT [span_wt=2]
 --> MC [span_wt=16,LLC]
     --> NODE[span_wt=64]
         --> NUMA [span_wt=128, SD_NUMA]
             --> NUMA [span_wt=256, SD_NUMA]

sd->span = 64, child->span = 16, nr_groups = 4, num_online_nodes() = 4
imb_numa_nr = max(1, (16 >> 1)/(4*4)) = max(1, 0.5) = 1.


NPS 4:
~~~~~~~
SMT [span_wt=2]
   --> MC [span_wt=16, LLC]
       --> NODE [span_wt=32]
           --> NUMA [span_wt=128, SD_NUMA]
	       --> NUMA [span_wt=256, SD_NUMA]

sd->span = 32, child->span = 16, nr_groups = 2, num_online_nodes() = 8
imb_numa_nr = max(1, (16 >> 1)/(2*8)) = max(1, 0.5) = 1.


While the imb_numa_nr = 1 is good for the NUMA domain within a socket
(the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
little bit aggressive for the NUMA domain spanning the two sockets. If
we have only a pair of communicating tasks in a socket, we will end up
spreading them across the two sockets with this patch.

> 
> Now, I'm thinking this assumes (fairly reasonable) that the level above
> LLC is a node, but I don't think we need to assume this, while also not
> assuming the balance domain spans the whole machine (yay paritions!).
> 
> 	for (top = sd; top->parent; top = top->parent)
> 		;
> 
> 	nr_llcs = top->span_weight / child->span_weight;
> 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> 
> which for my ivb-ep gets me:  20 / (40 / 20) = 10
> and the Zen system will have:  4 / (huge number) = 1
> 
> Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> can also write the above as:
> 
> 	(child->span_weight * child->span_weight) / top->span_weight;


Assuming that "child" here refers to the LLC domain, on Zen3 we would have
(a) child->span_weight = 16. (b) top->span_weight = 256.

So we get a^2/b = 1.

>
> Hmm?

Last week, I tried a modification on top of Mel's current patch where
we spread tasks between the LLCs of the groups within each NUMA domain
and compute the value of imb_numa_nr per NUMA domain. The idea is to set

    sd->imb_numa_nr = min(1U,
    		         (Number of LLCs in each sd group / Number of sd groups))

This won't work for processors which have a single LLC in a socket,
since the sd->imb_numa_nr will be 1 which is probably too low. FWIW,
with this heuristic, the imb_numa_nr across the different NPS
configurations of a Zen3 server is as follows

NPS1:
    NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4

NPS2:
    1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
    2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.

NPS4:
    1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
    2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.

Thus, at the highest NUMA level (socket), we don't spread across the
two sockets until there are 4 tasks within the socket. If there is
only a pair of communicating tasks in the socket, they will be left
alone within that socket. The stream numbers (average of 10 runs. The
following are Triad numbers. The Copy, Scale and Add numbers have the
same trend) are presented in the table below. We do see some
degradation for the 4 thread case in NPS2 and NPS4 modes with the
aforementioned approach, but there are gains as well for 16 and 32
thread case on NPS4 mode. 

NPS1:

==========+===========+================+=================
| Nr      |  Mel v3   | tip/sched/core |   Spread across|
| Stream  |           |                |   LLCs of NUMA |
| Threads |           |                |   groups       |
==========+===========+================+=================
|       4 | 111106.14 |       94849.77 |  111820.02     |
|         |           |      (-14.63%) |  (+0.64%)      | 
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|       8 | 175633.00 |      128268.22 |  189705.48     |
|         |           |      (-26.97%) |  (+8.01%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      16 | 252812.87 |      136745.98 |  255577.34     |
|         |           |      (-45.91%) |  (+1.09%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      32 | 248198.43 |      130120.30 |  253266.86     |
|         |           |      (-47.57%) |  (+2.04%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      64 | 244202.33 |      133773.03 |  249449.53     |
|         |           |      (-45.22%) |  (+2.15%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|     128 | 248459.85 |      249450.61 |  250346.09     |
|         |           |      (+0.40%)  |  (+0.76%)      |
~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

NPS2:
==========+===========+================+=================
| Nr      |  Mel v3   | tip/sched/core |   Spread across|
| Stream  |           |                |   LLCs of NUMA |
| Threads |           |                |   groups       |
==========+===========+================+=================
|       4 | 110888.35 |       63067.26 |  104971.36     |
|         |           |      (-43.12%) |   (-5.34%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|       8 | 174983.85 |       96226.39 |  177558.65     |
|         |           |      (-45.01%) |   (+1.47%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      16 | 252943.21 |       106474.3 |  260749.60     |
|         |           |      (-57.90%) |  (+1.47%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      32 | 248540.52 |      113864.09 |  254141.33     |
|         |           |      (-54.19%) |  (+2.25%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      64 | 248383.17 |      137101.85 |  255018.52     |
|         |           |      (-44.80%) |  (+2.67%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|     128 | 250123.31 |      257031.29 |  254457.13     |
|         |           |      (+2.76%)  |  (+1.73%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

NPS4:
==========+===========+================+=================
| Nr      |  Mel v3   | tip/sched/core |   Spread across|
| Stream  |           |                |   LLCs of NUMA |
| Threads |           |                |   groups       |
==========+===========+================+=================
|       4 | 108580.91 |       31746.06 |   97585.53	|
|         |           |      (-70.76%) |  (-10.12%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|       8 | 150259.94 |       64841.89 |  154954.75     |
|         |           |      (-56.84%) |  (+3.12%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      16 | 234137.41 |      106780.26 |  261005.27     |
|         |           |      (-54.39%) |  (+11.48%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      32 | 241583.06 |      147572.50 |  257004.22     |
|         |           |      (-38.91%) |  (+6.38%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      64 | 248511.64 |      166183.06 |  259599.32     |
|         |           |      (-33.12%) |  (+4.46%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|     128 | 252227.34 |      239270.85 |  259117.18     |
|         |           |      (-5.13%)  |  (2.73%)       | 
~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-06  8:48     ` Gautham R. Shenoy
@ 2021-12-06 14:51       ` Peter Zijlstra
  0 siblings, 0 replies; 46+ messages in thread
From: Peter Zijlstra @ 2021-12-06 14:51 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML,
	Sadagopan Srinivasan, Krupa Ramakrishnan

On Mon, Dec 06, 2021 at 02:18:21PM +0530, Gautham R. Shenoy wrote:
> On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > > +	/* Calculate allowed NUMA imbalance */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		int imb_numa_nr = 0;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				int nr_groups;
> > > +
> > > +				nr_groups = sd->span_weight / child->span_weight;
> > > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > > +						(nr_groups * num_online_nodes()));
> > > +			}
> > > +
> > > +			sd->imb_numa_nr = imb_numa_nr;
> > > +		}
> > 
> > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> > imb_numa_nr = 0, all domains above it will have the same value
> > calculated here.
> > 
> > So far so good I suppose :-)
> 
> Well, we will still have the same imb_numa_nr set for different NUMA
> domains which have different distances!

Fair enough; that would need making the computation depends on more
thing, but that shouldn't be too hard.

> > Then nr_groups is what it says on the tin; we could've equally well
> > iterated sd->groups and gotten the same number, but this is simpler.
> > 
> > Now, imb_numa_nr is where the magic happens, the way it's written
> > doesn't help, but it's something like:
> > 
> > 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> > 
> > With a minimum value of 1. So the larger the system is, or the smaller
> > the LLCs, the smaller this number gets, right?
> > 
> > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> > 
> > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> > of how many nodes it has.
> 
> That's correct. On a Zen3 box with 2 sockets with 64 cores per
> sockets, we can configure it with either 1/2/4 Nodes Per Socket
> (NPS). The imb_numa_nr value for each of the NPS configurations is as
> follows:

Cute; that's similar to the whole Intel sub-numa-cluster stuff then;
perhaps update the comment that goes with x86_has_numa_in_package ?
Currently that only mentions AMD Magny-Cours which is a few generations
ago.


> NPS 4:
> ~~~~~~~
> SMT [span_wt=2]
>    --> MC [span_wt=16, LLC]
>        --> NODE [span_wt=32]
>            --> NUMA [span_wt=128, SD_NUMA]
> 	       --> NUMA [span_wt=256, SD_NUMA]

OK, so at max nodes you still have at least 2 LLCs per node.

> While the imb_numa_nr = 1 is good for the NUMA domain within a socket
> (the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
> little bit aggressive for the NUMA domain spanning the two sockets. If
> we have only a pair of communicating tasks in a socket, we will end up
> spreading them across the two sockets with this patch.
> 
> > 
> > Now, I'm thinking this assumes (fairly reasonable) that the level above
> > LLC is a node, but I don't think we need to assume this, while also not
> > assuming the balance domain spans the whole machine (yay paritions!).
> > 
> > 	for (top = sd; top->parent; top = top->parent)
> > 		;
> > 
> > 	nr_llcs = top->span_weight / child->span_weight;
> > 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> > 
> > which for my ivb-ep gets me:  20 / (40 / 20) = 10
> > and the Zen system will have:  4 / (huge number) = 1
> > 
> > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> > can also write the above as:
> > 
> > 	(child->span_weight * child->span_weight) / top->span_weight;
> 
> 
> Assuming that "child" here refers to the LLC domain, on Zen3 we would have
> (a) child->span_weight = 16. (b) top->span_weight = 256.
> 
> So we get a^2/b = 1.

Yes, it would be in the same place as the current imb_numa_nr
calculation, so child would be the largest domain having
SHARE_PKG_RESOURCES, aka. LLC.

> Last week, I tried a modification on top of Mel's current patch where
> we spread tasks between the LLCs of the groups within each NUMA domain
> and compute the value of imb_numa_nr per NUMA domain. The idea is to set
> 
>     sd->imb_numa_nr = min(1U,
>     		         (Number of LLCs in each sd group / Number of sd groups))

s/min/max/

Which is basically something like:

for_each (sd in NUMA):
  llc_per_group = child->span / llc->span;
  nr_group = sd->span / child->span;
  imb = max(1, llc_per_group / nr_group);


> This won't work for processors which have a single LLC in a socket,
> since the sd->imb_numa_nr will be 1 which is probably too low. 

Right.

> FWIW,
> with this heuristic, the imb_numa_nr across the different NPS
> configurations of a Zen3 server is as follows
> 
> NPS1:
>     NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4
> 
> NPS2:
>     1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> NPS4:
>     1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> Thus, at the highest NUMA level (socket), we don't spread across the
> two sockets until there are 4 tasks within the socket. If there is
> only a pair of communicating tasks in the socket, they will be left
> alone within that socket. 

Something that might work:

imb = 0;
imb_span = 1;

for_each_sd(sd) {
	child = sd->child;

	if (!(sd->flags & SD_SPR) && child && (child->flags & SD_SPR)) {

		imb = /* initial magic */
		imb_span = sd->span;
		sd->imb = imb;

	} else if (imb) {
		sd->imb = imb * (sd->span / imb_span);
	}
}

Where we calculate the initial imbalance for the LLC boundary, and then
increase that for subsequent domains based on how often that boundary sd
fits in it. That gives the same progression you have, but also works for
NODE==LLC I think.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-04 10:40   ` Peter Zijlstra
  2021-12-06  8:48     ` Gautham R. Shenoy
@ 2021-12-06 15:12     ` Mel Gorman
  2021-12-09 14:23       ` Valentin Schneider
  1 sibling, 1 reply; 46+ messages in thread
From: Mel Gorman @ 2021-12-06 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy,
	LKML

On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > +	/* Calculate allowed NUMA imbalance */
> > +	for_each_cpu(i, cpu_map) {
> > +		int imb_numa_nr = 0;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				int nr_groups;
> > +
> > +				nr_groups = sd->span_weight / child->span_weight;
> > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > +						(nr_groups * num_online_nodes()));
> > +			}
> > +
> > +			sd->imb_numa_nr = imb_numa_nr;
> > +		}
> 
> OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> imb_numa_nr = 0, all domains above it will have the same value
> calculated here.
> 
> So far so good I suppose :-)
> 

Good start :)

> Then nr_groups is what it says on the tin; we could've equally well
> iterated sd->groups and gotten the same number, but this is simpler.
> 

I also thought it would be clearer.

> Now, imb_numa_nr is where the magic happens, the way it's written
> doesn't help, but it's something like:
> 
> 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> 
> With a minimum value of 1. So the larger the system is, or the smaller
> the LLCs, the smaller this number gets, right?
> 

Correct.

> So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> 
> But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> of how many nodes it has.
> 

The minimum of one was to allow a pair of communicating tasks to remain
on one node even if it's imbalacnced.

> Now, I'm thinking this assumes (fairly reasonable) that the level above
> LLC is a node, but I don't think we need to assume this, while also not
> assuming the balance domain spans the whole machine (yay paritions!).
> 
> 	for (top = sd; top->parent; top = top->parent)
> 		;
> 
> 	nr_llcs = top->span_weight / child->span_weight;
> 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> 
> which for my ivb-ep gets me:  20 / (40 / 20) = 10
> and the Zen system will have:  4 / (huge number) = 1
> 
> Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> can also write the above as:
> 
> 	(child->span_weight * child->span_weight) / top->span_weight;
> 

Gautham had similar reasoning to calculate the imbalance at each
higher-level domain instead of using a static value throughout and
it does make sense. For each level and splitting the imbalance between
two domains, this works out as


	/*
	 * Calculate an allowed NUMA imbalance such that LLCs do not get
	 * imbalanced.
	 */
	for_each_cpu(i, cpu_map) {
		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
			struct sched_domain *child = sd->child;

			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
				struct sched_domain *top = sd;
				unsigned int llc_sq;

				/*
				 * nr_llcs = (top->span_weight / llc_weight);
				 * imb = (child_weight / nr_llcs) >> 1
				 *
				 * is equivalent to
				 *
				 * imb = (llc_weight^2 / top->span_weight) >> 1
				 *
				 */
				llc_sq = child->span_weight * child->span_weight;
				while (top) {
					top->imb_numa_nr = max(1U,
						(llc_sq / top->span_weight) >> 1);
					top = top->parent;
				}

				break;
			}
		}
	}

I'll test this and should have results tomorrow.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-06 15:12     ` Mel Gorman
@ 2021-12-09 14:23       ` Valentin Schneider
  2021-12-09 15:43         ` Mel Gorman
  0 siblings, 1 reply; 46+ messages in thread
From: Valentin Schneider @ 2021-12-09 14:23 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Aubrey Li, Barry Song,
	Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy, LKML

On 06/12/21 15:12, Mel Gorman wrote:
> Gautham had similar reasoning to calculate the imbalance at each
> higher-level domain instead of using a static value throughout and
> it does make sense. For each level and splitting the imbalance between
> two domains, this works out as
>
>
>       /*
>        * Calculate an allowed NUMA imbalance such that LLCs do not get
>        * imbalanced.
>        */
>       for_each_cpu(i, cpu_map) {
>               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>                       struct sched_domain *child = sd->child;
>
>                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
>                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
>                               struct sched_domain *top = sd;
>                               unsigned int llc_sq;
>
>                               /*
>                                * nr_llcs = (top->span_weight / llc_weight);
>                                * imb = (child_weight / nr_llcs) >> 1
>                                *
>                                * is equivalent to
>                                *
>                                * imb = (llc_weight^2 / top->span_weight) >> 1
>                                *
>                                */
>                               llc_sq = child->span_weight * child->span_weight;
>                               while (top) {
>                                       top->imb_numa_nr = max(1U,
>                                               (llc_sq / top->span_weight) >> 1);
>                                       top = top->parent;
>                               }
>
>                               break;
>                       }
>               }
>       }
>

IIRC Peter suggested punting that logic to before domains get degenerated,
but I don't see how that helps here. If you just want to grab the LLC
domain (aka highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)) and compare
its span with that of its parents, that can happen after the degeneration,
no?

> I'll test this and should have results tomorrow.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-09 14:23       ` Valentin Schneider
@ 2021-12-09 15:43         ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-09 15:43 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy,
	LKML

On Thu, Dec 09, 2021 at 02:23:40PM +0000, Valentin Schneider wrote:
> On 06/12/21 15:12, Mel Gorman wrote:
> > Gautham had similar reasoning to calculate the imbalance at each
> > higher-level domain instead of using a static value throughout and
> > it does make sense. For each level and splitting the imbalance between
> > two domains, this works out as
> >
> >
> >       /*
> >        * Calculate an allowed NUMA imbalance such that LLCs do not get
> >        * imbalanced.
> >        */
> >       for_each_cpu(i, cpu_map) {
> >               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> >                       struct sched_domain *child = sd->child;
> >
> >                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> >                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> >                               struct sched_domain *top = sd;
> >                               unsigned int llc_sq;
> >
> >                               /*
> >                                * nr_llcs = (top->span_weight / llc_weight);
> >                                * imb = (child_weight / nr_llcs) >> 1
> >                                *
> >                                * is equivalent to
> >                                *
> >                                * imb = (llc_weight^2 / top->span_weight) >> 1
> >                                *
> >                                */
> >                               llc_sq = child->span_weight * child->span_weight;
> >                               while (top) {
> >                                       top->imb_numa_nr = max(1U,
> >                                               (llc_sq / top->span_weight) >> 1);
> >                                       top = top->parent;
> >                               }
> >
> >                               break;
> >                       }
> >               }
> >       }
> >
> 
> IIRC Peter suggested punting that logic to before domains get degenerated,
> but I don't see how that helps here. If you just want to grab the LLC
> domain (aka highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)) and compare
> its span with that of its parents, that can happen after the degeneration,
> no?
> 

I guess we could but I don't see any specific advantage to doing that.

> > I'll test this and should have results tomorrow.
> >

The test results indicated that there was still a problem with
communicating tasks being pulled apart so am testing a new version.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
                     ` (2 preceding siblings ...)
  2022-02-14 10:27   ` Srikar Dronamraju
@ 2022-02-14 11:03   ` Vincent Guittot
  3 siblings, 0 replies; 46+ messages in thread
From: Vincent Guittot @ 2022-02-14 11:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML

On Tue, 8 Feb 2022 at 10:44, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v6
> MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
> MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
> MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
> MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
> Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
> Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
> Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
> CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed. Even in cases where
> the average performance is neutral, the results are more stable.
>
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v6
> Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
> Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
> Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
> Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
> Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
> Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
> Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
> Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
> Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
> Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
> Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
> Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
> Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
> Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
> Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
> Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
> Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
> Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
> Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
> Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
> Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
> Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
> Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
> Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
> Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
> Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
> Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
> Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
> Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
> Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
> Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
> Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
> Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
> Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
>
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
>
>                               vanilla       sched-numaimb-v6
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
> Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
> Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Good to see that you have been able to move on SD_NUMA instead of
SD_PREFER_SIBLING.
The allowed imbalance looks also more consistent whatever the number of LLC

Reviewed-by: Vincent Guittot <vincent.guitto@linaro.org>


> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr running tasks that allows a NUMA imbalance */
>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..538756bd8e7f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -                                       int dst_running, int dst_weight);
> +                                       int dst_running, int imb_numa_nr);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -       return (running < (weight >> 2));
> +       return running <= imb_numa_nr;
>  }
>
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * allowed. If there is a real need of migration,
>                          * periodic load balance will take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
>         /*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               local->sum_nr_running + 1, local->group_weight);
> +                               local->sum_nr_running + 1, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /*
> +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> +        * imbalanced.
> +        */
> +       for_each_cpu(i, cpu_map) {
> +               unsigned int imb = 0;
> +               unsigned int imb_span = 1;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +                               struct sched_domain *top, *top_p;
> +                               unsigned int nr_llcs;
> +
> +                               /*
> +                                * For a single LLC per node, allow an
> +                                * imbalance up to 25% of the node. This is an
> +                                * arbitrary cutoff based on SMT-2 to balance
> +                                * between memory bandwidth and avoiding
> +                                * premature sharing of HT resources and SMT-4
> +                                * or SMT-8 *may* benefit from a different
> +                                * cutoff.
> +                                *
> +                                * For multiple LLCs, allow an imbalance
> +                                * until multiple tasks would share an LLC
> +                                * on one node while LLCs on another node
> +                                * remain idle.
> +                                */
> +                               nr_llcs = sd->span_weight / child->span_weight;
> +                               if (nr_llcs == 1)
> +                                       imb = sd->span_weight >> 2;
> +                               else
> +                                       imb = nr_llcs;
> +                               sd->imb_numa_nr = imb;
> +
> +                               /* Set span based on the first NUMA domain. */
> +                               top = sd;
> +                               top_p = top->parent;
> +                               while (top_p && !(top_p->flags & SD_NUMA)) {
> +                                       top = top->parent;
> +                                       top_p = top->parent;
> +                               }
> +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +                       } else {
> +                               int factor = max(1U, (sd->span_weight / imb_span));
> +
> +                               sd->imb_numa_nr = imb * factor;
> +                       }
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
  2022-02-08 16:19   ` Gautham R. Shenoy
  2022-02-09  5:10   ` K Prateek Nayak
@ 2022-02-14 10:27   ` Srikar Dronamraju
  2022-02-14 11:03   ` Vincent Guittot
  3 siblings, 0 replies; 46+ messages in thread
From: Srikar Dronamraju @ 2022-02-14 10:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy,
	K Prateek Nayak, LKML

* Mel Gorman <mgorman@techsingularity.net> [2022-02-08 09:43:34]:

> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.17.0-rc0             5.17.0-rc0
> 

Looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-09 10:33     ` Mel Gorman
@ 2022-02-11 19:02       ` Jirka Hladky
  0 siblings, 0 replies; 46+ messages in thread
From: Jirka Hladky @ 2022-02-11 19:02 UTC (permalink / raw)
  To: Mel Gorman
  Cc: K Prateek Nayak, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, Gautham Shenoy, LKML

Hi Mel,

we have tested 5.17.0_rc2 + v6 of your patch series and the results
are very good!

We see performance gains up to 25% for the hackbench (32 processes).
Even more importantly, we don't see any performance losses that we
have experienced with the previous versions of the patch series for
NAS and SPECjbb2005 workloads.

Tested-by: Jirka Hladky <jhladky@redhat.com>

Thanks a lot
Jirka


On Wed, Feb 9, 2022 at 12:08 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Wed, Feb 09, 2022 at 10:40:15AM +0530, K Prateek Nayak wrote:
> > There is a significant improvement throughout the board
> > with v6 outperforming tip/sched/core in every case!
> >
> > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> >
>
> Thanks very much Prateek and Gautham for reviewing and testing!
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-09  5:10   ` K Prateek Nayak
@ 2022-02-09 10:33     ` Mel Gorman
  2022-02-11 19:02       ` Jirka Hladky
  0 siblings, 1 reply; 46+ messages in thread
From: Mel Gorman @ 2022-02-09 10:33 UTC (permalink / raw)
  To: K Prateek Nayak
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	Gautham Shenoy, LKML

On Wed, Feb 09, 2022 at 10:40:15AM +0530, K Prateek Nayak wrote:
> There is a significant improvement throughout the board
> with v6 outperforming tip/sched/core in every case!
> 
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
> 

Thanks very much Prateek and Gautham for reviewing and testing!

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
  2022-02-08 16:19   ` Gautham R. Shenoy
@ 2022-02-09  5:10   ` K Prateek Nayak
  2022-02-09 10:33     ` Mel Gorman
  2022-02-14 10:27   ` Srikar Dronamraju
  2022-02-14 11:03   ` Vincent Guittot
  3 siblings, 1 reply; 46+ messages in thread
From: K Prateek Nayak @ 2022-02-09  5:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	Gautham Shenoy, LKML

Hello Mel,

On 2/8/2022 3:13 PM, Mel Gorman wrote:
[..snip..]
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v6
> MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
> MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
> MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
> MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

I was able to test STREAM without binding on different
NPS configurations of two socket Zen3 machine.

The results look good:

sched-tip 	- 5.17.0-rc1 tip sched/core
mel-v6 		- 5.17.0-rc1 tip sched/core + this patch

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:                sched-tip                  mel-v6
 Copy:   114470.18 (0.00 pct)    152806.94 (33.49 pct)
Scale:   111575.12 (0.00 pct)    189784.57 (70.09 pct)
  Add:   125436.15 (0.00 pct)    213371.05 (70.10 pct)
Triad:   123068.86 (0.00 pct)    209809.11 (70.48 pct)

NPS2

Test:               sched-tip                   mel-v6
 Copy:   57936.28 (0.00 pct)     155038.70 (167.60 pct)
Scale:   55599.30 (0.00 pct)     192601.59 (246.41 pct)
  Add:   63096.96 (0.00 pct)     211462.58 (235.13 pct)
Triad:   61983.39 (0.00 pct)     208909.34 (237.04 pct)

NPS4

Test:               sched-tip                   mel-v6
 Copy:   43946.42 (0.00 pct)     119583.69 (172.11 pct)
Scale:   33750.96 (0.00 pct)     180130.83 (433.70 pct)
  Add:   39109.72 (0.00 pct)     170296.68 (335.43 pct)
Triad:   36598.88 (0.00 pct)     169953.47 (364.36 pct)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS1

Test:               sched-tip                   mel-v6
 Copy:   132402.79 (0.00 pct)    225587.85 (70.37 pct)
Scale:   126923.02 (0.00 pct)    214363.58 (68.89 pct)
  Add:   145596.55 (0.00 pct)    260901.92 (79.19 pct)
Triad:   143092.91 (0.00 pct)    249081.79 (74.06 pct)

NPS 2

Test:               sched-tip                   mel-v6
 Copy:   107386.27 (0.00 pct)    227623.31 (111.96 pct)
Scale:   100941.44 (0.00 pct)    218116.63 (116.08 pct)
  Add:   115854.52 (0.00 pct)    272756.95 (135.43 pct)
Triad:   113369.96 (0.00 pct)    260235.32 (129.54 pct)

NPS4

Test:               sched-tip                   mel-v6
 Copy:   91083.07 (0.00 pct)     247163.90 (171.36 pct)
Scale:   90352.54 (0.00 pct)     223914.31 (147.82 pct)
  Add:   101973.98 (0.00 pct)    272842.42 (167.56 pct)
Triad:   99773.65 (0.00 pct)     258904.54 (159.49 pct)


There is a significant improvement throughout the board
with v6 outperforming tip/sched/core in every case!

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

--
Thanks and Regards
Prateek


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
@ 2022-02-08 16:19   ` Gautham R. Shenoy
  2022-02-09  5:10   ` K Prateek Nayak
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 46+ messages in thread
From: Gautham R. Shenoy @ 2022-02-08 16:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	K Prateek Nayak, LKML

On Tue, Feb 08, 2022 at 09:43:34AM +0000, Mel Gorman wrote:

[..snip..]

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */
> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
> +			}
> +		}
> +	}


On a 2 Socket Zen3 servers with 64 cores per socket, the imb_numa_nr
works out to be as follows for different Node Per Socket (NPS) modes

NPS = 1:
======
SMT(span = 2) -- > MC (span = 16) --> DIE (span = 128) --> NUMA (span = 256)
	 Parent of LLC is DIE. nr_llcs = 128/16 = 8. imb = 8.
	 top_p = NUMA. imb_span = 256.

for NUMA doman, factor = max(1U, 256/256) = 1. Thus sd->imb_numa_nr = 8.



NPS = 2
========
SMT(span=2)--> MC(span=16)--> NODE(span=64)--> NUMA1(span=128)--> NUMA2(span=256)

Parent of LLC = NODE. nr_llcs = 64/16 = 4. imb = 4.
top_p = NUMA1. imb_span = 128.

For NUMA1 domain, factor = 1. sd->imb_numa_nr = 4.
For NUMA2 domain, factor = 2. sd->imb_numa_nr = 8


NPS = 4
========
SMT(span=2)--> MC(span=16)--> NODE(span=32)--> NUMA1(span=128)--> NUMA2(span=256)

Parent of LLC = NODE. nr_llcs = 32/16 = 2. imb = 2.
top_p = NUMA1. imb_span = 128.

For NUMA1 domain, factor = 1. sd->imb_numa_nr = 2.
For NUMA2 domain, factor = 2. sd->imb_numa_nr = 4


The imb_numa_nr looks good for all the NPS modes. Furthermore, running
stream with 16 threads (equal to the number of LLCs in the system)
yields good results on all the NPS modes with this imb_numa_nr.

Reviewed-by: Gautham R. Shenoy <gautham.shenoy@amd.com>

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman
@ 2022-02-08  9:43 ` Mel Gorman
  2022-02-08 16:19   ` Gautham R. Shenoy
                     ` (3 more replies)
  0 siblings, 4 replies; 46+ messages in thread
From: Mel Gorman @ 2022-02-08  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	K Prateek Nayak, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v6
MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v6
Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla       sched-numaimb-v6
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 22 +++++++-------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641c0a7b..56cffe42abbc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4592ccf82c34..538756bd8e7f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1503,7 +1504,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool
-allow_numa_imbalance(unsigned int running, unsigned int weight)
+static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
 {
-	return (running < (weight >> 2));
+	return running <= imb_numa_nr;
 }
 
 /*
@@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * allowed. If there is a real need of migration,
 			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				local->sum_nr_running + 1, local->group_weight);
+				local->sum_nr_running + 1, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..e6cd55951304 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int nr_llcs;
+
+				/*
+				 * For a single LLC per node, allow an
+				 * imbalance up to 25% of the node. This is an
+				 * arbitrary cutoff based on SMT-2 to balance
+				 * between memory bandwidth and avoiding
+				 * premature sharing of HT resources and SMT-4
+				 * or SMT-8 *may* benefit from a different
+				 * cutoff.
+				 *
+				 * For multiple LLCs, allow an imbalance
+				 * until multiple tasks would share an LLC
+				 * on one node while LLCs on another node
+				 * remain idle.
+				 */
+				nr_llcs = sd->span_weight / child->span_weight;
+				if (nr_llcs == 1)
+					imb = sd->span_weight >> 2;
+				else
+					imb = nr_llcs;
+				sd->imb_numa_nr = imb;
+
+				/* Set span based on the first NUMA domain. */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && !(top_p->flags & SD_NUMA)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
@ 2022-02-04 16:45     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2022-02-04 16:45 UTC (permalink / raw)
  To: Nayak, KPrateek (K Prateek)
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	Gautham Shenoy, LKML

On Fri, Feb 04, 2022 at 08:37:53PM +0530, Nayak, KPrateek (K Prateek) wrote:
> On 2/3/2022 8:16 PM, Mel Gorman wrote:
> > @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> >   * This is an approximation as the number of running tasks may not be
> >   * related to the number of busy CPUs due to sched_setaffinity.
> >   */
> > -static inline bool
> > -allow_numa_imbalance(unsigned int running, unsigned int weight)
> > +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
> >  {
> > -	return (running < (weight >> 2));
> > +	return running < imb_numa_nr;
> >  }
> >  
> >  /*
> > @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >  			 * allowed. If there is a real need of migration,
> >  			 * periodic load balance will take care of it.
> >  			 */
> > -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> > +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> 
> Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance?

To account for the new task similar to what task_numa_find_cpu before
calling adjust_numa_imbalance.

> allow_numa_imbalance allows the imbalance based on the following inequality:
> 
> 	running < imb_numa_nr
> 
> Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain.
> Assume the group is running 7 task and we are finding the idlest group for the 8th task:
> 
>  	sd->imb_numa_nr = 8
> 	local_sgs.sum_nr_running = 7
> 
> In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance
> and place the task in the same group, each task can be given one LLC.
> However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different
> NUMA group.
> 
> In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/),
> the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/)
> used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had
> to bear. However that doesn't seem to be the case here.
> 

I failed to change < to <= in allow_numa_imbalance, I'll fix and retest.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
  2022-02-04  7:06   ` Srikar Dronamraju
@ 2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
  2022-02-04 16:45     ` Mel Gorman
  1 sibling, 1 reply; 46+ messages in thread
From: Nayak, KPrateek (K Prateek) @ 2022-02-04 15:07 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

Hello Mel,

On 2/3/2022 8:16 PM, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v5
> MB/sec copy-16    162596.94 (   0.00%)   501967.12 ( 208.72%)
> MB/sec scale-16   136901.28 (   0.00%)   376531.50 ( 175.04%)
> MB/sec add-16     157300.70 (   0.00%)   569997.42 ( 262.36%)
> MB/sec triad-16   151446.88 (   0.00%)   553204.54 ( 265.28%)
> 
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
> 
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
> 
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   400876.92 (   8.86%)
> Hmean     Score-16   388607.33 (   0.00%)   441447.30 *  13.60%*
> Max       Score-16   408945.69 (   0.00%)   478826.87 (  17.09%)
> Stddev    Score-16    15247.04 (   0.00%)    34061.76 (-123.40%)
> CoeffVar  Score-16        3.92 (   0.00%)        7.67 ( -95.82%)
> 
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
> 
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Hmean     tput-1      71631.55 (   0.00%)    70383.46 (  -1.74%)
> Hmean     tput-8     582758.78 (   0.00%)   607290.89 *   4.21%*
> Hmean     tput-16   1020372.75 (   0.00%)  1031257.25 (   1.07%)
> Hmean     tput-24   1416430.67 (   0.00%)  1587576.33 *  12.08%*
> Hmean     tput-32   1687702.72 (   0.00%)  1724207.51 (   2.16%)
> Hmean     tput-40   1798094.90 (   0.00%)  1983053.56 *  10.29%*
> Hmean     tput-48   1972731.77 (   0.00%)  2157461.70 (   9.36%)
> Hmean     tput-56   2386872.38 (   0.00%)  2193237.42 (  -8.11%)
> Hmean     tput-64   2536954.17 (   0.00%)  2588741.08 (   2.04%)
> Hmean     tput-72   2585071.36 (   0.00%)  2654776.36 (   2.70%)
> Hmean     tput-80   2960523.94 (   0.00%)  2894657.12 (  -2.22%)
> Hmean     tput-88   3061408.57 (   0.00%)  2903167.72 (  -5.17%)
> Hmean     tput-96   3052394.82 (   0.00%)  2872605.46 (  -5.89%)
> Hmean     tput-104  2997814.76 (   0.00%)  3013660.26 (   0.53%)
> Hmean     tput-112  2955353.29 (   0.00%)  3029122.16 (   2.50%)
> Hmean     tput-120  2889770.71 (   0.00%)  2957739.88 (   2.35%)
> Hmean     tput-128  2871713.84 (   0.00%)  2912410.18 (   1.42%)
> 
> In general, the standard deviation figures also are a lot more
> stable.
> 
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
> 
>                               vanilla    sched-numaimb-v5r12
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.26 *  17.58%*
> Stddev    ep.D        0.07 (   0.00%)        0.18 (-157.54%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.69 (-212.46%)
> Max       ep.D       31.93 (   0.00%)       26.46 (  17.13%)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>  	unsigned int busy_factor;	/* less balancing by factor if busy */
>  	unsigned int imbalance_pct;	/* No balance until over watermark */
>  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
> +	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
>  
>  	int nohz_idle;			/* NOHZ IDLE status */
>  	int flags;			/* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..86abf97a8df6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>  
>  	int src_cpu, src_nid;
>  	int dst_cpu, dst_nid;
> +	int imb_numa_nr;
>  
>  	struct numa_stats src_stats, dst_stats;
>  
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -					int dst_running, int dst_weight);
> +					int dst_running, int imb_numa_nr);
>  
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		dst_running = env->dst_stats.nr_running + 1;
>  		imbalance = max(0, dst_running - src_running);
>  		imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -							env->dst_stats.weight);
> +						  env->imb_numa_nr);
>  
>  		/* Use idle CPU if there is no imbalance */
>  		if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -	if (sd)
> +	if (sd) {
>  		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +		env.imb_numa_nr = sd->imb_numa_nr;
> +	}
>  	rcu_read_unlock();
>  
>  	/*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -	return (running < (weight >> 2));
> +	return running < imb_numa_nr;
>  }
>  
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			 * allowed. If there is a real need of migration,
>  			 * periodic load balance will take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))

Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance?
allow_numa_imbalance allows the imbalance based on the following inequality:

	running < imb_numa_nr

Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain.
Assume the group is running 7 task and we are finding the idlest group for the 8th task:

 	sd->imb_numa_nr = 8
	local_sgs.sum_nr_running = 7

In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance
and place the task in the same group, each task can be given one LLC.
However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different
NUMA group.

In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/),
the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/)
used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had
to bear. However that doesn't seem to be the case here.

>  				return NULL;
>  		}
>  
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>  
>  static inline long adjust_numa_imbalance(int imbalance,
> -				int dst_running, int dst_weight)
> +				int dst_running, int imb_numa_nr)
>  {
> -	if (!allow_numa_imbalance(dst_running, dst_weight))
> +	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>  		return imbalance;
>  
>  	/*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				local->sum_nr_running + 1, local->group_weight);
> +				local->sum_nr_running + 1, env->sd->imb_numa_nr);
>  		}
>  
>  		return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */

To add to my point above, the comment here says -
"allow an imbalance until multiple tasks would share an LLC on one node"
Whereas, in the case I highlighted above, we see balancing kick in with possibly
one LLC being unaccounted for. 

> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
> +			}
> +		}
> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))

Please correct me if I'm wrong.

Thanks and Regards
Prateek

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-04  7:06   ` Srikar Dronamraju
@ 2022-02-04  9:04     ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2022-02-04  9:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML

On Fri, Feb 04, 2022 at 12:36:54PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]:
> 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..e6cd55951304 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  		}
> >  	}
> > 
> > +	/*
> > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +	 * imbalanced.
> > +	 */
> 
> We seem to adding this hunk before the sched_domains may be degenerated.
> Wondering if we really want to do it before degeneration.
> 

There was no obvious advantage versus doing it at the same time
characteristics like groups were being determined.

> Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for
> all the 3 domains, then lets say the middle sched_domain gets degenerated. 
> Would the sd->imb_numa_nr's still be relevant?
> 

It's expected that it is still relevant as the ratios with respect to
SD_SHARE_PKG_RESOURCES should still be consistent.

> 
> > +	for_each_cpu(i, cpu_map) {
> > +		unsigned int imb = 0;
> > +		unsigned int imb_span = 1;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				struct sched_domain *top, *top_p;
> > +				unsigned int nr_llcs;
> > +
> > +				/*
> > +				 * For a single LLC per node, allow an
> > +				 * imbalance up to 25% of the node. This is an
> > +				 * arbitrary cutoff based on SMT-2 to balance
> > +				 * between memory bandwidth and avoiding
> > +				 * premature sharing of HT resources and SMT-4
> > +				 * or SMT-8 *may* benefit from a different
> > +				 * cutoff.
> > +				 *
> > +				 * For multiple LLCs, allow an imbalance
> > +				 * until multiple tasks would share an LLC
> > +				 * on one node while LLCs on another node
> > +				 * remain idle.
> > +				 */
> > +				nr_llcs = sd->span_weight / child->span_weight;
> > +				if (nr_llcs == 1)
> > +					imb = sd->span_weight >> 2;
> > +				else
> > +					imb = nr_llcs;
> > +				sd->imb_numa_nr = imb;
> > +
> > +				/* Set span based on the first NUMA domain. */
> > +				top = sd;
> > +				top_p = top->parent;
> > +				while (top_p && !(top_p->flags & SD_NUMA)) {
> > +					top = top->parent;
> > +					top_p = top->parent;
> > +				}
> > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> 
> I am getting confused by imb_span.
> Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and
> MC domains having SD_SHARE_PKG_RESOURCES flag set.
> We come here only for DIE domain.
> 
> imb_span set here is being used for both the subsequent sched domains
> most likely they will be NUMA domains. Right?
> 

Right.

> > +			} else {
> > +				int factor = max(1U, (sd->span_weight / imb_span));
> > +
> > +				sd->imb_numa_nr = imb * factor;
> 
> For SMT, (or any sched domains below the llcs) factor would be
> sd->span_weight but imb_numa_nr and imb would be 0.

Yes.

> For NUMA (or any sched domain just above DIE), factor would be
> sd->imb_numa_nr would be nr_llcs.
> For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of
> nr_llcs. Right?
> 

Right.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
@ 2022-02-04  7:06   ` Srikar Dronamraju
  2022-02-04  9:04     ` Mel Gorman
  2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
  1 sibling, 1 reply; 46+ messages in thread
From: Srikar Dronamraju @ 2022-02-04  7:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML

* Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]:

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
> 
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */

We seem to adding this hunk before the sched_domains may be degenerated.
Wondering if we really want to do it before degeneration.

Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for
all the 3 domains, then lets say the middle sched_domain gets degenerated. 
Would the sd->imb_numa_nr's still be relevant?


> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */
> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;

I am getting confused by imb_span.
Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and
MC domains having SD_SHARE_PKG_RESOURCES flag set.
We come here only for DIE domain.

imb_span set here is being used for both the subsequent sched domains
most likely they will be NUMA domains. Right?

> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;

For SMT, (or any sched domains below the llcs) factor would be
sd->span_weight but imb_numa_nr and imb would be 0.
For NUMA (or any sched domain just above DIE), factor would be
sd->imb_numa_nr would be nr_llcs.
For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of
nr_llcs. Right?


> +			}
> +		}
> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman
@ 2022-02-03 14:46 ` Mel Gorman
  2022-02-04  7:06   ` Srikar Dronamraju
  2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
  0 siblings, 2 replies; 46+ messages in thread
From: Mel Gorman @ 2022-02-03 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v5
MB/sec copy-16    162596.94 (   0.00%)   501967.12 ( 208.72%)
MB/sec scale-16   136901.28 (   0.00%)   376531.50 ( 175.04%)
MB/sec add-16     157300.70 (   0.00%)   569997.42 ( 262.36%)
MB/sec triad-16   151446.88 (   0.00%)   553204.54 ( 265.28%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   400876.92 (   8.86%)
Hmean     Score-16   388607.33 (   0.00%)   441447.30 *  13.60%*
Max       Score-16   408945.69 (   0.00%)   478826.87 (  17.09%)
Stddev    Score-16    15247.04 (   0.00%)    34061.76 (-123.40%)
CoeffVar  Score-16        3.92 (   0.00%)        7.67 ( -95.82%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Hmean     tput-1      71631.55 (   0.00%)    70383.46 (  -1.74%)
Hmean     tput-8     582758.78 (   0.00%)   607290.89 *   4.21%*
Hmean     tput-16   1020372.75 (   0.00%)  1031257.25 (   1.07%)
Hmean     tput-24   1416430.67 (   0.00%)  1587576.33 *  12.08%*
Hmean     tput-32   1687702.72 (   0.00%)  1724207.51 (   2.16%)
Hmean     tput-40   1798094.90 (   0.00%)  1983053.56 *  10.29%*
Hmean     tput-48   1972731.77 (   0.00%)  2157461.70 (   9.36%)
Hmean     tput-56   2386872.38 (   0.00%)  2193237.42 (  -8.11%)
Hmean     tput-64   2536954.17 (   0.00%)  2588741.08 (   2.04%)
Hmean     tput-72   2585071.36 (   0.00%)  2654776.36 (   2.70%)
Hmean     tput-80   2960523.94 (   0.00%)  2894657.12 (  -2.22%)
Hmean     tput-88   3061408.57 (   0.00%)  2903167.72 (  -5.17%)
Hmean     tput-96   3052394.82 (   0.00%)  2872605.46 (  -5.89%)
Hmean     tput-104  2997814.76 (   0.00%)  3013660.26 (   0.53%)
Hmean     tput-112  2955353.29 (   0.00%)  3029122.16 (   2.50%)
Hmean     tput-120  2889770.71 (   0.00%)  2957739.88 (   2.35%)
Hmean     tput-128  2871713.84 (   0.00%)  2912410.18 (   1.42%)

In general, the standard deviation figures also are a lot more
stable.

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla    sched-numaimb-v5r12
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.26 *  17.58%*
Stddev    ep.D        0.07 (   0.00%)        0.18 (-157.54%)
CoeffVar  ep.D        0.22 (   0.00%)        0.69 (-212.46%)
Max       ep.D       31.93 (   0.00%)       26.46 (  17.13%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 22 +++++++-------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641c0a7b..56cffe42abbc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4592ccf82c34..86abf97a8df6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1503,7 +1504,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool
-allow_numa_imbalance(unsigned int running, unsigned int weight)
+static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
 {
-	return (running < (weight >> 2));
+	return running < imb_numa_nr;
 }
 
 /*
@@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * allowed. If there is a real need of migration,
 			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				local->sum_nr_running + 1, local->group_weight);
+				local->sum_nr_running + 1, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..e6cd55951304 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int nr_llcs;
+
+				/*
+				 * For a single LLC per node, allow an
+				 * imbalance up to 25% of the node. This is an
+				 * arbitrary cutoff based on SMT-2 to balance
+				 * between memory bandwidth and avoiding
+				 * premature sharing of HT resources and SMT-4
+				 * or SMT-8 *may* benefit from a different
+				 * cutoff.
+				 *
+				 * For multiple LLCs, allow an imbalance
+				 * until multiple tasks would share an LLC
+				 * on one node while LLCs on another node
+				 * remain idle.
+				 */
+				nr_llcs = sd->span_weight / child->span_weight;
+				if (nr_llcs == 1)
+					imb = sd->span_weight >> 2;
+				else
+					imb = nr_llcs;
+				sd->imb_numa_nr = imb;
+
+				/* Set span based on the first NUMA domain. */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && !(top_p->flags & SD_NUMA)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-01-10 15:53                     ` Vincent Guittot
@ 2022-01-12 10:24                       ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2022-01-12 10:24 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Mon, Jan 10, 2022 at 04:53:26PM +0100, Vincent Guittot wrote:
> On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > > > <SNIP>
> > > >
> > > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > > >   * This is an approximation as the number of running tasks may not be
> > > >   * related to the number of busy CPUs due to sched_setaffinity.
> > > >   */
> > > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > > >  {
> > > > -       return (dst_running < (dst_weight >> 2));
> > > > +       return dst_running < imb_numa_nr;
> > > >  }
> > > >
> > > >  /*
> > > >
> > > > <SNIP>
> > > >
> > > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > > >         }
> > > >  }
> > > >
> > > > -#define NUMA_IMBALANCE_MIN 2
> > > > -
> > > >  static inline long adjust_numa_imbalance(int imbalance,
> > > > -                               int dst_running, int dst_weight)
> > > > +                               int dst_running, int imb_numa_nr)
> > > >  {
> > > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > > >                 return imbalance;
> > > >
> > > > -       /*
> > > > -        * Allow a small imbalance based on a simple pair of communicating
> > > > -        * tasks that remain local when the destination is lightly loaded.
> > > > -        */
> > > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > > +       if (imbalance <= imb_numa_nr)
> > >
> > > Isn't this always true ?
> > >
> > > imbalance is "always" < dst_running as imbalance is usually the number
> > > of these tasks that we would like to migrate
> > >
> >
> > It's not necessarily true. allow_numa_imbalanced is checking if
> > dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
> > imbalance.
> >
> > imb_numa_nr = 4
> > dst_running = 2
> > imbalance   = 1
> >
> > In that case, imbalance of 1 is ok, but 2 is not.
> 
> I don't catch your example. Why is imbalance = 2 not ok in your
> example above ? allow_numa_imbalance still returns true (dst-running <
> imb_numa_nr) and we still have imbalance <= imb_numa_nr
> 

At the time I wrote it, the comparison looked like < instead of <=.

> Also the name dst_running is quite confusing; In the case of
> calculate_imbalance, busiest->nr_running is passed as dst_running
> argument. But the busiest group is the src not the dst of the balance
> 
> Then,  imbalance < busiest->nr_running in load_balance because we try
> to even the number of task running in each groups without emptying it
> and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we
> have imbalance < dst_running < imb_numa_nr
> 

But either way, you have a valid point. The patch as-is is too complex
and doing too much and is failing to make progress as a result. I'm going
to go back to the drawing board and come up with a simpler version that
adjusts the cut-off depending on topology but only allows an imbalance
of NUMA_IMBALANCE_MIN and tidy up the inconsistencies.

> > This?
> >
> >                                  * The 25% imbalance is an arbitrary cutoff
> >                                  * based on SMT-2 to balance between memory
> >                                  * bandwidth and avoiding premature sharing
> >                                  * of HT resources and SMT-4 or SMT-8 *may*
> >                                  * benefit from a different cutoff. nr_llcs
> >                                  * are accounted for to mitigate premature
> >                                  * cache eviction due to multiple tasks
> >                                  * using one cache while a sibling cache
> >                                  * remains relatively idle.
> >
> > > For example, why is it better than just 25% of the LLC weight ?
> >
> > Because lets say there are 2 LLCs then an imbalance based on just the LLC
> > weight might allow 2 tasks to share one cache while another is idle. This
> > is the original problem whereby the vanilla imbalance allowed multiple
> > LLCs on the same node to be overloaded which hurt workloads that prefer
> > to spread wide.
> 
> In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each
> llc up to 25%  ? instead of dividing by nr_llcs
> 
> As an example, you have
> 1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32
> 1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8
> 1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2
> 
> sd->imb_numa_nr is used at NUMA level so the more LLC you have the
> lower imbalance is allowed
> 

The more LLCs, the lower the threshold where imbalances is allowed is
deliberate given that the motivating problem was that embarassingly
parallel problems on AMD suffer due to overloading some LLCs while
others remain idle.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-01-05 10:42                   ` Mel Gorman
  2022-01-05 10:49                     ` Mel Gorman
@ 2022-01-10 15:53                     ` Vincent Guittot
  2022-01-12 10:24                       ` Mel Gorman
  1 sibling, 1 reply; 46+ messages in thread
From: Vincent Guittot @ 2022-01-10 15:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Wed, 5 Jan 2022 at 11:42, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > > <SNIP>
> > >
> > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >   * This is an approximation as the number of running tasks may not be
> > >   * related to the number of busy CPUs due to sched_setaffinity.
> > >   */
> > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > >  {
> > > -       return (dst_running < (dst_weight >> 2));
> > > +       return dst_running < imb_numa_nr;
> > >  }
> > >
> > >  /*
> > >
> > > <SNIP>
> > >
> > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >         }
> > >  }
> > >
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                               int dst_running, int dst_weight)
> > > +                               int dst_running, int imb_numa_nr)
> > >  {
> > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > >                 return imbalance;
> > >
> > > -       /*
> > > -        * Allow a small imbalance based on a simple pair of communicating
> > > -        * tasks that remain local when the destination is lightly loaded.
> > > -        */
> > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +       if (imbalance <= imb_numa_nr)
> >
> > Isn't this always true ?
> >
> > imbalance is "always" < dst_running as imbalance is usually the number
> > of these tasks that we would like to migrate
> >
>
> It's not necessarily true. allow_numa_imbalanced is checking if
> dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
> imbalance.
>
> imb_numa_nr = 4
> dst_running = 2
> imbalance   = 1
>
> In that case, imbalance of 1 is ok, but 2 is not.

I don't catch your example. Why is imbalance = 2 not ok in your
example above ? allow_numa_imbalance still returns true (dst-running <
imb_numa_nr) and we still have imbalance <= imb_numa_nr

Also the name dst_running is quite confusing; In the case of
calculate_imbalance, busiest->nr_running is passed as dst_running
argument. But the busiest group is the src not the dst of the balance

Then,  imbalance < busiest->nr_running in load_balance because we try
to even the number of task running in each groups without emptying it
and allow_numa_imbalance checks that dst_running < imb_numa_nr. So we
have imbalance < dst_running < imb_numa_nr

>
> >
> > >                 return 0;
> > >
> > >         return imbalance;
> > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >                 /* Consider allowing a small imbalance between NUMA groups */
> > >                 if (env->sd->flags & SD_NUMA) {
> > >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -                               busiest->sum_nr_running, env->sd->span_weight);
> > > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> > >                 }
> > >
> > >                 return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..1fa3e977521d 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >                 }
> > >         }
> > >
> > > +       /*
> > > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +        * imbalanced.
> > > +        */
> > > +       for_each_cpu(i, cpu_map) {
> > > +               unsigned int imb = 0;
> > > +               unsigned int imb_span = 1;
> > > +
> > > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +                       struct sched_domain *child = sd->child;
> > > +
> > > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> >
> > sched_domains have not been degenerated yet so you found here the DIE domain
> >
>
> Yes
>
> > > +                               struct sched_domain *top, *top_p;
> > > +                               unsigned int llc_sq;
> > > +
> > > +                               /*
> > > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > > +                                * imb = (llc_weight / nr_llcs) >> 2
> >
> > it would be good to add a comment to explain why 25% of LLC weight /
> > number of LLC in a node is the right value.
>
> This?
>
>                                  * The 25% imbalance is an arbitrary cutoff
>                                  * based on SMT-2 to balance between memory
>                                  * bandwidth and avoiding premature sharing
>                                  * of HT resources and SMT-4 or SMT-8 *may*
>                                  * benefit from a different cutoff. nr_llcs
>                                  * are accounted for to mitigate premature
>                                  * cache eviction due to multiple tasks
>                                  * using one cache while a sibling cache
>                                  * remains relatively idle.
>
> > For example, why is it better than just 25% of the LLC weight ?
>
> Because lets say there are 2 LLCs then an imbalance based on just the LLC
> weight might allow 2 tasks to share one cache while another is idle. This
> is the original problem whereby the vanilla imbalance allowed multiple
> LLCs on the same node to be overloaded which hurt workloads that prefer
> to spread wide.

In this case, shouldn't it be (llc_weight >> 2) * nr_llcs to fill each
llc up to 25%  ? instead of dividing by nr_llcs

As an example, you have
1 node with 1 LLC with 128 CPUs will get an imb_numa_nr = 32
1 node with 2 LLC with 64 CPUs each will get an imb_numa_nr = 8
1 node with 4 LLC with 32 CPUs each will get an imb_numa_nr = 2

sd->imb_numa_nr is used at NUMA level so the more LLC you have the
lower imbalance is allowed

>
> > Do you want to allow the same imbalance at node level whatever the
> > number of LLC in the node ?
> >
>
> At this point, it's less clear how the larger domains should be
> balanced and the initial scaling is as good an option as any.
>
> > > +                                *
> > > +                                * is equivalent to
> > > +                                *
> > > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > > +                                *
> > > +                                */
> > > +                               llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > > +                               sd->imb_numa_nr = imb;
> > > +
> > > +                               /*
> > > +                                * Set span based on top domain that places
> > > +                                * tasks in sibling domains.
> > > +                                */
> > > +                               top = sd;
> > > +                               top_p = top->parent;
> > > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> >
> > Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
>
> Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to
> degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only
> SD_PREFER_SIBLING gave the current results.

SD_PREFER_SIBLING is not mandatory in childs of NUMA node (look at
heterogenous system as an example) so relying of this flag seems quite
fragile
The fact that you see inconsistency with SD_NUMA depending of NPS-1,
NPS-2 or NPS-4 topology probably means that you don't looks for the
right domain level or you try to compensate side effect of your
formula above

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-01-05 10:42                   ` Mel Gorman
@ 2022-01-05 10:49                     ` Mel Gorman
  2022-01-10 15:53                     ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2022-01-05 10:49 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Wed, Jan 05, 2022 at 10:42:07AM +0000, Mel Gorman wrote:
> On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > > <SNIP>
> > >
> > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >   * This is an approximation as the number of running tasks may not be
> > >   * related to the number of busy CPUs due to sched_setaffinity.
> > >   */
> > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > >  {
> > > -       return (dst_running < (dst_weight >> 2));
> > > +       return dst_running < imb_numa_nr;
> > >  }
> > >
> > >  /*
> > >
> > > <SNIP>
> > >
> > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >         }
> > >  }
> > >
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                               int dst_running, int dst_weight)
> > > +                               int dst_running, int imb_numa_nr)
> > >  {
> > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > >                 return imbalance;
> > >
> > > -       /*
> > > -        * Allow a small imbalance based on a simple pair of communicating
> > > -        * tasks that remain local when the destination is lightly loaded.
> > > -        */
> > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +       if (imbalance <= imb_numa_nr)
> > 
> > Isn't this always true ?
> > 
> > imbalance is "always" < dst_running as imbalance is usually the number
> > of these tasks that we would like to migrate
> > 
> 
> It's not necessarily true. allow_numa_imbalanced is checking if
> dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
> imbalance.
> 
> imb_numa_nr = 4
> dst_running = 2
> imbalance   = 1
> 
> In that case, imbalance of 1 is ok, but 2 is not.
> 

My bad, this is based on v5 which I just queued for testing.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-21 17:13                 ` Vincent Guittot
  2021-12-22  8:52                   ` Jirka Hladky
@ 2022-01-05 10:42                   ` Mel Gorman
  2022-01-05 10:49                     ` Mel Gorman
  2022-01-10 15:53                     ` Vincent Guittot
  1 sibling, 2 replies; 46+ messages in thread
From: Mel Gorman @ 2022-01-05 10:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Tue, Dec 21, 2021 at 06:13:15PM +0100, Vincent Guittot wrote:
> > <SNIP>
> >
> > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> >   * This is an approximation as the number of running tasks may not be
> >   * related to the number of busy CPUs due to sched_setaffinity.
> >   */
> > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> >  {
> > -       return (dst_running < (dst_weight >> 2));
> > +       return dst_running < imb_numa_nr;
> >  }
> >
> >  /*
> >
> > <SNIP>
> >
> > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >         }
> >  }
> >
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -                               int dst_running, int dst_weight)
> > +                               int dst_running, int imb_numa_nr)
> >  {
> > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> >                 return imbalance;
> >
> > -       /*
> > -        * Allow a small imbalance based on a simple pair of communicating
> > -        * tasks that remain local when the destination is lightly loaded.
> > -        */
> > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > +       if (imbalance <= imb_numa_nr)
> 
> Isn't this always true ?
> 
> imbalance is "always" < dst_running as imbalance is usually the number
> of these tasks that we would like to migrate
> 

It's not necessarily true. allow_numa_imbalanced is checking if
dst_running < imb_numa_nr and adjust_numa_imbalance is checking the
imbalance.

imb_numa_nr = 4
dst_running = 2
imbalance   = 1

In that case, imbalance of 1 is ok, but 2 is not.

> 
> >                 return 0;
> >
> >         return imbalance;
> > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, env->sd->span_weight);
> > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> >                 }
> >
> >                 return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >                 }
> >         }
> >
> > +       /*
> > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +        * imbalanced.
> > +        */
> > +       for_each_cpu(i, cpu_map) {
> > +               unsigned int imb = 0;
> > +               unsigned int imb_span = 1;
> > +
> > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +                       struct sched_domain *child = sd->child;
> > +
> > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> 
> sched_domains have not been degenerated yet so you found here the DIE domain
> 

Yes

> > +                               struct sched_domain *top, *top_p;
> > +                               unsigned int llc_sq;
> > +
> > +                               /*
> > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > +                                * imb = (llc_weight / nr_llcs) >> 2
> 
> it would be good to add a comment to explain why 25% of LLC weight /
> number of LLC in a node is the right value.

This?

                                 * The 25% imbalance is an arbitrary cutoff
                                 * based on SMT-2 to balance between memory
                                 * bandwidth and avoiding premature sharing
                                 * of HT resources and SMT-4 or SMT-8 *may*
                                 * benefit from a different cutoff. nr_llcs
                                 * are accounted for to mitigate premature
                                 * cache eviction due to multiple tasks
                                 * using one cache while a sibling cache
                                 * remains relatively idle.

> For example, why is it better than just 25% of the LLC weight ?

Because lets say there are 2 LLCs then an imbalance based on just the LLC
weight might allow 2 tasks to share one cache while another is idle. This
is the original problem whereby the vanilla imbalance allowed multiple
LLCs on the same node to be overloaded which hurt workloads that prefer
to spread wide.

> Do you want to allow the same imbalance at node level whatever the
> number of LLC in the node ?
> 

At this point, it's less clear how the larger domains should be
balanced and the initial scaling is as good an option as any.

> > +                                *
> > +                                * is equivalent to
> > +                                *
> > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > +                                *
> > +                                */
> > +                               llc_sq = child->span_weight * child->span_weight;
> > +
> > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > +                               sd->imb_numa_nr = imb;
> > +
> > +                               /*
> > +                                * Set span based on top domain that places
> > +                                * tasks in sibling domains.
> > +                                */
> > +                               top = sd;
> > +                               top_p = top->parent;
> > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> 
> Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?

Because on AMD Zen3, I saw inconsistent treatment of SD_NUMA prior to
degeneration depending on whether it was NPS-1, NPS-2 or NPS-4 and only
SD_PREFER_SIBLING gave the current results.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-22  8:52                   ` Jirka Hladky
@ 2022-01-04 19:52                     ` Jirka Hladky
  0 siblings, 0 replies; 46+ messages in thread
From: Jirka Hladky @ 2022-01-04 19:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML, Philip Auld

Hi Mel,

the table with results is badly formatted. Let me send it again:

We have tested the performance impact of this patch and we see
performance drops up to 60% with the OpenMP implementation of the NAS
parallel benchmark.

Example of results:
2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs)

Kernel                        5.16.0-rc5  5.16.0-rc5+patch_v4
                                  Mop/s tot    Mop/s tot                    Diff
sp_C_x, 8 threads    36316.1      12939.6                       -64%
sp_C_x, 16 threads  64790.2      23968.0                       -63%
sp_C_x, 32 threads  67205.5      48891.4                       -27%

Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results.

It seems like the allowed imbalance is too large and the negative
impact on workloads that prefer to span wide across all NUMA nodes is
too large.

Thanks
Jirka


On Wed, Dec 22, 2021 at 9:52 AM Jirka Hladky <jhladky@redhat.com> wrote:
>
> Hi Mel,
>
> we have tested the performance impact of this patch and we see
> performance drops up to 60% with the OpenMP implementation of the NAS
> parallel benchmark.
>
> Example of results:
> 2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs)
>
>                                         5.16.0-rc5
> 5.16.0-rc5                 Difference
>                                          (vanilla)             + v4 of
> the patch           in %
>                                          Mop/s total           Mop/s total
> sp_C_x, 8 threads            36316.1               12939.6
>          -64%
> sp_C_x, 16 threads          64790.2               23968.0
>         -63%
> sp_C_x, 32 threads          67205.5               48891.4
>         -27%
>
> Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results.
>
> It seems like the allowed imbalance is too large and the negative
> impact on workloads that prefer to span wide across all NUMA nodes is
> too large.
>
> Thanks
> Jirka
>
>
> On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > (sorry for the delay, was offline for a few days)
> > >
> > > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > > > Hello Mel,
> > > >
> > > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > > >
> > > > [..SNIP..]
> > > >
> >
> > [snip]
> >
> > >
> > > To avoid the corner case, we'd need to explicitly favour spreading early
> > > and assume wakeup will pull communicating tasks together and NUMA
> > > balancing migrate the data after some time which looks like
> > >
> > > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > > index c07bfa2d80f2..54f5207154d3 100644
> > > --- a/include/linux/sched/topology.h
> > > +++ b/include/linux/sched/topology.h
> > > @@ -93,6 +93,7 @@ struct sched_domain {
> > >         unsigned int busy_factor;       /* less balancing by factor if busy */
> > >         unsigned int imbalance_pct;     /* No balance until over watermark */
> > >         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> > > +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */
> >
> > So now you compute an allowed imbalance level instead of using
> > 25% of sd->span_weight
> > or
> > 25% of busiest->group_weight
> >
> > And you adjust this new imb_numa_nr according to the topology.
> >
> > That makes sense.
> >
> > >
> > >         int nohz_idle;                  /* NOHZ IDLE status */
> > >         int flags;                      /* See SD_* */
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 0a969affca76..df0e84462e62 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -1489,6 +1489,7 @@ struct task_numa_env {
> > >
> > >         int src_cpu, src_nid;
> > >         int dst_cpu, dst_nid;
> > > +       int imb_numa_nr;
> > >
> > >         struct numa_stats src_stats, dst_stats;
> > >
> > > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
> > >  static unsigned long cpu_runnable(struct rq *rq);
> > >  static unsigned long cpu_util(int cpu);
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                                       int dst_running, int dst_weight);
> > > +                                       int dst_running,
> > > +                                       int imb_numa_nr);
> > >
> > >  static inline enum
> > >  numa_type numa_classify(unsigned int imbalance_pct,
> > > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> > >                 dst_running = env->dst_stats.nr_running + 1;
> > >                 imbalance = max(0, dst_running - src_running);
> > >                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> > > -                                                       env->dst_stats.weight);
> > > +                                                 env->imb_numa_nr);
> > >
> > >                 /* Use idle CPU if there is no imbalance */
> > >                 if (!imbalance) {
> > > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
> > >          */
> > >         rcu_read_lock();
> > >         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> > > -       if (sd)
> > > +       if (sd) {
> > >                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> > > +               env.imb_numa_nr = sd->imb_numa_nr;
> > > +       }
> > >         rcu_read_unlock();
> > >
> > >         /*
> > > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> > >   * This is an approximation as the number of running tasks may not be
> > >   * related to the number of busy CPUs due to sched_setaffinity.
> > >   */
> > > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> > >  {
> > > -       return (dst_running < (dst_weight >> 2));
> > > +       return dst_running < imb_numa_nr;
> > >  }
> > >
> > >  /*
> > > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> > >                                 return idlest;
> > >  #endif
> > >                         /*
> > > -                        * Otherwise, keep the task on this node to stay close
> > > -                        * its wakeup source and improve locality. If there is
> > > -                        * a real need of migration, periodic load balance will
> > > -                        * take care of it.
> > > +                        * Otherwise, keep the task on this node to stay local
> > > +                        * to its wakeup source if the number of running tasks
> > > +                        * are below the allowed imbalance. If there is a real
> > > +                        * need of migration, periodic load balance will take
> > > +                        * care of it.
> > >                          */
> > > -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> > > +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
> > >                                 return NULL;
> > >                 }
> > >
> > > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >         }
> > >  }
> > >
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -                               int dst_running, int dst_weight)
> > > +                               int dst_running, int imb_numa_nr)
> > >  {
> > > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> > >                 return imbalance;
> > >
> > > -       /*
> > > -        * Allow a small imbalance based on a simple pair of communicating
> > > -        * tasks that remain local when the destination is lightly loaded.
> > > -        */
> > > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +       if (imbalance <= imb_numa_nr)
> >
> > Isn't this always true ?
> >
> > imbalance is "always" < dst_running as imbalance is usually the number
> > of these tasks that we would like to migrate
> >
> >
> > >                 return 0;
> > >
> > >         return imbalance;
> > > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >                 /* Consider allowing a small imbalance between NUMA groups */
> > >                 if (env->sd->flags & SD_NUMA) {
> > >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -                               busiest->sum_nr_running, env->sd->span_weight);
> > > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> > >                 }
> > >
> > >                 return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..1fa3e977521d 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >                 }
> > >         }
> > >
> > > +       /*
> > > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +        * imbalanced.
> > > +        */
> > > +       for_each_cpu(i, cpu_map) {
> > > +               unsigned int imb = 0;
> > > +               unsigned int imb_span = 1;
> > > +
> > > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +                       struct sched_domain *child = sd->child;
> > > +
> > > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> >
> > sched_domains have not been degenerated yet so you found here the DIE domain
> >
> > > +                               struct sched_domain *top, *top_p;
> > > +                               unsigned int llc_sq;
> > > +
> > > +                               /*
> > > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > > +                                * imb = (llc_weight / nr_llcs) >> 2
> >
> > it would be good to add a comment to explain why 25% of LLC weight /
> > number of LLC in a node is the right value.
> > For example, why is it better than just 25% of the LLC weight ?
> > Do you want to allow the same imbalance at node level whatever the
> > number of LLC in the node ?
> >
> > > +                                *
> > > +                                * is equivalent to
> > > +                                *
> > > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > > +                                *
> > > +                                */
> > > +                               llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > > +                               sd->imb_numa_nr = imb;
> > > +
> > > +                               /*
> > > +                                * Set span based on top domain that places
> > > +                                * tasks in sibling domains.
> > > +                                */
> > > +                               top = sd;
> > > +                               top_p = top->parent;
> > > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> >
> > Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
> > Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure
> > that you want to take this case into account, only numa node don't
> > have SD_PREFER_SIBLING
> >
> > > +                                       top = top->parent;
> > > +                                       top_p = top->parent;
> > > +                               }
> > > +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> > > +                       } else {
> > > +                               int factor = max(1U, (sd->span_weight / imb_span));
> > > +
> > > +                               sd->imb_numa_nr = imb * factor;
> > > +                       }
> > > +               }
> > > +       }
> > > +
> > >         /* Calculate CPU capacity for physical packages and nodes */
> > >         for (i = nr_cpumask_bits-1; i >= 0; i--) {
> > >                 if (!cpumask_test_cpu(i, cpu_map))
> >
>
>
> --
> -Jirka



-- 
-Jirka


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-21 17:13                 ` Vincent Guittot
@ 2021-12-22  8:52                   ` Jirka Hladky
  2022-01-04 19:52                     ` Jirka Hladky
  2022-01-05 10:42                   ` Mel Gorman
  1 sibling, 1 reply; 46+ messages in thread
From: Jirka Hladky @ 2021-12-22  8:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML, Philip Auld

Hi Mel,

we have tested the performance impact of this patch and we see
performance drops up to 60% with the OpenMP implementation of the NAS
parallel benchmark.

Example of results:
2x AMD EPYC 7313 16-Core server (total 32 cores, 64 CPUs)

                                        5.16.0-rc5
5.16.0-rc5                 Difference
                                         (vanilla)             + v4 of
the patch           in %
                                         Mop/s total           Mop/s total
sp_C_x, 8 threads            36316.1               12939.6
         -64%
sp_C_x, 16 threads          64790.2               23968.0
        -63%
sp_C_x, 32 threads          67205.5               48891.4
        -27%

Other NAS subtests (bt_C_x, is_D_x, ua_C_x) show similar results.

It seems like the allowed imbalance is too large and the negative
impact on workloads that prefer to span wide across all NUMA nodes is
too large.

Thanks
Jirka


On Tue, Dec 21, 2021 at 6:13 PM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > (sorry for the delay, was offline for a few days)
> >
> > On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > > Hello Mel,
> > >
> > > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > >
> > > [..SNIP..]
> > >
>
> [snip]
>
> >
> > To avoid the corner case, we'd need to explicitly favour spreading early
> > and assume wakeup will pull communicating tasks together and NUMA
> > balancing migrate the data after some time which looks like
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index c07bfa2d80f2..54f5207154d3 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -93,6 +93,7 @@ struct sched_domain {
> >         unsigned int busy_factor;       /* less balancing by factor if busy */
> >         unsigned int imbalance_pct;     /* No balance until over watermark */
> >         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> > +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */
>
> So now you compute an allowed imbalance level instead of using
> 25% of sd->span_weight
> or
> 25% of busiest->group_weight
>
> And you adjust this new imb_numa_nr according to the topology.
>
> That makes sense.
>
> >
> >         int nohz_idle;                  /* NOHZ IDLE status */
> >         int flags;                      /* See SD_* */
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a969affca76..df0e84462e62 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1489,6 +1489,7 @@ struct task_numa_env {
> >
> >         int src_cpu, src_nid;
> >         int dst_cpu, dst_nid;
> > +       int imb_numa_nr;
> >
> >         struct numa_stats src_stats, dst_stats;
> >
> > @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
> >  static unsigned long cpu_runnable(struct rq *rq);
> >  static unsigned long cpu_util(int cpu);
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -                                       int dst_running, int dst_weight);
> > +                                       int dst_running,
> > +                                       int imb_numa_nr);
> >
> >  static inline enum
> >  numa_type numa_classify(unsigned int imbalance_pct,
> > @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
> >                 dst_running = env->dst_stats.nr_running + 1;
> >                 imbalance = max(0, dst_running - src_running);
> >                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> > -                                                       env->dst_stats.weight);
> > +                                                 env->imb_numa_nr);
> >
> >                 /* Use idle CPU if there is no imbalance */
> >                 if (!imbalance) {
> > @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
> >          */
> >         rcu_read_lock();
> >         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> > -       if (sd)
> > +       if (sd) {
> >                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> > +               env.imb_numa_nr = sd->imb_numa_nr;
> > +       }
> >         rcu_read_unlock();
> >
> >         /*
> > @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> >   * This is an approximation as the number of running tasks may not be
> >   * related to the number of busy CPUs due to sched_setaffinity.
> >   */
> > -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> > +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
> >  {
> > -       return (dst_running < (dst_weight >> 2));
> > +       return dst_running < imb_numa_nr;
> >  }
> >
> >  /*
> > @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >                                 return idlest;
> >  #endif
> >                         /*
> > -                        * Otherwise, keep the task on this node to stay close
> > -                        * its wakeup source and improve locality. If there is
> > -                        * a real need of migration, periodic load balance will
> > -                        * take care of it.
> > +                        * Otherwise, keep the task on this node to stay local
> > +                        * to its wakeup source if the number of running tasks
> > +                        * are below the allowed imbalance. If there is a real
> > +                        * need of migration, periodic load balance will take
> > +                        * care of it.
> >                          */
> > -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> > +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
> >                                 return NULL;
> >                 }
> >
> > @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >         }
> >  }
> >
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -                               int dst_running, int dst_weight)
> > +                               int dst_running, int imb_numa_nr)
> >  {
> > -       if (!allow_numa_imbalance(dst_running, dst_weight))
> > +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
> >                 return imbalance;
> >
> > -       /*
> > -        * Allow a small imbalance based on a simple pair of communicating
> > -        * tasks that remain local when the destination is lightly loaded.
> > -        */
> > -       if (imbalance <= NUMA_IMBALANCE_MIN)
> > +       if (imbalance <= imb_numa_nr)
>
> Isn't this always true ?
>
> imbalance is "always" < dst_running as imbalance is usually the number
> of these tasks that we would like to migrate
>
>
> >                 return 0;
> >
> >         return imbalance;
> > @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, env->sd->span_weight);
> > +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
> >                 }
> >
> >                 return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >                 }
> >         }
> >
> > +       /*
> > +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +        * imbalanced.
> > +        */
> > +       for_each_cpu(i, cpu_map) {
> > +               unsigned int imb = 0;
> > +               unsigned int imb_span = 1;
> > +
> > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +                       struct sched_domain *child = sd->child;
> > +
> > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
>
> sched_domains have not been degenerated yet so you found here the DIE domain
>
> > +                               struct sched_domain *top, *top_p;
> > +                               unsigned int llc_sq;
> > +
> > +                               /*
> > +                                * nr_llcs = (sd->span_weight / llc_weight);
> > +                                * imb = (llc_weight / nr_llcs) >> 2
>
> it would be good to add a comment to explain why 25% of LLC weight /
> number of LLC in a node is the right value.
> For example, why is it better than just 25% of the LLC weight ?
> Do you want to allow the same imbalance at node level whatever the
> number of LLC in the node ?
>
> > +                                *
> > +                                * is equivalent to
> > +                                *
> > +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> > +                                *
> > +                                */
> > +                               llc_sq = child->span_weight * child->span_weight;
> > +
> > +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> > +                               sd->imb_numa_nr = imb;
> > +
> > +                               /*
> > +                                * Set span based on top domain that places
> > +                                * tasks in sibling domains.
> > +                                */
> > +                               top = sd;
> > +                               top_p = top->parent;
> > +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
>
> Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
> Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure
> that you want to take this case into account, only numa node don't
> have SD_PREFER_SIBLING
>
> > +                                       top = top->parent;
> > +                                       top_p = top->parent;
> > +                               }
> > +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> > +                       } else {
> > +                               int factor = max(1U, (sd->span_weight / imb_span));
> > +
> > +                               sd->imb_numa_nr = imb * factor;
> > +                       }
> > +               }
> > +       }
> > +
> >         /* Calculate CPU capacity for physical packages and nodes */
> >         for (i = nr_cpumask_bits-1; i >= 0; i--) {
> >                 if (!cpumask_test_cpu(i, cpu_map))
>


-- 
-Jirka


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-20 11:12               ` Mel Gorman
  2021-12-21 15:03                 ` Gautham R. Shenoy
@ 2021-12-21 17:13                 ` Vincent Guittot
  2021-12-22  8:52                   ` Jirka Hladky
  2022-01-05 10:42                   ` Mel Gorman
  1 sibling, 2 replies; 46+ messages in thread
From: Vincent Guittot @ 2021-12-21 17:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Gautham R. Shenoy, Peter Zijlstra, Ingo Molnar,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

On Mon, 20 Dec 2021 at 12:12, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> (sorry for the delay, was offline for a few days)
>
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> >
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> >
> > [..SNIP..]
> >

[snip]

>
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */

So now you compute an allowed imbalance level instead of using
25% of sd->span_weight
or
25% of busiest->group_weight

And you adjust this new imb_numa_nr according to the topology.

That makes sense.

>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..df0e84462e62 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
>  static inline long adjust_numa_imbalance(int imbalance,
> -                                       int dst_running, int dst_weight);
> +                                       int dst_running,
> +                                       int imb_numa_nr);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
>  {
> -       return (dst_running < (dst_weight >> 2));
> +       return dst_running < imb_numa_nr;
>  }
>
>  /*
> @@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                                 return idlest;
>  #endif
>                         /*
> -                        * Otherwise, keep the task on this node to stay close
> -                        * its wakeup source and improve locality. If there is
> -                        * a real need of migration, periodic load balance will
> -                        * take care of it.
> +                        * Otherwise, keep the task on this node to stay local
> +                        * to its wakeup source if the number of running tasks
> +                        * are below the allowed imbalance. If there is a real
> +                        * need of migration, periodic load balance will take
> +                        * care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>         }
>  }
>
> -#define NUMA_IMBALANCE_MIN 2
> -
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
> -       /*
> -        * Allow a small imbalance based on a simple pair of communicating
> -        * tasks that remain local when the destination is lightly loaded.
> -        */
> -       if (imbalance <= NUMA_IMBALANCE_MIN)
> +       if (imbalance <= imb_numa_nr)

Isn't this always true ?

imbalance is "always" < dst_running as imbalance is usually the number
of these tasks that we would like to migrate


>                 return 0;
>
>         return imbalance;
> @@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, env->sd->span_weight);
> +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..1fa3e977521d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /*
> +        * Calculate an allowed NUMA imbalance such that LLCs do not get
> +        * imbalanced.
> +        */
> +       for_each_cpu(i, cpu_map) {
> +               unsigned int imb = 0;
> +               unsigned int imb_span = 1;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {

sched_domains have not been degenerated yet so you found here the DIE domain

> +                               struct sched_domain *top, *top_p;
> +                               unsigned int llc_sq;
> +
> +                               /*
> +                                * nr_llcs = (sd->span_weight / llc_weight);
> +                                * imb = (llc_weight / nr_llcs) >> 2

it would be good to add a comment to explain why 25% of LLC weight /
number of LLC in a node is the right value.
For example, why is it better than just 25% of the LLC weight ?
Do you want to allow the same imbalance at node level whatever the
number of LLC in the node ?

> +                                *
> +                                * is equivalent to
> +                                *
> +                                * imb = (llc_weight^2 / sd->span_weight) >> 2
> +                                *
> +                                */
> +                               llc_sq = child->span_weight * child->span_weight;
> +
> +                               imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> +                               sd->imb_numa_nr = imb;
> +
> +                               /*
> +                                * Set span based on top domain that places
> +                                * tasks in sibling domains.
> +                                */
> +                               top = sd;
> +                               top_p = top->parent;
> +                               while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {

Why are you looping on SD_PREFER_SIBLING  instead of SD_NUMA  ?
Apart the heterogeneous domain (SD_ASYM_CPUCAPACITY) but I'm not sure
that you want to take this case into account, only numa node don't
have SD_PREFER_SIBLING

> +                                       top = top->parent;
> +                                       top_p = top->parent;
> +                               }
> +                               imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +                       } else {
> +                               int factor = max(1U, (sd->span_weight / imb_span));
> +
> +                               sd->imb_numa_nr = imb * factor;
> +                       }
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-20 11:12               ` Mel Gorman
@ 2021-12-21 15:03                 ` Gautham R. Shenoy
  2021-12-21 17:13                 ` Vincent Guittot
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-21 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Mon, Dec 20, 2021 at 11:12:43AM +0000, Mel Gorman wrote:
> (sorry for the delay, was offline for a few days)
> 
> On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> > Hello Mel,
> > 
> > On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> > 
> > [..SNIP..]
> > 
> > > > On a 2 Socket Zen3:
> > > > 
> > > > NPS=1
> > > >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > > >    top_p = NUMA, imb_span = 256.
> > > > 
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > > 
> > > > NPS=2
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > NPS=4:
> > > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > > >    top_p = NUMA, imb_span = 128.
> > > > 
> > > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > > 
> > > > Again, we will be more aggressively load balancing across the two
> > > > sockets in NPS=1 mode compared to NPS=2/4.
> > > > 
> > > 
> > > Yes, but I felt it was reasonable behaviour because we have to strike
> > > some sort of balance between allowing a NUMA imbalance up to a point
> > > to prevent communicating tasks being pulled apart and v3 broke that
> > > completely. There will always be a tradeoff between tasks that want to
> > > remain local to each other and others that prefer to spread as wide as
> > > possible as quickly as possible.
> > 
> > I agree with this argument that we want to be conservative while
> > pulling tasks across NUMA domains. My point was that the threshold at
> > the NUMA domain that spans the 2 sockets is lower for NPS=1
> > (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> > domain when NPS=2/4 (imb_numa_nr = 4).
> > 
> 
> Is that a problem though? On an Intel machine with sub-numa clustering,
> the distances are 11 and 21 for a "node" that is the split cache and the
> remote node respectively.

So, my question was, on an Intel machine, with sub-numa clustering
enabled vs disabled, is the value of imb_numa_nr for the NUMA domain
which spans the remote nodes (distance=21) the same or different. And
if it is different, what is the rationale behind that. I am totally
on-board with the idea that for the different NUMA levels, the
corresponding imb_numa_nr should be different.

Just in case, I was not making myself clear earlier, on Zen3, the
NUMA-A sched-domain, in the figures below, has groups where each group
spans a socket in all the NPS configurations. However, only on NPS=1
we have sd->imb_numa_nr=2 for NUMA-A, while on NPS=2/4, the value of
sd->imb_numa_nr=4 for NUMA-A domain. Thus if we had 4 tasks sharing
data, on NPS=2/4, they would reside on the same socket, while on
NPS=1, we will have 2 tasks on one socket, and the other 2 will
migrated to the other socket.

That said, I have not been able to observe any significiant difference
with a real-world workload like Mongodb run on NPS=1 with imb_numa_nr
set to 2 vs 4.



Zen3, NPS=1
------------------------------------------------------------------
|                                                                |
|  NUMA-A : sd->imb_numa_nr = 2   			     	 |
|     ------------------------     ------------------------  	 |
|     |DIE                   |     |DIE                   |  	 |
|     |                      |     |                      |  	 |
|     |    ------   ------   |     |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |	   |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |    |MC  |   |MC  |   |     |    |MC  |   |MC  |   |  	 |
|     |    ------   ------   |	   |    ------   ------   |  	 |
|     |                      |	   |                      |  	 |
|     ------------------------	   ------------------------  	 |
|							     	 |
------------------------------------------------------------------



Zen3, NPS=2
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |         | |         | |  | |         | |         | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------


Zen3, NPS=4
------------------------------------------------------------------
|                                                                |
|  NUMA-A: sd->imb_numa_nr = 4   				 |
|    ---------------------------  --------------------------- 	 |
|    |NUMA-B :sd->imb_numa_nr=2|  |NUMA-B :sd->imb_numa_nr=2|	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | ----------- ----------- |  | ----------- ----------- |    |
|    | |NODE     | |NODE     | |  | |NODE     | |NODE     | |    |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | |  |MC  | | |  |MC  | | |  | |  |MC  | | |  |MC  | | |	 |
|    | |  ------ | |  ------ | |  | |  ------ | |  ------ | |	 |
|    | ----------- ----------- |  | ----------- ----------- |	 |
|    |                         |  |                         |	 |
|    ---------------------------  ---------------------------	 |
|							     	 |
------------------------------------------------------------------




> > Irrespective of what NPS mode we are operating in, the NUMA distance
> > between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> > thresholds be the same for that level of NUMA? 
> > 
> 
> Maybe, but then it is not a property of the sched_domain and instead
> needs to account for distance when balancing between two nodes that may
> be varying distances from each other.
> 
> > Would something like the following work ?
> > 
> > if (sd->flags & SD_NUMA) {
> > 
> >    /* We are using the child as a proxy for the group. */
> >    group_span = sd->child->span_weight;
> >    sd_distance = /* NUMA distance at this sd level */
> > 
> 
> NUMA distance relative to what? On Zen, the distance to a remote node may
> be fixed but on topologies with multiple nodes that are not fully linked
> to every other node by one hop, the same is not true.

Fair enough. The "sd_distance" I was referring to the node_distance()
between the CPUs of any two groups in this NUMA domain. However, this
was assuming that the node_distance() between the CPUs of any two
groups would be the same, which is not the case for certain
platforms. So this wouldn't work.



> 
> >    /* By default we set the threshold to 1/4th the sched-group span. */
> >    imb_numa_shift = 2;
> > 
> >    /*
> >     * We can be a little aggressive if the cost of migrating tasks
> >     * across groups of this NUMA level is not high.
> >     * Assuming 
> >     */
> >    
> >    if (sd_distance < REMOTE_DISTANCE)
> >       imb_numa_shift++;
> > 
> 
> The distance would have to be accounted for elsewhere because here we
> are considering one node in isolation, not relative to other nodes.
> 
> >    /*
> >     * Compute the number of LLCs in each group.
> >     * More the LLCs, more aggressively we migrate across
> >     * the groups at this NUMA sd.
> >     */
> >     nr_llcs = group_span/llc_size;
> > 
> >     sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> > }
> > 
> 
> Same, any adjustment would have to happen during load balancing taking
> into account the relatively NUMA distances. I'm not necessarily opposed
> but it would be a separate patch.


Sure, we can look into this separately.


> 
> > > > <SNIP>
> > > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > > allow_numa_imbalance() as in v3 we get
> > > > 
> > > > NPS=4
> > > > Test:	 mel-v4.2
> > > >  Copy:	 225860.12 (498.11%)
> > > > Scale:	 227869.07 (572.58%)
> > > >   Add:	 278365.58 (624.93%)
> > > > Triad:	 264315.44 (596.62%)
> > > > 
> > > 
> > > The potential problem with this is that it probably will work for
> > > netperf when it's a single communicating pair but may not work as well
> > > when there are multiple communicating pairs or a number of communicating
> > > tasks that exceed numa_imb_nr.
> > 
> > 
> > Yes that's true. I think what you are doing in v4 is the right thing.
> > 
> > In case of stream in NPS=4, it just manages to hit the corner case for
> > this heuristic which results in a suboptimal behaviour. Description
> > follows:
> > 
> 
> To avoid the corner case, we'd need to explicitly favour spreading early
> and assume wakeup will pull communicating tasks together and NUMA
> balancing migrate the data after some time which looks like


Actually I was able to root-cause the reason behind the drop in the
performance of stream on NPS-4. I have already responded earlier in
another thread : https://lore.kernel.org/lkml/Ybzq%2FA+HS%2FGxGYha@BLR-5CG11610CF.amd.com/
Appending the patch here:


---
 kernel/sched/fair.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ec354bf88b0d..c1b2a422a877 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9191,13 +9191,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay local
-			 * to its wakeup source if the number of running tasks
-			 * are below the allowed imbalance. If there is a real
-			 * need of migration, periodic load balance will take
-			 * care of it.
+			 * Otherwise, keep the task on this node to
+			 * stay local to its wakeup source if the
+			 * number of running tasks (including the new
+			 * one) are below the allowed imbalance. If
+			 * there is a real need of migration, periodic
+			 * load balance will take care of it.
 			 */
-			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
+			if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
 				return NULL;
 		}
 
-- 

With this fix on top of your fix to compute the imb_numa_nr at the
relevant level (v4.1:
https://lore.kernel.org/lkml/20211213130131.GQ3366@techsingularity.net/),
the stream regression for NPS4 is no longer there.


Test:	tip-core	    v4		       v4.1		    v4.1-find_idlest_group_fix
 Copy:	 37762.14 (0.00%)   48748.60 (29.09%)  164765.83 (336.32%)  205963.99 (445.42%)
Scale:	 33879.66 (0.00%)   48317.66 (42.61%)  159641.67 (371.20%)  218136.57 (543.85%)
  Add:	 38398.90 (0.00%)   54259.56 (41.30%)  184583.70 (380.70%)  257857.90 (571.52%)
Triad:	 37942.38 (0.00%)   54503.74 (43.64%)  181250.80 (377.70%)  251410.28 (562.61%)

--
Thanks and Regards
gautham.




^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-16 18:33             ` Gautham R. Shenoy
@ 2021-12-20 11:12               ` Mel Gorman
  2021-12-21 15:03                 ` Gautham R. Shenoy
  2021-12-21 17:13                 ` Vincent Guittot
  0 siblings, 2 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-20 11:12 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

(sorry for the delay, was offline for a few days)

On Fri, Dec 17, 2021 at 12:03:06AM +0530, Gautham R. Shenoy wrote:
> Hello Mel,
> 
> On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> > On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> 
> [..SNIP..]
> 
> > > On a 2 Socket Zen3:
> > > 
> > > NPS=1
> > >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> > >    top_p = NUMA, imb_span = 256.
> > > 
> > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > > 
> > > NPS=2
> > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> > >    top_p = NUMA, imb_span = 128.
> > > 
> > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > 
> > > NPS=4:
> > >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> > >    top_p = NUMA, imb_span = 128.
> > > 
> > >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> > >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > > 
> > > Again, we will be more aggressively load balancing across the two
> > > sockets in NPS=1 mode compared to NPS=2/4.
> > > 
> > 
> > Yes, but I felt it was reasonable behaviour because we have to strike
> > some sort of balance between allowing a NUMA imbalance up to a point
> > to prevent communicating tasks being pulled apart and v3 broke that
> > completely. There will always be a tradeoff between tasks that want to
> > remain local to each other and others that prefer to spread as wide as
> > possible as quickly as possible.
> 
> I agree with this argument that we want to be conservative while
> pulling tasks across NUMA domains. My point was that the threshold at
> the NUMA domain that spans the 2 sockets is lower for NPS=1
> (imb_numa_nr = 2) when compared to the threshold for the same NUMA
> domain when NPS=2/4 (imb_numa_nr = 4).
> 

Is that a problem though? On an Intel machine with sub-numa clustering,
the distances are 11 and 21 for a "node" that is the split cache and the
remote node respectively.

> Irrespective of what NPS mode we are operating in, the NUMA distance
> between the two sockets is 32 on Zen3 systems. Hence shouldn't the
> thresholds be the same for that level of NUMA? 
> 

Maybe, but then it is not a property of the sched_domain and instead
needs to account for distance when balancing between two nodes that may
be varying distances from each other.

> Would something like the following work ?
> 
> if (sd->flags & SD_NUMA) {
> 
>    /* We are using the child as a proxy for the group. */
>    group_span = sd->child->span_weight;
>    sd_distance = /* NUMA distance at this sd level */
> 

NUMA distance relative to what? On Zen, the distance to a remote node may
be fixed but on topologies with multiple nodes that are not fully linked
to every other node by one hop, the same is not true.

>    /* By default we set the threshold to 1/4th the sched-group span. */
>    imb_numa_shift = 2;
> 
>    /*
>     * We can be a little aggressive if the cost of migrating tasks
>     * across groups of this NUMA level is not high.
>     * Assuming 
>     */
>    
>    if (sd_distance < REMOTE_DISTANCE)
>       imb_numa_shift++;
> 

The distance would have to be accounted for elsewhere because here we
are considering one node in isolation, not relative to other nodes.

>    /*
>     * Compute the number of LLCs in each group.
>     * More the LLCs, more aggressively we migrate across
>     * the groups at this NUMA sd.
>     */
>     nr_llcs = group_span/llc_size;
> 
>     sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
> }
> 

Same, any adjustment would have to happen during load balancing taking
into account the relatively NUMA distances. I'm not necessarily opposed
but it would be a separate patch.

> > > <SNIP>
> > > If we retain the (2,4) thresholds from v4.1 but use them in
> > > allow_numa_imbalance() as in v3 we get
> > > 
> > > NPS=4
> > > Test:	 mel-v4.2
> > >  Copy:	 225860.12 (498.11%)
> > > Scale:	 227869.07 (572.58%)
> > >   Add:	 278365.58 (624.93%)
> > > Triad:	 264315.44 (596.62%)
> > > 
> > 
> > The potential problem with this is that it probably will work for
> > netperf when it's a single communicating pair but may not work as well
> > when there are multiple communicating pairs or a number of communicating
> > tasks that exceed numa_imb_nr.
> 
> 
> Yes that's true. I think what you are doing in v4 is the right thing.
> 
> In case of stream in NPS=4, it just manages to hit the corner case for
> this heuristic which results in a suboptimal behaviour. Description
> follows:
> 

To avoid the corner case, we'd need to explicitly favour spreading early
and assume wakeup will pull communicating tasks together and NUMA
balancing migrate the data after some time which looks like

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..df0e84462e62 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running,
+					int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1885,7 +1887,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1952,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9050,9 +9054,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
 {
-	return (dst_running < (dst_weight >> 2));
+	return dst_running < imb_numa_nr;
 }
 
 /*
@@ -9186,12 +9190,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task on this node to stay local
+			 * to its wakeup source if the number of running tasks
+			 * are below the allowed imbalance. If there is a real
+			 * need of migration, periodic load balance will take
+			 * care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9280,19 +9285,13 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
-#define NUMA_IMBALANCE_MIN 2
-
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
-	/*
-	 * Allow a small imbalance based on a simple pair of communicating
-	 * tasks that remain local when the destination is lightly loaded.
-	 */
-	if (imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= imb_numa_nr)
 		return 0;
 
 	return imbalance;
@@ -9397,7 +9396,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..1fa3e977521d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,55 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int llc_sq;
+
+				/*
+				 * nr_llcs = (sd->span_weight / llc_weight);
+				 * imb = (llc_weight / nr_llcs) >> 2
+				 *
+				 * is equivalent to
+				 *
+				 * imb = (llc_weight^2 / sd->span_weight) >> 2
+				 *
+				 */
+				llc_sq = child->span_weight * child->span_weight;
+
+				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
+				sd->imb_numa_nr = imb;
+
+				/*
+				 * Set span based on top domain that places
+				 * tasks in sibling domains.
+				 */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
  2021-12-13  8:28   ` Gautham R. Shenoy
@ 2021-12-17 19:54   ` Gautham R. Shenoy
  1 sibling, 0 replies; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-17 19:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote:
[..snip..]

> @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task on this node to stay local
> +			 * to its wakeup source if the number of running tasks
> +			 * are below the allowed imbalance. If there is a real
> +			 * need of migration, periodic load balance will take
> +			 * care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
>  				return NULL;

Considering the fact that we want to check whether or not the
imb_numa_nr threshold is going to be crossed if we let the new task
stay local, this should be

      	      	    	if (local_sgs.sum_nr_running + 1 <= sd->imb_numa_nr)
			   	return NULL;



Without this change, on a Zen3 configured with Nodes Per Socket
(NPS)=4, the lower NUMA domain with sd->imb_numa_nr = 2, has 4 groups
(each group corresponds to a NODE sched-domain), when we run stream
with 8 threads, we see 3 of them being initially placed in the local
group and the remaining 5 distributed across the other 4 groups. None
of these 3 tasks will never get migrated to any of the other 3 groups,
because those others have at least one task.

Eg:

PID 157811 : timestamp 108921.267293 : first placed in NODE 1
PID 157812 : timestamp 108921.269877 : first placed in NODE 1
PID 157813 : timestamp 108921.269921 : first placed in NODE 1
PID 157814 : timestamp 108921.270007 : first placed in NODE 2
PID 157815 : timestamp 108921.270065 : first placed in NODE 3
PID 157816 : timestamp 108921.270118 : first placed in NODE 0
PID 157817 : timestamp 108921.270168 : first placed in NODE 2
PID 157818 : timestamp 108921.270216 : first placed in NODE 3

With the fix mentioned above, we see the 8 threads uniformly
distributed across the 4 groups.

PID 7500 : timestamp 436.156429 : first placed in     NODE   1
PID 7501 : timestamp 436.159058 : first placed in     NODE   1
PID 7502 : timestamp 436.159106 : first placed in     NODE   2
PID 7503 : timestamp 436.159173 : first placed in     NODE   3
PID 7504 : timestamp 436.159219 : first placed in     NODE   0
PID 7505 : timestamp 436.159263 : first placed in     NODE   2
PID 7506 : timestamp 436.159305 : first placed in     NODE   3
PID 7507 : timestamp 436.159348 : first placed in     NODE   0


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-15 12:25           ` Mel Gorman
@ 2021-12-16 18:33             ` Gautham R. Shenoy
  2021-12-20 11:12               ` Mel Gorman
  0 siblings, 1 reply; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-16 18:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,

On Wed, Dec 15, 2021 at 12:25:50PM +0000, Mel Gorman wrote:
> On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:

[..SNIP..]

> > On a 2 Socket Zen3:
> > 
> > NPS=1
> >    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
> >    top_p = NUMA, imb_span = 256.
> > 
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> > 
> > NPS=2
> >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
> >    top_p = NUMA, imb_span = 128.
> > 
> >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > 
> > NPS=4:
> >    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
> >    top_p = NUMA, imb_span = 128.
> > 
> >    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
> >    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> > 
> > Again, we will be more aggressively load balancing across the two
> > sockets in NPS=1 mode compared to NPS=2/4.
> > 
> 
> Yes, but I felt it was reasonable behaviour because we have to strike
> some sort of balance between allowing a NUMA imbalance up to a point
> to prevent communicating tasks being pulled apart and v3 broke that
> completely. There will always be a tradeoff between tasks that want to
> remain local to each other and others that prefer to spread as wide as
> possible as quickly as possible.

I agree with this argument that we want to be conservative while
pulling tasks across NUMA domains. My point was that the threshold at
the NUMA domain that spans the 2 sockets is lower for NPS=1
(imb_numa_nr = 2) when compared to the threshold for the same NUMA
domain when NPS=2/4 (imb_numa_nr = 4).

Irrespective of what NPS mode we are operating in, the NUMA distance
between the two sockets is 32 on Zen3 systems. Hence shouldn't the
thresholds be the same for that level of NUMA? 

Would something like the following work ?

if (sd->flags & SD_NUMA) {

   /* We are using the child as a proxy for the group. */
   group_span = sd->child->span_weight;
   sd_distance = /* NUMA distance at this sd level */

   /* By default we set the threshold to 1/4th the sched-group span. */
   imb_numa_shift = 2;

   /*
    * We can be a little aggressive if the cost of migrating tasks
    * across groups of this NUMA level is not high.
    * Assuming 
    */
   
   if (sd_distance < REMOTE_DISTANCE)
      imb_numa_shift++;

   /*
    * Compute the number of LLCs in each group.
    * More the LLCs, more aggressively we migrate across
    * the groups at this NUMA sd.
    */
    nr_llcs = group_span/llc_size;

    sd->imb_numa_nr = max(2U, (group_span / nr_llcs) >> imb_numa_shift);
}

With this, on Intel platforms, we will get sd->imb_numa_nr = (span of socket)/4

On Zen3,

NPS=1, Inter-socket NUMA : sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

NPS=2, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (64/4) >> (2+1)) = 2
       Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4

NPS=4, Intra-socket NUMA: sd->imb_numa_nr = max(2U, (32/2) >> (2+1)) = 2
       Inter-socket NUMA: sd->imb_numa_nr = max(2U, (128/8) >> 2) = 4




> 
> > <SNIP>
> > If we retain the (2,4) thresholds from v4.1 but use them in
> > allow_numa_imbalance() as in v3 we get
> > 
> > NPS=4
> > Test:	 mel-v4.2
> >  Copy:	 225860.12 (498.11%)
> > Scale:	 227869.07 (572.58%)
> >   Add:	 278365.58 (624.93%)
> > Triad:	 264315.44 (596.62%)
> > 
> 
> The potential problem with this is that it probably will work for
> netperf when it's a single communicating pair but may not work as well
> when there are multiple communicating pairs or a number of communicating
> tasks that exceed numa_imb_nr.


Yes that's true. I think what you are doing in v4 is the right thing.

In case of stream in NPS=4, it just manages to hit the corner case for
this heuristic which results in a suboptimal behaviour. Description
follows:

On NPS=4, if we run 8 stream tasks bound to a socket with v4.1, we get
the following initial placement based on data obtained via the
sched:sched_wakeup_new tracepoint. This behaviour is consistently
reproducible.

-------------------------------------------------------
| NUMA                                                |
|   ----------------------- ------------------------  |
|   | NODE0               | | NODE1                |  |
|   |   -------------     | |    -------------     |  |
|   |   |  0 tasks  | MC0 | |    |  1 tasks  | MC2 |  |
|   |   -------------     | |    -------------     |  |
|   |   -------------     | |    -------------     |  |
|   |   |  1 tasks  | MC1 | |    |  1 tasks  | MC3 |  |
|   |   -------------     | |    -------------     |  |
|   |                     | |                      |  |
|   ----------------------- ------------------------  |
|   ----------------------- ------------------------  |
|   | NODE2               | | NODE3                |  |
|   |   -------------     | |    -------------     |  |
|   |   |  1 tasks  | MC4 | |    |  1 tasks  | MC6 |  |
|   |   -------------     | |    -------------     |  |
|   |   -------------     | |    -------------     |  |
|   |   |  2 tasks  | MC5 | |    |  1 tasks  | MC7 |  |
|   |   -------------     | |    -------------     |  |
|   |                     | |                      |  |
|   ----------------------- ------------------------  |
|                                                     |
-------------------------------------------------------

From the trace data obtained for sched:sched_wakeup_new and
sched:sched_migrate_task, we see

PID 106089 : timestamp 35607.831040 : was running  in MC5
PID 106090 : timestamp 35607.831040 : first placed in MC4
PID 106091 : timestamp 35607.831081 : first placed in MC5
PID 106092 : timestamp 35607.831155 : first placed in MC7
PID 106093 : timestamp 35607.831209 : first placed in MC3
PID 106094 : timestamp 35607.831254 : first placed in MC1
PID 106095 : timestamp 35607.831300 : first placed in MC6
PID 106096 : timestamp 35607.831344 : first placed in MC2

Subsequently we do not see any migrations for stream tasks (via the
sched:sched_migrate_task tracepoint), even though they run for nearly
10 seconds. The reasons:

  - No load-balancing is possible at any of the NODE sched-domains
    since the groups are more or less balanced within each NODE.

  - At NUMA sched-domain, busiest group would be NODE2.  When any CPU
    in NODE0 performs load-balancing at NUMA level, it can pull tasks
    only if the imbalance between NODE0 and NODE2 is greater than
    imb_numa_nr = 2, which isn't the case here.

Hence, with v4.1, we get the following numbers which are better than
the current upstream, but are still not the best.
Copy:           78182.7
Scale:          76344.1
Add:            87638.7
Triad:          86388.9

However, if I run an "mpstat 1 10 > /tmp/mpstat.log&" just before
kickstarting stream-8, the performance significantly improves (again,
consistently reproducible).

Copy:          122804.6
Scale:         115192.9
Add:           137191.6
Triad:         133338.5

In this case, from the trace data for stream, we see:
PID 105174 : timestamp 35547.526816 : was running  in  MC4
PID 105174 : timestamp 35547.577635 : moved to         MC5

PID 105175 : timestamp 35547.526816 : first placed in  MC4
PID 105176 : timestamp 35547.526846 : first placed in  MC3
PID 105177 : timestamp 35547.526893 : first placed in  MC7
PID 105178 : timestamp 35547.526928 : first placed in  MC1
PID 105179 : timestamp 35547.526961 : first placed in  MC2
PID 105180 : timestamp 35547.527001 : first placed in  MC6
PID 105181 : timestamp 35547.527032 : first placed in  MC0

In this case, at the time of the initial placement
(find_idlest_group() ?), we are able to spread out farther away. The
subsequent load-balance at the NODE2 domain is able to balance the
tasks between MC4 and MC5.

> 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > NPS=1
> > ======
> > Clients: tip-core   mel-v3    mel-v4    mel-v4.1
> >     1	 633.19     619.16    632.94    619.27
> >     	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
> > 	 
> >     2	 1152.48    1189.88   1184.82   1189.19
> >     	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
> > 	 
> >     4	 1946.46    2177.40   1979.56	2196.09
> >     	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
> > 	 
> >     8	 3553.29    3564.50   3678.07	3668.77
> >     	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
> > 	 
> >    16	 6217.03    6484.58   6249.29	6534.73
> >    	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
> > 	 
> >    32	 11702.59   12185.77  12005.99	11917.57
> >    	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
> > 	 
> >    64	 18394.56   19535.11  19080.19	19500.55
> >    	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
> > 	 
> >   128	 27231.02   31759.92  27200.52	30358.99
> >   	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
> > 	 
> >   256	 33166.10   24474.30  31639.98	24788.12
> >   	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
> > 	 
> >   512	 41605.44   54823.57  46684.48	54559.02
> >   	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
> > 	 
> >  1024	 53650.54   56329.39  44422.99	56320.66
> >  	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 
> > 
> > 
> > We see that the v4.1 performs better than v4 in most cases except when
> > the number of clients=256 where the spread strategy seems to be
> > hurting as we see degradation in both v3 and v4.1. This is true even
> > for NPS=2 and NPS=4 cases (see below).
> > 
> 
> The 256 client case is a bit of a crapshoot. At that point, the NUMA
> imbalancing is disabled and the machine is overloaded.

Yup. 

[..snip..]

> Most likely because v4.2 is disabling the allowed NUMA imbalance too
> soon. This is the trade-off between favouring communicating tasks over
> embararassingly parallel problems.

v4.1 does allow the NUMA imbalance for a longer duration. But since
the thresholds are small enough, I guess it should be a ok for most
workloads.

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-15 11:52         ` Gautham R. Shenoy
@ 2021-12-15 12:25           ` Mel Gorman
  2021-12-16 18:33             ` Gautham R. Shenoy
  0 siblings, 1 reply; 46+ messages in thread
From: Mel Gorman @ 2021-12-15 12:25 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Dec 15, 2021 at 05:22:30PM +0530, Gautham R. Shenoy wrote:
> Hello Mel,
> 
> 
> On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote:
> 
> > 
> > Thanks for the patch. I will queue this one for tonight.
> >
> 
> Getting the numbers took a bit longer than I expected.
> 

No worries.

> > > <SNIP>
> > > +				/*
> > > +				 * Set span based on top domain that places
> > > +				 * tasks in sibling domains.
> > > +				 */
> > > +				top = sd;
> > > +				top_p = top->parent;
> > > +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> > > +					top = top->parent;
> > > +					top_p = top->parent;
> > > +				}
> > > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> > >  			} else {
> > > -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > > +				int factor = max(1U, (sd->span_weight / imb_span));
> > > +
> 
> 
> So for the first NUMA domain, the sd->imb_numa_nr will be imb, which
> turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings.
> 
> On a 2 Socket Zen3:
> 
> NPS=1
>    child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
>    top_p = NUMA, imb_span = 256.
> 
>    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2
> 
> NPS=2
>    child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
>    top_p = NUMA, imb_span = 128.
> 
>    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
>    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> 
> NPS=4:
>    child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
>    top_p = NUMA, imb_span = 128.
> 
>    NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
>    NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4
> 
> Again, we will be more aggressively load balancing across the two
> sockets in NPS=1 mode compared to NPS=2/4.
> 

Yes, but I felt it was reasonable behaviour because we have to strike
some sort of balance between allowing a NUMA imbalance up to a point
to prevent communicating tasks being pulled apart and v3 broke that
completely. There will always be a tradeoff between tasks that want to
remain local to each other and others that prefer to spread as wide as
possible as quickly as possible.

> <SNIP>
> If we retain the (2,4) thresholds from v4.1 but use them in
> allow_numa_imbalance() as in v3 we get
> 
> NPS=4
> Test:	 mel-v4.2
>  Copy:	 225860.12 (498.11%)
> Scale:	 227869.07 (572.58%)
>   Add:	 278365.58 (624.93%)
> Triad:	 264315.44 (596.62%)
> 

The potential problem with this is that it probably will work for
netperf when it's a single communicating pair but may not work as well
when there are multiple communicating pairs or a number of communicating
tasks that exceed numa_imb_nr.

> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> NPS=1
> ======
> Clients: tip-core   mel-v3    mel-v4    mel-v4.1
>     1	 633.19     619.16    632.94    619.27
>     	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
> 	 
>     2	 1152.48    1189.88   1184.82   1189.19
>     	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
> 	 
>     4	 1946.46    2177.40   1979.56	2196.09
>     	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
> 	 
>     8	 3553.29    3564.50   3678.07	3668.77
>     	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
> 	 
>    16	 6217.03    6484.58   6249.29	6534.73
>    	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
> 	 
>    32	 11702.59   12185.77  12005.99	11917.57
>    	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
> 	 
>    64	 18394.56   19535.11  19080.19	19500.55
>    	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
> 	 
>   128	 27231.02   31759.92  27200.52	30358.99
>   	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
> 	 
>   256	 33166.10   24474.30  31639.98	24788.12
>   	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
> 	 
>   512	 41605.44   54823.57  46684.48	54559.02
>   	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
> 	 
>  1024	 53650.54   56329.39  44422.99	56320.66
>  	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 
> 
> 
> We see that the v4.1 performs better than v4 in most cases except when
> the number of clients=256 where the spread strategy seems to be
> hurting as we see degradation in both v3 and v4.1. This is true even
> for NPS=2 and NPS=4 cases (see below).
> 

The 256 client case is a bit of a crapshoot. At that point, the NUMA
imbalancing is disabled and the machine is overloaded.

> NPS=2
> =====
> Clients: tip-core   mel-v3    mel-v4    mel-v4.1
>     1	 629.76	    620.91    629.11	631.95
>     	 (0.00%)    (-1.40%)  (-0.10%)	(0.34%)
> 	 
>     2	 1176.96    1203.12   1169.09	1186.74
>     	 (0.00%)    (2.22%)   (-0.66%)	(0.83%)
> 	 
>     4	 1990.97    2228.04   1888.19	1995.21
>     	 (0.00%)    (11.90%)  (-5.16%)	(0.21%)
> 	 
>     8	 3534.57    3617.16   3660.30	3548.09
>     	 (0.00%)    (2.33%)   (3.55%)	(0.38%)
> 	 
>    16	 6294.71    6547.80   6504.13	6470.34
>    	 (0.00%)    (4.02%)   (3.32%)	(2.79%)
> 	 
>    32	 12035.73   12143.03  11396.26	11860.91
>    	 (0.00%)    (0.89%)   (-5.31%)	(-1.45%)
> 	 
>    64	 18583.39   19439.12  17126.47	18799.54
>    	 (0.00%)    (4.60%)   (-7.83%)	(1.16%)
> 	 
>   128	 27811.89   30562.84  28090.29	27468.94
>   	 (0.00%)    (9.89%)   (1.00%)	(-1.23%)
> 	 
>   256	 28148.95   26488.57  29117.13	23628.29
>   	 (0.00%)    (-5.89%)  (3.43%)	(-16.05%)
> 	 
>   512	 43934.15   52796.38  42603.49	41725.75
>   	 (0.00%)    (20.17%)  (-3.02%)	(-5.02%)
> 	 
>  1024	 54391.65   53891.83  48419.09	43913.40
>  	 (0.00%)    (-0.91%)  (-10.98%)	(-19.26%)
> 
> In this case, v4.1 performs as good as v4 upto 64 clients. But after
> that we see degradation. The degradation is significant in 1024
> clients case.
> 

Kinda the same, it's more likely to be run-to-run variance because the
machine is overloaded.

> NPS=4
> =====
> Clients: tip-core   mel-v3    mel-v4    mel-v4.1    mel-v4.2
>     1	 622.65	    617.83    667.34	644.76	    617.58
>     	 (0.00%)    (-0.77%)  (7.17%)	(3.55%)	    (-0.81%)
> 	 
>     2	 1160.62    1182.30   1294.08	1193.88	    1182.55
>     	 (0.00%)    (1.86%)   (11.49%)	(2.86%)	    (1.88%)
> 	 
>     4	 1961.14    2171.91   2477.71	1929.56	    2116.01
>     	 (0.00%)    (10.74%)  (26.34%)	(-1.61%)    (7.89%)
> 	 
>     8	 3662.94    3447.98   4067.40	3627.43	    3580.32
>     	 (0.00%)    (-5.86%)  (11.04%)	(-0.96%)    (-2.25%)
> 	 
>    16	 6490.92    5871.93   6924.32	6660.13	    6413.34
>    	 (0.00%)    (-9.53%)  (6.67%)	(2.60%)	    (-1.19%)
> 	 
>    32	 11831.81   12004.30  12709.06	12187.78    11767.46
>    	 (0.00%)    (1.45%)   (7.41%)	(3.00%)	    (-0.54%)
> 	 
>    64	 17717.36   18406.79  18785.41	18820.33    18197.86
>    	 (0.00%)    (3.89%)   (6.02%)	(6.22%)	    (2.71%)
> 	 
>   128	 27723.35   27777.34  27939.63	27399.64    24310.93
>   	 (0.00%)    (0.19%)   (0.78%)	(-1.16%)    (-12.30%)
> 	 
>   256	 30919.69   23937.03  35412.26	26780.37    24642.24
>   	 (0.00%)    (-22.58%) (14.52%)	(-13.38%)   (-20.30%)
> 	 
>   512	 43366.03   49570.65  43830.84	43654.42    41031.90
>   	 (0.00%)    (14.30%)  (1.07%)	(0.66%)	    (-5.38%)
> 	 
>  1024	 46960.83   53576.16  50557.19	43743.07    40884.98
>  	 (0.00%)    (14.08%)  (7.65%)	(-6.85%)    (-12.93%)
> 
> 
> In the NPS=4 case, clearly v4 provides the best results.
> 
> v4.1 does better v4.2 since it is able to hold off spreading for a
> longer period compared to v4.2.
> 

Most likely because v4.2 is disabling the allowed NUMA imbalance too
soon. This is the trade-off between favouring communicating tasks over
embararassingly parallel problems.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-13 14:47       ` Gautham R. Shenoy
@ 2021-12-15 11:52         ` Gautham R. Shenoy
  2021-12-15 12:25           ` Mel Gorman
  0 siblings, 1 reply; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-15 11:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,


On Mon, Dec 13, 2021 at 08:17:37PM +0530, Gautham R. Shenoy wrote:

> 
> Thanks for the patch. I will queue this one for tonight.
>

Getting the numbers took a bit longer than I expected.

> 
> > 
> > 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index bacec575ade2..1fa3e977521d 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  
> >  			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> >  			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > -				struct sched_domain *top = sd;
> > +				struct sched_domain *top, *top_p;
> >  				unsigned int llc_sq;
> >  
> >  				/*
> > -				 * nr_llcs = (top->span_weight / llc_weight);
> > -				 * imb = (child_weight / nr_llcs) >> 2
> > +				 * nr_llcs = (sd->span_weight / llc_weight);
> > +				 * imb = (llc_weight / nr_llcs) >> 2
> >  				 *
> >  				 * is equivalent to
> >  				 *
> > -				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > +				 * imb = (llc_weight^2 / sd->span_weight) >> 2
> >  				 *
> >  				 */
> >  				llc_sq = child->span_weight * child->span_weight;
> >  
> > -				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > -				imb_span = sd->span_weight;
> > -
> > +				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
> >  				sd->imb_numa_nr = imb;
> > +
> > +				/*
> > +				 * Set span based on top domain that places
> > +				 * tasks in sibling domains.
> > +				 */
> > +				top = sd;
> > +				top_p = top->parent;
> > +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> > +					top = top->parent;
> > +					top_p = top->parent;
> > +				}
> > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> >  			} else {
> > -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > +				int factor = max(1U, (sd->span_weight / imb_span));
> > +


So for the first NUMA domain, the sd->imb_numa_nr will be imb, which
turns out to be 2 for Zen2 and Zen3 processors across all Nodes Per Socket Settings.

On a 2 Socket Zen3:

NPS=1
   child=MC, llc_weight=16, sd=DIE. sd->span_weight=128 imb=max(2U, (16*16/128) / 4)=2
   top_p = NUMA, imb_span = 256.

   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/256) = 2

NPS=2
   child=MC, llc_weight=16, sd=NODE. sd->span_weight=64 imb=max(2U, (16*16/64) / 4) = 2
   top_p = NUMA, imb_span = 128.

   NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4

NPS=4:
   child=MC, llc_weight=16, sd=NODE. sd->span_weight=32 imb=max(2U, (16*16/32) / 4) = 2
   top_p = NUMA, imb_span = 128.

   NUMA: sd->span_weight =128; sd->imb_numa_nr = 2 * (128/128) = 2
   NUMA: sd->span_weight =256; sd->imb_numa_nr = 2 * (256/128) = 4

Again, we will be more aggressively load balancing across the two
sockets in NPS=1 mode compared to NPS=2/4.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 113666.77      214885.89 	212162.63 	226065.79
 	 (0.00%)	(89.04%)	(86.65%)	(98.88%)
	 
Scale:	 110962.00	215475.05	220002.56	227610.64
	 (0.00%)	(94.18%)	(98.26%)	(105.12%)
	 
  Add:	 124440.44	250338.11	258569.94	271368.51
  	 (0.00%)	(101.17%)	(107.78%)	(118.07%)
	 
Triad:	 122836.42 	244993.16	251075.19	265465.57
	 (0.00%)	(99.44%)	(104.39%)	(116.11%)


NPS=2
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 58193.29	203242.78	191171.22	212785.96
 	 (0.00%)	(249.25%) 	(228.51%)	(265.65%)
	 
Scale:	 54980.48	209056.28 	175497.10 	223452.99
	 (0.00%)	(280.23%)	(219.19%)	(306.42%)
	 
  Add:	 63244.24	249610.19	203569.35	270487.22
  	 (0.00%)	(294.67%)	(221.87%)	(327.68%)
	 
Triad:	 62166.77 	242688.72	198971.26	262757.16
	 (0.00%)	(290.38%)	(220.06%)	(322.66%)


NPS=4
Test:	 tip-core	mel-v3		mel-v4		mel-v4.1
 Copy:	 37762.14	253412.82	48748.60	164765.83
 	 (0.00%)	(571.07%)	(29.09%)	(336.32%)
	 
Scale:	 33879.66	237736.93	48317.66	159641.67
	 (0.00%)	(601.70%)	(42.61%)	(371.20%)
	 
  Add:	 38398.90	292186.56	54259.56	184583.70
  	 (0.00%)	(660.92%)	(41.30%)	(380.70%)
	 
Triad:	 37942.38	285124.76	54503.74	181250.80
	 (0.00%)	(651.46%)	(43.64%)	(377.70%)
	 


So while in NPS=1 and NPS=2 we are able to recover the performance
that was obtained with v3, on v4, we are not able to recover as much.

However it is not only due to the fact that in, the imb_numa_nr
thresholds for NPS=4 were (1,1) for the two NUMA domains, while in
v4.1 the imb_numa_nr was (2,4), but also due to the fact that in v3,
we used the imb_numa_nr thresholds in allow_numa_imbalance() while in
v4 and v4.1 we are using those in adjust_numa_imbalance().

The distinction is that in v3, we will trigger load balance as soon as
the number of tasks in the busiest group in the NUMA domain is greater
than or equal to imb_numa_nr at that domain.

In v4 and v4.1, we will trigger the load-balance if

 - the number of tasks in the busiest group is greater than 1/4th the
   CPUs in the NUMA domain.

OR

- the difference between the idle CPUs in the busiest and this group
  is greater than imb_numa_nr.


If we retain the (2,4) thresholds from v4.1 but use them in
allow_numa_imbalance() as in v3 we get

NPS=4
Test:	 mel-v4.2
 Copy:	 225860.12 (498.11%)
Scale:	 227869.07 (572.58%)
  Add:	 278365.58 (624.93%)
Triad:	 264315.44 (596.62%)

It shouldn't have made so much of a difference considering the fact
that with 16 stream tasks, we should have hit "imbalance >
imb_numa_nr" in adjust_numa_imbalance() eventually. But these are the
numbers!


The trend is similar with -DNTIMES=100. Even in this case with v4.2 we
can recover the stream performance in NPS4 case.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NPS=1
Test:	tip-core   mel-v3     mel-v4	mel-v4.1
 Copy:	137281.89  235886.13  240588.80  249005.99
 	(0.00 %)   (71.82%)   (75.25%)	 (81.38%)

Scale:	126559.48  213667.75  226411.39  228523.78
	(0.00 %)   (68.82%)   (78.89%)   (80.56%)

  Add:	148144.23  255780.73  272027.93  272872.49
	(0.00 %)   (72.65%)   (83.62%)	 (84.19%)

Triad:	146829.78  240400.29  259854.84  265640.70
	(0.00 %)   (63.72%)   (76.97%)	 (80.91%)


NPS=2
Test:	tip-core   mel-v3     mel-v4	mel-v4.1
 Copy:	 105873.29 243513.40  198299.43 258266.99
 	 (0.00%)   (130.00%)  (87.29%)	(143.93%)
	 
Scale:	 100510.54 217361.59  177890.00 231509.69
	 (0.00%)   (116.25%)  (76.98%)	(130.33%)
	 
  Add:	 115932.61 268488.03  211436.57	288396.07
  	 (0.00%)   (131.58%)  (82.37%)	(148.76%)
	 
Triad:	 113349.09 253865.68  197810.83	272219.89
	 (0.00%)   (123.96%)  (74.51%)	(140.16%)


NPS=4
Test:	tip-core   mel-v3     mel-v4	mel-v4.1   mel-v4.2
 Copy:	106798.80  251503.78  48898.24	171769.82  266788.03
 	(0.00%)	   (135.49%)  (-54.21%)	(60.83%)   (149.80%)
	
Scale:	101377.71  221135.30  48425.90	160748.21  232480.83
	(0.00%)	   (118.13%)  (-52.23%)	(58.56%)   (129.32%)
	
  Add:	116131.56  275008.74  54425.29  186387.91  290787.31
  	(0.00%)	   (136.80%)  (-53.13%)	(60.49%)   (150.39%)
	
Triad:	113443.70  256027.20  54622.68  180936.47  277456.83
	(0.00%)	   (125.68%)  (-51.85%)	(59.49%)   (144.57%)



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tbench
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
NPS=1
======
Clients: tip-core   mel-v3    mel-v4    mel-v4.1
    1	 633.19     619.16    632.94    619.27
    	 (0.00%)    (-2.21%)  (-0.03%)	(-2.19%)
	 
    2	 1152.48    1189.88   1184.82   1189.19
    	 (0.00%)    (3.24%)   (2.80%)	(3.18%)
	 
    4	 1946.46    2177.40   1979.56	2196.09
    	 (0.00%)    (11.86%)  (1.70%)	(12.82%)
	 
    8	 3553.29    3564.50   3678.07	3668.77
    	 (0.00%)    (0.31%)   (3.51%)	(3.24%)
	 
   16	 6217.03    6484.58   6249.29	6534.73
   	 (0.00%)    (4.30%)   (0.51%)	(5.11%)
	 
   32	 11702.59   12185.77  12005.99	11917.57
   	 (0.00%)    (4.12%)   (2.59%)	(1.83%)
	 
   64	 18394.56   19535.11  19080.19	19500.55
   	 (0.00%)    (6.20%)   (3.72%)	(6.01%)
	 
  128	 27231.02   31759.92  27200.52	30358.99
  	 (0.00%)    (16.63%)  (-0.11%)	(11.48%)
	 
  256	 33166.10   24474.30  31639.98	24788.12
  	 (0.00%)    (-26.20%) (-4.60%)	(-25.26%)
	 
  512	 41605.44   54823.57  46684.48	54559.02
  	 (0.00%)    (31.77%)  (12.20%)	(31.13%)
	 
 1024	 53650.54   56329.39  44422.99	56320.66
 	 (0.00%)    (4.99%)   (-17.19%)	(4.97%) 


We see that the v4.1 performs better than v4 in most cases except when
the number of clients=256 where the spread strategy seems to be
hurting as we see degradation in both v3 and v4.1. This is true even
for NPS=2 and NPS=4 cases (see below).

NPS=2
=====
Clients: tip-core   mel-v3    mel-v4    mel-v4.1
    1	 629.76	    620.91    629.11	631.95
    	 (0.00%)    (-1.40%)  (-0.10%)	(0.34%)
	 
    2	 1176.96    1203.12   1169.09	1186.74
    	 (0.00%)    (2.22%)   (-0.66%)	(0.83%)
	 
    4	 1990.97    2228.04   1888.19	1995.21
    	 (0.00%)    (11.90%)  (-5.16%)	(0.21%)
	 
    8	 3534.57    3617.16   3660.30	3548.09
    	 (0.00%)    (2.33%)   (3.55%)	(0.38%)
	 
   16	 6294.71    6547.80   6504.13	6470.34
   	 (0.00%)    (4.02%)   (3.32%)	(2.79%)
	 
   32	 12035.73   12143.03  11396.26	11860.91
   	 (0.00%)    (0.89%)   (-5.31%)	(-1.45%)
	 
   64	 18583.39   19439.12  17126.47	18799.54
   	 (0.00%)    (4.60%)   (-7.83%)	(1.16%)
	 
  128	 27811.89   30562.84  28090.29	27468.94
  	 (0.00%)    (9.89%)   (1.00%)	(-1.23%)
	 
  256	 28148.95   26488.57  29117.13	23628.29
  	 (0.00%)    (-5.89%)  (3.43%)	(-16.05%)
	 
  512	 43934.15   52796.38  42603.49	41725.75
  	 (0.00%)    (20.17%)  (-3.02%)	(-5.02%)
	 
 1024	 54391.65   53891.83  48419.09	43913.40
 	 (0.00%)    (-0.91%)  (-10.98%)	(-19.26%)

In this case, v4.1 performs as good as v4 upto 64 clients. But after
that we see degradation. The degradation is significant in 1024
clients case.

NPS=4
=====
Clients: tip-core   mel-v3    mel-v4    mel-v4.1    mel-v4.2
    1	 622.65	    617.83    667.34	644.76	    617.58
    	 (0.00%)    (-0.77%)  (7.17%)	(3.55%)	    (-0.81%)
	 
    2	 1160.62    1182.30   1294.08	1193.88	    1182.55
    	 (0.00%)    (1.86%)   (11.49%)	(2.86%)	    (1.88%)
	 
    4	 1961.14    2171.91   2477.71	1929.56	    2116.01
    	 (0.00%)    (10.74%)  (26.34%)	(-1.61%)    (7.89%)
	 
    8	 3662.94    3447.98   4067.40	3627.43	    3580.32
    	 (0.00%)    (-5.86%)  (11.04%)	(-0.96%)    (-2.25%)
	 
   16	 6490.92    5871.93   6924.32	6660.13	    6413.34
   	 (0.00%)    (-9.53%)  (6.67%)	(2.60%)	    (-1.19%)
	 
   32	 11831.81   12004.30  12709.06	12187.78    11767.46
   	 (0.00%)    (1.45%)   (7.41%)	(3.00%)	    (-0.54%)
	 
   64	 17717.36   18406.79  18785.41	18820.33    18197.86
   	 (0.00%)    (3.89%)   (6.02%)	(6.22%)	    (2.71%)
	 
  128	 27723.35   27777.34  27939.63	27399.64    24310.93
  	 (0.00%)    (0.19%)   (0.78%)	(-1.16%)    (-12.30%)
	 
  256	 30919.69   23937.03  35412.26	26780.37    24642.24
  	 (0.00%)    (-22.58%) (14.52%)	(-13.38%)   (-20.30%)
	 
  512	 43366.03   49570.65  43830.84	43654.42    41031.90
  	 (0.00%)    (14.30%)  (1.07%)	(0.66%)	    (-5.38%)
	 
 1024	 46960.83   53576.16  50557.19	43743.07    40884.98
 	 (0.00%)    (14.08%)  (7.65%)	(-6.85%)    (-12.93%)


In the NPS=4 case, clearly v4 provides the best results.

v4.1 does better v4.2 since it is able to hold off spreading for a
longer period compared to v4.2.

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-13 13:01     ` Mel Gorman
@ 2021-12-13 14:47       ` Gautham R. Shenoy
  2021-12-15 11:52         ` Gautham R. Shenoy
  0 siblings, 1 reply; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-13 14:47 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,

On Mon, Dec 13, 2021 at 01:01:31PM +0000, Mel Gorman wrote:
> On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote:
> > > On a Zen3 machine running STREAM parallelised with OMP to have on instance
> > > per LLC the results and without binding, the results are
> > > 
> > >                             5.16.0-rc1             5.16.0-rc1
> > >                                vanilla       sched-numaimb-v4
> > > MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> > > MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> > > MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> > > MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)
> > 
> > 
> > Could you please share the size of the stream array ? These numbers
> > are higher than what I am observing.
> > 
> 
> 512MB

Thanks, I will try with this one.

> 
> > > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> > >  	}
> > >  }
> > >  
> > > -#define NUMA_IMBALANCE_MIN 2
> > > -
> > >  static inline long adjust_numa_imbalance(int imbalance,
> > > -				int dst_running, int dst_weight)
> > > +				int dst_running, int dst_weight,
> > > +				int imb_numa_nr)
> > >  {
> > >  	if (!allow_numa_imbalance(dst_running, dst_weight))
> > >  		return imbalance;
> > >
> > 
> > if (4 * dst_running >= dst_weight) we return imbalance here. The
> > dst_weight here corresponds to the span of the domain, while
> > dst_running is the nr_running in busiest.
> > 
> 
> Yes, once dst_running is high enough, no imbalance is allowed. In
> previous versions I changed this but that was a mistake and in this
> version, the threshold where imbalance is not allowed remains the same.
> 
> > On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
> > all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
> > two groups, where each group is a socket. So, unless there are at
> > least 64 tasks running in one of the sockets, we would not return
> > imbalance here and go to the next step.
> > 
> 
> Yes
> 
> > 
> > > -	/*
> > > -	 * Allow a small imbalance based on a simple pair of communicating
> > > -	 * tasks that remain local when the destination is lightly loaded.
> > > -	 */
> > > -	if (imbalance <= NUMA_IMBALANCE_MIN)
> > > +	if (imbalance <= imb_numa_nr)
> > 
> > imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
> > don't have PREFER_SIBLING, we would be balancing the number of idle
> > CPUs. We will end up doing the imbalance, as long as the difference
> > between the idle CPUs is at least 8.
> > 
> > In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
> > will not rebalance unless the difference between the idle CPUs is 16.
> > 
> > In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
> > threshold is now bumped up to 32.
> > 
> > >  		return 0;
> > 
> > 
> > 
> > >  
> > >  	return imbalance;
> > > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >  		/* Consider allowing a small imbalance between NUMA groups */
> > >  		if (env->sd->flags & SD_NUMA) {
> > >  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> > > -				busiest->sum_nr_running, env->sd->span_weight);
> > > +				busiest->sum_nr_running, env->sd->span_weight,
> > > +				env->sd->imb_numa_nr);
> > >  		}
> > >  
> > >  		return;
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..bacec575ade2 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >  		}
> > >  	}
> > >  
> > > +	/*
> > > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > > +	 * imbalanced.
> > > +	 */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		unsigned int imb = 0;
> > > +		unsigned int imb_span = 1;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				struct sched_domain *top = sd;
> > 
> > 
> > We don't seem to be using top anywhere where sd may not be used since
> > we already have variables imb and imb_span to record the
> > top->imb_numa_nr and top->span_weight.
> > 
> 
> Top could have been removed but we might still need it.
> 
> > 
> > > +				unsigned int llc_sq;
> > > +
> > > +				/*
> > > +				 * nr_llcs = (top->span_weight / llc_weight);
> > > +				 * imb = (child_weight / nr_llcs) >> 2
> > 
> > child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.
> > 
> 
> That is be clearer.
> 
> > > +				 *
> > > +				 * is equivalent to
> > > +				 *
> > > +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > > +				 *
> > > +				 */
> > > +				llc_sq = child->span_weight * child->span_weight;
> > > +
> > > +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > > +				imb_span = sd->span_weight;
> > 
> > On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
> >    with NPS=1
> >       top = DIE.
> >       top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.
> > 
> >    with NPS=2
> >       top = NODE.
> >       top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.
> > 
> >    with NPS=4      
> >       top = NODE.
> >       top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.
> > 
> > On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
> >    with NPS=1
> >       top = DIE.
> >       top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.
> > 
> >    with NPS=2
> >       top = NODE.
> >       top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.
> > 
> >    with NPS=4      
> >       top = NODE.
> >       top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.
> > 
> > 
> > > +
> > > +				sd->imb_numa_nr = imb;
> > > +			} else {
> > > +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > > +			}
> > 
> > On Zen3,
> >    with NPS=1
> >         sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.
> > 
> >    with NPS=2
> >         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
> > 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8
> > 
> >    with NPS=4
> >         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
> > 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16
> > 
> > 
> > For Zen2, since the imb_span and imb values are the same as the
> > corresponding NPS=x values on Zen3, the imb_numa_nr values are the
> > same as well since the corresponding sd->span_weight is the same.
> > 
> > If we look at the highest NUMA domain, there are two groups in all the
> > NPS configurations. There are the same number of LLCs in each of these
> > groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
> > on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
> > value, since we compute the imb_numa_nr value relative to the number
> > of "top" domains that can be fit within this NUMA domain. This is
> > because the size of the "top" domain varies with the NPS value. This
> > shows up in the benchmark results.
> > 
> 
> This was intentional to have some scaling but based on your results, the
> scaling might be at the wrong level.

Ok. 


> 
> > 
> > 
> > The numbers with stream, tbench and YCSB +
> > Mongodb are as follows:
> > 
> > 
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Stream with 16 threads.
> > built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> > Zen3, 64C128T per socket, 2 sockets,
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > NPS=1
> > Test:     tip/sched/core                 mel-v3                    mel-v4
> >  Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
> > Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
> >   Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
> > Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)
> > 
> > 
> > NPS=2
> > Test:     tip/sched/core                 mel-v3                     mel-v4
> >  Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
> > Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
> >   Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
> > Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)
> > 
> > NPS=4
> > Test:     tip/sched/core                 mel-v3                     mel-v4
> >  Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
> > Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
> >   Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
> > Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)
> > 
> 
> At minimum, v3 is a failure because a single pair of communicating tasks
> were getting split across NUMA domains and the allowed numa imbalance
> gets cut off too early because of the change to allow_numa_imbalance.
> So while it's a valid comparison, it's definitely not the fix.

v3 is definitely not a fix. I wasn't hinting at that. It was just to
point out the opportunity that we have.


> 
> Given how you describe NPS, maybe the scaling should only start at the
> point where tasks are no longer balanced between sibling domains. Can
> you try this? I've only boot tested it at this point. It should work for
> STREAM at least but probably not great for tbench.

Thanks for the patch. I will queue this one for tonight.


> 
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index bacec575ade2..1fa3e977521d 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  
>  			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
>  			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> -				struct sched_domain *top = sd;
> +				struct sched_domain *top, *top_p;
>  				unsigned int llc_sq;
>  
>  				/*
> -				 * nr_llcs = (top->span_weight / llc_weight);
> -				 * imb = (child_weight / nr_llcs) >> 2
> +				 * nr_llcs = (sd->span_weight / llc_weight);
> +				 * imb = (llc_weight / nr_llcs) >> 2
>  				 *
>  				 * is equivalent to
>  				 *
> -				 * imb = (llc_weight^2 / top->span_weight) >> 2
> +				 * imb = (llc_weight^2 / sd->span_weight) >> 2
>  				 *
>  				 */
>  				llc_sq = child->span_weight * child->span_weight;
>  
> -				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> -				imb_span = sd->span_weight;
> -
> +				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
>  				sd->imb_numa_nr = imb;
> +
> +				/*
> +				 * Set span based on top domain that places
> +				 * tasks in sibling domains.
> +				 */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
>  			} else {
> -				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
>  			}
>  		}
>  	}

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-13  8:28   ` Gautham R. Shenoy
@ 2021-12-13 13:01     ` Mel Gorman
  2021-12-13 14:47       ` Gautham R. Shenoy
  0 siblings, 1 reply; 46+ messages in thread
From: Mel Gorman @ 2021-12-13 13:01 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Mon, Dec 13, 2021 at 01:58:03PM +0530, Gautham R. Shenoy wrote:
> > On a Zen3 machine running STREAM parallelised with OMP to have on instance
> > per LLC the results and without binding, the results are
> > 
> >                             5.16.0-rc1             5.16.0-rc1
> >                                vanilla       sched-numaimb-v4
> > MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> > MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> > MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> > MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)
> 
> 
> Could you please share the size of the stream array ? These numbers
> are higher than what I am observing.
> 

512MB

> > @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >  	}
> >  }
> >  
> > -#define NUMA_IMBALANCE_MIN 2
> > -
> >  static inline long adjust_numa_imbalance(int imbalance,
> > -				int dst_running, int dst_weight)
> > +				int dst_running, int dst_weight,
> > +				int imb_numa_nr)
> >  {
> >  	if (!allow_numa_imbalance(dst_running, dst_weight))
> >  		return imbalance;
> >
> 
> if (4 * dst_running >= dst_weight) we return imbalance here. The
> dst_weight here corresponds to the span of the domain, while
> dst_running is the nr_running in busiest.
> 

Yes, once dst_running is high enough, no imbalance is allowed. In
previous versions I changed this but that was a mistake and in this
version, the threshold where imbalance is not allowed remains the same.

> On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
> all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
> two groups, where each group is a socket. So, unless there are at
> least 64 tasks running in one of the sockets, we would not return
> imbalance here and go to the next step.
> 

Yes

> 
> > -	/*
> > -	 * Allow a small imbalance based on a simple pair of communicating
> > -	 * tasks that remain local when the destination is lightly loaded.
> > -	 */
> > -	if (imbalance <= NUMA_IMBALANCE_MIN)
> > +	if (imbalance <= imb_numa_nr)
> 
> imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
> don't have PREFER_SIBLING, we would be balancing the number of idle
> CPUs. We will end up doing the imbalance, as long as the difference
> between the idle CPUs is at least 8.
> 
> In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
> will not rebalance unless the difference between the idle CPUs is 16.
> 
> In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
> threshold is now bumped up to 32.
> 
> >  		return 0;
> 
> 
> 
> >  
> >  	return imbalance;
> > @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >  		/* Consider allowing a small imbalance between NUMA groups */
> >  		if (env->sd->flags & SD_NUMA) {
> >  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -				busiest->sum_nr_running, env->sd->span_weight);
> > +				busiest->sum_nr_running, env->sd->span_weight,
> > +				env->sd->imb_numa_nr);
> >  		}
> >  
> >  		return;
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..bacec575ade2 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +	 * imbalanced.
> > +	 */
> > +	for_each_cpu(i, cpu_map) {
> > +		unsigned int imb = 0;
> > +		unsigned int imb_span = 1;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				struct sched_domain *top = sd;
> 
> 
> We don't seem to be using top anywhere where sd may not be used since
> we already have variables imb and imb_span to record the
> top->imb_numa_nr and top->span_weight.
> 

Top could have been removed but we might still need it.

> 
> > +				unsigned int llc_sq;
> > +
> > +				/*
> > +				 * nr_llcs = (top->span_weight / llc_weight);
> > +				 * imb = (child_weight / nr_llcs) >> 2
> 
> child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.
> 

That is be clearer.

> > +				 *
> > +				 * is equivalent to
> > +				 *
> > +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> > +				 *
> > +				 */
> > +				llc_sq = child->span_weight * child->span_weight;
> > +
> > +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> > +				imb_span = sd->span_weight;
> 
> On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
>    with NPS=1
>       top = DIE.
>       top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.
> 
>    with NPS=2
>       top = NODE.
>       top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.
> 
>    with NPS=4      
>       top = NODE.
>       top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.
> 
> On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
>    with NPS=1
>       top = DIE.
>       top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.
> 
>    with NPS=2
>       top = NODE.
>       top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.
> 
>    with NPS=4      
>       top = NODE.
>       top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.
> 
> 
> > +
> > +				sd->imb_numa_nr = imb;
> > +			} else {
> > +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> > +			}
> 
> On Zen3,
>    with NPS=1
>         sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.
> 
>    with NPS=2
>         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
> 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8
> 
>    with NPS=4
>         sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
> 	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16
> 
> 
> For Zen2, since the imb_span and imb values are the same as the
> corresponding NPS=x values on Zen3, the imb_numa_nr values are the
> same as well since the corresponding sd->span_weight is the same.
> 
> If we look at the highest NUMA domain, there are two groups in all the
> NPS configurations. There are the same number of LLCs in each of these
> groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
> on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
> value, since we compute the imb_numa_nr value relative to the number
> of "top" domains that can be fit within this NUMA domain. This is
> because the size of the "top" domain varies with the NPS value. This
> shows up in the benchmark results.
> 

This was intentional to have some scaling but based on your results, the
scaling might be at the wrong level.

> 
> 
> The numbers with stream, tbench and YCSB +
> Mongodb are as follows:
> 
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Stream with 16 threads.
> built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
> Zen3, 64C128T per socket, 2 sockets,
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> NPS=1
> Test:     tip/sched/core                 mel-v3                    mel-v4
>  Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
> Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
>   Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
> Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)
> 
> 
> NPS=2
> Test:     tip/sched/core                 mel-v3                     mel-v4
>  Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
> Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
>   Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
> Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)
> 
> NPS=4
> Test:     tip/sched/core                 mel-v3                     mel-v4
>  Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
> Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
>   Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
> Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)
> 

At minimum, v3 is a failure because a single pair of communicating tasks
were getting split across NUMA domains and the allowed numa imbalance
gets cut off too early because of the change to allow_numa_imbalance.
So while it's a valid comparison, it's definitely not the fix.

Given how you describe NPS, maybe the scaling should only start at the
point where tasks are no longer balanced between sibling domains. Can
you try this? I've only boot tested it at this point. It should work for
STREAM at least but probably not great for tbench.


diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index bacec575ade2..1fa3e977521d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2255,26 +2255,38 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 
 			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
 			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
-				struct sched_domain *top = sd;
+				struct sched_domain *top, *top_p;
 				unsigned int llc_sq;
 
 				/*
-				 * nr_llcs = (top->span_weight / llc_weight);
-				 * imb = (child_weight / nr_llcs) >> 2
+				 * nr_llcs = (sd->span_weight / llc_weight);
+				 * imb = (llc_weight / nr_llcs) >> 2
 				 *
 				 * is equivalent to
 				 *
-				 * imb = (llc_weight^2 / top->span_weight) >> 2
+				 * imb = (llc_weight^2 / sd->span_weight) >> 2
 				 *
 				 */
 				llc_sq = child->span_weight * child->span_weight;
 
-				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
-				imb_span = sd->span_weight;
-
+				imb = max(2U, ((llc_sq / sd->span_weight) >> 2));
 				sd->imb_numa_nr = imb;
+
+				/*
+				 * Set span based on top domain that places
+				 * tasks in sibling domains.
+				 */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && (top_p->flags & SD_PREFER_SIBLING)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
 			} else {
-				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
 			}
 		}
 	}

^ permalink raw reply related	[flat|nested] 46+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
@ 2021-12-13  8:28   ` Gautham R. Shenoy
  2021-12-13 13:01     ` Mel Gorman
  2021-12-17 19:54   ` Gautham R. Shenoy
  1 sibling, 1 reply; 46+ messages in thread
From: Gautham R. Shenoy @ 2021-12-13  8:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

Hello Mel,

On Fri, Dec 10, 2021 at 09:33:07AM +0000, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> adjusts the imbalance on multi-LLC machines to allow an imbalance up to
> the point where LLCs should be balanced between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.16.0-rc1             5.16.0-rc1
>                                vanilla       sched-numaimb-v4
> MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
> MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
> MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
> MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)


Could you please share the size of the stream array ? These numbers
are higher than what I am observing.

> 
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
> 
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per instance, the vanilla kernel
> allows threads to contend on cache. With the patch;
> 
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla    sched-numaimb-v4r24
> Min       Score-16   367816.09 (   0.00%)   384015.36 (   4.40%)
> Hmean     Score-16   389627.78 (   0.00%)   431907.14 *  10.85%*
> Max       Score-16   416178.96 (   0.00%)   480120.03 (  15.36%)
> Stddev    Score-16    17361.82 (   0.00%)    32505.34 ( -87.22%)
> CoeffVar  Score-16        4.45 (   0.00%)        7.49 ( -68.30%)
> 
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
> 
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla       sched-numaimb-v4
> Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
> Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
> Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
> Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
> Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
> Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
> Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
> Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
> Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
> Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
> Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
> Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
> Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
> Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
> Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
> Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
> Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)


> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 36 +++++++++++++++++----------------
>  kernel/sched/topology.c        | 37 ++++++++++++++++++++++++++++++++++
>  3 files changed, 57 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>  	unsigned int busy_factor;	/* less balancing by factor if busy */
>  	unsigned int imbalance_pct;	/* No balance until over watermark */
>  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
> +	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
>  
>  	int nohz_idle;			/* NOHZ IDLE status */
>  	int flags;			/* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..972ba586b113 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>  
>  	int src_cpu, src_nid;
>  	int dst_cpu, dst_nid;
> +	int imb_numa_nr;
>  
>  	struct numa_stats src_stats, dst_stats;
>  
> @@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
>  static inline long adjust_numa_imbalance(int imbalance,
> -					int dst_running, int dst_weight);
> +					int dst_running, int dst_weight,
> +					int imb_numa_nr);
>  
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		dst_running = env->dst_stats.nr_running + 1;
>  		imbalance = max(0, dst_running - src_running);
>  		imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -							env->dst_stats.weight);
> +						  env->dst_stats.weight,
> +						  env->imb_numa_nr);
>  
>  		/* Use idle CPU if there is no imbalance */
>  		if (!imbalance) {
> @@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -	if (sd)
> +	if (sd) {
>  		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +		env.imb_numa_nr = sd->imb_numa_nr;
> +	}
>  	rcu_read_unlock();
>  
>  	/*
> @@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  				return idlest;
>  #endif
>  			/*
> -			 * Otherwise, keep the task on this node to stay close
> -			 * its wakeup source and improve locality. If there is
> -			 * a real need of migration, periodic load balance will
> -			 * take care of it.
> +			 * Otherwise, keep the task on this node to stay local
> +			 * to its wakeup source if the number of running tasks
> +			 * are below the allowed imbalance. If there is a real
> +			 * need of migration, periodic load balance will take
> +			 * care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
>  				return NULL;
>  		}
>  
> @@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	}
>  }
>  
> -#define NUMA_IMBALANCE_MIN 2
> -
>  static inline long adjust_numa_imbalance(int imbalance,
> -				int dst_running, int dst_weight)
> +				int dst_running, int dst_weight,
> +				int imb_numa_nr)
>  {
>  	if (!allow_numa_imbalance(dst_running, dst_weight))
>  		return imbalance;
>

if (4 * dst_running >= dst_weight) we return imbalance here. The
dst_weight here corresponds to the span of the domain, while
dst_running is the nr_running in busiest.

On Zen3, at the top most NUMA domain, the dst_weight = 256 across in
all the configurations of Nodes Per Socket (NPS) = 1/2/4. There are
two groups, where each group is a socket. So, unless there are at
least 64 tasks running in one of the sockets, we would not return
imbalance here and go to the next step.


> -	/*
> -	 * Allow a small imbalance based on a simple pair of communicating
> -	 * tasks that remain local when the destination is lightly loaded.
> -	 */
> -	if (imbalance <= NUMA_IMBALANCE_MIN)
> +	if (imbalance <= imb_numa_nr)

imb_numa_nr in NPS=1 mode, imb_numa_nr would be 4. Since NUMA domains
don't have PREFER_SIBLING, we would be balancing the number of idle
CPUs. We will end up doing the imbalance, as long as the difference
between the idle CPUs is at least 8.

In NPS=2, imb_numa_nr = 8 for this topmost NUMA domain. So here, we
will not rebalance unless the difference between the idle CPUs is 16.

In NPS=4, imb_numa_nr = 16 for this topmost NUMA domain. So, the
threshold is now bumped up to 32.

>  		return 0;



>  
>  	return imbalance;
> @@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				busiest->sum_nr_running, env->sd->span_weight);
> +				busiest->sum_nr_running, env->sd->span_weight,
> +				env->sd->imb_numa_nr);
>  		}
>  
>  		return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..bacec575ade2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top = sd;


We don't seem to be using top anywhere where sd may not be used since
we already have variables imb and imb_span to record the
top->imb_numa_nr and top->span_weight.


> +				unsigned int llc_sq;
> +
> +				/*
> +				 * nr_llcs = (top->span_weight / llc_weight);
> +				 * imb = (child_weight / nr_llcs) >> 2

child here is the llc. So can we use imb = (llc_weight / nr_llcs) >> 2.

> +				 *
> +				 * is equivalent to
> +				 *
> +				 * imb = (llc_weight^2 / top->span_weight) >> 2
> +				 *
> +				 */
> +				llc_sq = child->span_weight * child->span_weight;
> +
> +				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
> +				imb_span = sd->span_weight;

On Zen3, child_weight (or llc_weight) = 16. llc_sq = 256.
   with NPS=1
      top = DIE.
      top->span_weight = 128. imb = max(2, (256/128) >> 2) = 2. imb_span = 128.

   with NPS=2
      top = NODE.
      top->span_weight = 64. imb = max(2, (256/64) >> 2) = 2. imb_span = 64.

   with NPS=4      
      top = NODE.
      top->span_weight = 32. imb = max(2, (256/32) >> 2) = 2. imb_span = 32.

On Zen2, child_weight (or llc_weight) = 8. llc_sq = 64.
   with NPS=1
      top = DIE.
      top->span_weight = 128. imb = max(2, (64/128) >> 2) = 2. imb_span = 128.

   with NPS=2
      top = NODE.
      top->span_weight = 64. imb = max(2, (64/64) >> 2) = 2. imb_span = 64.

   with NPS=4      
      top = NODE.
      top->span_weight = 32. imb = max(2, (64/32) >> 2) = 2. imb_span = 32.


> +
> +				sd->imb_numa_nr = imb;
> +			} else {
> +				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
> +			}

On Zen3,
   with NPS=1
        sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/128) = 4.

   with NPS=2
        sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/64) = 4
	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/64) = 8

   with NPS=4
        sd=NUMA, sd->span_weight = 128. sd->imb_numa_nr = 2 * (128/32) = 8
	sd=NUMA, sd->span_weight = 256. sd->imb_numa_nr = 2 * (256/32) = 16


For Zen2, since the imb_span and imb values are the same as the
corresponding NPS=x values on Zen3, the imb_numa_nr values are the
same as well since the corresponding sd->span_weight is the same.


If we look at the highest NUMA domain, there are two groups in all the
NPS configurations. There are the same number of LLCs in each of these
groups across the different NPS configurations (nr_llcs=8 on Zen3, 16
on Zen2) . However, the imb_numa_nr at this domain varies with the NPS
value, since we compute the imb_numa_nr value relative to the number
of "top" domains that can be fit within this NUMA domain. This is
because the size of the "top" domain varies with the NPS value. This
shows up in the benchmark results.



The numbers with stream, tbench and YCSB +
Mongodb are as follows:


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=10
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    113716.62 (0.00 pct)     218961.59 (92.55 pct)     217130.07 (90.93 pct)
Scale:    110996.89 (0.00 pct)     216674.73 (95.20 pct)     220765.94 (98.89 pct)
  Add:    124504.19 (0.00 pct)     253461.32 (103.57 pct     260273.88 (109.04 pct)
Triad:    122890.43 (0.00 pct)     247552.00 (101.44 pct     252615.62 (105.56 pct)


NPS=2
Test:     tip/sched/core                 mel-v3                     mel-v4
 Copy:    58217.00 (0.00 pct)      204630.34 (251.49 pct)     191312.73 (228.62 pct)
Scale:    55004.76 (0.00 pct)      212142.88 (285.68 pct)     175499.15 (219.06 pct)
  Add:    63269.04 (0.00 pct)      254752.56 (302.64 pct)     203571.50 (221.75 pct)
Triad:    62178.25 (0.00 pct)      247290.80 (297.71 pct)     198988.70 (220.02 pct)

NPS=4
Test:     tip/sched/core                 mel-v3                     mel-v4
 Copy:    37986.66 (0.00 pct)      254183.87 (569.13 pct)     48748.87 (28.33 pct)
Scale:    35471.22 (0.00 pct)      237804.76 (570.41 pct)     48317.82 (36.21 pct)
  Add:    39303.25 (0.00 pct)      292285.20 (643.66 pct)     54259.59 (38.05 pct)
Triad:    39319.85 (0.00 pct)      285284.30 (625.54 pct)     54503.98 (38.61 pct)


We can see that with the v4 patch, for NPS=2 and NPS=4, the gains
start diminishing since the thresholds are higher than NPS=1.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Stream with 16 threads.
built with -DSTREAM_ARRAY_SIZE=128000000, -DNTIMES=100
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    137362.66 (0.00 pct)     236661.65 (72.28 pct)     241148.65 (75.55 pct)
Scale:    126742.24 (0.00 pct)     214568.17 (69.29 pct)     226416.41 (78.64 pct)
  Add:    148236.33 (0.00 pct)     257114.42 (73.44 pct)     272030.50 (83.51 pct)
Triad:    146913.25 (0.00 pct)     241880.88 (64.64 pct)     259873.61 (76.88 pct)

NPS=2
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    107143.94 (0.00 pct)     244922.66 (128.59 pct)    198299.91 (85.07 pct)
Scale:    102004.90 (0.00 pct)     218738.55 (114.43 pct)    177890.23 (74.39 pct)
  Add:    117760.23 (0.00 pct)     270516.24 (129.71 pct)    211458.30 (79.56 pct)
Triad:    115927.92 (0.00 pct)     255985.20 (120.81 pct)    197812.60 (70.63 pct)


NPS=4
Test:     tip/sched/core                 mel-v3                    mel-v4
 Copy:    111653.17 (0.00 pct)     253912.17 (127.41 pct)     48898.34 (-56.20 pct)
Scale:    105289.35 (0.00 pct)     223710.85 (112.47 pct)     48426.03 (-54.00 pct)
  Add:    120927.64 (0.00 pct)     277701.20 (129.64 pct)     54425.48 (-54.99 pct)
Triad:    117659.97 (0.00 pct)     259473.84 (120.52 pct)     54622.82 (-53.57 pct)

with -DNTIMES=100, each of the Copy,Scale,Add,Triad kernels runs for a
longer duration. So the test takes longer time (6-10 seconds) giving
the load-balancer sufficient time to place the tasks and balance
them. In this configuration we see that the v4 shows some degradation
on NPS=4. This is due to the imb_numa_nr being higher compared to v3.

While Stream benefits from spreading, it is fair to understand the
gains that we make with benchmarks that would prefer the tasks
co-located instead of spread out. Chose tbench and YCSB+Mongodb as
representatives of these. The numbers are as follows:


~~~~~~~~~~~~~~~~~~~~~~~~
tbench
Zen3, 64C128T per socket, 2 sockets,
~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        633.25 (0.00 pct)        619.18 (-2.22 pct)      632.96 (-0.04 pct)
    2        1152.54 (0.00 pct)       1189.91 (3.24 pct)      1184.84 (2.80 pct)
    4        1946.53 (0.00 pct)       2177.45 (11.86 pct)     1979.62 (1.69 pct)
    8        3554.65 (0.00 pct)       3565.16 (0.29 pct)      3678.13 (3.47 pct)
   16        6222.00 (0.00 pct)       6484.89 (4.22 pct)      6256.02 (0.54 pct)
   32        11707.57 (0.00 pct)      12185.93 (4.08 pct)     12006.63 (2.55 pct)
   64        18433.50 (0.00 pct)      19537.03 (5.98 pct)     19088.57 (3.55 pct)
  128        27400.07 (0.00 pct)      31771.53 (15.95 pct)    27265.00 (-0.49 pct)
  256        33195.27 (0.00 pct)      24478.67 (-26.25 pct)   34065.60 (2.62 pct)
  512        41633.10 (0.00 pct)      54833.20 (31.70 pct)    46724.00 (12.22 pct)
 1024        53877.23 (0.00 pct)      56363.37 (4.61 pct)     44813.10 (-16.82 pct)


NPS=2
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        629.76 (0.00 pct)        620.94 (-1.40 pct)      629.22 (-0.08 pct)
    2        1177.01 (0.00 pct)       1203.27 (2.23 pct)      1169.12 (-0.66 pct)
    4        1990.97 (0.00 pct)       2228.18 (11.91 pct)     1888.39 (-5.15 pct)
    8        3535.45 (0.00 pct)       3620.76 (2.41 pct)      3662.72 (3.59 pct)
   16        6309.02 (0.00 pct)       6548.66 (3.79 pct)      6508.67 (3.16 pct)
   32        12038.73 (0.00 pct)      12145.97 (0.89 pct)     11411.50 (-5.21 pct)
   64        18599.67 (0.00 pct)      19448.87 (4.56 pct)     17146.07 (-7.81 pct)
  128        27861.57 (0.00 pct)      30630.53 (9.93 pct)     28217.30 (1.27 pct)
  256        28215.80 (0.00 pct)      26864.67 (-4.78 pct)    29330.47 (3.95 pct)
  512        44239.67 (0.00 pct)      52822.47 (19.40 pct)    42652.63 (-3.58 pct)
 1024        54403.53 (0.00 pct)      53905.57 (-0.91 pct)    48490.30 (-10.86 pct)



NPS=4
Clients:     tip/sched/core                 mel-v3                  mel-v4
    1        622.68 (0.00 pct)        617.87 (-0.77 pct)      667.38 (7.17 pct)
    2        1160.74 (0.00 pct)       1182.40 (1.86 pct)      1294.12 (11.49 pct)
    4        1961.29 (0.00 pct)       2172.41 (10.76 pct)     2477.76 (26.33 pct)
    8        3664.25 (0.00 pct)       3450.80 (-5.82 pct)     4067.42 (11.00 pct)
   16        6495.53 (0.00 pct)       5873.41 (-9.57 pct)     6931.66 (6.71 pct)
   32        11833.27 (0.00 pct)      12010.43 (1.49 pct)     12710.60 (7.41 pct)
   64        17723.50 (0.00 pct)      18416.23 (3.90 pct)     18793.47 (6.03 pct)
  128        27724.83 (0.00 pct)      27894.50 (0.61 pct)     27948.60 (0.80 pct)
  256        31351.70 (0.00 pct)      23944.43 (-23.62 pct)   35430.17 (13.00 pct)
  512        43383.43 (0.00 pct)      49830.63 (14.86 pct)    43877.83 (1.13 pct)
 1024        46974.27 (0.00 pct)      53583.83 (14.07 pct)    50563.23 (7.64 pct)


With NPS=4, with v4, we see no regressions with tbench compared to
tip/sched/core and there is a considerable improvement in most
cases. So, the higher imb_numa_nr helps pack the tasks which
beneficial to tbench.



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
YCSB + Mongodb.

4 client instances, 256 threads per client instance.  These threads
have a very low utilization. The overall system utilization was in the
range of 16-20%.

YCSB workload type : A
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

NPS=1
               tip/sched/core   mel-v3        mel-v4
Throughput     351611.0        314981.33     329026.33
                               (-10.42 pct)  (-6.42 pct)



NPS=4
               tip/sched/core   mel-v3        mel-v4
Throughput     315808.0         316600.67     331093.67
                                (0.25 pct)    (4.84 pct)

Since at NPS=4, the imb_numa_nr=8 and 16 respectively at the lower and
higher NUMA domains, the task spreading happens more reluctantly
compared to v3 where the imb_numa_nr was 1 in both the domains.


--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for " Mel Gorman
@ 2021-12-10  9:33 ` Mel Gorman
  2021-12-13  8:28   ` Gautham R. Shenoy
  2021-12-17 19:54   ` Gautham R. Shenoy
  0 siblings, 2 replies; 46+ messages in thread
From: Mel Gorman @ 2021-12-10  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.16.0-rc1             5.16.0-rc1
                               vanilla       sched-numaimb-v4
MB/sec copy-16    166712.18 (   0.00%)   651540.22 ( 290.82%)
MB/sec scale-16   140109.66 (   0.00%)   382254.74 ( 172.83%)
MB/sec add-16     160791.18 (   0.00%)   623073.98 ( 287.51%)
MB/sec triad-16   160043.84 (   0.00%)   633964.52 ( 296.12%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per instance, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla    sched-numaimb-v4r24
Min       Score-16   367816.09 (   0.00%)   384015.36 (   4.40%)
Hmean     Score-16   389627.78 (   0.00%)   431907.14 *  10.85%*
Max       Score-16   416178.96 (   0.00%)   480120.03 (  15.36%)
Stddev    Score-16    17361.82 (   0.00%)    32505.34 ( -87.22%)
CoeffVar  Score-16        4.45 (   0.00%)        7.49 ( -68.30%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla       sched-numaimb-v4
Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 36 +++++++++++++++++----------------
 kernel/sched/topology.c        | 37 ++++++++++++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..972ba586b113 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1504,7 +1505,8 @@ static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int dst_weight,
+					int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1885,7 +1887,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->dst_stats.weight,
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1953,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9186,12 +9191,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task on this node to stay local
+			 * to its wakeup source if the number of running tasks
+			 * are below the allowed imbalance. If there is a real
+			 * need of migration, periodic load balance will take
+			 * care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (local_sgs.sum_nr_running <= sd->imb_numa_nr)
 				return NULL;
 		}
 
@@ -9280,19 +9286,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
-#define NUMA_IMBALANCE_MIN 2
-
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int dst_weight,
+				int imb_numa_nr)
 {
 	if (!allow_numa_imbalance(dst_running, dst_weight))
 		return imbalance;
 
-	/*
-	 * Allow a small imbalance based on a simple pair of communicating
-	 * tasks that remain local when the destination is lightly loaded.
-	 */
-	if (imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= imb_numa_nr)
 		return 0;
 
 	return imbalance;
@@ -9397,7 +9398,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->span_weight,
+				env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..bacec575ade2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,43 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top = sd;
+				unsigned int llc_sq;
+
+				/*
+				 * nr_llcs = (top->span_weight / llc_weight);
+				 * imb = (child_weight / nr_llcs) >> 2
+				 *
+				 * is equivalent to
+				 *
+				 * imb = (llc_weight^2 / top->span_weight) >> 2
+				 *
+				 */
+				llc_sq = child->span_weight * child->span_weight;
+
+				imb = max(2U, ((llc_sq / top->span_weight) >> 2));
+				imb_span = sd->span_weight;
+
+				sd->imb_numa_nr = imb;
+			} else {
+				sd->imb_numa_nr = imb * (sd->span_weight / imb_span);
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman
@ 2021-11-25 15:19 ` Mel Gorman
  0 siblings, 0 replies; 46+ messages in thread
From: Mel Gorman @ 2021-11-25 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                               vanilla     sched-numaimb-v2r4
MB/sec copy-16    164279.50 (   0.00%)   702962.88 ( 327.91%)
MB/sec scale-16   137487.08 (   0.00%)   397132.98 ( 188.85%)
MB/sec add-16     157561.68 (   0.00%)   638006.32 ( 304.92%)
MB/sec triad-16   154562.04 (   0.00%)   641408.02 ( 314.98%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

                                  vanilla     sched-numaimb-v1r2
Min       Score-16   366090.84 (   0.00%)   401505.65 (   9.67%)
Hmean     Score-16   391416.56 (   0.00%)   452546.28 *  15.62%*
Stddev    Score-16    16452.12 (   0.00%)    31480.31 ( -91.35%)
CoeffVar  Score-16        4.20 (   0.00%)        6.92 ( -64.99%)
Max       Score-16   416666.67 (   0.00%)   483529.77 (  16.05%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                                  vanilla     sched-numaimb-v2r5
Hmean     tput-1      73743.05 (   0.00%)    72517.86 (  -1.66%)
Hmean     tput-8     563036.51 (   0.00%)   619505.85 *  10.03%*
Hmean     tput-16   1016590.61 (   0.00%)  1084022.36 (   6.63%)
Hmean     tput-24   1418558.41 (   0.00%)  1443296.06 (   1.74%)
Hmean     tput-32   1608794.22 (   0.00%)  1869822.05 *  16.23%*
Hmean     tput-40   1761338.13 (   0.00%)  2154415.40 *  22.32%*
Hmean     tput-48   2290646.54 (   0.00%)  2561031.20 *  11.80%*
Hmean     tput-56   2463345.12 (   0.00%)  2731874.84 *  10.90%*
Hmean     tput-64   2650213.53 (   0.00%)  2867054.47 (   8.18%)
Hmean     tput-72   2497253.28 (   0.00%)  3017637.28 *  20.84%*
Hmean     tput-80   2820786.72 (   0.00%)  3018947.39 (   7.03%)
Hmean     tput-88   2813541.68 (   0.00%)  3008805.43 *   6.94%*
Hmean     tput-96   2604158.67 (   0.00%)  2948056.40 *  13.21%*
Hmean     tput-104  2713810.62 (   0.00%)  2952327.00 (   8.79%)
Hmean     tput-112  2558425.37 (   0.00%)  2909089.90 *  13.71%*
Hmean     tput-120  2611434.93 (   0.00%)  2773024.11 *   6.19%*
Hmean     tput-128  2706103.22 (   0.00%)  2765678.84 (   2.20%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 26 +++++++++++++++-----------
 kernel/sched/topology.c        | 24 ++++++++++++++++++++++++
 3 files changed, 40 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..64f211879e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
 }
 
 /*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
+ * Allow a NUMA imbalance if busy CPUs is less than the allowed
+ * imbalance. This is an approximation as the number of running
+ * tasks may not be related to the number of busy CPUs due to
+ * sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
 {
-	return (dst_running < (dst_weight >> 2));
+	return dst_running < imb_numa_nr;
 }
 
 /*
@@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..9adeaa89ccb4 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,30 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/* Calculate allowed NUMA imbalance */
+	for_each_cpu(i, cpu_map) {
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *sd_numa = sd;
+				int imb_numa_nr, nr_groups;
+
+				nr_groups = sd->span_weight / child->span_weight;
+				imb_numa_nr = nr_groups / num_online_nodes();
+
+				while (sd_numa) {
+					if (sd_numa->flags & SD_NUMA) {
+						sd_numa->imb_numa_nr = imb_numa_nr;
+						break;
+					}
+					sd_numa = sd_numa->parent;
+				}
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2022-02-14 11:27 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-03  8:38   ` Barry Song
2021-12-03  9:51     ` Gautham R. Shenoy
2021-12-03 10:53     ` Mel Gorman
2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-12-03  8:15   ` Barry Song
2021-12-03 10:50     ` Mel Gorman
2021-12-03 11:14       ` Barry Song
2021-12-03 13:27         ` Mel Gorman
2021-12-04 10:40   ` Peter Zijlstra
2021-12-06  8:48     ` Gautham R. Shenoy
2021-12-06 14:51       ` Peter Zijlstra
2021-12-06 15:12     ` Mel Gorman
2021-12-09 14:23       ` Valentin Schneider
2021-12-09 15:43         ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2022-02-08  9:43 [PATCH v6 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-08  9:43 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-08 16:19   ` Gautham R. Shenoy
2022-02-09  5:10   ` K Prateek Nayak
2022-02-09 10:33     ` Mel Gorman
2022-02-11 19:02       ` Jirka Hladky
2022-02-14 10:27   ` Srikar Dronamraju
2022-02-14 11:03   ` Vincent Guittot
2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for " Mel Gorman
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2022-02-04  7:06   ` Srikar Dronamraju
2022-02-04  9:04     ` Mel Gorman
2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45     ` Mel Gorman
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-10  9:33 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman
2021-12-13  8:28   ` Gautham R. Shenoy
2021-12-13 13:01     ` Mel Gorman
2021-12-13 14:47       ` Gautham R. Shenoy
2021-12-15 11:52         ` Gautham R. Shenoy
2021-12-15 12:25           ` Mel Gorman
2021-12-16 18:33             ` Gautham R. Shenoy
2021-12-20 11:12               ` Mel Gorman
2021-12-21 15:03                 ` Gautham R. Shenoy
2021-12-21 17:13                 ` Vincent Guittot
2021-12-22  8:52                   ` Jirka Hladky
2022-01-04 19:52                     ` Jirka Hladky
2022-01-05 10:42                   ` Mel Gorman
2022-01-05 10:49                     ` Mel Gorman
2022-01-10 15:53                     ` Vincent Guittot
2022-01-12 10:24                       ` Mel Gorman
2021-12-17 19:54   ` Gautham R. Shenoy
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-11-25 15:19 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans " Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).