linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs
@ 2021-12-01 15:18 Mel Gorman
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  0 siblings, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

The series addresses two problems -- inconsistent use of scheduler domain
weights and sub-optimal performance when there are many LLCs per NUMA node.

-- 
2.31.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2021-12-01 15:18 ` Mel Gorman
  2021-12-03  8:38   ` Barry Song
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-01 15:18 ` Mel Gorman
  2021-12-03  8:15   ` Barry Song
  2021-12-04 10:40   ` Peter Zijlstra
  1 sibling, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2021-12-01 15:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
adjusts the imbalance on multi-LLC machines to allow an imbalance up to
the point where LLCs should be balanced between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.16.0-rc1             5.16.0-rc1
                               vanilla     sched-numaimb-v3r1
MB/sec copy-16    166712.18 (   0.00%)   587662.60 ( 252.50%)
MB/sec scale-16   140109.66 (   0.00%)   393528.14 ( 180.87%)
MB/sec add-16     160791.18 (   0.00%)   618622.00 ( 284.74%)
MB/sec triad-16   160043.84 (   0.00%)   589188.40 ( 268.14%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per instance, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla     sched-numaimb-v3r1
Min       Score-16   367816.09 (   0.00%)   403429.15 (   9.68%)
Hmean     Score-16   389627.78 (   0.00%)   451015.49 *  15.76%*
Max       Score-16   416178.96 (   0.00%)   480012.00 (  15.34%)
Stddev    Score-16    17361.82 (   0.00%)    32378.08 ( -86.49%)
CoeffVar  Score-16        4.45 (   0.00%)        7.14 ( -60.57%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                               5.16.0-rc1             5.16.0-rc1
                                  vanilla     sched-numaimb-v3r1
Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 26 +++++++++++++++-----------
 kernel/sched/topology.c        | 20 ++++++++++++++++++++
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index c07bfa2d80f2..54f5207154d3 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr imbalanced tasks allowed between nodes */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a969affca76..64f211879e43 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
 }
 
 /*
- * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
- * This is an approximation as the number of running tasks may not be
- * related to the number of busy CPUs due to sched_setaffinity.
+ * Allow a NUMA imbalance if busy CPUs is less than the allowed
+ * imbalance. This is an approximation as the number of running
+ * tasks may not be related to the number of busy CPUs due to
+ * sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
 {
-	return (dst_running < (dst_weight >> 2));
+	return dst_running < imb_numa_nr;
 }
 
 /*
@@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, env->sd->span_weight);
+				busiest->sum_nr_running, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..fee2930745ab 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/* Calculate allowed NUMA imbalance */
+	for_each_cpu(i, cpu_map) {
+		int imb_numa_nr = 0;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				int nr_groups;
+
+				nr_groups = sd->span_weight / child->span_weight;
+				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
+						(nr_groups * num_online_nodes()));
+			}
+
+			sd->imb_numa_nr = imb_numa_nr;
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2021-12-03  8:15   ` Barry Song
  2021-12-03 10:50     ` Mel Gorman
  2021-12-04 10:40   ` Peter Zijlstra
  1 sibling, 1 reply; 21+ messages in thread
From: Barry Song @ 2021-12-03  8:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
>
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> adjusts the imbalance on multi-LLC machines to allow an imbalance up to
> the point where LLCs should be balanced between nodes.
>
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
>
>                             5.16.0-rc1             5.16.0-rc1
>                                vanilla     sched-numaimb-v3r1
> MB/sec copy-16    166712.18 (   0.00%)   587662.60 ( 252.50%)
> MB/sec scale-16   140109.66 (   0.00%)   393528.14 ( 180.87%)
> MB/sec add-16     160791.18 (   0.00%)   618622.00 ( 284.74%)
> MB/sec triad-16   160043.84 (   0.00%)   589188.40 ( 268.14%)
>
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
>
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per instance, the vanilla kernel
> allows threads to contend on cache. With the patch;
>
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla     sched-numaimb-v3r1
> Min       Score-16   367816.09 (   0.00%)   403429.15 (   9.68%)
> Hmean     Score-16   389627.78 (   0.00%)   451015.49 *  15.76%*
> Max       Score-16   416178.96 (   0.00%)   480012.00 (  15.34%)
> Stddev    Score-16    17361.82 (   0.00%)    32378.08 ( -86.49%)
> CoeffVar  Score-16        4.45 (   0.00%)        7.14 ( -60.57%)
>
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
>
>                                5.16.0-rc1             5.16.0-rc1
>                                   vanilla     sched-numaimb-v3r1
> Hmean     tput-1      73743.05 (   0.00%)    70258.27 *  -4.73%*
> Hmean     tput-8     563036.51 (   0.00%)   591187.39 (   5.00%)
> Hmean     tput-16   1016590.61 (   0.00%)  1032311.78 (   1.55%)
> Hmean     tput-24   1418558.41 (   0.00%)  1424005.80 (   0.38%)
> Hmean     tput-32   1608794.22 (   0.00%)  1907855.80 *  18.59%*
> Hmean     tput-40   1761338.13 (   0.00%)  2108162.23 *  19.69%*
> Hmean     tput-48   2290646.54 (   0.00%)  2214383.47 (  -3.33%)
> Hmean     tput-56   2463345.12 (   0.00%)  2780216.58 *  12.86%*
> Hmean     tput-64   2650213.53 (   0.00%)  2598196.66 (  -1.96%)
> Hmean     tput-72   2497253.28 (   0.00%)  2998882.47 *  20.09%*
> Hmean     tput-80   2820786.72 (   0.00%)  2951655.27 (   4.64%)
> Hmean     tput-88   2813541.68 (   0.00%)  3045450.86 *   8.24%*
> Hmean     tput-96   2604158.67 (   0.00%)  3035311.91 *  16.56%*
> Hmean     tput-104  2713810.62 (   0.00%)  2984270.04 (   9.97%)
> Hmean     tput-112  2558425.37 (   0.00%)  2894737.46 *  13.15%*
> Hmean     tput-120  2611434.93 (   0.00%)  2781661.01 (   6.52%)
> Hmean     tput-128  2706103.22 (   0.00%)  2811447.85 (   3.89%)
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 26 +++++++++++++++-----------
>  kernel/sched/topology.c        | 20 ++++++++++++++++++++
>  3 files changed, 36 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index c07bfa2d80f2..54f5207154d3 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>         unsigned int busy_factor;       /* less balancing by factor if busy */
>         unsigned int imbalance_pct;     /* No balance until over watermark */
>         unsigned int cache_nice_tries;  /* Leave cache hot tasks for # tries */
> +       unsigned int imb_numa_nr;       /* Nr imbalanced tasks allowed between nodes */
>
>         int nohz_idle;                  /* NOHZ IDLE status */
>         int flags;                      /* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a969affca76..64f211879e43 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>
>         int src_cpu, src_nid;
>         int dst_cpu, dst_nid;
> +       int imb_numa_nr;
>
>         struct numa_stats src_stats, dst_stats;
>
> @@ -1885,7 +1886,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
>                 imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -                                                       env->dst_stats.weight);
> +                                                 env->imb_numa_nr);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -1950,8 +1951,10 @@ static int task_numa_migrate(struct task_struct *p)
>          */
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -       if (sd)
> +       if (sd) {
>                 env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +               env.imb_numa_nr = sd->imb_numa_nr;
> +       }
>         rcu_read_unlock();
>
>         /*
> @@ -9046,13 +9049,14 @@ static bool update_pick_idlest(struct sched_group *idlest,
>  }
>
>  /*
> - * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
> - * This is an approximation as the number of running tasks may not be
> - * related to the number of busy CPUs due to sched_setaffinity.
> + * Allow a NUMA imbalance if busy CPUs is less than the allowed
> + * imbalance. This is an approximation as the number of running
> + * tasks may not be related to the number of busy CPUs due to
> + * sched_setaffinity.
>   */
> -static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +static inline bool allow_numa_imbalance(int dst_running, int imb_numa_nr)
>  {
> -       return (dst_running < (dst_weight >> 2));
> +       return dst_running < imb_numa_nr;
>  }
>
>  /*
> @@ -9191,7 +9195,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * a real need of migration, periodic load balance will
>                          * take care of it.
>                          */
> -                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->imb_numa_nr))
>                                 return NULL;
>                 }
>
> @@ -9283,9 +9287,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>
>  static inline long adjust_numa_imbalance(int imbalance,
> -                               int dst_running, int dst_weight)
> +                               int dst_running, int imb_numa_nr)
>  {
> -       if (!allow_numa_imbalance(dst_running, dst_weight))
> +       if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>                 return imbalance;
>
>         /*
> @@ -9397,7 +9401,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, env->sd->span_weight);
> +                               busiest->sum_nr_running, env->sd->imb_numa_nr);
>                 }
>
>                 return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..fee2930745ab 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>                 }
>         }
>
> +       /* Calculate allowed NUMA imbalance */
> +       for_each_cpu(i, cpu_map) {
> +               int imb_numa_nr = 0;
> +
> +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +                       struct sched_domain *child = sd->child;
> +
> +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +                               int nr_groups;
> +
> +                               nr_groups = sd->span_weight / child->span_weight;
> +                               imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> +                                               (nr_groups * num_online_nodes()));

Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
child->span_weight/2/2/4?

Does this patch change the behaviour for machines whose numa equals LLC?

> +                       }
> +
> +                       sd->imb_numa_nr = imb_numa_nr;
> +               }
> +       }
> +
>         /* Calculate CPU capacity for physical packages and nodes */
>         for (i = nr_cpumask_bits-1; i >= 0; i--) {
>                 if (!cpumask_test_cpu(i, cpu_map))
> --
> 2.31.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-03  8:38   ` Barry Song
  2021-12-03  9:51     ` Gautham R. Shenoy
  2021-12-03 10:53     ` Mel Gorman
  0 siblings, 2 replies; 21+ messages in thread
From: Barry Song @ 2021-12-03  8:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> find_busiest_group uses the child domain's group weight instead of
> the sched_domain's weight that has SD_NUMA set when calculating the
> allowed imbalance between NUMA nodes. This is wrong and inconsistent
> with find_idlest_group.
>
> This patch uses the SD_NUMA weight in both.
>
> Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")

Hi Mel,

sorry I might be missing something. but I have failed to figure out
where this commit is.

> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e476f6d9435..0a969affca76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, busiest->group_weight);
> +                               busiest->sum_nr_running, env->sd->span_weight);
>                 }
>
>                 return;
> --
> 2.31.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-03  8:38   ` Barry Song
@ 2021-12-03  9:51     ` Gautham R. Shenoy
  2021-12-03 10:53     ` Mel Gorman
  1 sibling, 0 replies; 21+ messages in thread
From: Gautham R. Shenoy @ 2021-12-03  9:51 UTC (permalink / raw)
  To: Barry Song
  Cc: Mel Gorman, Peter Zijlstra, Ingo Molnar, Vincent Guittot,
	Valentin Schneider, Aubrey Li, Barry Song, Mike Galbraith,
	Srikar Dronamraju, LKML

Hello Barry,

On Fri, Dec 03, 2021 at 09:38:33PM +1300, Barry Song wrote:
> On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> >
> > This patch uses the SD_NUMA weight in both.
> >
> > Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")

> 
> Hi Mel,
> 
> sorry I might be missing something. but I have failed to figure out
> where this commit is.


I think it is supposed to be

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
since it was the commit which introduced the "busiest->group_weight" change. 

> 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > ---
> >  kernel/sched/fair.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6e476f6d9435..0a969affca76 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                 /* Consider allowing a small imbalance between NUMA groups */
> >                 if (env->sd->flags & SD_NUMA) {
> >                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> > -                               busiest->sum_nr_running, busiest->group_weight);
> > +                               busiest->sum_nr_running, env->sd->span_weight);



> >                 }
> >
> >                 return;
> > --
> > 2.31.1
> >
> 
> Thanks
> Barry

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-03  8:15   ` Barry Song
@ 2021-12-03 10:50     ` Mel Gorman
  2021-12-03 11:14       ` Barry Song
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-12-03 10:50 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 03, 2021 at 09:15:15PM +1300, Barry Song wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..fee2930745ab 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >                 }
> >         }
> >
> > +       /* Calculate allowed NUMA imbalance */
> > +       for_each_cpu(i, cpu_map) {
> > +               int imb_numa_nr = 0;
> > +
> > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +                       struct sched_domain *child = sd->child;
> > +
> > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +                               int nr_groups;
> > +
> > +                               nr_groups = sd->span_weight / child->span_weight;
> > +                               imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > +                                               (nr_groups * num_online_nodes()));
> 
> Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
> for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
> then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
> child->span_weight/2/2/4?
> 
> Does this patch change the behaviour for machines whose numa equals LLC?
> 

Yes, it changes behaviour. Instead of a flat 25%, it takes into account
the number of LLCs per node and the number of nodes overall.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-03  8:38   ` Barry Song
  2021-12-03  9:51     ` Gautham R. Shenoy
@ 2021-12-03 10:53     ` Mel Gorman
  1 sibling, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-12-03 10:53 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 03, 2021 at 09:38:33PM +1300, Barry Song wrote:
> On Fri, Dec 3, 2021 at 8:27 PM Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> >
> > This patch uses the SD_NUMA weight in both.
> >
> > Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
> 
> Hi Mel,
> 
> sorry I might be missing something. but I have failed to figure out
> where this commit is.
> 

Stupid cut&paste error and failing to check it properly

7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-03 10:50     ` Mel Gorman
@ 2021-12-03 11:14       ` Barry Song
  2021-12-03 13:27         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Barry Song @ 2021-12-03 11:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Fri, Dec 3, 2021 at 11:50 PM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Dec 03, 2021 at 09:15:15PM +1300, Barry Song wrote:
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index d201a7052a29..fee2930745ab 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -2242,6 +2242,26 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> > >                 }
> > >         }
> > >
> > > +       /* Calculate allowed NUMA imbalance */
> > > +       for_each_cpu(i, cpu_map) {
> > > +               int imb_numa_nr = 0;
> > > +
> > > +               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +                       struct sched_domain *child = sd->child;
> > > +
> > > +                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +                               int nr_groups;
> > > +
> > > +                               nr_groups = sd->span_weight / child->span_weight;
> > > +                               imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > > +                                               (nr_groups * num_online_nodes()));
> >
> > Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
> > for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
> > then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
> > child->span_weight/2/2/4?
> >
> > Does this patch change the behaviour for machines whose numa equals LLC?
> >
>
> Yes, it changes behaviour. Instead of a flat 25%, it takes into account
> the number of LLCs per node and the number of nodes overall.

Considering  the number of nodes overall seems to be quite weird to me.
for example, for the below machines

1P * 2DIE = 2NUMA:    node1 - node0
2P * 2DIE = 4NUMA:    node1 - node0  ------ node2 - node3
4P * 2DIE = 8NUMA:    node1 - node0  ------ node2 - node3
                                      node5 - node4  ------ node6 - node7

if one service pins node1 and node0 in all above configurations, it seems in all
different machines, the app will result in different behavior.

the other example is:
in a 2P machine, if one app pins the first two NUMAs, the other app pins
the last two NUMAs, why would the  num_online_nodes() matter to them?
there is no balance requirement between the two P.

>
> --
> Mel Gorman
> SUSE Labs

Thanks
Barry

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-03 11:14       ` Barry Song
@ 2021-12-03 13:27         ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-12-03 13:27 UTC (permalink / raw)
  To: Barry Song
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Sat, Dec 04, 2021 at 12:14:33AM +1300, Barry Song wrote:
> > > Hi Mel, you used to have 25% * numa_weight if node has only one LLC.
> > > for a system with 4 numa,  In case sd has 2 nodes, child is 1 numa node,
> > > then  nr_groups=2, num_online_nodes()=4,  imb_numa_nr will be
> > > child->span_weight/2/2/4?
> > >
> > > Does this patch change the behaviour for machines whose numa equals LLC?
> > >
> >
> > Yes, it changes behaviour. Instead of a flat 25%, it takes into account
> > the number of LLCs per node and the number of nodes overall.
> 
> Considering  the number of nodes overall seems to be quite weird to me.
> for example, for the below machines
> 
> 1P * 2DIE = 2NUMA:    node1 - node0
> 2P * 2DIE = 4NUMA:    node1 - node0  ------ node2 - node3
> 4P * 2DIE = 8NUMA:    node1 - node0  ------ node2 - node3
>                                       node5 - node4  ------ node6 - node7
> 
> if one service pins node1 and node0 in all above configurations, it seems in all
> different machines, the app will result in different behavior.
> 

The intent is to balance between LLCs across the whole machine, hence
accounting for the number of online nodes.

> the other example is:
> in a 2P machine, if one app pins the first two NUMAs, the other app pins
> the last two NUMAs, why would the  num_online_nodes() matter to them?
> there is no balance requirement between the two P.
> 

The previous 25% imbalance also did not take pinning into account and
the choice was somewhat arbitrary.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2021-12-03  8:15   ` Barry Song
@ 2021-12-04 10:40   ` Peter Zijlstra
  2021-12-06  8:48     ` Gautham R. Shenoy
  2021-12-06 15:12     ` Mel Gorman
  1 sibling, 2 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-12-04 10:40 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML

On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> +	/* Calculate allowed NUMA imbalance */
> +	for_each_cpu(i, cpu_map) {
> +		int imb_numa_nr = 0;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				int nr_groups;
> +
> +				nr_groups = sd->span_weight / child->span_weight;
> +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> +						(nr_groups * num_online_nodes()));
> +			}
> +
> +			sd->imb_numa_nr = imb_numa_nr;
> +		}

OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
imb_numa_nr = 0, all domains above it will have the same value
calculated here.

So far so good I suppose :-)

Then nr_groups is what it says on the tin; we could've equally well
iterated sd->groups and gotten the same number, but this is simpler.

Now, imb_numa_nr is where the magic happens, the way it's written
doesn't help, but it's something like:

	(child->span_weight / 2) / (nr_groups * num_online_nodes())

With a minimum value of 1. So the larger the system is, or the smaller
the LLCs, the smaller this number gets, right?

So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
/ (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.

But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
of how many nodes it has.

Now, I'm thinking this assumes (fairly reasonable) that the level above
LLC is a node, but I don't think we need to assume this, while also not
assuming the balance domain spans the whole machine (yay paritions!).

	for (top = sd; top->parent; top = top->parent)
		;

	nr_llcs = top->span_weight / child->span_weight;
	imb_numa_nr = max(1, child->span_weight / nr_llcs);

which for my ivb-ep gets me:  20 / (40 / 20) = 10
and the Zen system will have:  4 / (huge number) = 1

Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
can also write the above as:

	(child->span_weight * child->span_weight) / top->span_weight;

Hmm?


> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-04 10:40   ` Peter Zijlstra
@ 2021-12-06  8:48     ` Gautham R. Shenoy
  2021-12-06 14:51       ` Peter Zijlstra
  2021-12-06 15:12     ` Mel Gorman
  1 sibling, 1 reply; 21+ messages in thread
From: Gautham R. Shenoy @ 2021-12-06  8:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML,
	Sadagopan Srinivasan, Krupa Ramakrishnan

Hello Peter, Mel,


On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > +	/* Calculate allowed NUMA imbalance */
> > +	for_each_cpu(i, cpu_map) {
> > +		int imb_numa_nr = 0;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				int nr_groups;
> > +
> > +				nr_groups = sd->span_weight / child->span_weight;
> > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > +						(nr_groups * num_online_nodes()));
> > +			}
> > +
> > +			sd->imb_numa_nr = imb_numa_nr;
> > +		}
> 
> OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> imb_numa_nr = 0, all domains above it will have the same value
> calculated here.
> 
> So far so good I suppose :-)

Well, we will still have the same imb_numa_nr set for different NUMA
domains which have different distances!

> 
> Then nr_groups is what it says on the tin; we could've equally well
> iterated sd->groups and gotten the same number, but this is simpler.
> 
> Now, imb_numa_nr is where the magic happens, the way it's written
> doesn't help, but it's something like:
> 
> 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> 
> With a minimum value of 1. So the larger the system is, or the smaller
> the LLCs, the smaller this number gets, right?
> 
> So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> 
> But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> of how many nodes it has.

That's correct. On a Zen3 box with 2 sockets with 64 cores per
sockets, we can configure it with either 1/2/4 Nodes Per Socket
(NPS). The imb_numa_nr value for each of the NPS configurations is as
follows:


NPS1 :
~~~~~~~~
SMT [span_wt=2]
  --> MC [span_wt=16, LLC]
     --> DIE[span_wt=128]
         --> NUMA [span_wt=256, SD_NUMA]

sd->span = 128, child->span = 16, nr_groups = 8, num_online_nodes() = 2
imb_numa_nr = max(1, (16 >> 1)/(8*2)) = max(1, 0.5) = 1.



NPS2 :
~~~~~~~~
SMT [span_wt=2]
 --> MC [span_wt=16,LLC]
     --> NODE[span_wt=64]
         --> NUMA [span_wt=128, SD_NUMA]
             --> NUMA [span_wt=256, SD_NUMA]

sd->span = 64, child->span = 16, nr_groups = 4, num_online_nodes() = 4
imb_numa_nr = max(1, (16 >> 1)/(4*4)) = max(1, 0.5) = 1.


NPS 4:
~~~~~~~
SMT [span_wt=2]
   --> MC [span_wt=16, LLC]
       --> NODE [span_wt=32]
           --> NUMA [span_wt=128, SD_NUMA]
	       --> NUMA [span_wt=256, SD_NUMA]

sd->span = 32, child->span = 16, nr_groups = 2, num_online_nodes() = 8
imb_numa_nr = max(1, (16 >> 1)/(2*8)) = max(1, 0.5) = 1.


While the imb_numa_nr = 1 is good for the NUMA domain within a socket
(the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
little bit aggressive for the NUMA domain spanning the two sockets. If
we have only a pair of communicating tasks in a socket, we will end up
spreading them across the two sockets with this patch.

> 
> Now, I'm thinking this assumes (fairly reasonable) that the level above
> LLC is a node, but I don't think we need to assume this, while also not
> assuming the balance domain spans the whole machine (yay paritions!).
> 
> 	for (top = sd; top->parent; top = top->parent)
> 		;
> 
> 	nr_llcs = top->span_weight / child->span_weight;
> 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> 
> which for my ivb-ep gets me:  20 / (40 / 20) = 10
> and the Zen system will have:  4 / (huge number) = 1
> 
> Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> can also write the above as:
> 
> 	(child->span_weight * child->span_weight) / top->span_weight;


Assuming that "child" here refers to the LLC domain, on Zen3 we would have
(a) child->span_weight = 16. (b) top->span_weight = 256.

So we get a^2/b = 1.

>
> Hmm?

Last week, I tried a modification on top of Mel's current patch where
we spread tasks between the LLCs of the groups within each NUMA domain
and compute the value of imb_numa_nr per NUMA domain. The idea is to set

    sd->imb_numa_nr = min(1U,
    		         (Number of LLCs in each sd group / Number of sd groups))

This won't work for processors which have a single LLC in a socket,
since the sd->imb_numa_nr will be 1 which is probably too low. FWIW,
with this heuristic, the imb_numa_nr across the different NPS
configurations of a Zen3 server is as follows

NPS1:
    NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4

NPS2:
    1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
    2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.

NPS4:
    1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
    2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.

Thus, at the highest NUMA level (socket), we don't spread across the
two sockets until there are 4 tasks within the socket. If there is
only a pair of communicating tasks in the socket, they will be left
alone within that socket. The stream numbers (average of 10 runs. The
following are Triad numbers. The Copy, Scale and Add numbers have the
same trend) are presented in the table below. We do see some
degradation for the 4 thread case in NPS2 and NPS4 modes with the
aforementioned approach, but there are gains as well for 16 and 32
thread case on NPS4 mode. 

NPS1:

==========+===========+================+=================
| Nr      |  Mel v3   | tip/sched/core |   Spread across|
| Stream  |           |                |   LLCs of NUMA |
| Threads |           |                |   groups       |
==========+===========+================+=================
|       4 | 111106.14 |       94849.77 |  111820.02     |
|         |           |      (-14.63%) |  (+0.64%)      | 
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|       8 | 175633.00 |      128268.22 |  189705.48     |
|         |           |      (-26.97%) |  (+8.01%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      16 | 252812.87 |      136745.98 |  255577.34     |
|         |           |      (-45.91%) |  (+1.09%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      32 | 248198.43 |      130120.30 |  253266.86     |
|         |           |      (-47.57%) |  (+2.04%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      64 | 244202.33 |      133773.03 |  249449.53     |
|         |           |      (-45.22%) |  (+2.15%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|     128 | 248459.85 |      249450.61 |  250346.09     |
|         |           |      (+0.40%)  |  (+0.76%)      |
~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

NPS2:
==========+===========+================+=================
| Nr      |  Mel v3   | tip/sched/core |   Spread across|
| Stream  |           |                |   LLCs of NUMA |
| Threads |           |                |   groups       |
==========+===========+================+=================
|       4 | 110888.35 |       63067.26 |  104971.36     |
|         |           |      (-43.12%) |   (-5.34%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|       8 | 174983.85 |       96226.39 |  177558.65     |
|         |           |      (-45.01%) |   (+1.47%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      16 | 252943.21 |       106474.3 |  260749.60     |
|         |           |      (-57.90%) |  (+1.47%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      32 | 248540.52 |      113864.09 |  254141.33     |
|         |           |      (-54.19%) |  (+2.25%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      64 | 248383.17 |      137101.85 |  255018.52     |
|         |           |      (-44.80%) |  (+2.67%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|     128 | 250123.31 |      257031.29 |  254457.13     |
|         |           |      (+2.76%)  |  (+1.73%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

NPS4:
==========+===========+================+=================
| Nr      |  Mel v3   | tip/sched/core |   Spread across|
| Stream  |           |                |   LLCs of NUMA |
| Threads |           |                |   groups       |
==========+===========+================+=================
|       4 | 108580.91 |       31746.06 |   97585.53	|
|         |           |      (-70.76%) |  (-10.12%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|       8 | 150259.94 |       64841.89 |  154954.75     |
|         |           |      (-56.84%) |  (+3.12%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      16 | 234137.41 |      106780.26 |  261005.27     |
|         |           |      (-54.39%) |  (+11.48%)     |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      32 | 241583.06 |      147572.50 |  257004.22     |
|         |           |      (-38.91%) |  (+6.38%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|      64 | 248511.64 |      166183.06 |  259599.32     |
|         |           |      (-33.12%) |  (+4.46%)      |
~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~+
|     128 | 252227.34 |      239270.85 |  259117.18     |
|         |           |      (-5.13%)  |  (2.73%)       | 
~~~~~~~~~~+~~~~~~~~~~~~~~~~+~~~~~~~~~~~+~~~~~~~~~~~~~~~~+

--
Thanks and Regards
gautham.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-06  8:48     ` Gautham R. Shenoy
@ 2021-12-06 14:51       ` Peter Zijlstra
  0 siblings, 0 replies; 21+ messages in thread
From: Peter Zijlstra @ 2021-12-06 14:51 UTC (permalink / raw)
  To: Gautham R. Shenoy
  Cc: Mel Gorman, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju, LKML,
	Sadagopan Srinivasan, Krupa Ramakrishnan

On Mon, Dec 06, 2021 at 02:18:21PM +0530, Gautham R. Shenoy wrote:
> On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> > On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > > +	/* Calculate allowed NUMA imbalance */
> > > +	for_each_cpu(i, cpu_map) {
> > > +		int imb_numa_nr = 0;
> > > +
> > > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > > +			struct sched_domain *child = sd->child;
> > > +
> > > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > > +				int nr_groups;
> > > +
> > > +				nr_groups = sd->span_weight / child->span_weight;
> > > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > > +						(nr_groups * num_online_nodes()));
> > > +			}
> > > +
> > > +			sd->imb_numa_nr = imb_numa_nr;
> > > +		}
> > 
> > OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> > imb_numa_nr = 0, all domains above it will have the same value
> > calculated here.
> > 
> > So far so good I suppose :-)
> 
> Well, we will still have the same imb_numa_nr set for different NUMA
> domains which have different distances!

Fair enough; that would need making the computation depends on more
thing, but that shouldn't be too hard.

> > Then nr_groups is what it says on the tin; we could've equally well
> > iterated sd->groups and gotten the same number, but this is simpler.
> > 
> > Now, imb_numa_nr is where the magic happens, the way it's written
> > doesn't help, but it's something like:
> > 
> > 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> > 
> > With a minimum value of 1. So the larger the system is, or the smaller
> > the LLCs, the smaller this number gets, right?
> > 
> > So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> > / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> > 
> > But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> > of how many nodes it has.
> 
> That's correct. On a Zen3 box with 2 sockets with 64 cores per
> sockets, we can configure it with either 1/2/4 Nodes Per Socket
> (NPS). The imb_numa_nr value for each of the NPS configurations is as
> follows:

Cute; that's similar to the whole Intel sub-numa-cluster stuff then;
perhaps update the comment that goes with x86_has_numa_in_package ?
Currently that only mentions AMD Magny-Cours which is a few generations
ago.


> NPS 4:
> ~~~~~~~
> SMT [span_wt=2]
>    --> MC [span_wt=16, LLC]
>        --> NODE [span_wt=32]
>            --> NUMA [span_wt=128, SD_NUMA]
> 	       --> NUMA [span_wt=256, SD_NUMA]

OK, so at max nodes you still have at least 2 LLCs per node.

> While the imb_numa_nr = 1 is good for the NUMA domain within a socket
> (the lower NUMA domains in in NPS2 and NPS4 modes), it appears to be a
> little bit aggressive for the NUMA domain spanning the two sockets. If
> we have only a pair of communicating tasks in a socket, we will end up
> spreading them across the two sockets with this patch.
> 
> > 
> > Now, I'm thinking this assumes (fairly reasonable) that the level above
> > LLC is a node, but I don't think we need to assume this, while also not
> > assuming the balance domain spans the whole machine (yay paritions!).
> > 
> > 	for (top = sd; top->parent; top = top->parent)
> > 		;
> > 
> > 	nr_llcs = top->span_weight / child->span_weight;
> > 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> > 
> > which for my ivb-ep gets me:  20 / (40 / 20) = 10
> > and the Zen system will have:  4 / (huge number) = 1
> > 
> > Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> > can also write the above as:
> > 
> > 	(child->span_weight * child->span_weight) / top->span_weight;
> 
> 
> Assuming that "child" here refers to the LLC domain, on Zen3 we would have
> (a) child->span_weight = 16. (b) top->span_weight = 256.
> 
> So we get a^2/b = 1.

Yes, it would be in the same place as the current imb_numa_nr
calculation, so child would be the largest domain having
SHARE_PKG_RESOURCES, aka. LLC.

> Last week, I tried a modification on top of Mel's current patch where
> we spread tasks between the LLCs of the groups within each NUMA domain
> and compute the value of imb_numa_nr per NUMA domain. The idea is to set
> 
>     sd->imb_numa_nr = min(1U,
>     		         (Number of LLCs in each sd group / Number of sd groups))

s/min/max/

Which is basically something like:

for_each (sd in NUMA):
  llc_per_group = child->span / llc->span;
  nr_group = sd->span / child->span;
  imb = max(1, llc_per_group / nr_group);


> This won't work for processors which have a single LLC in a socket,
> since the sd->imb_numa_nr will be 1 which is probably too low. 

Right.

> FWIW,
> with this heuristic, the imb_numa_nr across the different NPS
> configurations of a Zen3 server is as follows
> 
> NPS1:
>     NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4
> 
> NPS2:
>     1st NUMA domain: nr_llcs_per_group = 4. nr_groups = 2. imb_numa_nr = 2.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> NPS4:
>     1st NUMA domain: nr_llcs_per_group = 2. nr_groups = 4. imb_numa_nr = min(1, 2/4) = 1.
>     2nd NUMA domain: nr_llcs_per_group = 8. nr_groups = 2. imb_numa_nr = 4.
> 
> Thus, at the highest NUMA level (socket), we don't spread across the
> two sockets until there are 4 tasks within the socket. If there is
> only a pair of communicating tasks in the socket, they will be left
> alone within that socket. 

Something that might work:

imb = 0;
imb_span = 1;

for_each_sd(sd) {
	child = sd->child;

	if (!(sd->flags & SD_SPR) && child && (child->flags & SD_SPR)) {

		imb = /* initial magic */
		imb_span = sd->span;
		sd->imb = imb;

	} else if (imb) {
		sd->imb = imb * (sd->span / imb_span);
	}
}

Where we calculate the initial imbalance for the LLC boundary, and then
increase that for subsequent domains based on how often that boundary sd
fits in it. That gives the same progression you have, but also works for
NODE==LLC I think.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-04 10:40   ` Peter Zijlstra
  2021-12-06  8:48     ` Gautham R. Shenoy
@ 2021-12-06 15:12     ` Mel Gorman
  2021-12-09 14:23       ` Valentin Schneider
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-12-06 15:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy,
	LKML

On Sat, Dec 04, 2021 at 11:40:56AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 01, 2021 at 03:18:44PM +0000, Mel Gorman wrote:
> > +	/* Calculate allowed NUMA imbalance */
> > +	for_each_cpu(i, cpu_map) {
> > +		int imb_numa_nr = 0;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				int nr_groups;
> > +
> > +				nr_groups = sd->span_weight / child->span_weight;
> > +				imb_numa_nr = max(1U, ((child->span_weight) >> 1) /
> > +						(nr_groups * num_online_nodes()));
> > +			}
> > +
> > +			sd->imb_numa_nr = imb_numa_nr;
> > +		}
> 
> OK, so let's see. All domains with SHARE_PKG_RESOURCES set will have
> imb_numa_nr = 0, all domains above it will have the same value
> calculated here.
> 
> So far so good I suppose :-)
> 

Good start :)

> Then nr_groups is what it says on the tin; we could've equally well
> iterated sd->groups and gotten the same number, but this is simpler.
> 

I also thought it would be clearer.

> Now, imb_numa_nr is where the magic happens, the way it's written
> doesn't help, but it's something like:
> 
> 	(child->span_weight / 2) / (nr_groups * num_online_nodes())
> 
> With a minimum value of 1. So the larger the system is, or the smaller
> the LLCs, the smaller this number gets, right?
> 

Correct.

> So my ivb-ep that has 20 cpus in a LLC and 2 nodes, will get: (20 / 2)
> / (1 * 2) = 10, while the ivb-ex will get: (20/2) / (1*4) = 5.
> 
> But a Zen box that has only like 4 CPUs per LLC will have 1, regardless
> of how many nodes it has.
> 

The minimum of one was to allow a pair of communicating tasks to remain
on one node even if it's imbalacnced.

> Now, I'm thinking this assumes (fairly reasonable) that the level above
> LLC is a node, but I don't think we need to assume this, while also not
> assuming the balance domain spans the whole machine (yay paritions!).
> 
> 	for (top = sd; top->parent; top = top->parent)
> 		;
> 
> 	nr_llcs = top->span_weight / child->span_weight;
> 	imb_numa_nr = max(1, child->span_weight / nr_llcs);
> 
> which for my ivb-ep gets me:  20 / (40 / 20) = 10
> and the Zen system will have:  4 / (huge number) = 1
> 
> Now, the exp: a / (b / a) is equivalent to a * (a / b) or a^2/b, so we
> can also write the above as:
> 
> 	(child->span_weight * child->span_weight) / top->span_weight;
> 

Gautham had similar reasoning to calculate the imbalance at each
higher-level domain instead of using a static value throughout and
it does make sense. For each level and splitting the imbalance between
two domains, this works out as


	/*
	 * Calculate an allowed NUMA imbalance such that LLCs do not get
	 * imbalanced.
	 */
	for_each_cpu(i, cpu_map) {
		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
			struct sched_domain *child = sd->child;

			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
				struct sched_domain *top = sd;
				unsigned int llc_sq;

				/*
				 * nr_llcs = (top->span_weight / llc_weight);
				 * imb = (child_weight / nr_llcs) >> 1
				 *
				 * is equivalent to
				 *
				 * imb = (llc_weight^2 / top->span_weight) >> 1
				 *
				 */
				llc_sq = child->span_weight * child->span_weight;
				while (top) {
					top->imb_numa_nr = max(1U,
						(llc_sq / top->span_weight) >> 1);
					top = top->parent;
				}

				break;
			}
		}
	}

I'll test this and should have results tomorrow.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-06 15:12     ` Mel Gorman
@ 2021-12-09 14:23       ` Valentin Schneider
  2021-12-09 15:43         ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Valentin Schneider @ 2021-12-09 14:23 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Aubrey Li, Barry Song,
	Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy, LKML

On 06/12/21 15:12, Mel Gorman wrote:
> Gautham had similar reasoning to calculate the imbalance at each
> higher-level domain instead of using a static value throughout and
> it does make sense. For each level and splitting the imbalance between
> two domains, this works out as
>
>
>       /*
>        * Calculate an allowed NUMA imbalance such that LLCs do not get
>        * imbalanced.
>        */
>       for_each_cpu(i, cpu_map) {
>               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
>                       struct sched_domain *child = sd->child;
>
>                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
>                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
>                               struct sched_domain *top = sd;
>                               unsigned int llc_sq;
>
>                               /*
>                                * nr_llcs = (top->span_weight / llc_weight);
>                                * imb = (child_weight / nr_llcs) >> 1
>                                *
>                                * is equivalent to
>                                *
>                                * imb = (llc_weight^2 / top->span_weight) >> 1
>                                *
>                                */
>                               llc_sq = child->span_weight * child->span_weight;
>                               while (top) {
>                                       top->imb_numa_nr = max(1U,
>                                               (llc_sq / top->span_weight) >> 1);
>                                       top = top->parent;
>                               }
>
>                               break;
>                       }
>               }
>       }
>

IIRC Peter suggested punting that logic to before domains get degenerated,
but I don't see how that helps here. If you just want to grab the LLC
domain (aka highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)) and compare
its span with that of its parents, that can happen after the degeneration,
no?

> I'll test this and should have results tomorrow.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2021-12-09 14:23       ` Valentin Schneider
@ 2021-12-09 15:43         ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-12-09 15:43 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham R. Shenoy,
	LKML

On Thu, Dec 09, 2021 at 02:23:40PM +0000, Valentin Schneider wrote:
> On 06/12/21 15:12, Mel Gorman wrote:
> > Gautham had similar reasoning to calculate the imbalance at each
> > higher-level domain instead of using a static value throughout and
> > it does make sense. For each level and splitting the imbalance between
> > two domains, this works out as
> >
> >
> >       /*
> >        * Calculate an allowed NUMA imbalance such that LLCs do not get
> >        * imbalanced.
> >        */
> >       for_each_cpu(i, cpu_map) {
> >               for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> >                       struct sched_domain *child = sd->child;
> >
> >                       if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> >                           (child->flags & SD_SHARE_PKG_RESOURCES)) {
> >                               struct sched_domain *top = sd;
> >                               unsigned int llc_sq;
> >
> >                               /*
> >                                * nr_llcs = (top->span_weight / llc_weight);
> >                                * imb = (child_weight / nr_llcs) >> 1
> >                                *
> >                                * is equivalent to
> >                                *
> >                                * imb = (llc_weight^2 / top->span_weight) >> 1
> >                                *
> >                                */
> >                               llc_sq = child->span_weight * child->span_weight;
> >                               while (top) {
> >                                       top->imb_numa_nr = max(1U,
> >                                               (llc_sq / top->span_weight) >> 1);
> >                                       top = top->parent;
> >                               }
> >
> >                               break;
> >                       }
> >               }
> >       }
> >
> 
> IIRC Peter suggested punting that logic to before domains get degenerated,
> but I don't see how that helps here. If you just want to grab the LLC
> domain (aka highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES)) and compare
> its span with that of its parents, that can happen after the degeneration,
> no?
> 

I guess we could but I don't see any specific advantage to doing that.

> > I'll test this and should have results tomorrow.
> >

The test results indicated that there was still a problem with
communicating tasks being pulled apart so am testing a new version.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-21 11:32     ` Mel Gorman
@ 2021-12-21 13:05       ` Vincent Guittot
  0 siblings, 0 replies; 21+ messages in thread
From: Vincent Guittot @ 2021-12-21 13:05 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

On Tue, 21 Dec 2021 at 12:32, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote:
> > On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote:
> > >
> > > find_busiest_group uses the child domain's group weight instead of
> > > the sched_domain's weight that has SD_NUMA set when calculating the
> > > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > > with find_idlest_group.
> >
> > I agree that find_busiest_group and find_idlest_group should be
> > consistent and use the same parameters but I wonder if sched_domain's
> > weight is the right one to use instead of the target group's weight.
> >
>
> Ok
>
> > IIRC, the goal of adjust_numa_imbalance is to keep some threads on the
> > same node as long as we consider that there is no performance impact
> > because of sharing  resources as they can even take advantage of
> > locality if they interact.
>
> Yes.
>
> > So we consider that tasks will not be
> > impacted by sharing resources if they use less than 25% of the CPUs of
> > a node. If we use the sd->span_weight instead, we consider that we can
> > pack threads in the same node as long as it uses less than 25% of the
> > CPUs in all nodes.
> >
>
> I assume you mean the target group weight instead of the node. The

I wanted to say that with this patch, we consider the imbalance
acceptable if the number of threads in a node is less than 25% of all
CPUs of all nodes (for this numa level) , but 25% of all CPUs of all
nodes can be more that the number of CPUs in the group.

So I would have changed find_idlest_group instead of changing find_busiest_group

> primary resource we are concerned with is memory bandwidth and it's a
> guess because we do not know for sure where memory channels are or how
> they are configured in this context and it may or may not be correlated
> with groups. I think using the group instead would deserve a series on
> its own after settling on an imbalance number when there are multiple
> LLCs per node.

I haven't look yet at the patch2 for multiple LLC per node

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-21 10:53   ` Vincent Guittot
@ 2021-12-21 11:32     ` Mel Gorman
  2021-12-21 13:05       ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-12-21 11:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

On Tue, Dec 21, 2021 at 11:53:50AM +0100, Vincent Guittot wrote:
> On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > find_busiest_group uses the child domain's group weight instead of
> > the sched_domain's weight that has SD_NUMA set when calculating the
> > allowed imbalance between NUMA nodes. This is wrong and inconsistent
> > with find_idlest_group.
> 
> I agree that find_busiest_group and find_idlest_group should be
> consistent and use the same parameters but I wonder if sched_domain's
> weight is the right one to use instead of the target group's weight.
> 

Ok

> IIRC, the goal of adjust_numa_imbalance is to keep some threads on the
> same node as long as we consider that there is no performance impact
> because of sharing  resources as they can even take advantage of
> locality if they interact.

Yes.

> So we consider that tasks will not be
> impacted by sharing resources if they use less than 25% of the CPUs of
> a node. If we use the sd->span_weight instead, we consider that we can
> pack threads in the same node as long as it uses less than 25% of the
> CPUs in all nodes.
> 

I assume you mean the target group weight instead of the node. The
primary resource we are concerned with is memory bandwidth and it's a
guess because we do not know for sure where memory channels are or how
they are configured in this context and it may or may not be correlated
with groups. I think using the group instead would deserve a series on
its own after settling on an imbalance number when there are multiple
LLCs per node.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
@ 2021-12-21 10:53   ` Vincent Guittot
  2021-12-21 11:32     ` Mel Gorman
  0 siblings, 1 reply; 21+ messages in thread
From: Vincent Guittot @ 2021-12-21 10:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

On Fri, 10 Dec 2021 at 10:33, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> find_busiest_group uses the child domain's group weight instead of
> the sched_domain's weight that has SD_NUMA set when calculating the
> allowed imbalance between NUMA nodes. This is wrong and inconsistent
> with find_idlest_group.

I agree that find_busiest_group and find_idlest_group should be
consistent and use the same parameters but I wonder if sched_domain's
weight is the right one to use instead of the target group's weight.

IIRC, the goal of adjust_numa_imbalance is to keep some threads on the
same node as long as we consider that there is no performance impact
because of sharing  resources as they can even take advantage of
locality if they interact. So we consider that tasks will not be
impacted by sharing resources if they use less than 25% of the CPUs of
a node. If we use the sd->span_weight instead, we consider that we can
pack threads in the same node as long as it uses less than 25% of the
CPUs in all nodes.

>
> This patch uses the SD_NUMA weight in both.
>
> Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6e476f6d9435..0a969affca76 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 /* Consider allowing a small imbalance between NUMA groups */
>                 if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                               busiest->sum_nr_running, busiest->group_weight);
> +                               busiest->sum_nr_running, env->sd->span_weight);
>                 }
>
>                 return;
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for " Mel Gorman
@ 2021-12-10  9:33 ` Mel Gorman
  2021-12-21 10:53   ` Vincent Guittot
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2021-12-10  9:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group
  2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2021-11-25 15:19 ` Mel Gorman
  0 siblings, 0 replies; 21+ messages in thread
From: Mel Gorman @ 2021-11-25 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, LKML, Mel Gorman

find_busiest_group uses the child domain's group weight instead of
the sched_domain's weight that has SD_NUMA set when calculating the
allowed imbalance between NUMA nodes. This is wrong and inconsistent
with find_idlest_group.

This patch uses the SD_NUMA weight in both.

Fixes: c4e8f691d926 ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCS")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e476f6d9435..0a969affca76 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9397,7 +9397,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				busiest->sum_nr_running, env->sd->span_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2021-12-21 13:06 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-01 15:18 [PATCH v3 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-12-01 15:18 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-03  8:38   ` Barry Song
2021-12-03  9:51     ` Gautham R. Shenoy
2021-12-03 10:53     ` Mel Gorman
2021-12-01 15:18 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2021-12-03  8:15   ` Barry Song
2021-12-03 10:50     ` Mel Gorman
2021-12-03 11:14       ` Barry Song
2021-12-03 13:27         ` Mel Gorman
2021-12-04 10:40   ` Peter Zijlstra
2021-12-06  8:48     ` Gautham R. Shenoy
2021-12-06 14:51       ` Peter Zijlstra
2021-12-06 15:12     ` Mel Gorman
2021-12-09 14:23       ` Valentin Schneider
2021-12-09 15:43         ` Mel Gorman
  -- strict thread matches above, loose matches on Subject: below --
2021-12-10  9:33 [PATCH v4 0/2] Adjust NUMA imbalance for " Mel Gorman
2021-12-10  9:33 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman
2021-12-21 10:53   ` Vincent Guittot
2021-12-21 11:32     ` Mel Gorman
2021-12-21 13:05       ` Vincent Guittot
2021-11-25 15:19 [PATCH 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2021-11-25 15:19 ` [PATCH 1/2] sched/fair: Use weight of SD_NUMA domain in find_busiest_group Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).