All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] Adjust NUMA imbalance for multiple LLCs
@ 2022-02-03 14:46 Mel Gorman
  2022-02-03 14:46 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  0 siblings, 2 replies; 8+ messages in thread
From: Mel Gorman @ 2022-02-03 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

Changelog since V4
o Scale imbalance based on the top domain that prefers siblings
o Keep allowed imbalance as 2 up to the point where LLCs can be overloaded

Changelog since V3
o Calculate imb_numa_nr for multiple SD_NUMA domains
o Restore behaviour where communicating pairs remain on the same node

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

The series addresses two problems -- inconsistent logic when allowing a
NUMA imbalance and sub-optimal performance when there are many LLCs per
NUMA node.

 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 30 ++++++++++---------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+), 13 deletions(-)

-- 
2.31.1

Mel Gorman (2):
  sched/fair: Improve consistency of allowed NUMA balance calculations
  sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans
    multiple LLCs

 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 30 ++++++++++---------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 71 insertions(+), 13 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations
  2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
@ 2022-02-03 14:46 ` Mel Gorman
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  1 sibling, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2022-02-03 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

There are inconsistencies when determining if a NUMA imbalance is allowed
that should be corrected.

o allow_numa_imbalance changes types and is not always examining
  the destination group so both the type should be corrected as
  well as the naming.
o find_idlest_group uses the sched_domain's weight instead of the
  group weight which is different to find_busiest_group
o find_busiest_group uses the source group instead of the destination
  which is different to task_numa_find_cpu
o Both find_idlest_group and find_busiest_group should account
  for the number of running tasks if a move was allowed to be
  consistent with task_numa_find_cpu

Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..4592ccf82c34 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9003,9 +9003,10 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+static inline bool
+allow_numa_imbalance(unsigned int running, unsigned int weight)
 {
-	return (dst_running < (dst_weight >> 2));
+	return (running < (weight >> 2));
 }
 
 /*
@@ -9139,12 +9140,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 				return idlest;
 #endif
 			/*
-			 * Otherwise, keep the task on this node to stay close
-			 * its wakeup source and improve locality. If there is
-			 * a real need of migration, periodic load balance will
-			 * take care of it.
+			 * Otherwise, keep the task close to the wakeup source
+			 * and improve locality if the number of running tasks
+			 * would remain below threshold where an imbalance is
+			 * allowed. If there is a real need of migration,
+			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
 				return NULL;
 		}
 
@@ -9350,7 +9352,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				busiest->sum_nr_running, busiest->group_weight);
+				local->sum_nr_running + 1, local->group_weight);
 		}
 
 		return;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
  2022-02-03 14:46 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
@ 2022-02-03 14:46 ` Mel Gorman
  2022-02-04  1:30   ` kernel test robot
                     ` (2 more replies)
  1 sibling, 3 replies; 8+ messages in thread
From: Mel Gorman @ 2022-02-03 14:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML, Mel Gorman

Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

                            5.17.0-rc0             5.17.0-rc0
                               vanilla       sched-numaimb-v5
MB/sec copy-16    162596.94 (   0.00%)   501967.12 ( 208.72%)
MB/sec scale-16   136901.28 (   0.00%)   376531.50 ( 175.04%)
MB/sec add-16     157300.70 (   0.00%)   569997.42 ( 262.36%)
MB/sec triad-16   151446.88 (   0.00%)   553204.54 ( 265.28%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Min       Score-16   368239.36 (   0.00%)   400876.92 (   8.86%)
Hmean     Score-16   388607.33 (   0.00%)   441447.30 *  13.60%*
Max       Score-16   408945.69 (   0.00%)   478826.87 (  17.09%)
Stddev    Score-16    15247.04 (   0.00%)    34061.76 (-123.40%)
CoeffVar  Score-16        3.92 (   0.00%)        7.67 ( -95.82%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed

                               5.17.0-rc0             5.17.0-rc0
                                  vanilla       sched-numaimb-v5
Hmean     tput-1      71631.55 (   0.00%)    70383.46 (  -1.74%)
Hmean     tput-8     582758.78 (   0.00%)   607290.89 *   4.21%*
Hmean     tput-16   1020372.75 (   0.00%)  1031257.25 (   1.07%)
Hmean     tput-24   1416430.67 (   0.00%)  1587576.33 *  12.08%*
Hmean     tput-32   1687702.72 (   0.00%)  1724207.51 (   2.16%)
Hmean     tput-40   1798094.90 (   0.00%)  1983053.56 *  10.29%*
Hmean     tput-48   1972731.77 (   0.00%)  2157461.70 (   9.36%)
Hmean     tput-56   2386872.38 (   0.00%)  2193237.42 (  -8.11%)
Hmean     tput-64   2536954.17 (   0.00%)  2588741.08 (   2.04%)
Hmean     tput-72   2585071.36 (   0.00%)  2654776.36 (   2.70%)
Hmean     tput-80   2960523.94 (   0.00%)  2894657.12 (  -2.22%)
Hmean     tput-88   3061408.57 (   0.00%)  2903167.72 (  -5.17%)
Hmean     tput-96   3052394.82 (   0.00%)  2872605.46 (  -5.89%)
Hmean     tput-104  2997814.76 (   0.00%)  3013660.26 (   0.53%)
Hmean     tput-112  2955353.29 (   0.00%)  3029122.16 (   2.50%)
Hmean     tput-120  2889770.71 (   0.00%)  2957739.88 (   2.35%)
Hmean     tput-128  2871713.84 (   0.00%)  2912410.18 (   1.42%)

In general, the standard deviation figures also are a lot more
stable.

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

                              vanilla    sched-numaimb-v5r12
Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
Amean     ep.D       31.86 (   0.00%)       26.26 *  17.58%*
Stddev    ep.D        0.07 (   0.00%)        0.18 (-157.54%)
CoeffVar  ep.D        0.22 (   0.00%)        0.69 (-212.46%)
Max       ep.D       31.93 (   0.00%)       26.46 (  17.13%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 22 +++++++-------
 kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
 3 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 8054641c0a7b..56cffe42abbc 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -93,6 +93,7 @@ struct sched_domain {
 	unsigned int busy_factor;	/* less balancing by factor if busy */
 	unsigned int imbalance_pct;	/* No balance until over watermark */
 	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
+	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
 
 	int nohz_idle;			/* NOHZ IDLE status */
 	int flags;			/* See SD_* */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4592ccf82c34..86abf97a8df6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1489,6 +1489,7 @@ struct task_numa_env {
 
 	int src_cpu, src_nid;
 	int dst_cpu, dst_nid;
+	int imb_numa_nr;
 
 	struct numa_stats src_stats, dst_stats;
 
@@ -1503,7 +1504,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static inline long adjust_numa_imbalance(int imbalance,
-					int dst_running, int dst_weight);
+					int dst_running, int imb_numa_nr);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
 		imbalance = adjust_numa_imbalance(imbalance, dst_running,
-							env->dst_stats.weight);
+						  env->imb_numa_nr);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
-	if (sd)
+	if (sd) {
 		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
+		env.imb_numa_nr = sd->imb_numa_nr;
+	}
 	rcu_read_unlock();
 
 	/*
@@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
  * This is an approximation as the number of running tasks may not be
  * related to the number of busy CPUs due to sched_setaffinity.
  */
-static inline bool
-allow_numa_imbalance(unsigned int running, unsigned int weight)
+static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
 {
-	return (running < (weight >> 2));
+	return running < imb_numa_nr;
 }
 
 /*
@@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * allowed. If there is a real need of migration,
 			 * periodic load balance will take care of it.
 			 */
-			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
+			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
 				return NULL;
 		}
 
@@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 #define NUMA_IMBALANCE_MIN 2
 
 static inline long adjust_numa_imbalance(int imbalance,
-				int dst_running, int dst_weight)
+				int dst_running, int imb_numa_nr)
 {
-	if (!allow_numa_imbalance(dst_running, dst_weight))
+	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
 		return imbalance;
 
 	/*
@@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		/* Consider allowing a small imbalance between NUMA groups */
 		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-				local->sum_nr_running + 1, local->group_weight);
+				local->sum_nr_running + 1, env->sd->imb_numa_nr);
 		}
 
 		return;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index d201a7052a29..e6cd55951304 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 		}
 	}
 
+	/*
+	 * Calculate an allowed NUMA imbalance such that LLCs do not get
+	 * imbalanced.
+	 */
+	for_each_cpu(i, cpu_map) {
+		unsigned int imb = 0;
+		unsigned int imb_span = 1;
+
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+			struct sched_domain *child = sd->child;
+
+			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
+			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
+				struct sched_domain *top, *top_p;
+				unsigned int nr_llcs;
+
+				/*
+				 * For a single LLC per node, allow an
+				 * imbalance up to 25% of the node. This is an
+				 * arbitrary cutoff based on SMT-2 to balance
+				 * between memory bandwidth and avoiding
+				 * premature sharing of HT resources and SMT-4
+				 * or SMT-8 *may* benefit from a different
+				 * cutoff.
+				 *
+				 * For multiple LLCs, allow an imbalance
+				 * until multiple tasks would share an LLC
+				 * on one node while LLCs on another node
+				 * remain idle.
+				 */
+				nr_llcs = sd->span_weight / child->span_weight;
+				if (nr_llcs == 1)
+					imb = sd->span_weight >> 2;
+				else
+					imb = nr_llcs;
+				sd->imb_numa_nr = imb;
+
+				/* Set span based on the first NUMA domain. */
+				top = sd;
+				top_p = top->parent;
+				while (top_p && !(top_p->flags & SD_NUMA)) {
+					top = top->parent;
+					top_p = top->parent;
+				}
+				imb_span = top_p ? top_p->span_weight : sd->span_weight;
+			} else {
+				int factor = max(1U, (sd->span_weight / imb_span));
+
+				sd->imb_numa_nr = imb * factor;
+			}
+		}
+	}
+
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
 		if (!cpumask_test_cpu(i, cpu_map))
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
@ 2022-02-04  1:30   ` kernel test robot
  2022-02-04  7:06   ` Srikar Dronamraju
  2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
  2 siblings, 0 replies; 8+ messages in thread
From: kernel test robot @ 2022-02-04  1:30 UTC (permalink / raw)
  To: kbuild-all

[-- Attachment #1: Type: text/plain, Size: 18268 bytes --]

Hi Mel,

I love your patch! Perhaps something to improve:

[auto build test WARNING on tip/sched/core]
[also build test WARNING on tip/master linux/master linus/master v5.17-rc2 next-20220203]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20220203-225129
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git ec2444530612a886b406e2830d7f314d1a07d4bb
config: parisc-randconfig-s031-20220131 (https://download.01.org/0day-ci/archive/20220204/202202040638.UP73q8Dl-lkp(a)intel.com/config)
compiler: hppa-linux-gcc (GCC) 11.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-dirty
        # https://github.com/0day-ci/linux/commit/81a6d8d3e9199b22ab27ea3ade91a0b0c18d0811
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Mel-Gorman/Adjust-NUMA-imbalance-for-multiple-LLCs/20220203-225129
        git checkout 81a6d8d3e9199b22ab27ea3ade91a0b0c18d0811
        # save the config file to linux build tree
        mkdir build_dir
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=parisc SHELL=/bin/bash kernel/sched/

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>


sparse warnings: (new ones prefixed by >>)
   kernel/sched/topology.c:461:19: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct perf_domain *pd @@     got struct perf_domain [noderef] __rcu *pd @@
   kernel/sched/topology.c:461:19: sparse:     expected struct perf_domain *pd
   kernel/sched/topology.c:461:19: sparse:     got struct perf_domain [noderef] __rcu *pd
   kernel/sched/topology.c:623:49: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:623:49: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:623:49: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:694:50: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *parent @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:694:50: sparse:     expected struct sched_domain *parent
   kernel/sched/topology.c:694:50: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:701:55: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child @@     got struct sched_domain *[assigned] tmp @@
   kernel/sched/topology.c:701:55: sparse:     expected struct sched_domain [noderef] __rcu *[noderef] __rcu child
   kernel/sched/topology.c:701:55: sparse:     got struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:711:29: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:711:29: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:711:29: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:716:20: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:716:20: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:716:20: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:737:13: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] tmp @@     got struct sched_domain [noderef] __rcu *sd @@
   kernel/sched/topology.c:737:13: sparse:     expected struct sched_domain *[assigned] tmp
   kernel/sched/topology.c:737:13: sparse:     got struct sched_domain [noderef] __rcu *sd
   kernel/sched/topology.c:899:70: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:899:70: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:899:70: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:928:59: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:928:59: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:928:59: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:974:57: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:974:57: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:974:57: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:976:25: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sibling @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:976:25: sparse:     expected struct sched_domain *sibling
   kernel/sched/topology.c:976:25: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:984:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:984:55: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:984:55: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:986:25: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *sibling @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:986:25: sparse:     expected struct sched_domain *sibling
   kernel/sched/topology.c:986:25: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1056:62: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct sched_domain *sd @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1056:62: sparse:     expected struct sched_domain *sd
   kernel/sched/topology.c:1056:62: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1160:40: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:1160:40: sparse:     expected struct sched_domain *child
   kernel/sched/topology.c:1160:40: sparse:     got struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1571:43: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain [noderef] __rcu *child @@     got struct sched_domain *child @@
   kernel/sched/topology.c:1571:43: sparse:     expected struct sched_domain [noderef] __rcu *child
   kernel/sched/topology.c:1571:43: sparse:     got struct sched_domain *child
   kernel/sched/topology.c:2130:31: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain [noderef] __rcu *parent @@     got struct sched_domain *sd @@
   kernel/sched/topology.c:2130:31: sparse:     expected struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2130:31: sparse:     got struct sched_domain *sd
   kernel/sched/topology.c:2233:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2233:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2233:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2254:56: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct sched_domain *child @@     got struct sched_domain [noderef] __rcu *child @@
   kernel/sched/topology.c:2254:56: sparse:     expected struct sched_domain *child
   kernel/sched/topology.c:2254:56: sparse:     got struct sched_domain [noderef] __rcu *child
>> kernel/sched/topology.c:2284:39: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *top_p @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2284:39: sparse:     expected struct sched_domain *top_p
   kernel/sched/topology.c:2284:39: sparse:     got struct sched_domain [noderef] __rcu *parent
>> kernel/sched/topology.c:2286:45: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] top @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2286:45: sparse:     expected struct sched_domain *[assigned] top
   kernel/sched/topology.c:2286:45: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2287:47: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *top_p @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2287:47: sparse:     expected struct sched_domain *top_p
   kernel/sched/topology.c:2287:47: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2253:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2253:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2253:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:2303:57: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/topology.c:2303:57: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/topology.c:2303:57: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c: note: in included file:
   kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1744:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1744:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1757:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1757:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1744:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1744:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1744:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/sched.h:1757:9: sparse: sparse: incorrect type in assignment (different address spaces) @@     expected struct sched_domain *[assigned] sd @@     got struct sched_domain [noderef] __rcu *parent @@
   kernel/sched/sched.h:1757:9: sparse:     expected struct sched_domain *[assigned] sd
   kernel/sched/sched.h:1757:9: sparse:     got struct sched_domain [noderef] __rcu *parent
   kernel/sched/topology.c:929:31: sparse: sparse: dereference of noderef expression
   kernel/sched/topology.c:1592:19: sparse: sparse: dereference of noderef expression

vim +2284 kernel/sched/topology.c

  2186	
  2187	/*
  2188	 * Build sched domains for a given set of CPUs and attach the sched domains
  2189	 * to the individual CPUs
  2190	 */
  2191	static int
  2192	build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *attr)
  2193	{
  2194		enum s_alloc alloc_state = sa_none;
  2195		struct sched_domain *sd;
  2196		struct s_data d;
  2197		struct rq *rq = NULL;
  2198		int i, ret = -ENOMEM;
  2199		bool has_asym = false;
  2200	
  2201		if (WARN_ON(cpumask_empty(cpu_map)))
  2202			goto error;
  2203	
  2204		alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
  2205		if (alloc_state != sa_rootdomain)
  2206			goto error;
  2207	
  2208		/* Set up domains for CPUs specified by the cpu_map: */
  2209		for_each_cpu(i, cpu_map) {
  2210			struct sched_domain_topology_level *tl;
  2211	
  2212			sd = NULL;
  2213			for_each_sd_topology(tl) {
  2214	
  2215				if (WARN_ON(!topology_span_sane(tl, cpu_map, i)))
  2216					goto error;
  2217	
  2218				sd = build_sched_domain(tl, cpu_map, attr, sd, i);
  2219	
  2220				has_asym |= sd->flags & SD_ASYM_CPUCAPACITY;
  2221	
  2222				if (tl == sched_domain_topology)
  2223					*per_cpu_ptr(d.sd, i) = sd;
  2224				if (tl->flags & SDTL_OVERLAP)
  2225					sd->flags |= SD_OVERLAP;
  2226				if (cpumask_equal(cpu_map, sched_domain_span(sd)))
  2227					break;
  2228			}
  2229		}
  2230	
  2231		/* Build the groups for the domains */
  2232		for_each_cpu(i, cpu_map) {
  2233			for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
  2234				sd->span_weight = cpumask_weight(sched_domain_span(sd));
  2235				if (sd->flags & SD_OVERLAP) {
  2236					if (build_overlap_sched_groups(sd, i))
  2237						goto error;
  2238				} else {
  2239					if (build_sched_groups(sd, i))
  2240						goto error;
  2241				}
  2242			}
  2243		}
  2244	
  2245		/*
  2246		 * Calculate an allowed NUMA imbalance such that LLCs do not get
  2247		 * imbalanced.
  2248		 */
  2249		for_each_cpu(i, cpu_map) {
  2250			unsigned int imb = 0;
  2251			unsigned int imb_span = 1;
  2252	
  2253			for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
  2254				struct sched_domain *child = sd->child;
  2255	
  2256				if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
  2257				    (child->flags & SD_SHARE_PKG_RESOURCES)) {
  2258					struct sched_domain *top, *top_p;
  2259					unsigned int nr_llcs;
  2260	
  2261					/*
  2262					 * For a single LLC per node, allow an
  2263					 * imbalance up to 25% of the node. This is an
  2264					 * arbitrary cutoff based on SMT-2 to balance
  2265					 * between memory bandwidth and avoiding
  2266					 * premature sharing of HT resources and SMT-4
  2267					 * or SMT-8 *may* benefit from a different
  2268					 * cutoff.
  2269					 *
  2270					 * For multiple LLCs, allow an imbalance
  2271					 * until multiple tasks would share an LLC
  2272					 * on one node while LLCs on another node
  2273					 * remain idle.
  2274					 */
  2275					nr_llcs = sd->span_weight / child->span_weight;
  2276					if (nr_llcs == 1)
  2277						imb = sd->span_weight >> 2;
  2278					else
  2279						imb = nr_llcs;
  2280					sd->imb_numa_nr = imb;
  2281	
  2282					/* Set span based on the first NUMA domain. */
  2283					top = sd;
> 2284					top_p = top->parent;
  2285					while (top_p && !(top_p->flags & SD_NUMA)) {
> 2286						top = top->parent;
  2287						top_p = top->parent;
  2288					}
  2289					imb_span = top_p ? top_p->span_weight : sd->span_weight;
  2290				} else {
  2291					int factor = max(1U, (sd->span_weight / imb_span));
  2292	
  2293					sd->imb_numa_nr = imb * factor;
  2294				}
  2295			}
  2296		}
  2297	
  2298		/* Calculate CPU capacity for physical packages and nodes */
  2299		for (i = nr_cpumask_bits-1; i >= 0; i--) {
  2300			if (!cpumask_test_cpu(i, cpu_map))
  2301				continue;
  2302	
  2303			for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
  2304				claim_allocations(i, sd);
  2305				init_sched_groups_capacity(i, sd);
  2306			}
  2307		}
  2308	
  2309		/* Attach the domains */
  2310		rcu_read_lock();
  2311		for_each_cpu(i, cpu_map) {
  2312			rq = cpu_rq(i);
  2313			sd = *per_cpu_ptr(d.sd, i);
  2314	
  2315			/* Use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing: */
  2316			if (rq->cpu_capacity_orig > READ_ONCE(d.rd->max_cpu_capacity))
  2317				WRITE_ONCE(d.rd->max_cpu_capacity, rq->cpu_capacity_orig);
  2318	
  2319			cpu_attach_domain(sd, d.rd, i);
  2320		}
  2321		rcu_read_unlock();
  2322	
  2323		if (has_asym)
  2324			static_branch_inc_cpuslocked(&sched_asym_cpucapacity);
  2325	
  2326		if (rq && sched_debug_verbose) {
  2327			pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
  2328				cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
  2329		}
  2330	
  2331		ret = 0;
  2332	error:
  2333		__free_domain_allocs(&d, alloc_state, cpu_map);
  2334	
  2335		return ret;
  2336	}
  2337	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2022-02-04  1:30   ` kernel test robot
@ 2022-02-04  7:06   ` Srikar Dronamraju
  2022-02-04  9:04     ` Mel Gorman
  2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
  2 siblings, 1 reply; 8+ messages in thread
From: Srikar Dronamraju @ 2022-02-04  7:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML

* Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]:

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
> 
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */

We seem to adding this hunk before the sched_domains may be degenerated.
Wondering if we really want to do it before degeneration.

Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for
all the 3 domains, then lets say the middle sched_domain gets degenerated. 
Would the sd->imb_numa_nr's still be relevant?


> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */
> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;

I am getting confused by imb_span.
Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and
MC domains having SD_SHARE_PKG_RESOURCES flag set.
We come here only for DIE domain.

imb_span set here is being used for both the subsequent sched domains
most likely they will be NUMA domains. Right?

> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;

For SMT, (or any sched domains below the llcs) factor would be
sd->span_weight but imb_numa_nr and imb would be 0.
For NUMA (or any sched domain just above DIE), factor would be
sd->imb_numa_nr would be nr_llcs.
For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of
nr_llcs. Right?


> +			}
> +		}
> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))
> -- 
> 2.31.1
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-04  7:06   ` Srikar Dronamraju
@ 2022-02-04  9:04     ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2022-02-04  9:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Gautham Shenoy, LKML

On Fri, Feb 04, 2022 at 12:36:54PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2022-02-03 14:46:52]:
> 
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index d201a7052a29..e6cd55951304 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
> >  		}
> >  	}
> > 
> > +	/*
> > +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> > +	 * imbalanced.
> > +	 */
> 
> We seem to adding this hunk before the sched_domains may be degenerated.
> Wondering if we really want to do it before degeneration.
> 

There was no obvious advantage versus doing it at the same time
characteristics like groups were being determined.

> Let say we have 3 sched domains and we calculated the sd->imb_numa_nr for
> all the 3 domains, then lets say the middle sched_domain gets degenerated. 
> Would the sd->imb_numa_nr's still be relevant?
> 

It's expected that it is still relevant as the ratios with respect to
SD_SHARE_PKG_RESOURCES should still be consistent.

> 
> > +	for_each_cpu(i, cpu_map) {
> > +		unsigned int imb = 0;
> > +		unsigned int imb_span = 1;
> > +
> > +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> > +			struct sched_domain *child = sd->child;
> > +
> > +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> > +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> > +				struct sched_domain *top, *top_p;
> > +				unsigned int nr_llcs;
> > +
> > +				/*
> > +				 * For a single LLC per node, allow an
> > +				 * imbalance up to 25% of the node. This is an
> > +				 * arbitrary cutoff based on SMT-2 to balance
> > +				 * between memory bandwidth and avoiding
> > +				 * premature sharing of HT resources and SMT-4
> > +				 * or SMT-8 *may* benefit from a different
> > +				 * cutoff.
> > +				 *
> > +				 * For multiple LLCs, allow an imbalance
> > +				 * until multiple tasks would share an LLC
> > +				 * on one node while LLCs on another node
> > +				 * remain idle.
> > +				 */
> > +				nr_llcs = sd->span_weight / child->span_weight;
> > +				if (nr_llcs == 1)
> > +					imb = sd->span_weight >> 2;
> > +				else
> > +					imb = nr_llcs;
> > +				sd->imb_numa_nr = imb;
> > +
> > +				/* Set span based on the first NUMA domain. */
> > +				top = sd;
> > +				top_p = top->parent;
> > +				while (top_p && !(top_p->flags & SD_NUMA)) {
> > +					top = top->parent;
> > +					top_p = top->parent;
> > +				}
> > +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> 
> I am getting confused by imb_span.
> Let say we have a topology of SMT -> MC -> DIE -> NUMA -> NUMA, with SMT and
> MC domains having SD_SHARE_PKG_RESOURCES flag set.
> We come here only for DIE domain.
> 
> imb_span set here is being used for both the subsequent sched domains
> most likely they will be NUMA domains. Right?
> 

Right.

> > +			} else {
> > +				int factor = max(1U, (sd->span_weight / imb_span));
> > +
> > +				sd->imb_numa_nr = imb * factor;
> 
> For SMT, (or any sched domains below the llcs) factor would be
> sd->span_weight but imb_numa_nr and imb would be 0.

Yes.

> For NUMA (or any sched domain just above DIE), factor would be
> sd->imb_numa_nr would be nr_llcs.
> For subsequent sched_domains, the sd->imb_numa_nr would be some multiple of
> nr_llcs. Right?
> 

Right.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
  2022-02-04  1:30   ` kernel test robot
  2022-02-04  7:06   ` Srikar Dronamraju
@ 2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
  2022-02-04 16:45     ` Mel Gorman
  2 siblings, 1 reply; 8+ messages in thread
From: Nayak, KPrateek (K Prateek) @ 2022-02-04 15:07 UTC (permalink / raw)
  To: Mel Gorman, Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Aubrey Li,
	Barry Song, Mike Galbraith, Srikar Dronamraju, Gautham Shenoy,
	LKML

Hello Mel,

On 2/3/2022 8:16 PM, Mel Gorman wrote:
> Commit 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA
> nodes") allowed an imbalance between NUMA nodes such that communicating
> tasks would not be pulled apart by the load balancer. This works fine when
> there is a 1:1 relationship between LLC and node but can be suboptimal
> for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
> 
> Zen* has multiple LLCs per node with local memory channels and due to
> the allowed imbalance, it's far harder to tune some workloads to run
> optimally than it is on hardware that has 1 LLC per node. This patch
> allows an imbalance to exist up to the point where LLCs should be balanced
> between nodes.
> 
> On a Zen3 machine running STREAM parallelised with OMP to have on instance
> per LLC the results and without binding, the results are
> 
>                             5.17.0-rc0             5.17.0-rc0
>                                vanilla       sched-numaimb-v5
> MB/sec copy-16    162596.94 (   0.00%)   501967.12 ( 208.72%)
> MB/sec scale-16   136901.28 (   0.00%)   376531.50 ( 175.04%)
> MB/sec add-16     157300.70 (   0.00%)   569997.42 ( 262.36%)
> MB/sec triad-16   151446.88 (   0.00%)   553204.54 ( 265.28%)
> 
> STREAM can use directives to force the spread if the OpenMP is new
> enough but that doesn't help if an application uses threads and
> it's not known in advance how many threads will be created.
> 
> Coremark is a CPU and cache intensive benchmark parallelised with
> threads. When running with 1 thread per core, the vanilla kernel
> allows threads to contend on cache. With the patch;
> 
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Min       Score-16   368239.36 (   0.00%)   400876.92 (   8.86%)
> Hmean     Score-16   388607.33 (   0.00%)   441447.30 *  13.60%*
> Max       Score-16   408945.69 (   0.00%)   478826.87 (  17.09%)
> Stddev    Score-16    15247.04 (   0.00%)    34061.76 (-123.40%)
> CoeffVar  Score-16        3.92 (   0.00%)        7.67 ( -95.82%)
> 
> It can also make a big difference for semi-realistic workloads
> like specjbb which can execute arbitrary numbers of threads without
> advance knowledge of how they should be placed
> 
>                                5.17.0-rc0             5.17.0-rc0
>                                   vanilla       sched-numaimb-v5
> Hmean     tput-1      71631.55 (   0.00%)    70383.46 (  -1.74%)
> Hmean     tput-8     582758.78 (   0.00%)   607290.89 *   4.21%*
> Hmean     tput-16   1020372.75 (   0.00%)  1031257.25 (   1.07%)
> Hmean     tput-24   1416430.67 (   0.00%)  1587576.33 *  12.08%*
> Hmean     tput-32   1687702.72 (   0.00%)  1724207.51 (   2.16%)
> Hmean     tput-40   1798094.90 (   0.00%)  1983053.56 *  10.29%*
> Hmean     tput-48   1972731.77 (   0.00%)  2157461.70 (   9.36%)
> Hmean     tput-56   2386872.38 (   0.00%)  2193237.42 (  -8.11%)
> Hmean     tput-64   2536954.17 (   0.00%)  2588741.08 (   2.04%)
> Hmean     tput-72   2585071.36 (   0.00%)  2654776.36 (   2.70%)
> Hmean     tput-80   2960523.94 (   0.00%)  2894657.12 (  -2.22%)
> Hmean     tput-88   3061408.57 (   0.00%)  2903167.72 (  -5.17%)
> Hmean     tput-96   3052394.82 (   0.00%)  2872605.46 (  -5.89%)
> Hmean     tput-104  2997814.76 (   0.00%)  3013660.26 (   0.53%)
> Hmean     tput-112  2955353.29 (   0.00%)  3029122.16 (   2.50%)
> Hmean     tput-120  2889770.71 (   0.00%)  2957739.88 (   2.35%)
> Hmean     tput-128  2871713.84 (   0.00%)  2912410.18 (   1.42%)
> 
> In general, the standard deviation figures also are a lot more
> stable.
> 
> Similarly, for embarassingly parallel problems like NPB-ep, there are
> improvements due to better spreading across LLC when the machine is not
> fully utilised.
> 
>                               vanilla    sched-numaimb-v5r12
> Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
> Amean     ep.D       31.86 (   0.00%)       26.26 *  17.58%*
> Stddev    ep.D        0.07 (   0.00%)        0.18 (-157.54%)
> CoeffVar  ep.D        0.22 (   0.00%)        0.69 (-212.46%)
> Max       ep.D       31.93 (   0.00%)       26.46 (  17.13%)
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 22 +++++++-------
>  kernel/sched/topology.c        | 53 ++++++++++++++++++++++++++++++++++
>  3 files changed, 66 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 8054641c0a7b..56cffe42abbc 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -93,6 +93,7 @@ struct sched_domain {
>  	unsigned int busy_factor;	/* less balancing by factor if busy */
>  	unsigned int imbalance_pct;	/* No balance until over watermark */
>  	unsigned int cache_nice_tries;	/* Leave cache hot tasks for # tries */
> +	unsigned int imb_numa_nr;	/* Nr running tasks that allows a NUMA imbalance */
>  
>  	int nohz_idle;			/* NOHZ IDLE status */
>  	int flags;			/* See SD_* */
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4592ccf82c34..86abf97a8df6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1489,6 +1489,7 @@ struct task_numa_env {
>  
>  	int src_cpu, src_nid;
>  	int dst_cpu, dst_nid;
> +	int imb_numa_nr;
>  
>  	struct numa_stats src_stats, dst_stats;
>  
> @@ -1503,7 +1504,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static inline long adjust_numa_imbalance(int imbalance,
> -					int dst_running, int dst_weight);
> +					int dst_running, int imb_numa_nr);
>  
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1884,7 +1885,7 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>  		dst_running = env->dst_stats.nr_running + 1;
>  		imbalance = max(0, dst_running - src_running);
>  		imbalance = adjust_numa_imbalance(imbalance, dst_running,
> -							env->dst_stats.weight);
> +						  env->imb_numa_nr);
>  
>  		/* Use idle CPU if there is no imbalance */
>  		if (!imbalance) {
> @@ -1949,8 +1950,10 @@ static int task_numa_migrate(struct task_struct *p)
>  	 */
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_numa, env.src_cpu));
> -	if (sd)
> +	if (sd) {
>  		env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
> +		env.imb_numa_nr = sd->imb_numa_nr;
> +	}
>  	rcu_read_unlock();
>  
>  	/*
> @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
>   * This is an approximation as the number of running tasks may not be
>   * related to the number of busy CPUs due to sched_setaffinity.
>   */
> -static inline bool
> -allow_numa_imbalance(unsigned int running, unsigned int weight)
> +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
>  {
> -	return (running < (weight >> 2));
> +	return running < imb_numa_nr;
>  }
>  
>  /*
> @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>  			 * allowed. If there is a real need of migration,
>  			 * periodic load balance will take care of it.
>  			 */
> -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))

Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance?
allow_numa_imbalance allows the imbalance based on the following inequality:

	running < imb_numa_nr

Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain.
Assume the group is running 7 task and we are finding the idlest group for the 8th task:

 	sd->imb_numa_nr = 8
	local_sgs.sum_nr_running = 7

In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance
and place the task in the same group, each task can be given one LLC.
However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different
NUMA group.

In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/),
the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/)
used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had
to bear. However that doesn't seem to be the case here.

>  				return NULL;
>  		}
>  
> @@ -9238,9 +9240,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  #define NUMA_IMBALANCE_MIN 2
>  
>  static inline long adjust_numa_imbalance(int imbalance,
> -				int dst_running, int dst_weight)
> +				int dst_running, int imb_numa_nr)
>  {
> -	if (!allow_numa_imbalance(dst_running, dst_weight))
> +	if (!allow_numa_imbalance(dst_running, imb_numa_nr))
>  		return imbalance;
>  
>  	/*
> @@ -9352,7 +9354,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		/* Consider allowing a small imbalance between NUMA groups */
>  		if (env->sd->flags & SD_NUMA) {
>  			env->imbalance = adjust_numa_imbalance(env->imbalance,
> -				local->sum_nr_running + 1, local->group_weight);
> +				local->sum_nr_running + 1, env->sd->imb_numa_nr);
>  		}
>  
>  		return;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index d201a7052a29..e6cd55951304 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -2242,6 +2242,59 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>  		}
>  	}
>  
> +	/*
> +	 * Calculate an allowed NUMA imbalance such that LLCs do not get
> +	 * imbalanced.
> +	 */
> +	for_each_cpu(i, cpu_map) {
> +		unsigned int imb = 0;
> +		unsigned int imb_span = 1;
> +
> +		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
> +			struct sched_domain *child = sd->child;
> +
> +			if (!(sd->flags & SD_SHARE_PKG_RESOURCES) && child &&
> +			    (child->flags & SD_SHARE_PKG_RESOURCES)) {
> +				struct sched_domain *top, *top_p;
> +				unsigned int nr_llcs;
> +
> +				/*
> +				 * For a single LLC per node, allow an
> +				 * imbalance up to 25% of the node. This is an
> +				 * arbitrary cutoff based on SMT-2 to balance
> +				 * between memory bandwidth and avoiding
> +				 * premature sharing of HT resources and SMT-4
> +				 * or SMT-8 *may* benefit from a different
> +				 * cutoff.
> +				 *
> +				 * For multiple LLCs, allow an imbalance
> +				 * until multiple tasks would share an LLC
> +				 * on one node while LLCs on another node
> +				 * remain idle.
> +				 */

To add to my point above, the comment here says -
"allow an imbalance until multiple tasks would share an LLC on one node"
Whereas, in the case I highlighted above, we see balancing kick in with possibly
one LLC being unaccounted for. 

> +				nr_llcs = sd->span_weight / child->span_weight;
> +				if (nr_llcs == 1)
> +					imb = sd->span_weight >> 2;
> +				else
> +					imb = nr_llcs;
> +				sd->imb_numa_nr = imb;
> +
> +				/* Set span based on the first NUMA domain. */
> +				top = sd;
> +				top_p = top->parent;
> +				while (top_p && !(top_p->flags & SD_NUMA)) {
> +					top = top->parent;
> +					top_p = top->parent;
> +				}
> +				imb_span = top_p ? top_p->span_weight : sd->span_weight;
> +			} else {
> +				int factor = max(1U, (sd->span_weight / imb_span));
> +
> +				sd->imb_numa_nr = imb * factor;
> +			}
> +		}
> +	}
> +
>  	/* Calculate CPU capacity for physical packages and nodes */
>  	for (i = nr_cpumask_bits-1; i >= 0; i--) {
>  		if (!cpumask_test_cpu(i, cpu_map))

Please correct me if I'm wrong.

Thanks and Regards
Prateek

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs
  2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
@ 2022-02-04 16:45     ` Mel Gorman
  0 siblings, 0 replies; 8+ messages in thread
From: Mel Gorman @ 2022-02-04 16:45 UTC (permalink / raw)
  To: Nayak, KPrateek (K Prateek)
  Cc: Peter Zijlstra, Ingo Molnar, Vincent Guittot, Valentin Schneider,
	Aubrey Li, Barry Song, Mike Galbraith, Srikar Dronamraju,
	Gautham Shenoy, LKML

On Fri, Feb 04, 2022 at 08:37:53PM +0530, Nayak, KPrateek (K Prateek) wrote:
> On 2/3/2022 8:16 PM, Mel Gorman wrote:
> > @@ -9003,10 +9006,9 @@ static bool update_pick_idlest(struct sched_group *idlest,
> >   * This is an approximation as the number of running tasks may not be
> >   * related to the number of busy CPUs due to sched_setaffinity.
> >   */
> > -static inline bool
> > -allow_numa_imbalance(unsigned int running, unsigned int weight)
> > +static inline bool allow_numa_imbalance(int running, int imb_numa_nr)
> >  {
> > -	return (running < (weight >> 2));
> > +	return running < imb_numa_nr;
> >  }
> >  
> >  /*
> > @@ -9146,7 +9148,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
> >  			 * allowed. If there is a real need of migration,
> >  			 * periodic load balance will take care of it.
> >  			 */
> > -			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, local_sgs.group_weight))
> > +			if (allow_numa_imbalance(local_sgs.sum_nr_running + 1, sd->imb_numa_nr))
> 
> Could you please clarify why are we adding 1 to local_sgs.sum_nr_running while allowing imbalance?

To account for the new task similar to what task_numa_find_cpu before
calling adjust_numa_imbalance.

> allow_numa_imbalance allows the imbalance based on the following inequality:
> 
> 	running < imb_numa_nr
> 
> Consider on a Zen3 CPU with 8 LLCs in the sched group of the NUMA domain.
> Assume the group is running 7 task and we are finding the idlest group for the 8th task:
> 
>  	sd->imb_numa_nr = 8
> 	local_sgs.sum_nr_running = 7
> 
> In this case, local_sgs.sum_nr_running + 1 is equal to sd->imb_numa_nr and if we allow NUMA imbalance
> and place the task in the same group, each task can be given one LLC.
> However, allow_numa_imbalance returns 0 for the above case and can lead to task being placed on a different
> NUMA group.
> 
> In case of Gautham's suggested fix (https://lore.kernel.org/lkml/YcHs37STv71n4erJ@BLR-5CG11610CF.amd.com/),
> the v4 patch in question (https://lore.kernel.org/lkml/20211210093307.31701-3-mgorman@techsingularity.net/)
> used the inequality "<=" to allow NUMA imbalance where we needed to consider the additional load CPU had
> to bear. However that doesn't seem to be the case here.
> 

I failed to change < to <= in allow_numa_imbalance, I'll fix and retest.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-02-04 16:46 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-03 14:46 [PATCH v5 0/2] Adjust NUMA imbalance for multiple LLCs Mel Gorman
2022-02-03 14:46 ` [PATCH 1/2] sched/fair: Improve consistency of allowed NUMA balance calculations Mel Gorman
2022-02-03 14:46 ` [PATCH 2/2] sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs Mel Gorman
2022-02-04  1:30   ` kernel test robot
2022-02-04  7:06   ` Srikar Dronamraju
2022-02-04  9:04     ` Mel Gorman
2022-02-04 15:07   ` Nayak, KPrateek (K Prateek)
2022-02-04 16:45     ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.