[PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing
@ 2020-11-20  9:06 Mel Gorman
  2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Mel Gorman @ 2020-11-20  9:06 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Vincent Guittot, Valentin Schneider, Juri Lelli, LKML, Mel Gorman

Changelog since v2
o Build fix for !NUMA_BALANCING configurations

Changelog since v1
o Split out patch that moves imbalance calculation
o Strongly connect fork imbalance considerations with adjust_numa_imbalance

When NUMA and CPU balancing were reconciled, there was an attempt to allow
a degree of imbalance but it caused more problems than it solved. Instead,
imbalance was only allowed with an almost idle NUMA domain. A lot of the
problems have since been addressed so it's time for a revisit. There is
also an issue with how fork is balanced across threads. It's mentioned
in this context as patch 3 and 4 should share similar behaviour in terms
of a nodes utilisation.

Patch 1 is just a cosmetic rename

Patch 2 moves an imbalance calculation. It is both a micro-optimisation
	and avoids confusing what imbalance means for different group
	types.

Patch 3 allows a "floating" imbalance to exist so communicating tasks can
	remain on the same domain until utilisation is higher. It aims
	to balance compute availability with memory bandwidth.

Patch 4 is the interesting one. Currently fork can allow a NUMA node
	to be completely utilised as long as there are idle CPUs until
	the load balancer gets involved. This caused serious problems
	with a real workload that unfortunately I cannot share many
	details about but there is a proxy reproducer.

-- 
2.26.2

Mel Gorman (4):
  sched/numa: Rename nr_running and break out the magic number
  sched: Avoid unnecessary calculation of load imbalance at clone time
  sched/numa: Allow a floating imbalance between NUMA nodes
  sched: Limit the amount of NUMA imbalance that can exist at fork time

 kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number
  2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
@ 2020-11-20  9:06 ` Mel Gorman
  2020-11-20 13:32   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-11-20  9:06 ` [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time Mel Gorman
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2020-11-20  9:06 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Vincent Guittot, Valentin Schneider, Juri Lelli, LKML, Mel Gorman

This is simply a preparation patch to make the following patches easier
to read. No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d78b68847f9..5fbed29e4001 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1550,7 +1550,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
-static inline long adjust_numa_imbalance(int imbalance, int nr_running);
+static inline long adjust_numa_imbalance(int imbalance, int dst_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -8991,7 +8991,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	}
 }
 
-static inline long adjust_numa_imbalance(int imbalance, int nr_running)
+#define NUMA_IMBALANCE_MIN 2
+
+static inline long adjust_numa_imbalance(int imbalance, int dst_running)
 {
 	unsigned int imbalance_min;
 
@@ -8999,8 +9001,8 @@ static inline long adjust_numa_imbalance(int imbalance, int nr_running)
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the source domain is almost idle.
 	 */
-	imbalance_min = 2;
-	if (nr_running <= imbalance_min)
+	imbalance_min = NUMA_IMBALANCE_MIN;
+	if (dst_running <= imbalance_min)
 		return 0;
 
 	return imbalance;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time
  2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
  2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
@ 2020-11-20  9:06 ` Mel Gorman
  2020-11-20 13:32   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-11-20  9:06 ` [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes Mel Gorman
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2020-11-20  9:06 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Vincent Guittot, Valentin Schneider, Juri Lelli, LKML, Mel Gorman

In find_idlest_group(), the load imbalance is only relevant when the group
is either overloaded or fully busy but it is calculated unconditionally.
This patch moves the imbalance calculation to the context it is required.
Technically, it is a micro-optimisation but really the benefit is avoiding
confusing one type of imbalance with another depending on the group_type
in the next patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5fbed29e4001..9aded12aaa90 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8777,9 +8777,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			.group_type = group_overloaded,
 	};
 
-	imbalance = scale_load_down(NICE_0_LOAD) *
-				(sd->imbalance_pct-100) / 100;
-
 	do {
 		int local_group;
 
@@ -8833,6 +8830,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	switch (local_sgs.group_type) {
 	case group_overloaded:
 	case group_fully_busy:
+
+		/* Calculate allowed imbalance based on load */
+		imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
+
 		/*
 		 * When comparing groups across NUMA domains, it's possible for
 		 * the local domain to be very lightly loaded relative to the
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes
  2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
  2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
  2020-11-20  9:06 ` [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time Mel Gorman
@ 2020-11-20  9:06 ` Mel Gorman
  2020-11-20 13:33   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-11-20  9:06 ` [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time Mel Gorman
  2020-11-20 12:58 ` [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Peter Zijlstra
  4 siblings, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2020-11-20  9:06 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Vincent Guittot, Valentin Schneider, Juri Lelli, LKML, Mel Gorman

Currently, an imbalance is only allowed when a destination node
is almost completely idle. This solved one basic class of problems
and was the cautious approach.

This patch revisits the possibility that NUMA nodes can be imbalanced
until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
superficial -- it's half the cores when HT is enabled.  At higher
utilisations, balancing should continue as normal and keep things even
until scheduler domains are fully busy or over utilised.

Note that this is not expected to be a universal win. Any benchmark
that prefers spreading as wide as possible with limited communication
will favour the old behaviour as there is more memory bandwidth.
Workloads that communicate heavily in pairs such as netperf or tbench
benefit. For the tests I ran, the vast majority of workloads saw
a benefit so it seems to be a worthwhile trade-off.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9aded12aaa90..e17e6c5da1d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1550,7 +1550,8 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
-static inline long adjust_numa_imbalance(int imbalance, int dst_running);
+static inline long adjust_numa_imbalance(int imbalance,
+					int dst_running, int dst_weight);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1930,7 +1931,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		src_running = env->src_stats.nr_running - 1;
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
-		imbalance = adjust_numa_imbalance(imbalance, dst_running);
+		imbalance = adjust_numa_imbalance(imbalance, dst_running,
+							env->dst_stats.weight);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -8995,16 +8997,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 #define NUMA_IMBALANCE_MIN 2
 
-static inline long adjust_numa_imbalance(int imbalance, int dst_running)
+static inline long adjust_numa_imbalance(int imbalance,
+				int dst_running, int dst_weight)
 {
-	unsigned int imbalance_min;
-
 	/*
 	 * Allow a small imbalance based on a simple pair of communicating
-	 * tasks that remain local when the source domain is almost idle.
+	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	imbalance_min = NUMA_IMBALANCE_MIN;
-	if (dst_running <= imbalance_min)
+	if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
 		return 0;
 
 	return imbalance;
@@ -9107,9 +9107,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA)
+		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-						busiest->sum_nr_running);
+				busiest->sum_nr_running, busiest->group_weight);
+		}
 
 		return;
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time
  2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
                   ` (2 preceding siblings ...)
  2020-11-20  9:06 ` [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes Mel Gorman
@ 2020-11-20  9:06 ` Mel Gorman
  2020-11-20 13:33   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  2020-11-20 12:58 ` [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Peter Zijlstra
  4 siblings, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2020-11-20  9:06 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: Vincent Guittot, Valentin Schneider, Juri Lelli, LKML, Mel Gorman

At fork time currently, a local node can be allowed to fill completely
and allow the periodic load balancer to fix the problem. This can be
problematic in cases where a task creates lots of threads that idle until
woken as part of a worker poll causing a memory bandwidth problem.

However, a "real" workload suffers badly from this behaviour. The workload
in question is mostly NUMA aware but spawns large numbers of threads
that act as a worker pool that can be called from anywhere. These need
to spread early to get reasonable behaviour.

This patch limits how much a local node can fill before spilling over
to another node and it will not be a universal win. Specifically,
very short-lived workloads that fit within a NUMA node would prefer
the memory bandwidth.

As I cannot describe the "real" workload, the best proxy measure I found
for illustration was a page fault microbenchmark. It's not representative
of the workload but demonstrates the hazard of the current behaviour.

pft timings
                                 5.10.0-rc2             5.10.0-rc2
                          imbalancefloat-v2          forkspread-v2
Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*

This is showing the time to fault regions belonging to threads. The target
machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
machine is approximately the point where one node becomes fully utilised.
The slower results are borderline noise.

Kernel building shows similar benefits around the same balance point.
Generally performance was either neutral or better in the tests conducted.
The main consideration with this patch is the point where fork stops
spreading a task so some workloads may benefit from different balance
points but it would be a risky tuning parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e17e6c5da1d5..6d1c24708664 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8761,6 +8761,16 @@ static bool update_pick_idlest(struct sched_group *idlest,
 	return true;
 }
 
+/*
+ * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
+ * This is an approximation as the number of running tasks may not be
+ * related to the number of busy CPUs due to sched_setaffinity.
+ */
+static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+{
+	return (dst_running < (dst_weight >> 2));
+}
+
 /*
  * find_idlest_group() finds and returns the least busy CPU group within the
  * domain.
@@ -8893,7 +8903,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (local_sgs.idle_cpus)
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
 				return NULL;
 		}
 
@@ -9000,11 +9010,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 static inline long adjust_numa_imbalance(int imbalance,
 				int dst_running, int dst_weight)
 {
+	if (!allow_numa_imbalance(dst_running, dst_weight))
+		return imbalance;
+
 	/*
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= NUMA_IMBALANCE_MIN)
 		return 0;
 
 	return imbalance;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing
  2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
                   ` (3 preceding siblings ...)
  2020-11-20  9:06 ` [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time Mel Gorman
@ 2020-11-20 12:58 ` Peter Zijlstra
  2020-11-20 14:02   ` Mel Gorman
  4 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2020-11-20 12:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Juri Lelli, LKML

On Fri, Nov 20, 2020 at 09:06:26AM +0000, Mel Gorman wrote:

> Mel Gorman (4):
>   sched/numa: Rename nr_running and break out the magic number
>   sched: Avoid unnecessary calculation of load imbalance at clone time
>   sched/numa: Allow a floating imbalance between NUMA nodes
>   sched: Limit the amount of NUMA imbalance that can exist at fork time
> 
>  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++-------------
>  1 file changed, 31 insertions(+), 13 deletions(-)

OK, lets give this another go :-)

Thanks!

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number
  2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
@ 2020-11-20 13:32   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2020-11-20 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Juri Lelli, LKML

On Fri, 20 Nov 2020 at 10:06, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> This is simply a preparation patch to make the following patches easier
> to read. No functional change.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6d78b68847f9..5fbed29e4001 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1550,7 +1550,7 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
> -static inline long adjust_numa_imbalance(int imbalance, int nr_running);
> +static inline long adjust_numa_imbalance(int imbalance, int dst_running);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -8991,7 +8991,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>         }
>  }
>
> -static inline long adjust_numa_imbalance(int imbalance, int nr_running)
> +#define NUMA_IMBALANCE_MIN 2
> +
> +static inline long adjust_numa_imbalance(int imbalance, int dst_running)
>  {
>         unsigned int imbalance_min;
>
> @@ -8999,8 +9001,8 @@ static inline long adjust_numa_imbalance(int imbalance, int nr_running)
>          * Allow a small imbalance based on a simple pair of communicating
>          * tasks that remain local when the source domain is almost idle.
>          */
> -       imbalance_min = 2;
> -       if (nr_running <= imbalance_min)
> +       imbalance_min = NUMA_IMBALANCE_MIN;
> +       if (dst_running <= imbalance_min)
>                 return 0;
>
>         return imbalance;
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time
  2020-11-20  9:06 ` [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time Mel Gorman
@ 2020-11-20 13:32   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2020-11-20 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Juri Lelli, LKML

On Fri, 20 Nov 2020 at 10:06, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> In find_idlest_group(), the load imbalance is only relevant when the group
> is either overloaded or fully busy but it is calculated unconditionally.
> This patch moves the imbalance calculation to the context it is required.
> Technically, it is a micro-optimisation but really the benefit is avoiding
> confusing one type of imbalance with another depending on the group_type
> in the next patch.
>
> No functional change.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5fbed29e4001..9aded12aaa90 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8777,9 +8777,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                         .group_type = group_overloaded,
>         };
>
> -       imbalance = scale_load_down(NICE_0_LOAD) *
> -                               (sd->imbalance_pct-100) / 100;
> -
>         do {
>                 int local_group;
>
> @@ -8833,6 +8830,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>         switch (local_sgs.group_type) {
>         case group_overloaded:
>         case group_fully_busy:
> +
> +               /* Calculate allowed imbalance based on load */
> +               imbalance = scale_load_down(NICE_0_LOAD) *
> +                               (sd->imbalance_pct-100) / 100;
> +
>                 /*
>                  * When comparing groups across NUMA domains, it's possible for
>                  * the local domain to be very lightly loaded relative to the
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes
  2020-11-20  9:06 ` [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes Mel Gorman
@ 2020-11-20 13:33   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2020-11-20 13:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Juri Lelli, LKML

On Fri, 20 Nov 2020 at 10:06, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Currently, an imbalance is only allowed when a destination node
> is almost completely idle. This solved one basic class of problems
> and was the cautious approach.
>
> This patch revisits the possibility that NUMA nodes can be imbalanced
> until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
> superficial -- it's half the cores when HT is enabled.  At higher
> utilisations, balancing should continue as normal and keep things even
> until scheduler domains are fully busy or over utilised.
>
> Note that this is not expected to be a universal win. Any benchmark
> that prefers spreading as wide as possible with limited communication
> will favour the old behaviour as there is more memory bandwidth.
> Workloads that communicate heavily in pairs such as netperf or tbench
> benefit. For the tests I ran, the vast majority of workloads saw
> a benefit so it seems to be a worthwhile trade-off.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9aded12aaa90..e17e6c5da1d5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1550,7 +1550,8 @@ struct task_numa_env {
>  static unsigned long cpu_load(struct rq *rq);
>  static unsigned long cpu_runnable(struct rq *rq);
>  static unsigned long cpu_util(int cpu);
> -static inline long adjust_numa_imbalance(int imbalance, int dst_running);
> +static inline long adjust_numa_imbalance(int imbalance,
> +                                       int dst_running, int dst_weight);
>
>  static inline enum
>  numa_type numa_classify(unsigned int imbalance_pct,
> @@ -1930,7 +1931,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
>                 src_running = env->src_stats.nr_running - 1;
>                 dst_running = env->dst_stats.nr_running + 1;
>                 imbalance = max(0, dst_running - src_running);
> -               imbalance = adjust_numa_imbalance(imbalance, dst_running);
> +               imbalance = adjust_numa_imbalance(imbalance, dst_running,
> +                                                       env->dst_stats.weight);
>
>                 /* Use idle CPU if there is no imbalance */
>                 if (!imbalance) {
> @@ -8995,16 +8997,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
>  #define NUMA_IMBALANCE_MIN 2
>
> -static inline long adjust_numa_imbalance(int imbalance, int dst_running)
> +static inline long adjust_numa_imbalance(int imbalance,
> +                               int dst_running, int dst_weight)
>  {
> -       unsigned int imbalance_min;
> -
>         /*
>          * Allow a small imbalance based on a simple pair of communicating
> -        * tasks that remain local when the source domain is almost idle.
> +        * tasks that remain local when the destination is lightly loaded.
>          */
> -       imbalance_min = NUMA_IMBALANCE_MIN;
> -       if (dst_running <= imbalance_min)
> +       if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
>                 return 0;
>
>         return imbalance;
> @@ -9107,9 +9107,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                 }
>
>                 /* Consider allowing a small imbalance between NUMA groups */
> -               if (env->sd->flags & SD_NUMA)
> +               if (env->sd->flags & SD_NUMA) {
>                         env->imbalance = adjust_numa_imbalance(env->imbalance,
> -                                               busiest->sum_nr_running);
> +                               busiest->sum_nr_running, busiest->group_weight);
> +               }
>
>                 return;
>         }
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time
  2020-11-20  9:06 ` [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time Mel Gorman
@ 2020-11-20 13:33   ` Vincent Guittot
  2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2020-11-20 13:33 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider, Juri Lelli, LKML

On Fri, 20 Nov 2020 at 10:06, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> At fork time currently, a local node can be allowed to fill completely
> and allow the periodic load balancer to fix the problem. This can be
> problematic in cases where a task creates lots of threads that idle until
> woken as part of a worker poll causing a memory bandwidth problem.
>
> However, a "real" workload suffers badly from this behaviour. The workload
> in question is mostly NUMA aware but spawns large numbers of threads
> that act as a worker pool that can be called from anywhere. These need
> to spread early to get reasonable behaviour.
>
> This patch limits how much a local node can fill before spilling over
> to another node and it will not be a universal win. Specifically,
> very short-lived workloads that fit within a NUMA node would prefer
> the memory bandwidth.
>
> As I cannot describe the "real" workload, the best proxy measure I found
> for illustration was a page fault microbenchmark. It's not representative
> of the workload but demonstrates the hazard of the current behaviour.
>
> pft timings
>                                  5.10.0-rc2             5.10.0-rc2
>                           imbalancefloat-v2          forkspread-v2
> Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
> Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
> Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
> Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
> Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
> Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
> Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
> Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
> Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*
>
> This is showing the time to fault regions belonging to threads. The target
> machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
> machine is approximately the point where one node becomes fully utilised.
> The slower results are borderline noise.
>
> Kernel building shows similar benefits around the same balance point.
> Generally performance was either neutral or better in the tests conducted.
> The main consideration with this patch is the point where fork stops
> spreading a task so some workloads may benefit from different balance
> points but it would be a risky tuning parameter.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

> ---
>  kernel/sched/fair.c | 17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e17e6c5da1d5..6d1c24708664 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8761,6 +8761,16 @@ static bool update_pick_idlest(struct sched_group *idlest,
>         return true;
>  }
>
> +/*
> + * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
> + * This is an approximation as the number of running tasks may not be
> + * related to the number of busy CPUs due to sched_setaffinity.
> + */
> +static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
> +{
> +       return (dst_running < (dst_weight >> 2));
> +}
> +
>  /*
>   * find_idlest_group() finds and returns the least busy CPU group within the
>   * domain.
> @@ -8893,7 +8903,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
>                          * a real need of migration, periodic load balance will
>                          * take care of it.
>                          */
> -                       if (local_sgs.idle_cpus)
> +                       if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
>                                 return NULL;
>                 }
>
> @@ -9000,11 +9010,14 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  static inline long adjust_numa_imbalance(int imbalance,
>                                 int dst_running, int dst_weight)
>  {
> +       if (!allow_numa_imbalance(dst_running, dst_weight))
> +               return imbalance;
> +
>         /*
>          * Allow a small imbalance based on a simple pair of communicating
>          * tasks that remain local when the destination is lightly loaded.
>          */
> -       if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
> +       if (imbalance <= NUMA_IMBALANCE_MIN)
>                 return 0;
>
>         return imbalance;
> --
> 2.26.2
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing
  2020-11-20 12:58 ` [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Peter Zijlstra
@ 2020-11-20 14:02   ` Mel Gorman
  0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2020-11-20 14:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Valentin Schneider, Juri Lelli, LKML

On Fri, Nov 20, 2020 at 01:58:11PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 20, 2020 at 09:06:26AM +0000, Mel Gorman wrote:
> 
> > Mel Gorman (4):
> >   sched/numa: Rename nr_running and break out the magic number
> >   sched: Avoid unnecessary calculation of load imbalance at clone time
> >   sched/numa: Allow a floating imbalance between NUMA nodes
> >   sched: Limit the amount of NUMA imbalance that can exist at fork time
> > 
> >  kernel/sched/fair.c | 44 +++++++++++++++++++++++++++++++-------------
> >  1 file changed, 31 insertions(+), 13 deletions(-)
> 
> OK, lets give this another go :-)
> 

Weeeeeeeee!

My expectations are that NAS will show some glitches depending on the
subtest, core usage and whether the subtest prefers packing closely or
spreading wide for both patch 3 and 4. I'm not *too* concerned about
that as HPC workloads are more likely to specify "places" be it OMP or
MPI. Ordinarily I would disagree with myself as NAS has been used as one
standard for scheduler behaviour and NUMA balancing in particular but
it favours allowing communicating tasks to remain local while spreading
for memory bandwidth when the busy CPUs is higher. I think that's a
reasonable balance.

In this case the main motive for patch 4 is the "real workload" that
is both memory and CPU intensive, runs on large machines and is latency
sensitive. I'm favouring the real workload over being able to pick a NAS
configuration that would show a regression.

The main negative corner case I'm anticipating is parallel loads (like
NAS), that are not memory bound, and the degree of parrallelisation is
between 25% and 100% of one node's compute capacity. Once it passes the
25% threshold, it'll start geting spread and may manifest as a regression
with patch 3 contributing slightly and patch 4 contributing a lot to
a regression.

Communicating tasks like tbench with varying thread counts will show
minor gains and losses depending on thread count.

Single communciators like netperf or perfpipe should show be ok or at
least within noise.

Hackbench should be fine because it typically saturates the machine so
any glitches there will likely be due to timing artifacts on initial
placement during clone.

Putting the predictions in writing should summon the regression demons
faster to prove me wrong due to Murphy's Law :P

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched: Avoid unnecessary calculation of load imbalance at clone time
  2020-11-20  9:06 ` [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time Mel Gorman
  2020-11-20 13:32   ` Vincent Guittot
@ 2020-11-25 14:02   ` tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-11-25 14:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel), Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     5c339005f854fa75aa46078ad640919425658b3e
Gitweb:        https://git.kernel.org/tip/5c339005f854fa75aa46078ad640919425658b3e
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Fri, 20 Nov 2020 09:06:28 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Nov 2020 16:47:47 +01:00

sched: Avoid unnecessary calculation of load imbalance at clone time

In find_idlest_group(), the load imbalance is only relevant when the group
is either overloaded or fully busy but it is calculated unconditionally.
This patch moves the imbalance calculation to the context it is required.
Technically, it is a micro-optimisation but really the benefit is avoiding
confusing one type of imbalance with another depending on the group_type
in the next patch.

No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9d10abe..2626c6b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8777,9 +8777,6 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			.group_type = group_overloaded,
 	};
 
-	imbalance = scale_load_down(NICE_0_LOAD) *
-				(sd->imbalance_pct-100) / 100;
-
 	do {
 		int local_group;
 
@@ -8833,6 +8830,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 	switch (local_sgs.group_type) {
 	case group_overloaded:
 	case group_fully_busy:
+
+		/* Calculate allowed imbalance based on load */
+		imbalance = scale_load_down(NICE_0_LOAD) *
+				(sd->imbalance_pct-100) / 100;
+
 		/*
 		 * When comparing groups across NUMA domains, it's possible for
 		 * the local domain to be very lightly loaded relative to the

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/numa: Rename nr_running and break out the magic number
  2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
  2020-11-20 13:32   ` Vincent Guittot
@ 2020-11-25 14:02   ` tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-11-25 14:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel), Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     abeae76a47005aa3f07c9be12d8076365622e25c
Gitweb:        https://git.kernel.org/tip/abeae76a47005aa3f07c9be12d8076365622e25c
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Fri, 20 Nov 2020 09:06:27 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Nov 2020 16:47:47 +01:00

sched/numa: Rename nr_running and break out the magic number

This is simply a preparation patch to make the following patches easier
to read. No functional change.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6691e28..9d10abe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1559,7 +1559,7 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
-static inline long adjust_numa_imbalance(int imbalance, int nr_running);
+static inline long adjust_numa_imbalance(int imbalance, int dst_running);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -8991,7 +8991,9 @@ next_group:
 	}
 }
 
-static inline long adjust_numa_imbalance(int imbalance, int nr_running)
+#define NUMA_IMBALANCE_MIN 2
+
+static inline long adjust_numa_imbalance(int imbalance, int dst_running)
 {
 	unsigned int imbalance_min;
 
@@ -8999,8 +9001,8 @@ static inline long adjust_numa_imbalance(int imbalance, int nr_running)
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the source domain is almost idle.
 	 */
-	imbalance_min = 2;
-	if (nr_running <= imbalance_min)
+	imbalance_min = NUMA_IMBALANCE_MIN;
+	if (dst_running <= imbalance_min)
 		return 0;
 
 	return imbalance;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched/numa: Allow a floating imbalance between NUMA nodes
  2020-11-20  9:06 ` [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes Mel Gorman
  2020-11-20 13:33   ` Vincent Guittot
@ 2020-11-25 14:02   ` tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-11-25 14:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel), Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     7d2b5dd0bcc48095651f1b85f751eef610b3e034
Gitweb:        https://git.kernel.org/tip/7d2b5dd0bcc48095651f1b85f751eef610b3e034
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Fri, 20 Nov 2020 09:06:29 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Nov 2020 16:47:47 +01:00

sched/numa: Allow a floating imbalance between NUMA nodes

Currently, an imbalance is only allowed when a destination node
is almost completely idle. This solved one basic class of problems
and was the cautious approach.

This patch revisits the possibility that NUMA nodes can be imbalanced
until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat
superficial -- it's half the cores when HT is enabled.  At higher
utilisations, balancing should continue as normal and keep things even
until scheduler domains are fully busy or over utilised.

Note that this is not expected to be a universal win. Any benchmark
that prefers spreading as wide as possible with limited communication
will favour the old behaviour as there is more memory bandwidth.
Workloads that communicate heavily in pairs such as netperf or tbench
benefit. For the tests I ran, the vast majority of workloads saw
a benefit so it seems to be a worthwhile trade-off.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2626c6b..377c77b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1559,7 +1559,8 @@ struct task_numa_env {
 static unsigned long cpu_load(struct rq *rq);
 static unsigned long cpu_runnable(struct rq *rq);
 static unsigned long cpu_util(int cpu);
-static inline long adjust_numa_imbalance(int imbalance, int dst_running);
+static inline long adjust_numa_imbalance(int imbalance,
+					int dst_running, int dst_weight);
 
 static inline enum
 numa_type numa_classify(unsigned int imbalance_pct,
@@ -1939,7 +1940,8 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 		src_running = env->src_stats.nr_running - 1;
 		dst_running = env->dst_stats.nr_running + 1;
 		imbalance = max(0, dst_running - src_running);
-		imbalance = adjust_numa_imbalance(imbalance, dst_running);
+		imbalance = adjust_numa_imbalance(imbalance, dst_running,
+							env->dst_stats.weight);
 
 		/* Use idle CPU if there is no imbalance */
 		if (!imbalance) {
@@ -8995,16 +8997,14 @@ next_group:
 
 #define NUMA_IMBALANCE_MIN 2
 
-static inline long adjust_numa_imbalance(int imbalance, int dst_running)
+static inline long adjust_numa_imbalance(int imbalance,
+				int dst_running, int dst_weight)
 {
-	unsigned int imbalance_min;
-
 	/*
 	 * Allow a small imbalance based on a simple pair of communicating
-	 * tasks that remain local when the source domain is almost idle.
+	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	imbalance_min = NUMA_IMBALANCE_MIN;
-	if (dst_running <= imbalance_min)
+	if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
 		return 0;
 
 	return imbalance;
@@ -9106,9 +9106,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 
 		/* Consider allowing a small imbalance between NUMA groups */
-		if (env->sd->flags & SD_NUMA)
+		if (env->sd->flags & SD_NUMA) {
 			env->imbalance = adjust_numa_imbalance(env->imbalance,
-						busiest->sum_nr_running);
+				busiest->sum_nr_running, busiest->group_weight);
+		}
 
 		return;
 	}

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [tip: sched/core] sched: Limit the amount of NUMA imbalance that can exist at fork time
  2020-11-20  9:06 ` [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time Mel Gorman
  2020-11-20 13:33   ` Vincent Guittot
@ 2020-11-25 14:02   ` tip-bot2 for Mel Gorman
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-11-25 14:02 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel), Vincent Guittot, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     23e6082a522e32232f7377540b4d42d8304253b8
Gitweb:        https://git.kernel.org/tip/23e6082a522e32232f7377540b4d42d8304253b8
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Fri, 20 Nov 2020 09:06:30 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Nov 2020 16:47:48 +01:00

sched: Limit the amount of NUMA imbalance that can exist at fork time

At fork time currently, a local node can be allowed to fill completely
and allow the periodic load balancer to fix the problem. This can be
problematic in cases where a task creates lots of threads that idle until
woken as part of a worker poll causing a memory bandwidth problem.

However, a "real" workload suffers badly from this behaviour. The workload
in question is mostly NUMA aware but spawns large numbers of threads
that act as a worker pool that can be called from anywhere. These need
to spread early to get reasonable behaviour.

This patch limits how much a local node can fill before spilling over
to another node and it will not be a universal win. Specifically,
very short-lived workloads that fit within a NUMA node would prefer
the memory bandwidth.

As I cannot describe the "real" workload, the best proxy measure I found
for illustration was a page fault microbenchmark. It's not representative
of the workload but demonstrates the hazard of the current behaviour.

pft timings
                                 5.10.0-rc2             5.10.0-rc2
                          imbalancefloat-v2          forkspread-v2
Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*

This is showing the time to fault regions belonging to threads. The target
machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
machine is approximately the point where one node becomes fully utilised.
The slower results are borderline noise.

Kernel building shows similar benefits around the same balance point.
Generally performance was either neutral or better in the tests conducted.
The main consideration with this patch is the point where fork stops
spreading a task so some workloads may benefit from different balance
points but it would be a risky tuning parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 377c77b..2e8aade 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8762,6 +8762,16 @@ static bool update_pick_idlest(struct sched_group *idlest,
 }

 /*
+ * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
+ * This is an approximation as the number of running tasks may not be
+ * related to the number of busy CPUs due to sched_setaffinity.
+ */
+static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+{
+	return (dst_running < (dst_weight >> 2));
+}
+
+/*
  * find_idlest_group() finds and returns the least busy CPU group within the
  * domain.
  *
@@ -8893,7 +8903,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (local_sgs.idle_cpus)
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
 				return NULL;
 		}

@@ -9000,11 +9010,14 @@ next_group:
 static inline long adjust_numa_imbalance(int imbalance,
 				int dst_running, int dst_weight)
 {
+	if (!allow_numa_imbalance(dst_running, dst_weight))
+		return imbalance;
+
 	/*
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= NUMA_IMBALANCE_MIN)
 		return 0;

 	return imbalance;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-11-25 14:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
2020-11-20 13:32   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20  9:06 ` [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time Mel Gorman
2020-11-20 13:32   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20  9:06 ` [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes Mel Gorman
2020-11-20 13:33   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20  9:06 ` [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time Mel Gorman
2020-11-20 13:33   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20 12:58 ` [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Peter Zijlstra
2020-11-20 14:02   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).