linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
@ 2020-01-14 10:13 Mel Gorman
  2020-01-16 16:35 ` Mel Gorman
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: Mel Gorman @ 2020-01-14 10:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML,
	Mel Gorman

Changelog since V3
o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
  be as good or better than allowing an imbalance based on the group weight
  without worrying about potential spillover of the lower scheduler domains.

Changelog since V2
o Only allow a small imbalance when utilisation is low to address reports that
  higher utilisation workloads were hitting corner cases.

Changelog since V1
o Alter code flow 						vincent.guittot
o Use idle CPUs for comparison instead of sum_nr_running	vincent.guittot
o Note that the division is still in place. Without it and taking
  imbalance_adj into account before the cutoff, two NUMA domains
  do not converage as being equally balanced when the number of
  busy tasks equals the size of one domain (50% of the sum).

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.

2-socket Haswell machine 48 core, HT enabled
netperf-tcp -- mmtests config config-network-netperf-unbound
                       	      baseline              lbnuma-v3
Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)

In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.

Ops NUMA base-page range updates       19796.00         292.00
Ops NUMA PTE updates                   19796.00         292.00
Ops NUMA PMD updates                       0.00           0.00
Ops NUMA hint faults                   16113.00         143.00
Ops NUMA hint local faults %            8407.00         142.00
Ops NUMA hint local percent               52.18          99.30
Ops NUMA pages migrated                 4244.00           1.00

Without the patch, only 52.18% of sampled accesses are local.  In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates".  The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries.  A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.

For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.

A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.

Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%

For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.

Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.

In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small.  For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ba749f579714..ade7a8dca5e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			unsigned int imbalance_min;
+
+			/*
+			 * Compute an allowed imbalance based on a simple
+			 * pair of communicating tasks that should remain
+			 * local and ignore them.
+			 *
+			 * NOTE: Generally this would have been based on
+			 * the domain size and this was evaluated. However,
+			 * the benefit is similar across a range of workloads
+			 * and machines but scaling by the domain size adds
+			 * the risk that lower domains have to be rebalanced.
+			 */
+			imbalance_min = 2;
+			if (busiest->sum_nr_running <= imbalance_min)
+				env->imbalance = 0;
+		}
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
@ 2020-01-16 16:35 ` Mel Gorman
  2020-01-17 13:08   ` Vincent Guittot
  2020-01-17 14:37   ` Valentin Schneider
  2020-01-17 13:16 ` Vincent Guittot
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 24+ messages in thread
From: Mel Gorman @ 2020-01-16 16:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
>   be as good or better than allowing an imbalance based on the group weight
>   without worrying about potential spillover of the lower scheduler domains.
> 
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
>   higher utilisation workloads were hitting corner cases.
> 
> Changelog since V1
> o Alter code flow 						vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running	vincent.guittot
> o Note that the division is still in place. Without it and taking
>   imbalance_adj into account before the cutoff, two NUMA domains
>   do not converage as being equally balanced when the number of
>   busy tasks equals the size of one domain (50% of the sum).
> 
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
> 
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
> 

Any thoughts on whether this is ok for tip or are there suggestions on
an alternative approach?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-16 16:35 ` Mel Gorman
@ 2020-01-17 13:08   ` Vincent Guittot
  2020-01-17 14:15     ` Mel Gorman
  2020-01-17 14:23     ` Phil Auld
  2020-01-17 14:37   ` Valentin Schneider
  1 sibling, 2 replies; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 13:08 UTC (permalink / raw)
  To: Mel Gorman, Phil Auld
  Cc: Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

Hi Mel,


On Thu, 16 Jan 2020 at 17:35, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> > Changelog since V3
> > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> >   be as good or better than allowing an imbalance based on the group weight
> >   without worrying about potential spillover of the lower scheduler domains.
> >
> > Changelog since V2
> > o Only allow a small imbalance when utilisation is low to address reports that
> >   higher utilisation workloads were hitting corner cases.
> >
> > Changelog since V1
> > o Alter code flow                                             vincent.guittot
> > o Use idle CPUs for comparison instead of sum_nr_running      vincent.guittot
> > o Note that the division is still in place. Without it and taking
> >   imbalance_adj into account before the cutoff, two NUMA domains
> >   do not converage as being equally balanced when the number of
> >   busy tasks equals the size of one domain (50% of the sum).
> >
> > The CPU load balancer balances between different domains to spread load
> > and strives to have equal balance everywhere. Communicating tasks can
> > migrate so they are topologically close to each other but these decisions
> > are independent. On a lightly loaded NUMA machine, two communicating tasks
> > pulled together at wakeup time can be pushed apart by the load balancer.
> > In isolation, the load balancer decision is fine but it ignores the tasks
> > data locality and the wakeup/LB paths continually conflict. NUMA balancing
> > is also a factor but it also simply conflicts with the load balancer.
> >
> > This patch allows a fixed degree of imbalance of two tasks to exist
> > between NUMA domains regardless of utilisation levels. In many cases,
> > this prevents communicating tasks being pulled apart. It was evaluated
> > whether the imbalance should be scaled to the domain size. However, no
> > additional benefit was measured across a range of workloads and machines
> > and scaling adds the risk that lower domains have to be rebalanced. While
> > this could change again in the future, such a change should specify the
> > use case and benefit.
> >
>
> Any thoughts on whether this is ok for tip or are there suggestions on
> an alternative approach?

I have just finished to run some tests on my system with your patch
and I haven't seen any noticeable any changes so far which was a bit
expected. The tests that I usually run, use more than 4 tasks on my 2
nodes system; the only exception is perf sched  pipe and the results
for this test stays the same with and without your patch. I'm curious
if this impacts Phil's tests which run LU.c benchmark with some
burning cpu tasks

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
  2020-01-16 16:35 ` Mel Gorman
@ 2020-01-17 13:16 ` Vincent Guittot
  2020-01-17 14:26   ` Mel Gorman
  2020-01-17 15:09 ` Vincent Guittot
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 13:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, 14 Jan 2020 at 11:13, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
>   be as good or better than allowing an imbalance based on the group weight
>   without worrying about potential spillover of the lower scheduler domains.
>
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
>   higher utilisation workloads were hitting corner cases.
>
> Changelog since V1
> o Alter code flow                                               vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running        vincent.guittot
> o Note that the division is still in place. Without it and taking
>   imbalance_adj into account before the cutoff, two NUMA domains
>   do not converage as being equally balanced when the number of
>   busy tasks equals the size of one domain (50% of the sum).
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offload depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
>                               baseline              lbnuma-v3
> Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
> Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
> Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
> Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
> Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
> Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
> Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
> Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
> Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
> Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
> Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
> Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
> Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
> Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
> Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
> Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
> Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
> Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)
>
> In this case, there was a sizable improvement to performance and
> a general reduction in variance. However, this is not univeral.
> For most machines, the impact was roughly a 3% performance gain.
>
> Ops NUMA base-page range updates       19796.00         292.00
> Ops NUMA PTE updates                   19796.00         292.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                   16113.00         143.00
> Ops NUMA hint local faults %            8407.00         142.00
> Ops NUMA hint local percent               52.18          99.30
> Ops NUMA pages migrated                 4244.00           1.00
>
> Without the patch, only 52.18% of sampled accesses are local.  In an
> earlier changelog, 100% of sampled accesses are local and indeed on
> most machines, this was still the case. In this specific case, the
> local sampled rates was 99.3% but note the "base-page range updates"
> and "PTE updates".  The activity with the patch is negligible as were
> the number of faults. The small number of pages migrated were related to
> shared libraries.  A 2-socket Broadwell showed better results on average
> but are not presented for brevity as the performance was similar except
> it showed 100% of the sampled NUMA hints were local. The patch holds up
> for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
>
> For dbench, the impact depends on the filesystem used and the number of
> clients. On XFS, there is little difference as the clients typically
> communicate with workqueues which have a separate class of scheduler
> problem at the moment. For ext4, performance is generally better,
> particularly for small numbers of clients as NUMA balancing activity is
> negligible with the patch applied.
>
> A more interesting example is the Facebook schbench which uses a
> number of messaging threads to communicate with worker threads. In this
> configuration, one messaging thread is used per NUMA node and the number of
> worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> for response latency is then reported.
>
> Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
> Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
> Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
> Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
> Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
> Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
> Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)

Which parameter changes between above and below tests ?

> Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
> Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
> Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
> Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
> Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
> Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
> Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%
>
> For higher worker threads, the differences become negligible but it's
> interesting to note the difference in wakeup latency at low utilisation
> and mpstat confirms that activity was almost all on one node until
> the number of worker threads increase.
>
> Hackbench generally showed neutral results across a range of machines.
> This is different to earlier versions of the patch which allowed imbalances
> for higher degrees of utilisation. perf bench pipe showed negligible
> differences in overall performance as the differences are very close to
> the noise.
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is negligible with small gains/losses within the noise measured. This is
> because the number of threads far exceeds the small imbalance the aptch
> cares about. Similarly, there were report of regressions for the autonuma
> benchmark against earlier versions but again, normal load balancing now
> applies for that workload.
>
> In general, the patch simply seeks to avoid unnecessary cross-node
> migrations in the basic case where imbalances are very small.  For low
> utilisation communicating workloads, this patch generally behaves better
> with less NUMA balancing activity. For high utilisation, there is no
> change in behaviour.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
>  1 file changed, 29 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..ade7a8dca5e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>         /*
>          * Try to use spare capacity of local group without overloading it or
>          * emptying busiest.
> -        * XXX Spreading tasks across NUMA nodes is not always the best policy
> -        * and special care should be taken for SD_NUMA domain level before
> -        * spreading the tasks. For now, load_balance() fully relies on
> -        * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>          */
>         if (local->group_type == group_has_spare) {
>                 if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>                         env->migration_type = migrate_task;
>                         lsub_positive(&nr_diff, local->sum_nr_running);
>                         env->imbalance = nr_diff >> 1;
> -                       return;
> -               }
> +               } else {
>
> -               /*
> -                * If there is no overload, we just want to even the number of
> -                * idle cpus.
> -                */
> -               env->migration_type = migrate_task;
> -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> +                       /*
> +                        * If there is no overload, we just want to even the number of
> +                        * idle cpus.
> +                        */
> +                       env->migration_type = migrate_task;
> +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
>                                                  busiest->idle_cpus) >> 1);
> +               }
> +
> +               /* Consider allowing a small imbalance between NUMA groups */
> +               if (env->sd->flags & SD_NUMA) {
> +                       unsigned int imbalance_min;
> +
> +                       /*
> +                        * Compute an allowed imbalance based on a simple
> +                        * pair of communicating tasks that should remain
> +                        * local and ignore them.
> +                        *
> +                        * NOTE: Generally this would have been based on
> +                        * the domain size and this was evaluated. However,
> +                        * the benefit is similar across a range of workloads
> +                        * and machines but scaling by the domain size adds
> +                        * the risk that lower domains have to be rebalanced.
> +                        */
> +                       imbalance_min = 2;
> +                       if (busiest->sum_nr_running <= imbalance_min)
> +                               env->imbalance = 0;

Out of curiosity why have you decided to use the above instead of
  env->imbalance -= min(env->imbalance, imbalance_adj);

Have you seen perf regression with the min ?

That being said, the proposal looks good to me. It is self contained
and provides perf improvement for some targeted UCs.

> +               }
> +
>                 return;
>         }
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 13:08   ` Vincent Guittot
@ 2020-01-17 14:15     ` Mel Gorman
  2020-01-17 14:32       ` Phil Auld
  2020-01-17 14:23     ` Phil Auld
  1 sibling, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-17 14:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, Jan 17, 2020 at 02:08:13PM +0100, Vincent Guittot wrote:
> > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > between NUMA domains regardless of utilisation levels. In many cases,
> > > this prevents communicating tasks being pulled apart. It was evaluated
> > > whether the imbalance should be scaled to the domain size. However, no
> > > additional benefit was measured across a range of workloads and machines
> > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > this could change again in the future, such a change should specify the
> > > use case and benefit.
> > >
> >
> > Any thoughts on whether this is ok for tip or are there suggestions on
> > an alternative approach?
> 
> I have just finished to run some tests on my system with your patch
> and I haven't seen any noticeable any changes so far which was a bit
> expected. The tests that I usually run, use more than 4 tasks on my 2
> nodes system;

This is indeed expected. With more active tasks, normal load balancing
applies.

> the only exception is perf sched  pipe and the results
> for this test stays the same with and without your patch.

I never saw much difference with perf sched pipe either. It was
generally within the noise.

> I'm curious
> if this impacts Phil's tests which run LU.c benchmark with some
> burning cpu tasks

I didn't see any problem with LU.c whether parallelised by openMPI or
openMP but an independent check would be nice.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 13:08   ` Vincent Guittot
  2020-01-17 14:15     ` Mel Gorman
@ 2020-01-17 14:23     ` Phil Auld
  1 sibling, 0 replies; 24+ messages in thread
From: Phil Auld @ 2020-01-17 14:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

Hi,

On Fri, Jan 17, 2020 at 02:08:13PM +0100 Vincent Guittot wrote:
> Hi Mel,
> 
> 
> On Thu, 16 Jan 2020 at 17:35, Mel Gorman <mgorman@techsingularity.net> wrote:
> >
> > On Tue, Jan 14, 2020 at 10:13:20AM +0000, Mel Gorman wrote:
> > > Changelog since V3
> > > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> > >   be as good or better than allowing an imbalance based on the group weight
> > >   without worrying about potential spillover of the lower scheduler domains.
> > >
> > > Changelog since V2
> > > o Only allow a small imbalance when utilisation is low to address reports that
> > >   higher utilisation workloads were hitting corner cases.
> > >
> > > Changelog since V1
> > > o Alter code flow                                             vincent.guittot
> > > o Use idle CPUs for comparison instead of sum_nr_running      vincent.guittot
> > > o Note that the division is still in place. Without it and taking
> > >   imbalance_adj into account before the cutoff, two NUMA domains
> > >   do not converage as being equally balanced when the number of
> > >   busy tasks equals the size of one domain (50% of the sum).
> > >
> > > The CPU load balancer balances between different domains to spread load
> > > and strives to have equal balance everywhere. Communicating tasks can
> > > migrate so they are topologically close to each other but these decisions
> > > are independent. On a lightly loaded NUMA machine, two communicating tasks
> > > pulled together at wakeup time can be pushed apart by the load balancer.
> > > In isolation, the load balancer decision is fine but it ignores the tasks
> > > data locality and the wakeup/LB paths continually conflict. NUMA balancing
> > > is also a factor but it also simply conflicts with the load balancer.
> > >
> > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > between NUMA domains regardless of utilisation levels. In many cases,
> > > this prevents communicating tasks being pulled apart. It was evaluated
> > > whether the imbalance should be scaled to the domain size. However, no
> > > additional benefit was measured across a range of workloads and machines
> > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > this could change again in the future, such a change should specify the
> > > use case and benefit.
> > >
> >
> > Any thoughts on whether this is ok for tip or are there suggestions on
> > an alternative approach?
> 
> I have just finished to run some tests on my system with your patch
> and I haven't seen any noticeable any changes so far which was a bit
> expected. The tests that I usually run, use more than 4 tasks on my 2
> nodes system; the only exception is perf sched  pipe and the results
> for this test stays the same with and without your patch. I'm curious
> if this impacts Phil's tests which run LU.c benchmark with some
> burning cpu tasks
>

I'm not seeing much meaningful difference with this real v4 versus what I 
posted earlier. It seems to have tightened up the range in some cases, but 
otherwise it's hard to see much difference, which is a good thing. I'll 
put the eye chart below for the curious. 

I'll see if the perf team has time to run a full suite on it. But for this
case it looks fine.


Cheers,
Phil


The lbv4* one is the non-official v4 from the email thread. The other v4
is the real posted one. Otherwise the rest of this is the same as I posted
the other day, which described the test. https://lkml.org/lkml/2020/1/7/840


GROUP - LU.c and cpu hogs in separate cgroups
Mop/s - Higher is better
============76_GROUP========Mop/s===================================
	min	q1	median	q3	max
5.4.0	 1671.8	 4211.2	 6103.0	 6934.1	 7865.4
5.4.0	 1777.1	 3719.9	 4861.8	 5822.5	13479.6
5.4.0	 2015.3	 2716.2	 5007.1	 6214.5	 9491.7
5.5-rc2	27641.0	30684.7	32091.8	33417.3	38118.1
5.5-rc2	27386.0	29795.2	32484.1	36004.0	37704.3
5.5-rc2	26649.6	29485.0	30379.7	33116.0	36832.8
lbv3	28496.3	29716.0	30634.8	32998.4	40945.2
lbv3	27294.7	29336.4	30186.0	31888.3	35839.1
lbv3	27099.3	29325.3	31680.1	35973.5	39000.0
lbv4*	27936.4	30109.0	31724.8	33150.7	35905.1
lbv4*	26431.0	29355.6	29850.1	32704.4	36060.3
lbv4*	27436.6	29945.9	31076.9	32207.8	35401.5
lbv4	28006.3	29861.1	31993.1	33469.3	34060.7
lbv4	28468.2	30057.7	31606.3	31963.5	35348.5
lbv4	25016.3	28897.5	29274.4	30229.2	36862.7

Runtime - Lower is better
============76_GROUP========time====================================
	min	q1	median	q3	max
5.4.0	259.2	294.92	335.39	484.33	1219.61
5.4.0	151.3	351.1	419.4	551.99	1147.3
5.4.0	214.8	328.16	407.27	751.03	1011.77
5.5-rc2	 53.49	 61.03	 63.56	 66.46	  73.77
5.5-rc2  54.08	 56.67	 62.78	 68.44	  74.45
5.5-rc2	 55.36	 61.61	 67.14	 69.16	  76.51
lbv3	 49.8	 61.8	 66.59	 68.62	  71.55
lbv3	 56.89	 63.95	 67.55	 69.51	  74.7
lbv3	 52.28	 56.68	 64.38	 69.54	  75.24
lbv4*	 56.79	 61.52	 64.3	 67.73	  72.99
lbv4*	 56.54	 62.36	 68.31	 69.47	  77.14
lbv4*	 57.6	 63.33	 65.64	 68.11	  74.32
lbv4	 59.86	 60.93	 63.74	 68.28	  72.81
lbv4	 57.68	 63.79	 64.52	 67.84	  71.62
lbv4	 55.31	 67.46	 69.66	 70.56	  81.51

NORMAL - LU.c and cpu hogs all in one cgroup
Mop/s - Higher is better
============76_NORMAL========Mop/s===================================
	min	q1	median	q3	max
5.4.0	32912.6	34047.5	36739.4	39124.1	41592.5
5.4.0	29937.7	33060.6	34860.8	39528.8	43328.1
5.4.0	31851.2	34281.1	35284.4	36016.8	38847.4
5.5-rc2	30475.6	32505.1	33977.3	34876	36233.8
5.5-rc2	30657.7	31301.1	32059.4	34396.7	38661.8
5.5-rc2	31022	32247.6	32628.9	33245	38572.3
lbv3	30606.4	32794.4	34258.6	35699	38669.2
lbv3	29722.7	30558.9	32731.2	36412	40752.3
lbv3	30297.7	32568.3	36654.6	38066.2	38988.3
lbv4*	30084.9	31227.5	32312.8	33222.8	36039.7
lbv4*	29875.9	32903.6	33803.1	34519.3	38663.5
lbv4*	27923.3	30631.1	32666.9	33516.7	36663.4
lbv4	30401.4	32559.5	33268.3	35012.9	35953.9
lbv4	29372.5	30677	32081.7	33734.2	39326.8
lbv4	29583.7	30432.5	32542.9	33170.5	34123.1

Runtime - Lower is better
============76_NORMAL========time====================================
	min	q1	median	q3	max
5.4.0	49.02	52.115	55.58	59.89	61.95
5.4.0	47.06	51.615	58.57	61.68	68.11
5.4.0	52.49	56.615	57.795	59.48	64.02
5.5-rc2	56.27	58.47	60.02	62.735	66.91
5.5-rc2	52.74	59.295	63.605	65.145	66.51
5.5-rc2	52.86	61.335	62.495	63.23	65.73
lbv3	52.73	57.12	59.52	62.19	66.62
lbv3	50.03	56.02	62.39	66.725	68.6
lbv3	52.3	53.565	55.65	62.645	67.3
lbv4*	56.58	61.375	63.135	65.3	67.77
lbv4*	52.74	59.07	60.335	61.97	68.25
lbv4*	55.61	60.835	62.42	66.635	73.02
lbv4	56.71	58.235	61.295	62.63	67.07
lbv4	51.85	60.535	63.56	66.54	69.42
lbv4	59.75	61.48	62.655	67	68.92



 
> >
> > --
> > Mel Gorman
> > SUSE Labs
> 

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 13:16 ` Vincent Guittot
@ 2020-01-17 14:26   ` Mel Gorman
  2020-01-17 14:29     ` Vincent Guittot
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-17 14:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, Jan 17, 2020 at 02:16:15PM +0100, Vincent Guittot wrote:
> > A more interesting example is the Facebook schbench which uses a
> > number of messaging threads to communicate with worker threads. In this
> > configuration, one messaging thread is used per NUMA node and the number of
> > worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> > for response latency is then reported.
> >
> > Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
> > Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
> > Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
> > Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
> > Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
> > Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
> > Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
> 
> Which parameter changes between above and below tests ?
> 
> > Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
> > Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
> > Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
> > Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
> > Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
> > Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
> > Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%
> >

The number of worker pool threads. Above is 1 worker thread, below is 2.

> > @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >                         env->migration_type = migrate_task;
> >                         lsub_positive(&nr_diff, local->sum_nr_running);
> >                         env->imbalance = nr_diff >> 1;
> > -                       return;
> > -               }
> > +               } else {
> >
> > -               /*
> > -                * If there is no overload, we just want to even the number of
> > -                * idle cpus.
> > -                */
> > -               env->migration_type = migrate_task;
> > -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> > +                       /*
> > +                        * If there is no overload, we just want to even the number of
> > +                        * idle cpus.
> > +                        */
> > +                       env->migration_type = migrate_task;
> > +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
> >                                                  busiest->idle_cpus) >> 1);
> > +               }
> > +
> > +               /* Consider allowing a small imbalance between NUMA groups */
> > +               if (env->sd->flags & SD_NUMA) {
> > +                       unsigned int imbalance_min;
> > +
> > +                       /*
> > +                        * Compute an allowed imbalance based on a simple
> > +                        * pair of communicating tasks that should remain
> > +                        * local and ignore them.
> > +                        *
> > +                        * NOTE: Generally this would have been based on
> > +                        * the domain size and this was evaluated. However,
> > +                        * the benefit is similar across a range of workloads
> > +                        * and machines but scaling by the domain size adds
> > +                        * the risk that lower domains have to be rebalanced.
> > +                        */
> > +                       imbalance_min = 2;
> > +                       if (busiest->sum_nr_running <= imbalance_min)
> > +                               env->imbalance = 0;
> 
> Out of curiosity why have you decided to use the above instead of
>   env->imbalance -= min(env->imbalance, imbalance_adj);
> 
> Have you seen perf regression with the min ?
> 

I didn't see a regression with min() but at this point, we're only
dealing with the case of ignoring a small imbalance when the busiest
group is almost completely idle. The distinction between using min and
just ignoring the imbalance is almost irrevelant in that case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 14:26   ` Mel Gorman
@ 2020-01-17 14:29     ` Vincent Guittot
  0 siblings, 0 replies; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 14:29 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, 17 Jan 2020 at 15:26, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Fri, Jan 17, 2020 at 02:16:15PM +0100, Vincent Guittot wrote:
> > > A more interesting example is the Facebook schbench which uses a
> > > number of messaging threads to communicate with worker threads. In this
> > > configuration, one messaging thread is used per NUMA node and the number of
> > > worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> > > for response latency is then reported.
> > >
> > > Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
> > > Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
> > > Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
> > > Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
> > > Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
> > > Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
> > > Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
> >
> > Which parameter changes between above and below tests ?
> >
> > > Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
> > > Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
> > > Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
> > > Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
> > > Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
> > > Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
> > > Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%
> > >
>
> The number of worker pool threads. Above is 1 worker thread, below is 2.
>
> > > @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > >                         env->migration_type = migrate_task;
> > >                         lsub_positive(&nr_diff, local->sum_nr_running);
> > >                         env->imbalance = nr_diff >> 1;
> > > -                       return;
> > > -               }
> > > +               } else {
> > >
> > > -               /*
> > > -                * If there is no overload, we just want to even the number of
> > > -                * idle cpus.
> > > -                */
> > > -               env->migration_type = migrate_task;
> > > -               env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > +                       /*
> > > +                        * If there is no overload, we just want to even the number of
> > > +                        * idle cpus.
> > > +                        */
> > > +                       env->migration_type = migrate_task;
> > > +                       env->imbalance = max_t(long, 0, (local->idle_cpus -
> > >                                                  busiest->idle_cpus) >> 1);
> > > +               }
> > > +
> > > +               /* Consider allowing a small imbalance between NUMA groups */
> > > +               if (env->sd->flags & SD_NUMA) {
> > > +                       unsigned int imbalance_min;
> > > +
> > > +                       /*
> > > +                        * Compute an allowed imbalance based on a simple
> > > +                        * pair of communicating tasks that should remain
> > > +                        * local and ignore them.
> > > +                        *
> > > +                        * NOTE: Generally this would have been based on
> > > +                        * the domain size and this was evaluated. However,
> > > +                        * the benefit is similar across a range of workloads
> > > +                        * and machines but scaling by the domain size adds
> > > +                        * the risk that lower domains have to be rebalanced.
> > > +                        */
> > > +                       imbalance_min = 2;
> > > +                       if (busiest->sum_nr_running <= imbalance_min)
> > > +                               env->imbalance = 0;
> >
> > Out of curiosity why have you decided to use the above instead of
> >   env->imbalance -= min(env->imbalance, imbalance_adj);
> >
> > Have you seen perf regression with the min ?
> >
>
> I didn't see a regression with min() but at this point, we're only
> dealing with the case of ignoring a small imbalance when the busiest
> group is almost completely idle. The distinction between using min and
> just ignoring the imbalance is almost irrevelant in that case.

yes you're right

>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 14:15     ` Mel Gorman
@ 2020-01-17 14:32       ` Phil Auld
  0 siblings, 0 replies; 24+ messages in thread
From: Phil Auld @ 2020-01-17 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, Jan 17, 2020 at 02:15:03PM +0000 Mel Gorman wrote:
> On Fri, Jan 17, 2020 at 02:08:13PM +0100, Vincent Guittot wrote:
> > > > This patch allows a fixed degree of imbalance of two tasks to exist
> > > > between NUMA domains regardless of utilisation levels. In many cases,
> > > > this prevents communicating tasks being pulled apart. It was evaluated
> > > > whether the imbalance should be scaled to the domain size. However, no
> > > > additional benefit was measured across a range of workloads and machines
> > > > and scaling adds the risk that lower domains have to be rebalanced. While
> > > > this could change again in the future, such a change should specify the
> > > > use case and benefit.
> > > >
> > >
> > > Any thoughts on whether this is ok for tip or are there suggestions on
> > > an alternative approach?
> > 
> > I have just finished to run some tests on my system with your patch
> > and I haven't seen any noticeable any changes so far which was a bit
> > expected. The tests that I usually run, use more than 4 tasks on my 2
> > nodes system;
> 
> This is indeed expected. With more active tasks, normal load balancing
> applies.
> 
> > the only exception is perf sched  pipe and the results
> > for this test stays the same with and without your patch.
> 
> I never saw much difference with perf sched pipe either. It was
> generally within the noise.
> 
> > I'm curious
> > if this impacts Phil's tests which run LU.c benchmark with some
> > burning cpu tasks
> 
> I didn't see any problem with LU.c whether parallelised by openMPI or
> openMP but an independent check would be nice.
> 

My particular case is not straight up LU.c. It's the group imbalance 
setup which was totally borken before Vincent's work. The test setup
is designed to show how the load balancer (used to) fail by using group
scaled "load" at larger (NUMA) domain levels. It's very susceptible to 
imbalances so I wanted to make sure your patch allowing imblanances 
didn't re-break it. 


Cheers,
Phil


> -- 
> Mel Gorman
> SUSE Labs
> 

-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-16 16:35 ` Mel Gorman
  2020-01-17 13:08   ` Vincent Guittot
@ 2020-01-17 14:37   ` Valentin Schneider
  1 sibling, 0 replies; 24+ messages in thread
From: Valentin Schneider @ 2020-01-17 14:37 UTC (permalink / raw)
  To: Mel Gorman, Vincent Guittot
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Srikar Dronamraju,
	Quentin Perret, Dietmar Eggemann, Morten Rasmussen, Hillf Danton,
	Parth Shah, Rik van Riel, LKML

On 16/01/2020 16:35, Mel Gorman wrote:
> Any thoughts on whether this is ok for tip or are there suggestions on
> an alternative approach?
> 

My main concern was about using number of tasks instead of number of busy
CPUs, which you're doing here, so I'm happy with that side of things.

As for the simpler imbalance heuristic, I don't have any issue with it
either. It's obvious that it caters to pairs of communicating tasks, and
we can try to extend it later on if required.

So yeah, FWIW:
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
  2020-01-16 16:35 ` Mel Gorman
  2020-01-17 13:16 ` Vincent Guittot
@ 2020-01-17 15:09 ` Vincent Guittot
  2020-01-17 15:11   ` Peter Zijlstra
  2020-01-17 15:21 ` Phil Auld
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Vincent Guittot @ 2020-01-17 15:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Phil Auld, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, 14 Jan 2020 at 11:13, Mel Gorman <mgorman@techsingularity.net> wrote:
>
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
>   be as good or better than allowing an imbalance based on the group weight
>   without worrying about potential spillover of the lower scheduler domains.
>
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
>   higher utilisation workloads were hitting corner cases.
>
> Changelog since V1
> o Alter code flow                                               vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running        vincent.guittot
> o Note that the division is still in place. Without it and taking
>   imbalance_adj into account before the cutoff, two NUMA domains
>   do not converage as being equally balanced when the number of
>   busy tasks equals the size of one domain (50% of the sum).
>
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
>
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
>
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offload depending on the
> transmission rate.
>
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
>                               baseline              lbnuma-v3
> Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
> Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
> Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
> Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
> Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
> Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
> Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
> Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
> Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
> Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
> Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
> Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
> Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
> Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
> Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
> Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
> Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
> Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)
>
> In this case, there was a sizable improvement to performance and
> a general reduction in variance. However, this is not univeral.
> For most machines, the impact was roughly a 3% performance gain.
>
> Ops NUMA base-page range updates       19796.00         292.00
> Ops NUMA PTE updates                   19796.00         292.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                   16113.00         143.00
> Ops NUMA hint local faults %            8407.00         142.00
> Ops NUMA hint local percent               52.18          99.30
> Ops NUMA pages migrated                 4244.00           1.00
>
> Without the patch, only 52.18% of sampled accesses are local.  In an
> earlier changelog, 100% of sampled accesses are local and indeed on
> most machines, this was still the case. In this specific case, the
> local sampled rates was 99.3% but note the "base-page range updates"
> and "PTE updates".  The activity with the patch is negligible as were
> the number of faults. The small number of pages migrated were related to
> shared libraries.  A 2-socket Broadwell showed better results on average
> but are not presented for brevity as the performance was similar except
> it showed 100% of the sampled NUMA hints were local. The patch holds up
> for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
>
> For dbench, the impact depends on the filesystem used and the number of
> clients. On XFS, there is little difference as the clients typically
> communicate with workqueues which have a separate class of scheduler
> problem at the moment. For ext4, performance is generally better,
> particularly for small numbers of clients as NUMA balancing activity is
> negligible with the patch applied.
>
> A more interesting example is the Facebook schbench which uses a
> number of messaging threads to communicate with worker threads. In this
> configuration, one messaging thread is used per NUMA node and the number of
> worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> for response latency is then reported.
>
> Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
> Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
> Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
> Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
> Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
> Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
> Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
> Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
> Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
> Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
> Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
> Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
> Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
> Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%
>
> For higher worker threads, the differences become negligible but it's
> interesting to note the difference in wakeup latency at low utilisation
> and mpstat confirms that activity was almost all on one node until
> the number of worker threads increase.
>
> Hackbench generally showed neutral results across a range of machines.
> This is different to earlier versions of the patch which allowed imbalances
> for higher degrees of utilisation. perf bench pipe showed negligible
> differences in overall performance as the differences are very close to
> the noise.
>
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is negligible with small gains/losses within the noise measured. This is
> because the number of threads far exceeds the small imbalance the aptch
> cares about. Similarly, there were report of regressions for the autonuma
> benchmark against earlier versions but again, normal load balancing now
> applies for that workload.
>
> In general, the patch simply seeks to avoid unnecessary cross-node
> migrations in the basic case where imbalances are very small.  For low
> utilisation communicating workloads, this patch generally behaves better
> with less NUMA balancing activity. For high utilisation, there is no
> change in behaviour.
>
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>

Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 15:09 ` Vincent Guittot
@ 2020-01-17 15:11   ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2020-01-17 15:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Mel Gorman, Phil Auld, Ingo Molnar, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, Jan 17, 2020 at 04:09:33PM +0100, Vincent Guittot wrote:
> On Tue, 14 Jan 2020 at 11:13, Mel Gorman <mgorman@techsingularity.net> wrote:

> > In general, the patch simply seeks to avoid unnecessary cross-node
> > migrations in the basic case where imbalances are very small.  For low
> > utilisation communicating workloads, this patch generally behaves better
> > with less NUMA balancing activity. For high utilisation, there is no
> > change in behaviour.
> >
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> 
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>

Thanks all!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
                   ` (2 preceding siblings ...)
  2020-01-17 15:09 ` Vincent Guittot
@ 2020-01-17 15:21 ` Phil Auld
  2020-01-17 17:56 ` Srikar Dronamraju
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: Phil Auld @ 2020-01-17 15:21 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Ingo Molnar, Peter Zijlstra, Valentin Schneider,
	Srikar Dronamraju, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 14, 2020 at 10:13:20AM +0000 Mel Gorman wrote:
> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
>   be as good or better than allowing an imbalance based on the group weight
>   without worrying about potential spillover of the lower scheduler domains.
> 
> Changelog since V2
> o Only allow a small imbalance when utilisation is low to address reports that
>   higher utilisation workloads were hitting corner cases.
> 
> Changelog since V1
> o Alter code flow 						vincent.guittot
> o Use idle CPUs for comparison instead of sum_nr_running	vincent.guittot
> o Note that the division is still in place. Without it and taking
>   imbalance_adj into account before the cutoff, two NUMA domains
>   do not converage as being equally balanced when the number of
>   busy tasks equals the size of one domain (50% of the sum).
> 
> The CPU load balancer balances between different domains to spread load
> and strives to have equal balance everywhere. Communicating tasks can
> migrate so they are topologically close to each other but these decisions
> are independent. On a lightly loaded NUMA machine, two communicating tasks
> pulled together at wakeup time can be pushed apart by the load balancer.
> In isolation, the load balancer decision is fine but it ignores the tasks
> data locality and the wakeup/LB paths continually conflict. NUMA balancing
> is also a factor but it also simply conflicts with the load balancer.
> 
> This patch allows a fixed degree of imbalance of two tasks to exist
> between NUMA domains regardless of utilisation levels. In many cases,
> this prevents communicating tasks being pulled apart. It was evaluated
> whether the imbalance should be scaled to the domain size. However, no
> additional benefit was measured across a range of workloads and machines
> and scaling adds the risk that lower domains have to be rebalanced. While
> this could change again in the future, such a change should specify the
> use case and benefit.
> 
> The most obvious impact is on netperf TCP_STREAM -- two simple
> communicating tasks with some softirq offload depending on the
> transmission rate.
> 
> 2-socket Haswell machine 48 core, HT enabled
> netperf-tcp -- mmtests config config-network-netperf-unbound
>                        	      baseline              lbnuma-v3
> Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
> Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
> Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
> Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
> Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
> Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
> Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
> Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
> Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
> Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
> Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
> Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
> Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
> Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
> Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
> Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
> Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
> Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)
> 
> In this case, there was a sizable improvement to performance and
> a general reduction in variance. However, this is not univeral.
> For most machines, the impact was roughly a 3% performance gain.
> 
> Ops NUMA base-page range updates       19796.00         292.00
> Ops NUMA PTE updates                   19796.00         292.00
> Ops NUMA PMD updates                       0.00           0.00
> Ops NUMA hint faults                   16113.00         143.00
> Ops NUMA hint local faults %            8407.00         142.00
> Ops NUMA hint local percent               52.18          99.30
> Ops NUMA pages migrated                 4244.00           1.00
> 
> Without the patch, only 52.18% of sampled accesses are local.  In an
> earlier changelog, 100% of sampled accesses are local and indeed on
> most machines, this was still the case. In this specific case, the
> local sampled rates was 99.3% but note the "base-page range updates"
> and "PTE updates".  The activity with the patch is negligible as were
> the number of faults. The small number of pages migrated were related to
> shared libraries.  A 2-socket Broadwell showed better results on average
> but are not presented for brevity as the performance was similar except
> it showed 100% of the sampled NUMA hints were local. The patch holds up
> for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.
> 
> For dbench, the impact depends on the filesystem used and the number of
> clients. On XFS, there is little difference as the clients typically
> communicate with workqueues which have a separate class of scheduler
> problem at the moment. For ext4, performance is generally better,
> particularly for small numbers of clients as NUMA balancing activity is
> negligible with the patch applied.
> 
> A more interesting example is the Facebook schbench which uses a
> number of messaging threads to communicate with worker threads. In this
> configuration, one messaging thread is used per NUMA node and the number of
> worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
> for response latency is then reported.
> 
> Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
> Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
> Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
> Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
> Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
> Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
> Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
> Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
> Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
> Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
> Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
> Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
> Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
> Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%
> 
> For higher worker threads, the differences become negligible but it's
> interesting to note the difference in wakeup latency at low utilisation
> and mpstat confirms that activity was almost all on one node until
> the number of worker threads increase.
> 
> Hackbench generally showed neutral results across a range of machines.
> This is different to earlier versions of the patch which allowed imbalances
> for higher degrees of utilisation. perf bench pipe showed negligible
> differences in overall performance as the differences are very close to
> the noise.
> 
> An earlier prototype of the patch showed major regressions for NAS C-class
> when running with only half of the available CPUs -- 20-30% performance
> hits were measured at the time. With this version of the patch, the impact
> is negligible with small gains/losses within the noise measured. This is
> because the number of threads far exceeds the small imbalance the aptch
> cares about. Similarly, there were report of regressions for the autonuma
> benchmark against earlier versions but again, normal load balancing now
> applies for that workload.
> 
> In general, the patch simply seeks to avoid unnecessary cross-node
> migrations in the basic case where imbalances are very small.  For low
> utilisation communicating workloads, this patch generally behaves better
> with less NUMA balancing activity. For high utilisation, there is no
> change in behaviour.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> ---
>  kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
>  1 file changed, 29 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ba749f579714..ade7a8dca5e4 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8648,10 +8648,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  	/*
>  	 * Try to use spare capacity of local group without overloading it or
>  	 * emptying busiest.
> -	 * XXX Spreading tasks across NUMA nodes is not always the best policy
> -	 * and special care should be taken for SD_NUMA domain level before
> -	 * spreading the tasks. For now, load_balance() fully relies on
> -	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
>  	 */
>  	if (local->group_type == group_has_spare) {
>  		if (busiest->group_type > group_fully_busy) {
> @@ -8691,16 +8687,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			env->migration_type = migrate_task;
>  			lsub_positive(&nr_diff, local->sum_nr_running);
>  			env->imbalance = nr_diff >> 1;
> -			return;
> -		}
> +		} else {
>  
> -		/*
> -		 * If there is no overload, we just want to even the number of
> -		 * idle cpus.
> -		 */
> -		env->migration_type = migrate_task;
> -		env->imbalance = max_t(long, 0, (local->idle_cpus -
> +			/*
> +			 * If there is no overload, we just want to even the number of
> +			 * idle cpus.
> +			 */
> +			env->migration_type = migrate_task;
> +			env->imbalance = max_t(long, 0, (local->idle_cpus -
>  						 busiest->idle_cpus) >> 1);
> +		}
> +
> +		/* Consider allowing a small imbalance between NUMA groups */
> +		if (env->sd->flags & SD_NUMA) {
> +			unsigned int imbalance_min;
> +
> +			/*
> +			 * Compute an allowed imbalance based on a simple
> +			 * pair of communicating tasks that should remain
> +			 * local and ignore them.
> +			 *
> +			 * NOTE: Generally this would have been based on
> +			 * the domain size and this was evaluated. However,
> +			 * the benefit is similar across a range of workloads
> +			 * and machines but scaling by the domain size adds
> +			 * the risk that lower domains have to be rebalanced.
> +			 */
> +			imbalance_min = 2;
> +			if (busiest->sum_nr_running <= imbalance_min)
> +				env->imbalance = 0;
> +		}
> +
>  		return;
>  	}
>  
> 

Works for me. I like this simlified version.

Acked-by: Phil Auld <pauld@redhat.com>

  and/or

Tested-by: Phil Auld <pauld@redhat.com>


-- 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
                   ` (3 preceding siblings ...)
  2020-01-17 15:21 ` Phil Auld
@ 2020-01-17 17:56 ` Srikar Dronamraju
  2020-01-17 21:58   ` Mel Gorman
  2020-01-21  9:59 ` Srikar Dronamraju
  2020-01-29 11:32 ` [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains tip-bot2 for Mel Gorman
  6 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-17 17:56 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

* Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:

> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
>   be as good or better than allowing an imbalance based on the group weight
>   without worrying about potential spillover of the lower scheduler domains.
> 

We certainly are seeing better results than v1.
However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
the others are improving.

While numa04 improves by 14%, numa02 regress by around 12%.

Setup:
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           32
NUMA node(s):        8
Model:               2.1 (pvr 004b 0201)
Model name:          POWER8 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           64K
L1i cache:           32K
L2 cache:            512K
L3 cache:            8192K
NUMA node0 CPU(s):   0-31
NUMA node1 CPU(s):   32-63
NUMA node2 CPU(s):   64-95
NUMA node3 CPU(s):   96-127
NUMA node4 CPU(s):   128-159
NUMA node5 CPU(s):   160-191
NUMA node6 CPU(s):   192-223
NUMA node7 CPU(s):   224-255

numa01 is a set of 2 process each running 128 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa02 is a single process with 256 threads;
each thread doing 800 loops on 32MB thread local memory operations.

numa03 is a single process with 256 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa04 is a set of 8 process (as many nodes) each running 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
each thread doing 50 loops on 3GB process shared memory operations.

Details below:
Testcase         Time:  Min        Max        Avg        StdDev
./numa01.sh      Real:  513.12     547.37     530.25     17.12
./numa01.sh      Sys:   107.73     146.26     127.00     19.26
./numa01.sh      User:  122812.39  129136.61  125974.50  3162.11
./numa02.sh      Real:  68.23      72.44      70.34      2.10
./numa02.sh      Sys:   52.35      55.65      54.00      1.65
./numa02.sh      User:  14334.37   14907.14   14620.76   286.38
./numa03.sh      Real:  471.36     485.19     478.27     6.92
./numa03.sh      Sys:   74.91      77.03      75.97      1.06
./numa03.sh      User:  118197.30  121238.68  119717.99  1520.69
./numa04.sh      Real:  450.35     454.93     452.64     2.29
./numa04.sh      Sys:   362.49     397.95     380.22     17.73
./numa04.sh      User:  93150.82   93300.60   93225.71   74.89
./numa05.sh      Real:  361.18     366.32     363.75     2.57
./numa05.sh      Sys:   678.72     726.32     702.52     23.80
./numa05.sh      User:  82634.58   85103.97   83869.27   1234.70
Testcase         Time:  Min        Max        Avg        StdDev   %Change
./numa01.sh      Real:  485.45     530.20     507.83     22.37    4.41486%
./numa01.sh      Sys:   123.45     130.62     127.03     3.59     -0.0236165%
./numa01.sh      User:  119152.08  127121.14  123136.61  3984.53  2.30467%
./numa02.sh      Real:  78.87      82.31      80.59      1.72     -12.7187%
./numa02.sh      Sys:   81.18      85.07      83.12      1.94     -35.0337%
./numa02.sh      User:  16303.70   17122.14   16712.92   409.22   -12.5182%
./numa03.sh      Real:  477.20     528.12     502.66     25.46    -4.85219%
./numa03.sh      Sys:   88.93      115.36     102.15     13.21    -25.629%
./numa03.sh      User:  119120.73  129829.89  124475.31  5354.58  -3.8219%
./numa04.sh      Real:  374.70     414.76     394.73     20.03    14.6708%
./numa04.sh      Sys:   357.14     379.20     368.17     11.03    3.27294%
./numa04.sh      User:  87830.73   88547.21   88188.97   358.24   5.7113%
./numa05.sh      Real:  369.50     401.56     385.53     16.03    -5.64937%
./numa05.sh      Sys:   718.99     741.02     730.00     11.01    -3.76438%
./numa05.sh      User:  84989.07   85271.75   85130.41   141.34   -1.48142%

vmstat for numa01
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        2170524     2021927     -6.84613%
numa_hint_faults_local  376099      337768      -10.1917%
numa_hit                1177785     1149206     -2.4265%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              1176900     1149095     -2.36256%
numa_miss               0           0           NA
numa_other              885         111         -87.4576%
numa_pages_migrated     304670      292963      -3.84252%
numa_pte_updates        2171627     2022996     -6.84422%
pgfault                 4469999     4266785     -4.54618%
pgmajfault              280         247         -11.7857%
pgmigrate_fail          1           0           -100%
pgmigrate_success       304670      292963      -3.84252%

vmstat for numa02
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        496508      508975      2.51094%
numa_hint_faults_local  295974      282634      -4.50715%
numa_hit                585706      642712      9.73287%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              585700      642677      9.72802%
numa_miss               0           0           NA
numa_other              6           35          483.333%
numa_pages_migrated     199884      224448      12.2891%
numa_pte_updates        513146      525354      2.37905%
pgfault                 1111950     1238982     11.4243%
pgmajfault              121         141         16.5289%
pgmigrate_fail          0           0           NA
pgmigrate_success       199884      224448      12.2891%

vmstat for numa03
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        863404      951850      10.2439%
numa_hint_faults_local  108422      120466      11.1084%
numa_hit                612432      592068      -3.3251%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              612384      592059      -3.319%
numa_miss               0           0           NA
numa_other              48          9           -81.25%
numa_pages_migrated     118517      121945      2.89241%
numa_pte_updates        865936      952055      9.94519%
pgfault                 2291712     2325598     1.47863%
pgmajfault              155         113         -27.0968%
pgmigrate_fail          0           2           NA
pgmigrate_success       118517      121945      2.89241%

vmstat for numa04
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        8122814     7678754     -5.46682%
numa_hint_faults_local  3965028     4202779     5.9962%
numa_hit                2453692     2412929     -1.66129%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              2453668     2412815     -1.66498%
numa_miss               0           0           NA
numa_other              24          114         375%
numa_pages_migrated     1302687     1249958     -4.04771%
numa_pte_updates        8139895     7683560     -5.60615%
pgfault                 10420191    10002382    -4.00961%
pgmajfault              145         166         14.4828%
pgmigrate_fail          0           1           NA
pgmigrate_success       1302687     1249958     -4.04771%

vmstat for numa05
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           252995      NA
numa_hint_faults        16968844    16706026    -1.54883%
numa_hint_faults_local  10525364    10167507    -3.39995%
numa_hit                4354639     3947252     -9.35524%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              4354568     3947234     -9.35418%
numa_miss               0           252995      NA
numa_other              71          253013      356256%
numa_pages_migrated     2398713     2288409     -4.59847%
numa_pte_updates        16997456    16760448    -1.39437%
pgfault                 20471213    19945264    -2.56921%
pgmajfault              166         261         57.2289%
pgmigrate_fail          4           2           -50%
pgmigrate_success       2398713     2288409     -4.59847%


numa06 is a set of 2 process each running 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa07 is a single process with 32 threads;
each thread doing 800 loops on 32MB thread local memory operations.

numa08 is a single process with 32 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa09 is a set of 8 process (as many nodes) each running 4 threads;
each thread doing 50 loops on 3GB process shared memory operations.

numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
each thread doing 50 loops on 3GB process shared memory operations.


Testcase         Time:  Min      Max      Avg      StdDev
./numa06.sh      Real:  81.30    85.29    83.30    2.00
./numa06.sh      Sys:   6.15     8.64     7.40     1.24
./numa06.sh      User:  2493.87  2499.31  2496.59  2.72
./numa07.sh      Real:  17.01    18.47    17.74    0.73
./numa07.sh      Sys:   2.08     2.33     2.21     0.13
./numa07.sh      User:  396.38   427.87   412.12   15.74
./numa08.sh      Real:  77.89    79.05    78.47    0.58
./numa08.sh      Sys:   3.76     4.66     4.21     0.45
./numa08.sh      User:  2396.50  2443.64  2420.07  23.57
./numa09.sh      Real:  60.64    65.37    63.01    2.37
./numa09.sh      Sys:   31.28    33.10    32.19    0.91
./numa09.sh      User:  1666.04  1685.55  1675.80  9.75
./numa10.sh      Real:  56.48    56.64    56.56    0.08
./numa10.sh      Sys:   56.59    63.25    59.92    3.33
./numa10.sh      User:  1487.83  1492.53  1490.18  2.35
Testcase         Time:  Min      Max      Avg      StdDev  %Change
./numa06.sh      Real:  74.43    79.30    76.87    2.43    8.36477%
./numa06.sh      Sys:   8.64     9.16     8.90     0.26    -16.8539%
./numa06.sh      User:  2278.98  2376.25  2327.61  48.64   7.25981%
./numa07.sh      Real:  14.32    14.59    14.46    0.14    22.6833%
./numa07.sh      Sys:   2.02     2.09     2.05     0.04    7.80488%
./numa07.sh      User:  338.27   349.57   343.92   5.65    19.8302%
./numa08.sh      Real:  75.19    81.25    78.22    3.03    0.319611%
./numa08.sh      Sys:   3.92     3.98     3.95     0.03    6.58228%
./numa08.sh      User:  2320.61  2509.58  2415.10  94.48   0.205789%
./numa09.sh      Real:  64.44    64.65    64.55    0.10    -2.38575%
./numa09.sh      Sys:   32.11    39.12    35.61    3.51    -9.60404%
./numa09.sh      User:  1700.54  1771.65  1736.10  35.56   -3.4733%
./numa10.sh      Real:  56.78    57.61    57.20    0.42    -1.11888%
./numa10.sh      Sys:   67.30    67.82    67.56    0.26    -11.3085%
./numa10.sh      User:  1502.38  1502.95  1502.66  0.29    -0.830527%

vmstat for numa06
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        1401846     1317738     -5.9998%
numa_hint_faults_local  291501      254441      -12.7135%
numa_hit                490509      495083      0.932501%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              490506      495068      0.93006%
numa_miss               0           0           NA
numa_other              3           15          400%
numa_pages_migrated     224869      237124      5.44984%
numa_pte_updates        1401947     1317899     -5.99509%
pgfault                 1817481     1775118     -2.33086%
pgmajfault              175         178         1.71429%
pgmigrate_fail          0           0           NA
pgmigrate_success       224869      237124      5.44984%

vmstat for numa07
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        90935       87129       -4.18541%
numa_hint_faults_local  52864       49110       -7.10124%
numa_hit                94632       91902       -2.88486%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              94632       91902       -2.88486%
numa_miss               0           0           NA
numa_other              0           0           NA
numa_pages_migrated     37232       37744       1.37516%
numa_pte_updates        92987       89177       -4.09735%
pgfault                 171811      177212      3.14357%
pgmajfault              65          72          10.7692%
pgmigrate_fail          0           0           NA
pgmigrate_success       37232       37744       1.37516%

vmstat for numa08
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        656205      578320      -11.869%
numa_hint_faults_local  77425       85553       10.4979%
numa_hit                262903      246913      -6.08209%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              262902      246902      -6.08592%
numa_miss               0           0           NA
numa_other              1           11          1000%
numa_pages_migrated     115615      94939       -17.8835%
numa_pte_updates        656300      578399      -11.8697%
pgfault                 1000775     879013      -12.1668%
pgmajfault              80          173         116.25%
pgmigrate_fail          0           0           NA
pgmigrate_success       115615      94939       -17.8835%

vmstat for numa09
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           0           NA
numa_hint_faults        5292059     5086197     -3.89002%
numa_hint_faults_local  2771125     2463519     -11.1004%
numa_hit                1993632     2043106     2.4816%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              1993631     2043076     2.48015%
numa_miss               0           0           NA
numa_other              1           30          2900%
numa_pages_migrated     1154157     1223564     6.01365%
numa_pte_updates        5313698     5098234     -4.05488%
pgfault                 6531964     6196370     -5.13772%
pgmajfault              83          121         45.7831%
pgmigrate_fail          0           0           NA
pgmigrate_success       1154157     1223564     6.01365%

vmstat for numa10
param                   last_patch  with_patch  %Change
-----                   ----------  ----------  -------
numa_foreign            0           195343      NA
numa_hint_faults        9745914     10968959    12.5493%
numa_hint_faults_local  6331681     7146416     12.8676%
numa_hit                3533392     3466916     -1.88136%
numa_huge_pte_updates   0           0           NA
numa_interleave         0           0           NA
numa_local              3533392     3466908     -1.88159%
numa_miss               0           195343      NA
numa_other              0           195351      NA
numa_pages_migrated     1930180     2050279     6.22217%
numa_pte_updates        9798861     11018095    12.4426%
pgfault                 11544963    12744348    10.3888%
pgmajfault              83          154         85.5422%
pgmigrate_fail          0           0           NA
pgmigrate_success       1930180     2050279     6.22217%

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 17:56 ` Srikar Dronamraju
@ 2020-01-17 21:58   ` Mel Gorman
  2020-01-20  8:09     ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-17 21:58 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> 
> > Changelog since V3
> > o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
> >   be as good or better than allowing an imbalance based on the group weight
> >   without worrying about potential spillover of the lower scheduler domains.
> > 
> 
> We certainly are seeing better results than v1.
> However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> the others are improving.
> 
> While numa04 improves by 14%, numa02 regress by around 12%.
> 

Ok, so it's both a win and a loss. This is a curiousity that this patch
may be the primary factor given that the logic only triggers when the
local group has spare capacity and the busiest group is nearly idle. The
test cases you describe should have fairly busy local groups.

> Setup:
> Architecture:        ppc64le
> Byte Order:          Little Endian
> CPU(s):              256
> On-line CPU(s) list: 0-255
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):           32
> NUMA node(s):        8
> Model:               2.1 (pvr 004b 0201)
> Model name:          POWER8 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:           64K
> L1i cache:           32K
> L2 cache:            512K
> L3 cache:            8192K
> NUMA node0 CPU(s):   0-31
> NUMA node1 CPU(s):   32-63
> NUMA node2 CPU(s):   64-95
> NUMA node3 CPU(s):   96-127
> NUMA node4 CPU(s):   128-159
> NUMA node5 CPU(s):   160-191
> NUMA node6 CPU(s):   192-223
> NUMA node7 CPU(s):   224-255
> 
> numa01 is a set of 2 process each running 128 threads;
> each thread doing 50 loops on 3GB process shared memory operations.

Are the shared operations shared between the 2 processes? 256 threads
in total would more than exceed the capacity of a local group, even 128
threads per process would exceed the capacity of the local group. In such
a situation, much would depend on the locality of the accesses as well
as any shared accesses.

> numa02 is a single process with 256 threads;
> each thread doing 800 loops on 32MB thread local memory operations.
> 

This one is more interesting. False sharing shouldn't be an issue so the
threads should be independent.

> numa03 is a single process with 256 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
> 

Similar.

> numa04 is a set of 8 process (as many nodes) each running 32 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
> 

Less clear as you don't say what is sharing the memory operations.

> numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
> 

Again, hard to tell because the shared memory operations are not described.

Of all of these, numa02 is the most interesting as it's the simplest
case showing a problem.

> Details below:

How many iterations for each test? 

> Testcase         Time:  Min        Max        Avg        StdDev
> ./numa01.sh      Real:  513.12     547.37     530.25     17.12
> ./numa01.sh      Sys:   107.73     146.26     127.00     19.26
> ./numa01.sh      User:  122812.39  129136.61  125974.50  3162.11
> ./numa02.sh      Real:  68.23      72.44      70.34      2.10
> ./numa02.sh      Sys:   52.35      55.65      54.00      1.65
> ./numa02.sh      User:  14334.37   14907.14   14620.76   286.38
> ./numa03.sh      Real:  471.36     485.19     478.27     6.92
> ./numa03.sh      Sys:   74.91      77.03      75.97      1.06
> ./numa03.sh      User:  118197.30  121238.68  119717.99  1520.69
> ./numa04.sh      Real:  450.35     454.93     452.64     2.29
> ./numa04.sh      Sys:   362.49     397.95     380.22     17.73
> ./numa04.sh      User:  93150.82   93300.60   93225.71   74.89
> ./numa05.sh      Real:  361.18     366.32     363.75     2.57
> ./numa05.sh      Sys:   678.72     726.32     702.52     23.80
> ./numa05.sh      User:  82634.58   85103.97   83869.27   1234.70
> Testcase         Time:  Min        Max        Avg        StdDev   %Change
> ./numa01.sh      Real:  485.45     530.20     507.83     22.37    4.41486%
> ./numa01.sh      Sys:   123.45     130.62     127.03     3.59     -0.0236165%
> ./numa01.sh      User:  119152.08  127121.14  123136.61  3984.53  2.30467%

The number of iterations is unknown in general but there is a lot of
overlap between the min and max ranges and the range is wide. It may or
may not be a gain overall.

Before range: 513 to 547
After  range: 485 to 530


> ./numa02.sh      Real:  78.87      82.31      80.59      1.72     -12.7187%
> ./numa02.sh      Sys:   81.18      85.07      83.12      1.94     -35.0337%
> ./numa02.sh      User:  16303.70   17122.14   16712.92   409.22   -12.5182%

Before range: 58 to 72
After range: 78 to 82

This one is more interesting in general. Can you add trace_printks to
the check for SD_NUMA the patch introduces and dump the sum_nr_running
for both local and busiest when the imbalance is ignored please? That
might give some hint as to the improper conditions where imbalance is
ignored.

However, knowing the number of iterations would be helpful. Can you also
tell me if this is consistent between boots or is it always roughly 12%
regression regardless of the number of iterations?

> ./numa03.sh      Real:  477.20     528.12     502.66     25.46    -4.85219%
> ./numa03.sh      Sys:   88.93      115.36     102.15     13.21    -25.629%
> ./numa03.sh      User:  119120.73  129829.89  124475.31  5354.58  -3.8219%

Range before: 471 to 485
Range after: 477 to 528

> ./numa04.sh      Real:  374.70     414.76     394.73     20.03    14.6708%
> ./numa04.sh      Sys:   357.14     379.20     368.17     11.03    3.27294%
> ./numa04.sh      User:  87830.73   88547.21   88188.97   358.24   5.7113%

Range before: 450 -> 454
Range after:  374 -> 414

Big gain there but the fact the range changed so much is a concern and
makes me wonder if this case is stable from boot to boot. 

> ./numa05.sh      Real:  369.50     401.56     385.53     16.03    -5.64937%
> ./numa05.sh      Sys:   718.99     741.02     730.00     11.01    -3.76438%
> ./numa05.sh      User:  84989.07   85271.75   85130.41   141.34   -1.48142%
> 

Big range changes again but the shared memory operations complicate
matters. I think it's best to focus on numa02 for and identify if there
is an improper condition where the patch has an impact, the local group
has high utilisation but spare capacity while the busiest group is
almost completely idle.

> vmstat for numa01

I'm not going to comment in detail on these other than noting that NUMA
balancing is heavily active in all cases which may be masking any effect
of the patch and may have unstable results in general.

> <SNIP vmstat>
> <SNIP description of loads that showed gains>
>
> numa09 is a set of 8 process (as many nodes) each running 4 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
> 

No description of shared operations but NUMA balancing is very active so
sharing is probably between processes.

> numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
> each thread doing 50 loops on 3GB process shared memory operations.
> 

Again, shared accesses without description and heavy NUMA balancing
activity.

So bottom line, a lot of these cases have shared operations where NUMA
balancing decisions should dominate and make it hard to detect any impact
from the patch. The exception is numa02 so please add tracing and dump
out local and busiest sum_nr_running when the imbalance is ignored. I
want to see if it's as simple as the local group is very busy but has
capacity where the busiest group is almost idle. I also want to see how
many times over the course of the numa02 workload that the conditions
for the patch are even met.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-17 21:58   ` Mel Gorman
@ 2020-01-20  8:09     ` Srikar Dronamraju
  2020-01-20  8:33       ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-20  8:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

* Mel Gorman <mgorman@techsingularity.net> [2020-01-17 21:58:53]:

> On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> > 
> > We certainly are seeing better results than v1.
> > However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> > the others are improving.
> > 
> > While numa04 improves by 14%, numa02 regress by around 12%.
> > 

> Ok, so it's both a win and a loss. This is a curiousity that this patch
> may be the primary factor given that the logic only triggers when the
> local group has spare capacity and the busiest group is nearly idle. The
> test cases you describe should have fairly busy local groups.
> 

Right, your code only seems to affect when the local group has spare
capacity and the busiest->sum_nr_running <=2 

> > 
> > numa01 is a set of 2 process each running 128 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> 
> Are the shared operations shared between the 2 processes? 256 threads
> in total would more than exceed the capacity of a local group, even 128
> threads per process would exceed the capacity of the local group. In such
> a situation, much would depend on the locality of the accesses as well
> as any shared accesses.

Except for numa02 and numa07, (both handle local memory operations) all
shared operations are within the process. i.e per process sharing.

> 
> > numa02 is a single process with 256 threads;
> > each thread doing 800 loops on 32MB thread local memory operations.
> > 
> 
> This one is more interesting. False sharing shouldn't be an issue so the
> threads should be independent.
> 
> > numa03 is a single process with 256 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> > 
> 
> Similar.

This is similar to numa01. Except now all threads belong to just one
process.

> 
> > numa04 is a set of 8 process (as many nodes) each running 32 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> > 
> 
> Less clear as you don't say what is sharing the memory operations.

all sharing is within the process. In Numa04/numa09, I try to spawn as many
process as the number of nodes, other than that its same as Numa02.

> 
> > numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> > 
> 
> > Details below:
> 
> How many iterations for each test? 

I run 5 iterations. Want me to run with more iterations?

> 
> 
> > ./numa02.sh      Real:  78.87      82.31      80.59      1.72     -12.7187%
> > ./numa02.sh      Sys:   81.18      85.07      83.12      1.94     -35.0337%
> > ./numa02.sh      User:  16303.70   17122.14   16712.92   409.22   -12.5182%
> 
> Before range: 58 to 72
> After range: 78 to 82
> 
> This one is more interesting in general. Can you add trace_printks to
> the check for SD_NUMA the patch introduces and dump the sum_nr_running
> for both local and busiest when the imbalance is ignored please? That
> might give some hint as to the improper conditions where imbalance is
> ignored.

Can be done. Will get back with the results. But do let me know if you want
to run with more iterations or rerun the tests.

> 
> However, knowing the number of iterations would be helpful. Can you also
> tell me if this is consistent between boots or is it always roughly 12%
> regression regardless of the number of iterations?
> 

I have only measured for 5 iterations and I haven't repeated to see if the
numbers are consistent.

> > ./numa03.sh      Real:  477.20     528.12     502.66     25.46    -4.85219%
> > ./numa03.sh      Sys:   88.93      115.36     102.15     13.21    -25.629%
> > ./numa03.sh      User:  119120.73  129829.89  124475.31  5354.58  -3.8219%
> 
> Range before: 471 to 485
> Range after: 477 to 528
> 
> > ./numa04.sh      Real:  374.70     414.76     394.73     20.03    14.6708%
> > ./numa04.sh      Sys:   357.14     379.20     368.17     11.03    3.27294%
> > ./numa04.sh      User:  87830.73   88547.21   88188.97   358.24   5.7113%
> 
> Range before: 450 -> 454
> Range after:  374 -> 414
> 
> Big gain there but the fact the range changed so much is a concern and
> makes me wonder if this case is stable from boot to boot. 
> 
> > ./numa05.sh      Real:  369.50     401.56     385.53     16.03    -5.64937%
> > ./numa05.sh      Sys:   718.99     741.02     730.00     11.01    -3.76438%
> > ./numa05.sh      User:  84989.07   85271.75   85130.41   141.34   -1.48142%
> > 
> 
> Big range changes again but the shared memory operations complicate
> matters. I think it's best to focus on numa02 for and identify if there
> is an improper condition where the patch has an impact, the local group
> has high utilisation but spare capacity while the busiest group is
> almost completely idle.
> 
> > vmstat for numa01
> 
> I'm not going to comment in detail on these other than noting that NUMA
> balancing is heavily active in all cases which may be masking any effect
> of the patch and may have unstable results in general.
> 
> > <SNIP vmstat>
> > <SNIP description of loads that showed gains>
> >
> > numa09 is a set of 8 process (as many nodes) each running 4 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> > 
> 
> No description of shared operations but NUMA balancing is very active so
> sharing is probably between processes.
> 
> > numa10 is a set of 16 process (twice as many nodes) each running 2 threads;
> > each thread doing 50 loops on 3GB process shared memory operations.
> > 
> 
> Again, shared accesses without description and heavy NUMA balancing
> activity.
> 
> So bottom line, a lot of these cases have shared operations where NUMA
> balancing decisions should dominate and make it hard to detect any impact
> from the patch. The exception is numa02 so please add tracing and dump
> out local and busiest sum_nr_running when the imbalance is ignored. I
> want to see if it's as simple as the local group is very busy but has
> capacity where the busiest group is almost idle. I also want to see how
> many times over the course of the numa02 workload that the conditions
> for the patch are even met.
> 

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-20  8:09     ` Srikar Dronamraju
@ 2020-01-20  8:33       ` Mel Gorman
  2020-01-20 17:27         ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-20  8:33 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Mon, Jan 20, 2020 at 01:39:35PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2020-01-17 21:58:53]:
> 
> > On Fri, Jan 17, 2020 at 11:26:31PM +0530, Srikar Dronamraju wrote:
> > > * Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:
> > > 
> > > We certainly are seeing better results than v1.
> > > However numa02, numa03, numa05, numa09 and numa10 still seem to regressing, while
> > > the others are improving.
> > > 
> > > While numa04 improves by 14%, numa02 regress by around 12%.
> > > 
> 
> > Ok, so it's both a win and a loss. This is a curiousity that this patch
> > may be the primary factor given that the logic only triggers when the
> > local group has spare capacity and the busiest group is nearly idle. The
> > test cases you describe should have fairly busy local groups.
> > 
> 
> Right, your code only seems to affect when the local group has spare
> capacity and the busiest->sum_nr_running <=2 
> 

And this is why I'm curious as to why your workload is affected at all
because it uses many tasks.  I stopped allowing an imbalance for higher
task counts partially on the basis of your previous report.

> > This one is more interesting. False sharing shouldn't be an issue so the
> > threads should be independent.
> > 
> > > numa03 is a single process with 256 threads;
> > > each thread doing 50 loops on 3GB process shared memory operations.
> > > 
> > 
> > Similar.
> 
> This is similar to numa01. Except now all threads belong to just one
> process.
> 

My concern is that the shared memory options means that NUMA balancing
and false sharing can dominate and hide any impact of the patch itself.
Whether it has good or bad results may be partially down to luck.

> > 
> > > numa05 is a set of 16 process (twice as many nodes) each running 16 threads;
> > > each thread doing 50 loops on 3GB process shared memory operations.
> > > 
> > 
> > > Details below:
> > 
> > How many iterations for each test? 
> 
> I run 5 iterations. Want me to run with more iterations?
> 

5 should be enough for now. I'm more interested in hearing if the
regression/gain is consistent when the patch is applied and a confirmation
that the patch really makes a difference to this set of workloads.

> > 
> > 
> > > ./numa02.sh      Real:  78.87      82.31      80.59      1.72     -12.7187%
> > > ./numa02.sh      Sys:   81.18      85.07      83.12      1.94     -35.0337%
> > > ./numa02.sh      User:  16303.70   17122.14   16712.92   409.22   -12.5182%
> > 
> > Before range: 58 to 72
> > After range: 78 to 82
> > 
> > This one is more interesting in general. Can you add trace_printks to
> > the check for SD_NUMA the patch introduces and dump the sum_nr_running
> > for both local and busiest when the imbalance is ignored please? That
> > might give some hint as to the improper conditions where imbalance is
> > ignored.
> 
> Can be done. Will get back with the results. But do let me know if you want
> to run with more iterations or rerun the tests.
> 

The results of this will be interesting in itself. I'm particularly
interested in seeing what the traces look like for a good and bad result.

> > 
> > However, knowing the number of iterations would be helpful. Can you also
> > tell me if this is consistent between boots or is it always roughly 12%
> > regression regardless of the number of iterations?
> > 
> 
> I have only measured for 5 iterations and I haven't repeated to see if the
> numbers are consistent.
> 

Ok, that is quite a problem as the assertion at the moment is that the
patch causes a mix of regressions/gains. It's currently unclear to me
why the patch would have a major impact on this workload at all given the
number of active tasks and the nature of the patch.  I'm concerned that
the workload may be naturally unstable but tracing may be able to help.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-20  8:33       ` Mel Gorman
@ 2020-01-20 17:27         ` Srikar Dronamraju
  2020-01-20 18:21           ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-20 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

> And this is why I'm curious as to why your workload is affected at all
> because it uses many tasks.  I stopped allowing an imbalance for higher
> task counts partially on the basis of your previous report.
> 

With this hunk on top of your patch and 5 runs of numa02, there were 0
traces.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ade7a8dca5e4..7506cf67bde8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8714,8 +8714,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			 * the risk that lower domains have to be rebalanced.
 			 */
 			imbalance_min = 2;
-			if (busiest->sum_nr_running <= imbalance_min)
+			if (busiest->sum_nr_running <= imbalance_min) {
+				trace_printk("Reseting imbalance: busiest->sum_nr_running=%d, local->sum_nr_running=%d\n", busiest->sum_nr_irunning, local->sum_nr_running);
 				env->imbalance = 0;
+			}
 		}
 
 		return;


perf stat for the 5 iterations this time shows: 
77.817 +- 0.995 seconds time elapsed  ( +-  1.28% )
which I think is significantly less than last time around.

So I think it may be some other noise that could have contributed to the
jump last time. Also since the time consumption of numa02 is very small, a
small disturbance can show up as a big number from a percentage perspective.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-20 17:27         ` Srikar Dronamraju
@ 2020-01-20 18:21           ` Mel Gorman
  2020-01-21  8:55             ` Srikar Dronamraju
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-20 18:21 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Mon, Jan 20, 2020 at 10:57:06PM +0530, Srikar Dronamraju wrote:
> > And this is why I'm curious as to why your workload is affected at all
> > because it uses many tasks.  I stopped allowing an imbalance for higher
> > task counts partially on the basis of your previous report.
> > 
> 
> With this hunk on top of your patch and 5 runs of numa02, there were 0
> traces.
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ade7a8dca5e4..7506cf67bde8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8714,8 +8714,10 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  			 * the risk that lower domains have to be rebalanced.
>  			 */
>  			imbalance_min = 2;
> -			if (busiest->sum_nr_running <= imbalance_min)
> +			if (busiest->sum_nr_running <= imbalance_min) {
> +				trace_printk("Reseting imbalance: busiest->sum_nr_running=%d, local->sum_nr_running=%d\n", busiest->sum_nr_irunning, local->sum_nr_running);
>  				env->imbalance = 0;
> +			}
>  		}
>  
>  		return;
> 

Ok, thanks. No traces indicates that the patch should have no effect at
all and any difference in performance is a coincidence. What about the
other test programs?

> 
> perf stat for the 5 iterations this time shows: 
> 77.817 +- 0.995 seconds time elapsed  ( +-  1.28% )
> which I think is significantly less than last time around.
> 
> So I think it may be some other noise that could have contributed to the
> jump last time. Also since the time consumption of numa02 is very small, a
> small disturbance can show up as a big number from a percentage perspective.

Understood. At the moment, I'm going to assume that the patch has zero
impact on your workload but confirmation that the other test programs
trigger no traces would be appreciated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-20 18:21           ` Mel Gorman
@ 2020-01-21  8:55             ` Srikar Dronamraju
  2020-01-21  9:11               ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-21  8:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

* Mel Gorman <mgorman@techsingularity.net> [2020-01-20 18:21:00]:

> Understood. At the moment, I'm going to assume that the patch has zero
> impact on your workload but confirmation that the other test programs
> trigger no traces would be appreciated.
> 

Yes, I confirm there were no traces when run with other test programs too.

> -- 
> Mel Gorman
> SUSE Labs

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-21  8:55             ` Srikar Dronamraju
@ 2020-01-21  9:11               ` Mel Gorman
  2020-01-21 10:42                 ` Peter Zijlstra
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2020-01-21  9:11 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 21, 2020 at 02:25:01PM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <mgorman@techsingularity.net> [2020-01-20 18:21:00]:
> 
> > Understood. At the moment, I'm going to assume that the patch has zero
> > impact on your workload but confirmation that the other test programs
> > trigger no traces would be appreciated.
> > 
> 
> Yes, I confirm there were no traces when run with other test programs too.
> 

Ok, great, thanks for confirming that!

Peter or Ingo, I think at this point all review comments have been
addressed. Is there anything else you'd like before picking the patch
up?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
                   ` (4 preceding siblings ...)
  2020-01-17 17:56 ` Srikar Dronamraju
@ 2020-01-21  9:59 ` Srikar Dronamraju
  2020-01-29 11:32 ` [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains tip-bot2 for Mel Gorman
  6 siblings, 0 replies; 24+ messages in thread
From: Srikar Dronamraju @ 2020-01-21  9:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vincent Guittot, Phil Auld, Ingo Molnar, Peter Zijlstra,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

* Mel Gorman <mgorman@techsingularity.net> [2020-01-14 10:13:20]:

> Changelog since V3
> o Allow a fixed imbalance a basic comparison with 2 tasks. This turned out to
>   be as good or better than allowing an imbalance based on the group weight
>   without worrying about potential spillover of the lower scheduler domains.
> 

Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4
  2020-01-21  9:11               ` Mel Gorman
@ 2020-01-21 10:42                 ` Peter Zijlstra
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Zijlstra @ 2020-01-21 10:42 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Srikar Dronamraju, Vincent Guittot, Phil Auld, Ingo Molnar,
	Valentin Schneider, Quentin Perret, Dietmar Eggemann,
	Morten Rasmussen, Hillf Danton, Parth Shah, Rik van Riel, LKML

On Tue, Jan 21, 2020 at 09:11:48AM +0000, Mel Gorman wrote:
> On Tue, Jan 21, 2020 at 02:25:01PM +0530, Srikar Dronamraju wrote:
> > * Mel Gorman <mgorman@techsingularity.net> [2020-01-20 18:21:00]:
> > 
> > > Understood. At the moment, I'm going to assume that the patch has zero
> > > impact on your workload but confirmation that the other test programs
> > > trigger no traces would be appreciated.
> > > 
> > 
> > Yes, I confirm there were no traces when run with other test programs too.
> > 
> 
> Ok, great, thanks for confirming that!
> 
> Peter or Ingo, I think at this point all review comments have been
> addressed. Is there anything else you'd like before picking the patch
> up?

I've already queued it a few days ago, should show up in tip soonish :-)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
  2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
                   ` (5 preceding siblings ...)
  2020-01-21  9:59 ` Srikar Dronamraju
@ 2020-01-29 11:32 ` tip-bot2 for Mel Gorman
  6 siblings, 0 replies; 24+ messages in thread
From: tip-bot2 for Mel Gorman @ 2020-01-29 11:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Mel Gorman, Peter Zijlstra (Intel),
	Ingo Molnar, Valentin Schneider, Vincent Guittot,
	Srikar Dronamraju, Phil Auld, x86, LKML

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b396f52326de20ec974471b7b19168867b365cbf
Gitweb:        https://git.kernel.org/tip/b396f52326de20ec974471b7b19168867b365cbf
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Tue, 14 Jan 2020 10:13:20 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Tue, 28 Jan 2020 21:36:55 +01:00

sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a fixed degree of imbalance of two tasks to exist
between NUMA domains regardless of utilisation levels. In many cases,
this prevents communicating tasks being pulled apart. It was evaluated
whether the imbalance should be scaled to the domain size. However, no
additional benefit was measured across a range of workloads and machines
and scaling adds the risk that lower domains have to be rebalanced. While
this could change again in the future, such a change should specify the
use case and benefit.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.

 2-socket Haswell machine 48 core, HT enabled
 netperf-tcp -- mmtests config config-network-netperf-unbound
			      baseline              lbnuma-v3
 Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
 Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
 Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
 Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
 Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
 Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
 Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
 Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
 Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
 Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
 Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
 Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
 Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
 Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
 Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
 Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
 Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
 Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)

In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.

 Ops NUMA base-page range updates       19796.00         292.00
 Ops NUMA PTE updates                   19796.00         292.00
 Ops NUMA PMD updates                       0.00           0.00
 Ops NUMA hint faults                   16113.00         143.00
 Ops NUMA hint local faults %            8407.00         142.00
 Ops NUMA hint local percent               52.18          99.30
 Ops NUMA pages migrated                 4244.00           1.00

Without the patch, only 52.18% of sampled accesses are local.  In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates".  The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries.  A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.

For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.

A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.

 Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
 Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
 Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
 Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
 Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
 Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
 Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
 Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
 Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
 Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
 Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
 Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
 Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%

For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.

Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.

In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small.  For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
---
 kernel/sched/fair.c | 41 +++++++++++++++++++++++++++++------------
 1 file changed, 29 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4e0d7..25dffc0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8658,10 +8658,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8701,16 +8697,37 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			unsigned int imbalance_min;
+
+			/*
+			 * Compute an allowed imbalance based on a simple
+			 * pair of communicating tasks that should remain
+			 * local and ignore them.
+			 *
+			 * NOTE: Generally this would have been based on
+			 * the domain size and this was evaluated. However,
+			 * the benefit is similar across a range of workloads
+			 * and machines but scaling by the domain size adds
+			 * the risk that lower domains have to be rebalanced.
+			 */
+			imbalance_min = 2;
+			if (busiest->sum_nr_running <= imbalance_min)
+				env->imbalance = 0;
+		}
+
 		return;
 	}
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-01-29 11:33 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-01-14 10:13 [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v4 Mel Gorman
2020-01-16 16:35 ` Mel Gorman
2020-01-17 13:08   ` Vincent Guittot
2020-01-17 14:15     ` Mel Gorman
2020-01-17 14:32       ` Phil Auld
2020-01-17 14:23     ` Phil Auld
2020-01-17 14:37   ` Valentin Schneider
2020-01-17 13:16 ` Vincent Guittot
2020-01-17 14:26   ` Mel Gorman
2020-01-17 14:29     ` Vincent Guittot
2020-01-17 15:09 ` Vincent Guittot
2020-01-17 15:11   ` Peter Zijlstra
2020-01-17 15:21 ` Phil Auld
2020-01-17 17:56 ` Srikar Dronamraju
2020-01-17 21:58   ` Mel Gorman
2020-01-20  8:09     ` Srikar Dronamraju
2020-01-20  8:33       ` Mel Gorman
2020-01-20 17:27         ` Srikar Dronamraju
2020-01-20 18:21           ` Mel Gorman
2020-01-21  8:55             ` Srikar Dronamraju
2020-01-21  9:11               ` Mel Gorman
2020-01-21 10:42                 ` Peter Zijlstra
2020-01-21  9:59 ` Srikar Dronamraju
2020-01-29 11:32 ` [tip: sched/core] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains tip-bot2 for Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).