All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2
@ 2018-02-13 13:37 Mel Gorman
  2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

This is a combination of "Reduce migrations due to load imbalance and
process exits" and "Stop wake_affine fighting with automatic NUMA balancing"
series.

There was a request for data on NAS OMP and how it compares with hard-binding
of threads to CPUs that is included in the changelog for the relevant
patch. I did not update most of the comparisons as a partial rerun of the
workloads shows similar results.

Changelog since v1
o Replace patch 4 with Peter's version
o Correct comments
o Add data on NAS OMP
o Rebase to 4.16-rc1

The following series was motivated by the fact that higher migrations were
observed than expected. The first two patches are minor cleanups. The third
patch avoids migrations on wakeup when two CPUs are equally loaded which is
particularly important in the case where the measured load is 0. The fourth
patch avoids cross-node migrations of a parent process when the child exits.
The fifth path resists new children being woken for the first time on a
remote node but it's not barred completely as that may lead to saturation
of a single node. The final patch forces automatic NUMA balancing to backoff
if it disagrees with wake_affine about where a task should be running.

 kernel/sched/fair.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 82 insertions(+), 7 deletions(-)

-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
  2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
  2018-02-21 10:27   ` [tip:sched/core] " tip-bot for Mel Gorman
  2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eb3ffc9be84..4b3e86eb2222 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,9 +5751,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
-		       int prev_cpu, int sync)
+		       int this_cpu, int prev_cpu, int sync)
 {
-	int this_cpu = smp_processor_id();
 	int target = nr_cpumask_bits;
 
 	if (sched_feat(WA_IDLE))
@@ -6376,7 +6375,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (cpu == prev_cpu)
 			goto pick_cpu;
 
-		new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
+		new_cpu = wake_affine(affine_sd, p, cpu, prev_cpu, sync);
 	}
 
 	if (sd && !(sd_flag & SD_BALANCE_FORK)) {
-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed
  2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
  2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
  2018-02-21 10:28   ` [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() " tip-bot for Mel Gorman
  2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b3e86eb2222..c1091cb023c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5724,7 +5724,6 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	unsigned long task_load;
 
 	this_eff_load = target_load(this_cpu, sd->wake_idx);
-	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
@@ -5742,6 +5741,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load *= 100;
 	this_eff_load *= capacity_of(prev_cpu);
 
+	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 	prev_eff_load -= task_load;
 	if (sched_feat(WA_BIAS))
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal
  2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
  2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
  2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
  2018-02-21 10:28   ` [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() " tip-bot for Mel Gorman
  2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.

This patch modifies wake_affine_weight to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.

For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows

                                     4.15.0                 4.15.0
                                      16rc0          noequal-v1r23
Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)

The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.

The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency

                                     4.15.0                 4.15.0
                                      16rc0          noequal-v1r23
Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1091cb023c4..28c8d9c91955 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
 	prev_eff_load *= capacity_of(this_cpu);
 
-	return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
+	/*
+	 * If sync, adjust the weight of prev_eff_load such that if
+	 * prev_eff == this_eff that select_idle_sibling will consider
+	 * stacking the wakee on top of the waker if no other CPU is
+	 * idle.
+	 */
+	if (sync)
+		prev_eff_load += 1;
+
+	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit
  2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
                   ` (2 preceding siblings ...)
  2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
  2018-02-21 10:29   ` [tip:sched/core] " tip-bot for Peter Zijlstra
  2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
  2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
  5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

From: Peter Zijlstra <peterz@infradead.org>

When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28c8d9c91955..e5bbcbefd01b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6350,7 +6350,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
 	int want_affine = 0;
-	int sync = wake_flags & WF_SYNC;
+	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
  2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
                   ` (3 preceding siblings ...)
  2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
  2018-02-21 10:29   ` [tip:sched/core] " tip-bot for Mel Gorman
  2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
  5 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on.  This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine.

This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.

A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes

                                 4.15.0                 4.15.0
                           noexit-v1r23           sdnuma-v1r23
Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)

              4.15.0      4.15.0
        noexit-v1r23 sdnuma-v1r23
User         5434.12     5188.41
System       4878.77     3467.09
Elapsed     10259.06     8624.21

That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.

There is also a noticable impact on hackbench such as this example using
processes and pipes

hackbench-process-pipes
                              4.15.0                 4.15.0
                        noexit-v1r23           sdnuma-v1r23
Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)

It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable

                                     4.15.0                 4.15.0
                               noexit-v1r23           sdnuma-v1r23
Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)

However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate.

NUMA base PTE updates             4620        1473
NUMA huge PMD updates                0           0
NUMA page range updates           4620        1473
NUMA hint faults                  4301        1383
NUMA hint local faults            1309         451
NUMA hint local percent             30          32
NUMA pages migrated               1335         491
AutoNUMA cost                      21%          6%

There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5bbcbefd01b..b1efd7570c88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5911,6 +5911,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	if (!idlest)
 		return NULL;
 
+	/*
+	 * When comparing groups across NUMA domains, it's possible for the
+	 * local domain to be very lightly loaded relative to the remote
+	 * domains but "imbalance" skews the comparison making remote CPUs
+	 * look much more favourable. When considering cross-domain, add
+	 * imbalance to the runnable load on the remote node and consider
+	 * staying local.
+	 */
+	if ((sd->flags & SD_NUMA) &&
+	    min_runnable_load + imbalance >= this_runnable_load)
+		return NULL;
+
 	if (min_runnable_load > (this_runnable_load + imbalance))
 		return NULL;
 
-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
  2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
                   ` (4 preceding siblings ...)
  2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
@ 2018-02-13 13:37 ` Mel Gorman
  2018-02-13 14:01   ` Peter Zijlstra
  2018-02-21 10:30   ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
  5 siblings, 2 replies; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich,
	LKML, Mel Gorman

If wake_affine pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.

Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting.

Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)

NUMA alloc hit                 1465986     1423090
NUMA alloc miss                      0           0
NUMA interleave hit                  0           0
NUMA alloc local               1465897     1423003
NUMA base PTE updates             1473        1420
NUMA huge PMD updates                0           0
NUMA page range updates           1473        1420
NUMA hint faults                  1383        1312
NUMA hint local faults             451         124
NUMA hint local percent             32           9

There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine and automatic
NUMA balancing fighting each other constantly is undesirable.

However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine

nas-mpi
                          4.15.0                 4.15.0
                    sdnuma-v1r23       delayretry-v1r23
Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)

              4.15.0      4.15.0
        sdnuma-v1r23delayretry-v1r23
User        75665.20    70413.30
System      20321.59     8861.67
Elapsed       766.13      634.92

Minor Faults                  16528502     7127941
Major Faults                      4553        5068
NUMA alloc local               6963197     6749135
NUMA base PTE updates        366409093   107491434
NUMA huge PMD updates           687556      198880
NUMA page range updates      718437765   209317994
NUMA hint faults              13643410     4601187
NUMA hint local faults         9212593     3063996
NUMA hint local percent             67          66

Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.

There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage.

nas-omp
                      4.16.0-rc1             4.16.0-rc1
                     sdnuma-v2r1        delayretry-v2r1
Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)

          4.16.0-rc1  4.16.0-rc1
         sdnuma-v2r1delayretry-v2r1
User       305633.08   296751.91
System        451.75      357.80
Elapsed      2595.73     2368.13

However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here.

                      4.16.0-rc1             4.16.0-rc1
                 delayretry-bind     delayretry-unbound
Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)

          4.16.0-rc1  4.16.0-rc1
        delayretry-binddelayretry-unbound
User       277731.25   296751.91
System        261.29      357.80
Elapsed      2100.55     2368.13

Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.

Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b1efd7570c88..b8778924cef5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1869,6 +1869,7 @@ static int task_numa_migrate(struct task_struct *p)
 static void numa_migrate_preferred(struct task_struct *p)
 {
 	unsigned long interval = HZ;
+	unsigned long numa_migrate_retry;
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
@@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 	/* Periodically retry migrating the task to the preferred node */
 	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
-	p->numa_migrate_retry = jiffies + interval;
+	numa_migrate_retry = jiffies + interval;
+
+	/*
+	 * Check that the new retry threshold is after the current one. If
+	 * the retry is in the future, it implies that wake_affine has
+	 * temporarily asked NUMA balancing to backoff from placement.
+	 */
+	if (numa_migrate_retry > p->numa_migrate_retry)
+		return;
+
+	/* Safe to try placing the task on the preferred node */
+	p->numa_migrate_retry = numa_migrate_retry;
 
 	/* Success if task is already running on preferred CPU */
 	if (task_node(p) == p->numa_preferred_nid)
@@ -5759,6 +5771,48 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+	unsigned long interval;
+
+	if (!static_branch_likely(&sched_numa_balancing))
+		return;
+
+	/* If balancing has no preference then continue gathering data */
+	if (p->numa_preferred_nid == -1)
+		return;
+
+	/*
+	 * If the wakeup is not affecting locality then it is neutral from
+	 * the perspective of NUMA balacing so continue gathering data.
+	 */
+	if (cpus_share_cache(prev_cpu, target))
+		return;
+
+	/*
+	 * Temporarily prevent NUMA balancing trying to place waker/wakee after
+	 * wakee has been moved by wake_affine. This will potentially allow
+	 * related tasks to converge and update their data placement. The
+	 * 4 * numa_scan_period is to allow the two-pass filter to migrate
+	 * hot data to the wakers node.
+	 */
+	interval = max(sysctl_numa_balancing_scan_delay,
+			 p->numa_scan_period << 2);
+	p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+
+	interval = max(sysctl_numa_balancing_scan_delay,
+			 current->numa_scan_period << 2);
+	current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+}
+#else
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+}
+#endif
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
@@ -5774,6 +5828,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
+	update_wa_numa_placement(p, prev_cpu, target);
 	schedstat_inc(sd->ttwu_move_affine);
 	schedstat_inc(p->se.statistics.nr_wakeups_affine);
 	return target;
-- 
2.15.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
  2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
@ 2018-02-13 14:01   ` Peter Zijlstra
  2018-02-13 14:18     ` Mel Gorman
  2018-02-21 10:30   ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-02-13 14:01 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML

On Tue, Feb 13, 2018 at 01:37:30PM +0000, Mel Gorman wrote:
> +static void
> +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> +{
> +	unsigned long interval;
> +
> +	if (!static_branch_likely(&sched_numa_balancing))
> +		return;
> +
> +	/* If balancing has no preference then continue gathering data */
> +	if (p->numa_preferred_nid == -1)
> +		return;
> +
> +	/*
> +	 * If the wakeup is not affecting locality then it is neutral from
> +	 * the perspective of NUMA balacing so continue gathering data.
> +	 */
> +	if (cpus_share_cache(prev_cpu, target))
> +		return;

Dang, I wanted to mention this before, but it slipped my mind. The
comment and code don't match.

Did you want to write:

	if (cpu_to_node(prev_cpu) == cpu_to_node(target))
		return;

> +	/*
> +	 * Temporarily prevent NUMA balancing trying to place waker/wakee after
> +	 * wakee has been moved by wake_affine. This will potentially allow
> +	 * related tasks to converge and update their data placement. The
> +	 * 4 * numa_scan_period is to allow the two-pass filter to migrate
> +	 * hot data to the wakers node.
> +	 */
> +	interval = max(sysctl_numa_balancing_scan_delay,
> +			 p->numa_scan_period << 2);
> +	p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
> +
> +	interval = max(sysctl_numa_balancing_scan_delay,
> +			 current->numa_scan_period << 2);
> +	current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
> +}

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
  2018-02-13 14:01   ` Peter Zijlstra
@ 2018-02-13 14:18     ` Mel Gorman
  2018-02-13 14:43       ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 14:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML

On Tue, Feb 13, 2018 at 03:01:37PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 13, 2018 at 01:37:30PM +0000, Mel Gorman wrote:
> > +static void
> > +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> > +{
> > +	unsigned long interval;
> > +
> > +	if (!static_branch_likely(&sched_numa_balancing))
> > +		return;
> > +
> > +	/* If balancing has no preference then continue gathering data */
> > +	if (p->numa_preferred_nid == -1)
> > +		return;
> > +
> > +	/*
> > +	 * If the wakeup is not affecting locality then it is neutral from
> > +	 * the perspective of NUMA balacing so continue gathering data.
> > +	 */
> > +	if (cpus_share_cache(prev_cpu, target))
> > +		return;
> 
> Dang, I wanted to mention this before, but it slipped my mind. The
> comment and code don't match.
> 
> Did you want to write:
> 
> 	if (cpu_to_node(prev_cpu) == cpu_to_node(target))
> 		return;
> 

Well, it was deliberate. While it's possible to be on the same memory
node and not sharing cache, the scheduler typically is more concerned with
the LLC than NUMA per-se. If they share LLC, then I also assume that they
share memory locality.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
  2018-02-13 14:18     ` Mel Gorman
@ 2018-02-13 14:43       ` Peter Zijlstra
  2018-02-13 15:00         ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2018-02-13 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML

On Tue, Feb 13, 2018 at 02:18:12PM +0000, Mel Gorman wrote:
> On Tue, Feb 13, 2018 at 03:01:37PM +0100, Peter Zijlstra wrote:
> > On Tue, Feb 13, 2018 at 01:37:30PM +0000, Mel Gorman wrote:
> > > +static void
> > > +update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
> > > +{
> > > +	unsigned long interval;
> > > +
> > > +	if (!static_branch_likely(&sched_numa_balancing))
> > > +		return;
> > > +
> > > +	/* If balancing has no preference then continue gathering data */
> > > +	if (p->numa_preferred_nid == -1)
> > > +		return;
> > > +
> > > +	/*
> > > +	 * If the wakeup is not affecting locality then it is neutral from
> > > +	 * the perspective of NUMA balacing so continue gathering data.
> > > +	 */
> > > +	if (cpus_share_cache(prev_cpu, target))
> > > +		return;
> > 
> > Dang, I wanted to mention this before, but it slipped my mind. The
> > comment and code don't match.
> > 
> > Did you want to write:
> > 
> > 	if (cpu_to_node(prev_cpu) == cpu_to_node(target))
> > 		return;
> > 
> 
> Well, it was deliberate. While it's possible to be on the same memory
> node and not sharing cache, the scheduler typically is more concerned with
> the LLC than NUMA per-se. If they share LLC, then I also assume that they
> share memory locality.

True, but the remaining code only has effect for numa balance, which is
concerned with nodes. So I don't see the point of using something
potentially smaller.

Suppose someone did hardware where a node has 2 cache clusters, then
we'd still set a wake_affine back-off for numa-balance, even though it
remains on the same node.

How would that be useful?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
  2018-02-13 14:43       ` Peter Zijlstra
@ 2018-02-13 15:00         ` Mel Gorman
  2018-02-13 15:10           ` Peter Zijlstra
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-02-13 15:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML

On Tue, Feb 13, 2018 at 03:43:26PM +0100, Peter Zijlstra wrote:
> > 
> > Well, it was deliberate. While it's possible to be on the same memory
> > node and not sharing cache, the scheduler typically is more concerned with
> > the LLC than NUMA per-se. If they share LLC, then I also assume that they
> > share memory locality.
> 
> True, but the remaining code only has effect for numa balance, which is
> concerned with nodes. So I don't see the point of using something
> potentially smaller.
> 
> Suppose someone did hardware where a node has 2 cache clusters, then
> we'd still set a wake_affine back-off for numa-balance, even though it
> remains on the same node.
> 
> How would that be useful?

Fair point, it could be unexpected from a NUMA balancing perspective and
sub-numa clustering does exist so it's a potential issue. I'm happy to
change it to cpu_to_node. I can resend the series if you prefer but feel
free to change it in-place if you're picking it up. I do not expect any
change on the machines I tested with as for all of them LLC was equivalent
to checking the node ID.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine
  2018-02-13 15:00         ` Mel Gorman
@ 2018-02-13 15:10           ` Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2018-02-13 15:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Mike Galbraith, Matt Fleming, Giovanni Gherdovich, LKML

On Tue, Feb 13, 2018 at 03:00:20PM +0000, Mel Gorman wrote:
> On Tue, Feb 13, 2018 at 03:43:26PM +0100, Peter Zijlstra wrote:
> > > 
> > > Well, it was deliberate. While it's possible to be on the same memory
> > > node and not sharing cache, the scheduler typically is more concerned with
> > > the LLC than NUMA per-se. If they share LLC, then I also assume that they
> > > share memory locality.
> > 
> > True, but the remaining code only has effect for numa balance, which is
> > concerned with nodes. So I don't see the point of using something
> > potentially smaller.
> > 
> > Suppose someone did hardware where a node has 2 cache clusters, then
> > we'd still set a wake_affine back-off for numa-balance, even though it
> > remains on the same node.
> > 
> > How would that be useful?
> 
> Fair point, it could be unexpected from a NUMA balancing perspective and
> sub-numa clustering does exist so it's a potential issue. I'm happy to
> change it to cpu_to_node. I can resend the series if you prefer but feel
> free to change it in-place if you're picking it up. I do not expect any
> change on the machines I tested with as for all of them LLC was equivalent
> to checking the node ID.

OK, changed it locally. Thanks!

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
  2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
@ 2018-02-21 10:27   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, torvalds, ggherdovich, efault, tglx, mgorman, mingo,
	matt, hpa, linux-kernel

Commit-ID:  7ebb66a12f85bc375beaf45ca900427fe47aa8f7
Gitweb:     https://git.kernel.org/tip/7ebb66a12f85bc375beaf45ca900427fe47aa8f7
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:25 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:07 +0100

sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine

The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 820f94c..0132572 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,9 +5751,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
-		       int prev_cpu, int sync)
+		       int this_cpu, int prev_cpu, int sync)
 {
-	int this_cpu = smp_processor_id();
 	int target = nr_cpumask_bits;
 
 	if (sched_feat(WA_IDLE))
@@ -6376,7 +6375,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (cpu == prev_cpu)
 			goto pick_cpu;
 
-		new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
+		new_cpu = wake_affine(affine_sd, p, cpu, prev_cpu, sync);
 	}
 
 	if (sd && !(sd_flag & SD_BALANCE_FORK)) {

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
  2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
@ 2018-02-21 10:28   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, efault, matt, torvalds, mingo, ggherdovich, mgorman, peterz,
	linux-kernel, tglx

Commit-ID:  eeb60398639143c11ff2c8b509e3a471411bb5d3
Gitweb:     https://git.kernel.org/tip/eeb60398639143c11ff2c8b509e3a471411bb5d3
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:26 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:07 +0100

sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed

On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0132572..ae3e6f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5724,7 +5724,6 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	unsigned long task_load;
 
 	this_eff_load = target_load(this_cpu, sd->wake_idx);
-	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 
 	if (sync) {
 		unsigned long current_load = task_h_load(current);
@@ -5742,6 +5741,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		this_eff_load *= 100;
 	this_eff_load *= capacity_of(prev_cpu);
 
+	prev_eff_load = source_load(prev_cpu, sd->wake_idx);
 	prev_eff_load -= task_load;
 	if (sched_feat(WA_BIAS))
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() if weights are equal
  2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
@ 2018-02-21 10:28   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, linux-kernel, tglx, matt, peterz, mingo, mgorman, hpa,
	efault, ggherdovich

Commit-ID:  082f764a2f3f2968afa1a0b04a1ccb1b70633844
Gitweb:     https://git.kernel.org/tip/082f764a2f3f2968afa1a0b04a1ccb1b70633844
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:27 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:08 +0100

sched/fair: Do not migrate on wake_affine_weight() if weights are equal

wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.

This patch modifies wake_affine_weight() to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.

For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Hmean     send-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     send-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     send-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     send-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     send-2048      9637.02 (   0.00%)     9601.34 (  -0.37%)
 Hmean     send-3312     14355.37 (   0.00%)    14414.51 (   0.41%)
 Hmean     send-4096     16464.97 (   0.00%)    16301.37 (  -0.99%)
 Hmean     send-8192     26722.42 (   0.00%)    26428.95 (  -1.10%)
 Hmean     send-16384    38137.81 (   0.00%)    38046.11 (  -0.24%)
 Hmean     recv-64         355.47 (   0.00%)      349.50 (  -1.68%)
 Hmean     recv-128        697.98 (   0.00%)      693.35 (  -0.66%)
 Hmean     recv-256       1328.02 (   0.00%)     1318.77 (  -0.70%)
 Hmean     recv-1024      5051.83 (   0.00%)     5051.11 (  -0.01%)
 Hmean     recv-2048      9636.95 (   0.00%)     9601.30 (  -0.37%)
 Hmean     recv-3312     14355.32 (   0.00%)    14414.48 (   0.41%)
 Hmean     recv-4096     16464.74 (   0.00%)    16301.16 (  -0.99%)
 Hmean     recv-8192     26721.63 (   0.00%)    26428.17 (  -1.10%)
 Hmean     recv-16384    38136.00 (   0.00%)    38044.88 (  -0.24%)
 Stddev    send-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    send-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    send-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    send-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    send-2048       294.57 (   0.00%)      223.88 (  24.00%)
 Stddev    send-3312       302.28 (   0.00%)      271.74 (  10.10%)
 Stddev    send-4096       195.92 (   0.00%)      121.10 (  38.19%)
 Stddev    send-8192       399.71 (   0.00%)      563.77 ( -41.04%)
 Stddev    send-16384     1163.47 (   0.00%)     1103.68 (   5.14%)
 Stddev    recv-64           7.30 (   0.00%)        4.75 (  34.96%)
 Stddev    recv-128         15.15 (   0.00%)       22.38 ( -47.66%)
 Stddev    recv-256         13.99 (   0.00%)       19.14 ( -36.81%)
 Stddev    recv-1024       105.73 (   0.00%)       67.38 (  36.27%)
 Stddev    recv-2048       294.59 (   0.00%)      223.89 (  24.00%)
 Stddev    recv-3312       302.24 (   0.00%)      271.75 (  10.09%)
 Stddev    recv-4096       196.03 (   0.00%)      121.14 (  38.20%)
 Stddev    recv-8192       399.86 (   0.00%)      563.65 ( -40.96%)
 Stddev    recv-16384     1163.79 (   0.00%)     1103.86 (   5.15%)

The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.

The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency:

                                      4.15.0                 4.15.0
                                       16rc0          noequal-v1r23
 Lat 50.00th-qrtle-1        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-1        49.00 (   0.00%)       41.00 (  16.33%)
 Lat 90.00th-qrtle-1        52.00 (   0.00%)       50.00 (   3.85%)
 Lat 95.00th-qrtle-1        54.00 (   0.00%)       51.00 (   5.56%)
 Lat 99.00th-qrtle-1        63.00 (   0.00%)       60.00 (   4.76%)
 Lat 99.50th-qrtle-1        66.00 (   0.00%)       61.00 (   7.58%)
 Lat 99.90th-qrtle-1        78.00 (   0.00%)       65.00 (  16.67%)
 Lat 50.00th-qrtle-2        38.00 (   0.00%)       38.00 (   0.00%)
 Lat 75.00th-qrtle-2        42.00 (   0.00%)       43.00 (  -2.38%)
 Lat 90.00th-qrtle-2        46.00 (   0.00%)       48.00 (  -4.35%)
 Lat 95.00th-qrtle-2        49.00 (   0.00%)       50.00 (  -2.04%)
 Lat 99.00th-qrtle-2        55.00 (   0.00%)       57.00 (  -3.64%)
 Lat 99.50th-qrtle-2        58.00 (   0.00%)       60.00 (  -3.45%)
 Lat 99.90th-qrtle-2        65.00 (   0.00%)       68.00 (  -4.62%)
 Lat 50.00th-qrtle-4        41.00 (   0.00%)       41.00 (   0.00%)
 Lat 75.00th-qrtle-4        45.00 (   0.00%)       46.00 (  -2.22%)
 Lat 90.00th-qrtle-4        50.00 (   0.00%)       50.00 (   0.00%)
 Lat 95.00th-qrtle-4        54.00 (   0.00%)       53.00 (   1.85%)
 Lat 99.00th-qrtle-4        61.00 (   0.00%)       61.00 (   0.00%)
 Lat 99.50th-qrtle-4        65.00 (   0.00%)       64.00 (   1.54%)
 Lat 99.90th-qrtle-4        76.00 (   0.00%)       82.00 (  -7.89%)
 Lat 50.00th-qrtle-8        48.00 (   0.00%)       46.00 (   4.17%)
 Lat 75.00th-qrtle-8        55.00 (   0.00%)       54.00 (   1.82%)
 Lat 90.00th-qrtle-8        60.00 (   0.00%)       59.00 (   1.67%)
 Lat 95.00th-qrtle-8        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.00th-qrtle-8        71.00 (   0.00%)       69.00 (   2.82%)
 Lat 99.50th-qrtle-8        74.00 (   0.00%)       73.00 (   1.35%)
 Lat 99.90th-qrtle-8        98.00 (   0.00%)       90.00 (   8.16%)
 Lat 50.00th-qrtle-16       56.00 (   0.00%)       55.00 (   1.79%)
 Lat 75.00th-qrtle-16       68.00 (   0.00%)       67.00 (   1.47%)
 Lat 90.00th-qrtle-16       77.00 (   0.00%)       78.00 (  -1.30%)
 Lat 95.00th-qrtle-16       82.00 (   0.00%)       84.00 (  -2.44%)
 Lat 99.00th-qrtle-16       90.00 (   0.00%)       93.00 (  -3.33%)
 Lat 99.50th-qrtle-16       93.00 (   0.00%)       97.00 (  -4.30%)
 Lat 99.90th-qrtle-16      110.00 (   0.00%)      110.00 (   0.00%)
 Lat 50.00th-qrtle-32       68.00 (   0.00%)       62.00 (   8.82%)
 Lat 75.00th-qrtle-32       90.00 (   0.00%)       83.00 (   7.78%)
 Lat 90.00th-qrtle-32      110.00 (   0.00%)      100.00 (   9.09%)
 Lat 95.00th-qrtle-32      122.00 (   0.00%)      111.00 (   9.02%)
 Lat 99.00th-qrtle-32      145.00 (   0.00%)      133.00 (   8.28%)
 Lat 99.50th-qrtle-32      154.00 (   0.00%)      143.00 (   7.14%)
 Lat 99.90th-qrtle-32     2316.00 (   0.00%)      515.00 (  77.76%)
 Lat 50.00th-qrtle-35       69.00 (   0.00%)       72.00 (  -4.35%)
 Lat 75.00th-qrtle-35       92.00 (   0.00%)       95.00 (  -3.26%)
 Lat 90.00th-qrtle-35      111.00 (   0.00%)      114.00 (  -2.70%)
 Lat 95.00th-qrtle-35      122.00 (   0.00%)      124.00 (  -1.64%)
 Lat 99.00th-qrtle-35      142.00 (   0.00%)      144.00 (  -1.41%)
 Lat 99.50th-qrtle-35      150.00 (   0.00%)      154.00 (  -2.67%)
 Lat 99.90th-qrtle-35     6104.00 (   0.00%)     5640.00 (   7.60%)

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ae3e6f8..a07920f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 		prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
 	prev_eff_load *= capacity_of(this_cpu);
 
-	return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
+	/*
+	 * If sync, adjust the weight of prev_eff_load such that if
+	 * prev_eff == this_eff that select_idle_sibling() will consider
+	 * stacking the wakee on top of the waker if no other CPU is
+	 * idle.
+	 */
+	if (sync)
+		prev_eff_load += 1;
+
+	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Do not migrate due to a sync wakeup on exit
  2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
@ 2018-02-21 10:29   ` tip-bot for Peter Zijlstra
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Peter Zijlstra @ 2018-02-21 10:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, ggherdovich, efault, mgorman, tglx, peterz, mingo,
	linux-kernel, matt

Commit-ID:  24d0c1d6e65f635b2c0684d0a42ff6c0674aa0e6
Gitweb:     https://git.kernel.org/tip/24d0c1d6e65f635b2c0684d0a42ff6c0674aa0e6
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Tue, 13 Feb 2018 13:37:28 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:42 +0100

sched/fair: Do not migrate due to a sync wakeup on exit

When a task exits, it notifies the parent that it has exited. This is a
sync wakeup and the exiting task may pull the parent towards the wakers
CPU. For simple workloads like using a shell, it was observed that the
shell is pulled across nodes by exiting processes. This is daft as the
parent may be long-lived and properly placed. This patch special cases a
sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a07920f..302dda8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6350,7 +6350,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
 	int want_affine = 0;
-	int sync = wake_flags & WF_SYNC;
+	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
  2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
@ 2018-02-21 10:29   ` tip-bot for Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, matt, hpa, torvalds, peterz, mingo, mgorman, tglx,
	efault, ggherdovich

Commit-ID:  2c83362734dad8e48ccc0710b5cd2436a0323893
Gitweb:     https://git.kernel.org/tip/2c83362734dad8e48ccc0710b5cd2436a0323893
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:29 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:43 +0100

sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on

find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on.  This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine().

This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.

A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes:

                                  4.15.0                 4.15.0
                            noexit-v1r23           sdnuma-v1r23
 Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
 Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
 Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
 Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
 Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)

               4.15.0      4.15.0
         noexit-v1r23 sdnuma-v1r23
 User         5434.12     5188.41
 System       4878.77     3467.09
 Elapsed     10259.06     8624.21

That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.

There is also a noticable impact on hackbench such as this example using
processes and pipes:

 hackbench-process-pipes
                               4.15.0                 4.15.0
                         noexit-v1r23           sdnuma-v1r23
 Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
 Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
 Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
 Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
 Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
 Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
 Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
 Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
 Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
 Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
 Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
 Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
 Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
 Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
 Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)

It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable:

                                      4.15.0                 4.15.0
                                noexit-v1r23           sdnuma-v1r23
 Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
 Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
 Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
 Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
 Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
 Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
 Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
 Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
 Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
 Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
 Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
 Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
 Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
 Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)

However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate:

 NUMA base PTE updates             4620        1473
 NUMA huge PMD updates                0           0
 NUMA page range updates           4620        1473
 NUMA hint faults                  4301        1383
 NUMA hint local faults            1309         451
 NUMA hint local percent             30          32
 NUMA pages migrated               1335         491
 AutoNUMA cost                      21%          6%

There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 302dda8..94aea5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5911,6 +5911,18 @@ skip_spare:
 	if (!idlest)
 		return NULL;
 
+	/*
+	 * When comparing groups across NUMA domains, it's possible for the
+	 * local domain to be very lightly loaded relative to the remote
+	 * domains but "imbalance" skews the comparison making remote CPUs
+	 * look much more favourable. When considering cross-domain, add
+	 * imbalance to the runnable load on the remote node and consider
+	 * staying local.
+	 */
+	if ((sd->flags & SD_NUMA) &&
+	    min_runnable_load + imbalance >= this_runnable_load)
+		return NULL;
+
 	if (min_runnable_load > (this_runnable_load + imbalance))
 		return NULL;
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
  2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
  2018-02-13 14:01   ` Peter Zijlstra
@ 2018-02-21 10:30   ` tip-bot for Mel Gorman
  2018-05-07 11:06     ` Srikar Dronamraju
  1 sibling, 1 reply; 22+ messages in thread
From: tip-bot for Mel Gorman @ 2018-02-21 10:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, mgorman, torvalds, ggherdovich, peterz, matt, linux-kernel,
	efault, hpa, mingo

Commit-ID:  7347fc87dfe6b7315e74310ee1243dc222c68086
Gitweb:     https://git.kernel.org/tip/7347fc87dfe6b7315e74310ee1243dc222c68086
Author:     Mel Gorman <mgorman@techsingularity.net>
AuthorDate: Tue, 13 Feb 2018 13:37:30 +0000
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 21 Feb 2018 08:49:45 +0100

sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()

If wake_affine() pulls a task to another node for any reason and the node is
no longer preferred then temporarily stop automatic NUMA balancing pulling
the task back. Otherwise, tasks with a strong waker/wakee relationship
may constantly fight automatic NUMA balancing over where a task should
be placed.

Once again netperf is interesting here. The performance barely changes
but automatic NUMA balancing is interesting:

 Hmean     send-64         354.67 (   0.00%)      352.15 (  -0.71%)
 Hmean     send-128        702.91 (   0.00%)      693.84 (  -1.29%)
 Hmean     send-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
 Hmean     send-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
 Hmean     send-2048      9687.44 (   0.00%)     9624.45 (  -0.65%)
 Hmean     send-3312     14577.64 (   0.00%)    14514.35 (  -0.43%)
 Hmean     send-4096     16393.62 (   0.00%)    16488.30 (   0.58%)
 Hmean     send-8192     26877.26 (   0.00%)    26431.63 (  -1.66%)
 Hmean     send-16384    38683.43 (   0.00%)    38264.91 (  -1.08%)
 Hmean     recv-64         354.67 (   0.00%)      352.15 (  -0.71%)
 Hmean     recv-128        702.91 (   0.00%)      693.84 (  -1.29%)
 Hmean     recv-256       1350.07 (   0.00%)     1344.19 (  -0.44%)
 Hmean     recv-1024      5124.38 (   0.00%)     4941.24 (  -3.57%)
 Hmean     recv-2048      9687.43 (   0.00%)     9624.45 (  -0.65%)
 Hmean     recv-3312     14577.59 (   0.00%)    14514.35 (  -0.43%)
 Hmean     recv-4096     16393.55 (   0.00%)    16488.20 (   0.58%)
 Hmean     recv-8192     26876.96 (   0.00%)    26431.29 (  -1.66%)
 Hmean     recv-16384    38682.41 (   0.00%)    38263.94 (  -1.08%)

 NUMA alloc hit                 1465986     1423090
 NUMA alloc miss                      0           0
 NUMA interleave hit                  0           0
 NUMA alloc local               1465897     1423003
 NUMA base PTE updates             1473        1420
 NUMA huge PMD updates                0           0
 NUMA page range updates           1473        1420
 NUMA hint faults                  1383        1312
 NUMA hint local faults             451         124
 NUMA hint local percent             32           9

There is a slight degrading in performance but there are slightly fewer
NUMA faults. There is a large drop in the percentage of local faults but
the bulk of migrations for netperf are in small shared libraries so it's
reflecting the fact that automatic NUMA balancing has backed off. This is
a case where despite wake_affine() and automatic NUMA balancing fighting
for placement that there is a marginal benefit to rescheduling to local
data quickly. However, it should be noted that wake_affine() and automatic
NUMA balancing fighting each other constantly is undesirable.

However, the benefit in other cases is large. This is the result for NAS
with the D class sizing on a 4-socket machine:

 nas-mpi
                           4.15.0                 4.15.0
                     sdnuma-v1r23       delayretry-v1r23
 Time cg.D      557.00 (   0.00%)      431.82 (  22.47%)
 Time ep.D       77.83 (   0.00%)       79.01 (  -1.52%)
 Time is.D       26.46 (   0.00%)       26.64 (  -0.68%)
 Time lu.D      727.14 (   0.00%)      597.94 (  17.77%)
 Time mg.D      191.35 (   0.00%)      146.85 (  23.26%)

               4.15.0      4.15.0
         sdnuma-v1r23delayretry-v1r23
 User        75665.20    70413.30
 System      20321.59     8861.67
 Elapsed       766.13      634.92

 Minor Faults                  16528502     7127941
 Major Faults                      4553        5068
 NUMA alloc local               6963197     6749135
 NUMA base PTE updates        366409093   107491434
 NUMA huge PMD updates           687556      198880
 NUMA page range updates      718437765   209317994
 NUMA hint faults              13643410     4601187
 NUMA hint local faults         9212593     3063996
 NUMA hint local percent             67          66

Note the massive reduction in system CPU usage even though the percentage
of local faults is barely affected. There is a massive reduction in the
number of PTE updates showing that automatic NUMA balancing has backed off.
A critical observation is also that there is a massive reduction in minor
faults which is due to far fewer NUMA hinting faults being trapped.

There were questions on NAS OMP and how it behaved related to threads
being bound to CPUs. First, there are more gains than losses with this
patch applied and a reduction in system CPU usage:

nas-omp
                      4.16.0-rc1             4.16.0-rc1
                     sdnuma-v2r1        delayretry-v2r1
Time bt.D      436.71 (   0.00%)      430.05 (   1.53%)
Time cg.D      201.02 (   0.00%)      180.87 (  10.02%)
Time ep.D       32.84 (   0.00%)       32.68 (   0.49%)
Time is.D        9.63 (   0.00%)        9.64 (  -0.10%)
Time lu.D      331.20 (   0.00%)      304.80 (   7.97%)
Time mg.D       54.87 (   0.00%)       52.72 (   3.92%)
Time sp.D     1108.78 (   0.00%)      917.10 (  17.29%)
Time ua.D      378.81 (   0.00%)      398.83 (  -5.28%)

          4.16.0-rc1  4.16.0-rc1
         sdnuma-v2r1delayretry-v2r1
User       305633.08   296751.91
System        451.75      357.80
Elapsed      2595.73     2368.13

However, it does not close the gap between binding and being unbound. There
is negligible difference between the performance of the baseline and a
patched kernel when threads are bound so it is not presented here:

                      4.16.0-rc1             4.16.0-rc1
                 delayretry-bind     delayretry-unbound
Time bt.D      385.02 (   0.00%)      430.05 ( -11.70%)
Time cg.D      144.02 (   0.00%)      180.87 ( -25.59%)
Time ep.D       32.85 (   0.00%)       32.68 (   0.52%)
Time is.D       10.52 (   0.00%)        9.64 (   8.37%)
Time lu.D      285.31 (   0.00%)      304.80 (  -6.83%)
Time mg.D       43.21 (   0.00%)       52.72 ( -22.01%)
Time sp.D      820.24 (   0.00%)      917.10 ( -11.81%)
Time ua.D      337.09 (   0.00%)      398.83 ( -18.32%)

          4.16.0-rc1  4.16.0-rc1
        delayretry-binddelayretry-unbound
User       277731.25   296751.91
System        261.29      357.80
Elapsed      2100.55     2368.13

Unfortunately, while performance is improved by the patch, there is still
quite a long way to go before it's equivalent to hard binding.

Other workloads like hackbench, tbench, dbench and schbench are barely
affected. dbench shows a mix of gains and losses depending on the machine
although in general, the results are more stable.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Giovanni Gherdovich <ggherdovich@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 56 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 94aea5b..33662a3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1869,6 +1869,7 @@ static int task_numa_migrate(struct task_struct *p)
 static void numa_migrate_preferred(struct task_struct *p)
 {
 	unsigned long interval = HZ;
+	unsigned long numa_migrate_retry;
 
 	/* This task has no NUMA fault statistics yet */
 	if (unlikely(p->numa_preferred_nid == -1 || !p->numa_faults))
@@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
 
 	/* Periodically retry migrating the task to the preferred node */
 	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
-	p->numa_migrate_retry = jiffies + interval;
+	numa_migrate_retry = jiffies + interval;
+
+	/*
+	 * Check that the new retry threshold is after the current one. If
+	 * the retry is in the future, it implies that wake_affine has
+	 * temporarily asked NUMA balancing to backoff from placement.
+	 */
+	if (numa_migrate_retry > p->numa_migrate_retry)
+		return;
+
+	/* Safe to try placing the task on the preferred node */
+	p->numa_migrate_retry = numa_migrate_retry;
 
 	/* Success if task is already running on preferred CPU */
 	if (task_node(p) == p->numa_preferred_nid)
@@ -5759,6 +5771,48 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
 	return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
 }
 
+#ifdef CONFIG_NUMA_BALANCING
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+	unsigned long interval;
+
+	if (!static_branch_likely(&sched_numa_balancing))
+		return;
+
+	/* If balancing has no preference then continue gathering data */
+	if (p->numa_preferred_nid == -1)
+		return;
+
+	/*
+	 * If the wakeup is not affecting locality then it is neutral from
+	 * the perspective of NUMA balacing so continue gathering data.
+	 */
+	if (cpu_to_node(prev_cpu) == cpu_to_node(target))
+		return;
+
+	/*
+	 * Temporarily prevent NUMA balancing trying to place waker/wakee after
+	 * wakee has been moved by wake_affine. This will potentially allow
+	 * related tasks to converge and update their data placement. The
+	 * 4 * numa_scan_period is to allow the two-pass filter to migrate
+	 * hot data to the wakers node.
+	 */
+	interval = max(sysctl_numa_balancing_scan_delay,
+			 p->numa_scan_period << 2);
+	p->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+
+	interval = max(sysctl_numa_balancing_scan_delay,
+			 current->numa_scan_period << 2);
+	current->numa_migrate_retry = jiffies + msecs_to_jiffies(interval);
+}
+#else
+static void
+update_wa_numa_placement(struct task_struct *p, int prev_cpu, int target)
+{
+}
+#endif
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 		       int this_cpu, int prev_cpu, int sync)
 {
@@ -5774,6 +5828,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	if (target == nr_cpumask_bits)
 		return prev_cpu;
 
+	update_wa_numa_placement(p, prev_cpu, target);
 	schedstat_inc(sd->ttwu_move_affine);
 	schedstat_inc(p->se.statistics.nr_wakeups_affine);
 	return target;

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
  2018-02-21 10:30   ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
@ 2018-05-07 11:06     ` Srikar Dronamraju
  2018-05-09  8:41       ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-05-07 11:06 UTC (permalink / raw)
  To: mgorman, torvalds, tglx, mingo, hpa, efault, linux-kernel, matt,
	peterz, ggherdovich
  Cc: linux-tip-commits, mpe

Hi Mel,

I do see performance improving with this commit 7347fc87df "sched/numa:
Delay retrying placement for automatic NUMA balance after wake_affine()"
even on powerpc where we have SD_WAKE_AFFINE *disabled* on numa sched
domains. Ideally this commit should not have affected powerpc machines.
That made me to look a bit deeper.

> @@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
>
>  	/* Periodically retry migrating the task to the preferred node */
>  	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> -	p->numa_migrate_retry = jiffies + interval;
> +	numa_migrate_retry = jiffies + interval;
> +
> +	/*
> +	 * Check that the new retry threshold is after the current one. If
> +	 * the retry is in the future, it implies that wake_affine has
> +	 * temporarily asked NUMA balancing to backoff from placement.
> +	 */
> +	if (numa_migrate_retry > p->numa_migrate_retry)
> +		return;

The above check looks wrong. This check will most likely to be true,
numa_migrate_preferred() itself is called either when jiffies >
p->numa_migrate_retry or if the task's numa_preferred_nid has changed.

Hence we never end up calling task_numa_migrate() i.e we never go thro
the active cpu balancing path in numa balancing.

Reading the comments just above the check, makes me think the check
should have been

	if (numa_migrate_retry < p->numa_migrate_retry)
		return;

Here is perf stat output with 7347fc87df running perf bench numa mem
--no-data_rand_walk 96 -p 2 -t 48 -G 0 -P 3072 -T 0 -l 50 -c -s 1000

          2,13,898      cs                                                            ( +-  2.65% )
            10,228      migrations                                                    ( +- 14.61% )
         21,86,406      faults                                                        ( +-  9.69% )
   40,65,84,68,026      cache-misses                                                  ( +-  0.31% )
                 0      sched:sched_move_numa   <---------------
                 0      sched:sched_stick_numa   <---------------
                 0      sched:sched_swap_numa   <---------------
          1,41,780      migrate:mm_migrate_pages                                      ( +- 24.11% )
                 0      migrate:mm_numa_migrate_ratelimit

     778.331602169 seconds time elapsed


If you look at sched_move_numa, sched_stick_numa, sched_swap_numa
numbers, its very clear that we did try any active cpu migrations.

Same command with the commit reverted

       2,38,685      cs                                                            ( +-  2.93% )
            25,127      migrations                                                    ( +- 13.22% )
         17,27,858      faults                                                        ( +-  2.61% )
   34,77,06,21,298      cache-misses                                                  ( +-  0.61% )
               560      sched:sched_move_numa                                         ( +-  2.05% )
                16      sched:sched_stick_numa                                        ( +- 33.33% )
               310      sched:sched_swap_numa                                         ( +- 15.16% )
          1,25,062      migrate:mm_migrate_pages                                      ( +-  0.91% )
                 0      migrate:mm_numa_migrate_ratelimit

     916.777315465 seconds time elapsed

(numbers are almost same with just that check commented/modified)

So we are seeing an improvement, but the improvement is because of
bypassing the active cpu balancing. Do we really want to by-pass this
code?

> +
> +	/* Safe to try placing the task on the preferred node */
> +	p->numa_migrate_retry = numa_migrate_retry;
>
>  	/* Success if task is already running on preferred CPU */
>  	if (task_node(p) == p->numa_preferred_nid)
> @@ -5759,6 +5771,48 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
  2018-05-07 11:06     ` Srikar Dronamraju
@ 2018-05-09  8:41       ` Mel Gorman
  2018-05-09 10:58         ` Srikar Dronamraju
  0 siblings, 1 reply; 22+ messages in thread
From: Mel Gorman @ 2018-05-09  8:41 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: torvalds, tglx, mingo, hpa, efault, linux-kernel, matt, peterz,
	ggherdovich, linux-tip-commits, mpe

On Mon, May 07, 2018 at 04:06:07AM -0700, Srikar Dronamraju wrote:
> > @@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
> >
> >  	/* Periodically retry migrating the task to the preferred node */
> >  	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> > -	p->numa_migrate_retry = jiffies + interval;
> > +	numa_migrate_retry = jiffies + interval;
> > +
> > +	/*
> > +	 * Check that the new retry threshold is after the current one. If
> > +	 * the retry is in the future, it implies that wake_affine has
> > +	 * temporarily asked NUMA balancing to backoff from placement.
> > +	 */
> > +	if (numa_migrate_retry > p->numa_migrate_retry)
> > +		return;
> 
> The above check looks wrong. This check will most likely to be true,
> numa_migrate_preferred() itself is called either when jiffies >
> p->numa_migrate_retry or if the task's numa_preferred_nid has changed.
> 

Sorry for the delay getting back -- viral infections combined with a public
day off is slowing me.

You're right, without affine wakeups with a wakeup-intensive workload
the path may never be hit and with the current code, it effectively acts
as a broken throttling mechanism. However, I've confirmed that "fixing"
it has mixed results with many regressions on x86 for both 2 and 4 socket
boxes. I need time to think about it and see if this can be fixed without
introducing another regression. I'll also check if a plain revert is the
way to go for a short-term fix and then revisit it.

Thanks Srikar.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
  2018-05-09  8:41       ` Mel Gorman
@ 2018-05-09 10:58         ` Srikar Dronamraju
  2018-05-09 16:34           ` Mel Gorman
  0 siblings, 1 reply; 22+ messages in thread
From: Srikar Dronamraju @ 2018-05-09 10:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: torvalds, tglx, mingo, hpa, efault, linux-kernel, matt, peterz,
	ggherdovich, linux-tip-commits, mpe

* Mel Gorman <mgorman@techsingularity.net> [2018-05-09 09:41:48]:

> On Mon, May 07, 2018 at 04:06:07AM -0700, Srikar Dronamraju wrote:
> > > @@ -1876,7 +1877,18 @@ static void numa_migrate_preferred(struct task_struct *p)
> > >
> > >  	/* Periodically retry migrating the task to the preferred node */
> > >  	interval = min(interval, msecs_to_jiffies(p->numa_scan_period) / 16);
> > > -	p->numa_migrate_retry = jiffies + interval;
> > > +	numa_migrate_retry = jiffies + interval;
> > > +
> > > +	/*
> > > +	 * Check that the new retry threshold is after the current one. If
> > > +	 * the retry is in the future, it implies that wake_affine has
> > > +	 * temporarily asked NUMA balancing to backoff from placement.
> > > +	 */
> > > +	if (numa_migrate_retry > p->numa_migrate_retry)
> > > +		return;
> > 
> > The above check looks wrong. This check will most likely to be true,
> > numa_migrate_preferred() itself is called either when jiffies >
> > p->numa_migrate_retry or if the task's numa_preferred_nid has changed.
> > 
> 
> You're right, without affine wakeups with a wakeup-intensive workload
> the path may never be hit and with the current code, it effectively acts
> as a broken throttling mechanism.

I haven't tried on an x86 box, but still trying to get my head around
that check. How does affine wakeups differ for this check.

Lets say p->numa_migrate_retry was set by wake_affine and task has
crossed that temporary period where we dont want the task to undergo
numa balancing.

Now the task is back at numa_migrate_preferred(); p->numa_migrate_retry
is lesser than jiffies (something like "current jiffies - 100"). It
would always return back from that check.

In the other scenario, where wake_affine set p->numa_migrate_preferred to a
bigger value, the task calls numa_migrate_preferred(),
numa_migrate_preferred  could be before p->numa_migrate_preferred. In
such a case, we should have stopped the task from migration.
However we overwrite p->numa_migrate_preferred and do the
task_numa_migrate(). Somehow this doesn't seem to achieve what the
commit intended.

Or did I misunderstand?

--
Thanks and Regards
Srikar

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine()
  2018-05-09 10:58         ` Srikar Dronamraju
@ 2018-05-09 16:34           ` Mel Gorman
  0 siblings, 0 replies; 22+ messages in thread
From: Mel Gorman @ 2018-05-09 16:34 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: torvalds, tglx, mingo, hpa, efault, linux-kernel, matt, peterz,
	ggherdovich, linux-tip-commits, mpe

On Wed, May 09, 2018 at 03:58:14AM -0700, Srikar Dronamraju wrote:
> In the other scenario, where wake_affine set p->numa_migrate_preferred to a
> bigger value, the task calls numa_migrate_preferred(),
> numa_migrate_preferred  could be before p->numa_migrate_preferred. In
> such a case, we should have stopped the task from migration.
> However we overwrite p->numa_migrate_preferred and do the
> task_numa_migrate(). Somehow this doesn't seem to achieve what the
> commit intended.
> 
> Or did I misunderstand?
> 

Nope, the logic is broken. While it can be "fixed", the end result adds
complexity for very little gain. I believe the right path for the moment
is a revert and retry at a future date.

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2018-05-09 16:34 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
2018-02-21 10:27   ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
2018-02-21 10:28   ` [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
2018-02-21 10:28   ` [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
2018-02-21 10:29   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2018-02-13 13:37 ` [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on Mel Gorman
2018-02-21 10:29   ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
2018-02-13 14:01   ` Peter Zijlstra
2018-02-13 14:18     ` Mel Gorman
2018-02-13 14:43       ` Peter Zijlstra
2018-02-13 15:00         ` Mel Gorman
2018-02-13 15:10           ` Peter Zijlstra
2018-02-21 10:30   ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
2018-05-07 11:06     ` Srikar Dronamraju
2018-05-09  8:41       ` Mel Gorman
2018-05-09 10:58         ` Srikar Dronamraju
2018-05-09 16:34           ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.