* [PATCH 0/4] Reduce migrations due to load imbalance and process exits
@ 2018-02-12 14:58 Mel Gorman
2018-02-12 14:58 ` [PATCH 1/4] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
` (3 more replies)
0 siblings, 4 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 14:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman
The following series was motivated by the fact that higher migrations were
observed than expected. The first two patches are minor cleanups. The third
patch avoids migrations on wakeup when two CPUs are equally loaded which
is particularly important in the case where the measured load is 0. The last
patch avoids cross-node migrations of a parent process when the child exits.
Overall, there is a minor boost in performance on some workloads with the
details included in the changelogs.
The series was tested based on Linus's tree during the 4.16 merge window
as of Feb 8th but applies cleanly to 4.16-rc1.
kernel/sched/fair.c | 26 ++++++++++++++++++++------
1 file changed, 20 insertions(+), 6 deletions(-)
--
2.15.1
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/4] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
2018-02-12 14:58 [PATCH 0/4] Reduce migrations due to load imbalance and process exits Mel Gorman
@ 2018-02-12 14:58 ` Mel Gorman
2018-02-12 14:58 ` [PATCH 2/4] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
` (2 subsequent siblings)
3 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 14:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman
The only caller of wake_affine() knows the CPU ID. Pass it in instead of
rechecking it.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eb3ffc9be84..4b3e86eb2222 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5751,9 +5751,8 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
- int prev_cpu, int sync)
+ int this_cpu, int prev_cpu, int sync)
{
- int this_cpu = smp_processor_id();
int target = nr_cpumask_bits;
if (sched_feat(WA_IDLE))
@@ -6376,7 +6375,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
if (cpu == prev_cpu)
goto pick_cpu;
- new_cpu = wake_affine(affine_sd, p, prev_cpu, sync);
+ new_cpu = wake_affine(affine_sd, p, cpu, prev_cpu, sync);
}
if (sd && !(sd_flag & SD_BALANCE_FORK)) {
--
2.15.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/4] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed
2018-02-12 14:58 [PATCH 0/4] Reduce migrations due to load imbalance and process exits Mel Gorman
2018-02-12 14:58 ` [PATCH 1/4] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
@ 2018-02-12 14:58 ` Mel Gorman
2018-02-12 14:58 ` [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
2018-02-12 14:58 ` [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
3 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 14:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman
On sync wakeups, the previous CPU effective load may not be used so delay
the calculation until it's needed.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4b3e86eb2222..c1091cb023c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5724,7 +5724,6 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
unsigned long task_load;
this_eff_load = target_load(this_cpu, sd->wake_idx);
- prev_eff_load = source_load(prev_cpu, sd->wake_idx);
if (sync) {
unsigned long current_load = task_h_load(current);
@@ -5742,6 +5741,7 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
this_eff_load *= 100;
this_eff_load *= capacity_of(prev_cpu);
+ prev_eff_load = source_load(prev_cpu, sd->wake_idx);
prev_eff_load -= task_load;
if (sched_feat(WA_BIAS))
prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
--
2.15.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-12 14:58 [PATCH 0/4] Reduce migrations due to load imbalance and process exits Mel Gorman
2018-02-12 14:58 ` [PATCH 1/4] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
2018-02-12 14:58 ` [PATCH 2/4] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
@ 2018-02-12 14:58 ` Mel Gorman
2018-02-12 17:29 ` Peter Zijlstra
2018-02-12 14:58 ` [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
3 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 14:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman
wake_affine_weight() will consider migrating a task to, or near, the current
CPU if there is a load imbalance. If the CPUs share LLC then either CPU
is valid as a search-for-idle-sibling target and equally appropriate for
stacking two tasks on one CPU if an idle sibling is unavailable. If they do
not share cache then a cross-node migration potentially impacts locality
so while they are equal from a CPU capacity point of view, they are not
equal in terms of memory locality. In either case, it's more appropriate
to migrate only if there is a difference in their effective load.
This patch modifies wake_affine_weight to only consider migrating a task
if there is a load imbalance for normal wakeups but will allow potential
stacking if the loads are equal and it's a sync wakeup.
For the most part, the different in performance is marginal. For example,
on a 4-socket server running netperf UDP_STREAM on localhost the differences
are as follows
4.15.0 4.15.0
16rc0 noequal-v1r23
Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%)
Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%)
Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%)
Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%)
Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%)
Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%)
Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%)
Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%)
Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%)
Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%)
Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%)
Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%)
Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%)
Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%)
Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%)
Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%)
Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%)
Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%)
Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%)
Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%)
Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%)
Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%)
Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%)
Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%)
Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%)
Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%)
Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%)
Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%)
The difference in overall performance is marginal but note that most
measurements are less variable. There were similar observations for other
netperf comparisons. hackbench with sockets or threads with processes or
threads showed minor difference with some reduction of migration. tbench
showed only marginal differences that were within the noise. dbench,
regardless of filesystem, showed minor differences all of which are
within noise. Multiple machines, both UMA and NUMA were tested without
any regressions showing up.
The biggest risk with a patch like this is affecting wakeup latencies.
However, the schbench load from Facebook which is very sensitive to wakeup
latency showed a mixed result with mostly improvements in wakeup latency
4.15.0 4.15.0
16rc0 noequal-v1r23
Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%)
Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%)
Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%)
Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%)
Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%)
Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%)
Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%)
Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%)
Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%)
Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%)
Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%)
Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%)
Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%)
Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%)
Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%)
Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%)
Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%)
Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%)
Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%)
Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%)
Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%)
Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%)
Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%)
Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%)
Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%)
Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%)
Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%)
Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%)
Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%)
Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%)
Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%)
Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%)
Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%)
Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%)
Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%)
Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%)
Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%)
Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%)
Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%)
Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%)
Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%)
Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%)
Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%)
Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%)
Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%)
Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%)
Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%)
Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%)
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c1091cb023c4..28c8d9c91955 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
prev_eff_load *= capacity_of(this_cpu);
- return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
+ /*
+ * If sync, adjust the weight of prev_eff_load such that if
+ * prev_eff == this_eff that select_idle_sibling will consider
+ * stacking the wakee on top of the waker if no other CPU is
+ * idle.
+ */
+ if (sync)
+ prev_eff_load += 1;
+
+ return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
}
static int wake_affine(struct sched_domain *sd, struct task_struct *p,
--
2.15.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-12 14:58 [PATCH 0/4] Reduce migrations due to load imbalance and process exits Mel Gorman
` (2 preceding siblings ...)
2018-02-12 14:58 ` [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
@ 2018-02-12 14:58 ` Mel Gorman
2018-02-12 17:31 ` Peter Zijlstra
2018-02-12 20:01 ` Andreas Mohr
3 siblings, 2 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 14:58 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML, Mel Gorman
When a task exits, it notifies the parent that it has exited. This is a sync
wakup and the exiting task may pull the parent towards the wakers CPU. For
even simple workloads like using a shell, it was observed that the shell
is pulled across nodes by exiting processes. This is daft as the parent
may be long-lived and properly placed. This patch special cases a sync
wakeup on exit to avoid pulling tasks across nodes. Testing on a range
of workloads and machines showed very little differences in performance
although there was a small 3% boost on some machines running a shellscript
intensive workload (git regression test suite).
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 28c8d9c91955..50442697b455 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5710,8 +5710,14 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
- if (sync && cpu_rq(this_cpu)->nr_running == 1)
+ if (sync && cpu_rq(this_cpu)->nr_running == 1) {
+ /* Avoid tasks exiting pulling parents to new nodes */
+ if ((current->flags & PF_EXITING) &&
+ !cpus_share_cache(this_cpu, prev_cpu))
+ return prev_cpu;
+
return this_cpu;
+ }
return nr_cpumask_bits;
}
--
2.15.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-12 14:58 ` [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
@ 2018-02-12 17:29 ` Peter Zijlstra
2018-02-12 17:43 ` Mel Gorman
2018-02-12 19:14 ` Mike Galbraith
0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-02-12 17:29 UTC (permalink / raw)
To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 02:58:56PM +0000, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c1091cb023c4..28c8d9c91955 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
> prev_eff_load *= capacity_of(this_cpu);
>
> - return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
> + /*
> + * If sync, adjust the weight of prev_eff_load such that if
> + * prev_eff == this_eff that select_idle_sibling will consider
> + * stacking the wakee on top of the waker if no other CPU is
> + * idle.
> + */
> + if (sync)
> + prev_eff_load += 1;
So where we had <= and would consistently favour pulling the task to the
waking CPU when all else what equal, you now switch to <, such that when
things are equal we do not pull.
That makes sense I suppose.
Except for sync wakeups, where you say, if everything else is equal,
pull, which also makes sense, because sync says 'current' promises to go
away.
OK.
> +
> + return this_eff_load < prev_eff_load ? this_cpu : nr_cpumask_bits;
> }
>
> static int wake_affine(struct sched_domain *sd, struct task_struct *p,
> --
> 2.15.1
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-12 14:58 ` [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
@ 2018-02-12 17:31 ` Peter Zijlstra
2018-02-12 17:47 ` Mel Gorman
2018-02-12 20:01 ` Andreas Mohr
1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2018-02-12 17:31 UTC (permalink / raw)
To: Mel Gorman; +Cc: Mike Galbraith, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 02:58:57PM +0000, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 28c8d9c91955..50442697b455 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5710,8 +5710,14 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>
> - if (sync && cpu_rq(this_cpu)->nr_running == 1)
> + if (sync && cpu_rq(this_cpu)->nr_running == 1) {
> + /* Avoid tasks exiting pulling parents to new nodes */
> + if ((current->flags & PF_EXITING) &&
> + !cpus_share_cache(this_cpu, prev_cpu))
> + return prev_cpu;
> +
Cute, but should we not kill @sync right at the source in this case?
Something a little like this?
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5eb3ffc9be84..568ea4ce5b36 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6342,7 +6342,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
int cpu = smp_processor_id();
int new_cpu = prev_cpu;
int want_affine = 0;
- int sync = wake_flags & WF_SYNC;
+ int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
if (sd_flag & SD_BALANCE_WAKE) {
record_wakee(p);
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-12 17:29 ` Peter Zijlstra
@ 2018-02-12 17:43 ` Mel Gorman
2018-02-12 19:14 ` Mike Galbraith
1 sibling, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 17:43 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 06:29:26PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 02:58:56PM +0000, Mel Gorman wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c1091cb023c4..28c8d9c91955 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> > prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
> > prev_eff_load *= capacity_of(this_cpu);
> >
> > - return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
> > + /*
> > + * If sync, adjust the weight of prev_eff_load such that if
> > + * prev_eff == this_eff that select_idle_sibling will consider
> > + * stacking the wakee on top of the waker if no other CPU is
> > + * idle.
> > + */
> > + if (sync)
> > + prev_eff_load += 1;
>
> So where we had <= and would consistently favour pulling the task to the
> waking CPU when all else what equal, you now switch to <, such that when
> things are equal we do not pull.
>
> That makes sense I suppose.
>
Yep, with the addenum that when CPU load is equal, it does not
necessarily mean they are equal in terms of memory locality. It might
make more sense to use <= if there were more cases where we stacked
tasks on the same CPU but we avoid that as much as possible for good
reasons.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-12 17:31 ` Peter Zijlstra
@ 2018-02-12 17:47 ` Mel Gorman
2018-02-12 18:01 ` Mel Gorman
0 siblings, 1 reply; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 17:47 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 06:31:47PM +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 02:58:57PM +0000, Mel Gorman wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 28c8d9c91955..50442697b455 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5710,8 +5710,14 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> > return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> >
> > - if (sync && cpu_rq(this_cpu)->nr_running == 1)
> > + if (sync && cpu_rq(this_cpu)->nr_running == 1) {
> > + /* Avoid tasks exiting pulling parents to new nodes */
> > + if ((current->flags & PF_EXITING) &&
> > + !cpus_share_cache(this_cpu, prev_cpu))
> > + return prev_cpu;
> > +
>
> Cute, but should we not kill @sync right at the source in this case?
>
> Something a little like this?
>
That works too but note that it has a slightly disadvantage in that we
add weight to the select_task_rq_fair fast path for all wakeups, not just
those that are affine. It's also not 100% functionally equivalent as I was
still allowing parents to migrate to the CPU if it shared cache although I
expect any advantage there is marginal. I can replace my patch with yours
if you really prefer it as the overhead in the fast path should be marginal.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-12 17:47 ` Mel Gorman
@ 2018-02-12 18:01 ` Mel Gorman
0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 18:01 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Mike Galbraith, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 05:47:29PM +0000, Mel Gorman wrote:
> On Mon, Feb 12, 2018 at 06:31:47PM +0100, Peter Zijlstra wrote:
> > On Mon, Feb 12, 2018 at 02:58:57PM +0000, Mel Gorman wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 28c8d9c91955..50442697b455 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5710,8 +5710,14 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > > if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> > > return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> > >
> > > - if (sync && cpu_rq(this_cpu)->nr_running == 1)
> > > + if (sync && cpu_rq(this_cpu)->nr_running == 1) {
> > > + /* Avoid tasks exiting pulling parents to new nodes */
> > > + if ((current->flags & PF_EXITING) &&
> > > + !cpus_share_cache(this_cpu, prev_cpu))
> > > + return prev_cpu;
> > > +
> >
> > Cute, but should we not kill @sync right at the source in this case?
> >
> > Something a little like this?
> >
>
> That works too but note that it has a slightly disadvantage in that we
> add weight to the select_task_rq_fair fast path for all wakeups, not just
> those that are affine. It's also not 100% functionally equivalent as I was
> still allowing parents to migrate to the CPU if it shared cache although I
> expect any advantage there is marginal. I can replace my patch with yours
> if you really prefer it as the overhead in the fast path should be marginal.
>
Clearly my brains leaked out my ears at the end of a work day. The overhead
only applies to sync wakeups and is a very small amount of overhead in the
case where this_cpu or prev_cpu was idle so I'm happy with either version.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-12 17:29 ` Peter Zijlstra
2018-02-12 17:43 ` Mel Gorman
@ 2018-02-12 19:14 ` Mike Galbraith
2018-02-12 19:57 ` Peter Zijlstra
2018-02-12 20:20 ` Mel Gorman
1 sibling, 2 replies; 15+ messages in thread
From: Mike Galbraith @ 2018-02-12 19:14 UTC (permalink / raw)
To: Peter Zijlstra, Mel Gorman; +Cc: Matt Fleming, LKML
On Mon, 2018-02-12 at 18:29 +0100, Peter Zijlstra wrote:
> On Mon, Feb 12, 2018 at 02:58:56PM +0000, Mel Gorman wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c1091cb023c4..28c8d9c91955 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> > prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
> > prev_eff_load *= capacity_of(this_cpu);
> >
> > - return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
> > + /*
> > + * If sync, adjust the weight of prev_eff_load such that if
> > + * prev_eff == this_eff that select_idle_sibling will consider
> > + * stacking the wakee on top of the waker if no other CPU is
> > + * idle.
> > + */
> > + if (sync)
> > + prev_eff_load += 1;
>
> So where we had <= and would consistently favour pulling the task to the
> waking CPU when all else what equal, you now switch to <, such that when
> things are equal we do not pull.
>
> That makes sense I suppose.
>
> Except for sync wakeups, where you say, if everything else is equal,
> pull, which also makes sense, because sync says 'current' promises to go
> away.
>
> OK.
Tasks tend to not honor that promise.. a lot. Even if the sync hint
were a golden promise, it wouldn't be bullet proof: migrating a compute
hog based on a single "phoned home" wakeup remains a bad idea whether
schedule() is called immediately after the wakeup or not. It's a pain,
only useful informational content is "this is a communication wakeup".
-Mike
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-12 19:14 ` Mike Galbraith
@ 2018-02-12 19:57 ` Peter Zijlstra
2018-02-12 20:20 ` Mel Gorman
1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2018-02-12 19:57 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Mel Gorman, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 08:14:49PM +0100, Mike Galbraith wrote:
> On Mon, 2018-02-12 at 18:29 +0100, Peter Zijlstra wrote:
> > Except for sync wakeups, where you say, if everything else is equal,
> > pull, which also makes sense, because sync says 'current' promises to go
> > away.
>
> Tasks tend to not honor that promise.. a lot.
Yes I know.. still its something and its a very small bias in this case.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-12 14:58 ` [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
2018-02-12 17:31 ` Peter Zijlstra
@ 2018-02-12 20:01 ` Andreas Mohr
2018-02-12 20:09 ` Mel Gorman
1 sibling, 1 reply; 15+ messages in thread
From: Andreas Mohr @ 2018-02-12 20:01 UTC (permalink / raw)
To: Mel Gorman; +Cc: Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML
Hello,
On Mon, Feb 12, 2018 at 02:58:57PM +0000, Mel Gorman wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 28c8d9c91955..50442697b455 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5710,8 +5710,14 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>
> - if (sync && cpu_rq(this_cpu)->nr_running == 1)
> + if (sync && cpu_rq(this_cpu)->nr_running == 1) {
> + /* Avoid tasks exiting pulling parents to new nodes */
Semi-confusing wording (WTH are "pulling parents"? :) - possibly better variant?:
Avoid having exiting tasks pull their parents to new nodes
Good catch, it seems.
HTH,
Andreas Mohr
--
GNU/Linux. It's not the software that's free, it's you.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit
2018-02-12 20:01 ` Andreas Mohr
@ 2018-02-12 20:09 ` Mel Gorman
0 siblings, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 20:09 UTC (permalink / raw)
To: Andreas Mohr; +Cc: Peter Zijlstra, Mike Galbraith, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 09:01:27PM +0100, Andreas Mohr wrote:
> Hello,
>
> On Mon, Feb 12, 2018 at 02:58:57PM +0000, Mel Gorman wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 28c8d9c91955..50442697b455 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5710,8 +5710,14 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
> > if (idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
> > return idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
> >
> > - if (sync && cpu_rq(this_cpu)->nr_running == 1)
> > + if (sync && cpu_rq(this_cpu)->nr_running == 1) {
> > + /* Avoid tasks exiting pulling parents to new nodes */
>
> Semi-confusing wording (WTH are "pulling parents"? :) - possibly better variant?:
> Avoid having exiting tasks pull their parents to new nodes
That is better wording. If I don't replace the patch with Peter's
version, I'll update it.
> Good catch, it seems.
>
Thanks.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal
2018-02-12 19:14 ` Mike Galbraith
2018-02-12 19:57 ` Peter Zijlstra
@ 2018-02-12 20:20 ` Mel Gorman
1 sibling, 0 replies; 15+ messages in thread
From: Mel Gorman @ 2018-02-12 20:20 UTC (permalink / raw)
To: Mike Galbraith; +Cc: Peter Zijlstra, Matt Fleming, LKML
On Mon, Feb 12, 2018 at 08:14:49PM +0100, Mike Galbraith wrote:
> On Mon, 2018-02-12 at 18:29 +0100, Peter Zijlstra wrote:
> > On Mon, Feb 12, 2018 at 02:58:56PM +0000, Mel Gorman wrote:
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index c1091cb023c4..28c8d9c91955 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5747,7 +5747,16 @@ wake_affine_weight(struct sched_domain *sd, struct task_struct *p,
> > > prev_eff_load *= 100 + (sd->imbalance_pct - 100) / 2;
> > > prev_eff_load *= capacity_of(this_cpu);
> > >
> > > - return this_eff_load <= prev_eff_load ? this_cpu : nr_cpumask_bits;
> > > + /*
> > > + * If sync, adjust the weight of prev_eff_load such that if
> > > + * prev_eff == this_eff that select_idle_sibling will consider
> > > + * stacking the wakee on top of the waker if no other CPU is
> > > + * idle.
> > > + */
> > > + if (sync)
> > > + prev_eff_load += 1;
> >
> > So where we had <= and would consistently favour pulling the task to the
> > waking CPU when all else what equal, you now switch to <, such that when
> > things are equal we do not pull.
> >
> > That makes sense I suppose.
> >
> > Except for sync wakeups, where you say, if everything else is equal,
> > pull, which also makes sense, because sync says 'current' promises to go
> > away.
> >
> > OK.
>
> Tasks tend to not honor that promise.. a lot. Even if the sync hint
> were a golden promise, it wouldn't be bullet proof: migrating a compute
> hog based on a single "phoned home" wakeup remains a bad idea whether
> schedule() is called immediately after the wakeup or not. It's a pain,
> only useful informational content is "this is a communication wakeup".
>
Agreed and I'm aware of the hazard of sync wakeups being no guarantee
that the task will immediately sleep. If there is any delay at all then
stacking incurs a wakeup latency penalty but in many cases, it'll be an
idle sibling that is used. The sync hint does give a stronger hint that the
tasks are closely related though which is why I special-cased it slightly
and I feel it's justified. I think in all cases where it mattered, it was
due to pref_eff_load and this_eff_load begin equal to 0 when waking a task
via a socket.
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2018-02-12 20:20 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-12 14:58 [PATCH 0/4] Reduce migrations due to load imbalance and process exits Mel Gorman
2018-02-12 14:58 ` [PATCH 1/4] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
2018-02-12 14:58 ` [PATCH 2/4] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
2018-02-12 14:58 ` [PATCH 3/4] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
2018-02-12 17:29 ` Peter Zijlstra
2018-02-12 17:43 ` Mel Gorman
2018-02-12 19:14 ` Mike Galbraith
2018-02-12 19:57 ` Peter Zijlstra
2018-02-12 20:20 ` Mel Gorman
2018-02-12 14:58 ` [PATCH 4/4] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
2018-02-12 17:31 ` Peter Zijlstra
2018-02-12 17:47 ` Mel Gorman
2018-02-12 18:01 ` Mel Gorman
2018-02-12 20:01 ` Andreas Mohr
2018-02-12 20:09 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).