* [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-09-10 8:42 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
` (5 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.
This problem is more likely if there are more cores per node or more
nodes in the system.
Use a per run-queue variable to check if numa-balance is active on the
run-queue.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 199709 206350 3.32534
1 330830 319963 -3.28477
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89011.9 89627.8 0.69193
1 218946 211338 -3.47483
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 180473 186539 3.36117
1 212805 220344 3.54268
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 56941.8 56836 -0.185804
1 111686 112970 1.14965
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12029.8 12124.6 12060.9 34.0076
5 13136.1 13170.2 13150.2 14.7482 9.03166
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4968.51 5006.62 4981.31 13.4151
5 4319.79 4998.19 4836.53 261.109 -2.90646
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9342.92 9381.44 9363.92 12.8587
5 9325.56 9402.7 9362.49 25.9638 -0.0152714
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 143.4 188.892 170.225 16.9929
5 132.581 191.072 170.554 21.6444 0.193274
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2->v3:
Add comments as requested by Peter.
kernel/sched/fair.c | 22 ++++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 23 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..5cf921a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1514,6 +1514,21 @@ struct task_numa_env {
static void task_numa_assign(struct task_numa_env *env,
struct task_struct *p, long imp)
{
+ struct rq *rq = cpu_rq(env->dst_cpu);
+
+ /* Bail out if run-queue part of active numa balance. */
+ if (xchg(&rq->numa_migrate_on, 1))
+ return;
+
+ /*
+ * Clear previous best_cpu/rq numa-migrate flag, since task now
+ * found a better cpu to move/swap.
+ */
+ if (env->best_cpu != -1) {
+ rq = cpu_rq(env->best_cpu);
+ WRITE_ONCE(rq->numa_migrate_on, 0);
+ }
+
if (env->best_task)
put_task_struct(env->best_task);
if (p)
@@ -1569,6 +1584,9 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp;
int dist = env->dist;
+ if (READ_ONCE(dst_rq->numa_migrate_on))
+ return;
+
rcu_read_lock();
cur = task_rcu_dereference(&dst_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1710,6 +1728,7 @@ static int task_numa_migrate(struct task_struct *p)
.best_cpu = -1,
};
struct sched_domain *sd;
+ struct rq *best_rq;
unsigned long taskweight, groupweight;
int nid, ret, dist;
long taskimp, groupimp;
@@ -1811,14 +1830,17 @@ static int task_numa_migrate(struct task_struct *p)
*/
p->numa_scan_period = task_scan_start(p);
+ best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
ret = migrate_task_to(p, env.best_cpu);
+ WRITE_ONCE(best_rq->numa_migrate_on, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
return ret;
}
ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+ WRITE_ONCE(best_rq->numa_migrate_on, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..0b91612 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,6 +783,7 @@ struct rq {
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
+ unsigned int numa_migrate_on;
#endif
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-09-10 8:42 ` Ingo Molnar
0 siblings, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:42 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> Task migration under numa balancing can happen in parallel. More than
> one task might choose to migrate to the same cpu at the same time. This
> can result in
> - During task swap, choosing a task that was not part of the evaluation.
> - During task swap, task which just got moved into its preferred node,
> moving to a completely different node.
> - During task swap, task failing to move to the preferred node, will have
> to wait an extra interval for the next migrate opportunity.
> - During task movement, multiple task movements can cause load imbalance.
Please capitalize both 'CPU' and 'NUMA' in changelogs and code comments.
> This problem is more likely if there are more cores per node or more
> nodes in the system.
>
> Use a per run-queue variable to check if numa-balance is active on the
> run-queue.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 199709 206350 3.32534
> 1 330830 319963 -3.28477
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 89011.9 89627.8 0.69193
> 1 218946 211338 -3.47483
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 180473 186539 3.36117
> 1 212805 220344 3.54268
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 56941.8 56836 -0.185804
> 1 111686 112970 1.14965
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 12029.8 12124.6 12060.9 34.0076
> 5 13136.1 13170.2 13150.2 14.7482 9.03166
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4968.51 5006.62 4981.31 13.4151
> 5 4319.79 4998.19 4836.53 261.109 -2.90646
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9342.92 9381.44 9363.92 12.8587
> 5 9325.56 9402.7 9362.49 25.9638 -0.0152714
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 143.4 188.892 170.225 16.9929
> 5 132.581 191.072 170.554 21.6444 0.193274
I have applied this patch, but the zero comments benchmark dump is annoying, as the numbers do
not show unconditional advantages - there's some increases in performance and some regressions.
In particular this:
> dbench / transactions / higher numbers are better
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4968.51 5006.62 4981.31 13.4151
> 5 4319.79 4998.19 4836.53 261.109 -2.90646
is concerning: not only did we lose some performance, variance went up by a *lot*. Is this just
a measurement fluke? We cannot know and you didn't comment.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-09-06 11:48 ` Peter Zijlstra
2018-09-10 8:39 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
` (4 subsequent siblings)
6 siblings, 2 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
Since this spinlock will only serialize migrate rate limiting,
convert the spinlock to a trylock. If another task races ahead of this task
then this task can simply move on.
While here, add correct two abnormalities.
- Avoid time being stretched for every interval.
- Use READ/WRITE_ONCE with next window.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 206350 200892 -2.64502
1 319963 325766 1.81365
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 186539 190261 1.99529
1 220344 195305 -11.3636
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 56836 57651.1 1.43413
1 112970 111351 -1.43312
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 13136.1 13170.2 13150.2 14.7482
5 12254.7 12331.9 12297.8 28.1846 -6.48203
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4319.79 4998.19 4836.53 261.109
5 4997.83 5030.14 5015.54 12.947 3.70121
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9325.56 9402.7 9362.49 25.9638
5 9331.84 9375.11 9352.04 16.0703 -0.111616
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 132.581 191.072 170.554 21.6444
5 147.55 181.605 168.963 11.3513 -0.932842
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
---
Changelog v1->v2:
Fix stretch every interval pointed by Peter Zijlstra.
Verified that some of the regression is due to fixing interval stretch.
mm/migrate.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f..dbc2cb7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1868,16 +1868,24 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
unsigned long nr_pages)
{
+ unsigned long next_window, interval;
+
+ next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
+ interval = msecs_to_jiffies(migrate_interval_millisecs);
+
/*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
*/
- if (time_after(jiffies, next_window)) {
- spin_lock(&pgdat->numabalancing_migrate_lock);
+ if (time_after(jiffies, next_window) &&
+ spin_trylock(&pgdat->numabalancing_migrate_lock)) {
pgdat->numabalancing_migrate_nr_pages = 0;
- pgdat->numabalancing_migrate_next_window = jiffies +
- msecs_to_jiffies(migrate_interval_millisecs);
+ do {
+ next_window += interval;
+ } while (unlikely(time_after(jiffies, next_window)));
+
+ WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
spin_unlock(&pgdat->numabalancing_migrate_lock);
}
if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
@ 2018-09-06 11:48 ` Peter Zijlstra
2018-09-10 8:39 ` Ingo Molnar
1 sibling, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-09-06 11:48 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
On Fri, Aug 03, 2018 at 11:43:57AM +0530, Srikar Dronamraju wrote:
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f..dbc2cb7 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1868,16 +1868,24 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
> static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
> unsigned long nr_pages)
> {
> + unsigned long next_window, interval;
> +
> + next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
> + interval = msecs_to_jiffies(migrate_interval_millisecs);
> +
> /*
> * Rate-limit the amount of data that is being migrated to a node.
> * Optimal placement is no good if the memory bus is saturated and
> * all the time is being spent migrating!
> */
> - if (time_after(jiffies, next_window)) {
> - spin_lock(&pgdat->numabalancing_migrate_lock);
> + if (time_after(jiffies, next_window) &&
> + spin_trylock(&pgdat->numabalancing_migrate_lock)) {
This patch doesn't apply cleanly; also you introduce @next_window with
this patch, so how can it remove a user of it.
I fixed this up, still weird.
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
2018-09-06 11:48 ` Peter Zijlstra
@ 2018-09-10 8:39 ` Ingo Molnar
1 sibling, 0 replies; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:39 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> Since this spinlock will only serialize migrate rate limiting,
> convert the spinlock to a trylock. If another task races ahead of this task
> then this task can simply move on.
>
> While here, add correct two abnormalities.
> - Avoid time being stretched for every interval.
> - Use READ/WRITE_ONCE with next window.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 206350 200892 -2.64502
> 1 319963 325766 1.81365
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 186539 190261 1.99529
> 1 220344 195305 -11.3636
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 56836 57651.1 1.43413
> 1 112970 111351 -1.43312
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 13136.1 13170.2 13150.2 14.7482
> 5 12254.7 12331.9 12297.8 28.1846 -6.48203
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4319.79 4998.19 4836.53 261.109
> 5 4997.83 5030.14 5015.54 12.947 3.70121
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9325.56 9402.7 9362.49 25.9638
> 5 9331.84 9375.11 9352.04 16.0703 -0.111616
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 132.581 191.072 170.554 21.6444
> 5 147.55 181.605 168.963 11.3513 -0.932842
Firstly, *please* always characterize benchmark runs. What did you find? How should we
interpret the result? Are there any tradeoffs?
*Don't* just dump them on us.
Because in this particular case the results are not obvious, at all:
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 206350 200892 -2.64502
> 1 319963 325766 1.81365
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 186539 190261 1.99529
> 1 220344 195305 -11.3636
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 56836 57651.1 1.43413
> 1 112970 111351 -1.43312
Why is this better? The largest drop is 11% which seems significant.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-08-03 6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-09-10 8:46 ` Ingo Molnar
2018-08-03 6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
` (3 subsequent siblings)
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
If numa improvement from the task migration is going to be very
minimal, then avoid task migration.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 200892 210118 4.59252
1 325766 313171 -3.86627
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89011.9 91027.5 2.26442
1 211338 216460 2.42361
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 190261 191918 0.870909
1 195305 207043 6.01009
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 57651.1 58462.1 1.40674
1 111351 108334 -2.70945
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12254.7 12331.9 12297.8 28.1846
5 11851.8 11937.3 11890.9 33.5169 -3.30872
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4997.83 5030.14 5015.54 12.947
5 4791 5016.08 4962.55 85.9625 -1.05652
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9331.84 9375.11 9352.04 16.0703
5 9353.43 9380.49 9369.6 9.04361 0.187767
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 147.55 181.605 168.963 11.3513
5 149.518 215.412 179.083 21.5903 5.98948
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
- Handle trivial changes due to variable name change. (Rik Van Riel)
- Drop changes where subsequent better cpu find was rejected for
small numa improvement (Rik Van Riel).
kernel/sched/fair.c | 23 ++++++++++++++++++-----
1 file changed, 18 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cf921a..a717870 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1568,6 +1568,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
}
/*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP 30
+
+/*
* This checks if the overall compute and NUMA accesses of the system would
* be improved if the source tasks was migrated to the target dst_cpu taking
* into account that it might be best if task running on the dst_cpu should
@@ -1600,7 +1607,7 @@ static void task_numa_compare(struct task_numa_env *env,
goto unlock;
if (!cur) {
- if (maymove || imp > env->best_imp)
+ if (maymove && moveimp >= env->best_imp)
goto assign;
else
goto unlock;
@@ -1643,16 +1650,22 @@ static void task_numa_compare(struct task_numa_env *env,
task_weight(cur, env->dst_nid, dist);
}
- if (imp <= env->best_imp)
- goto unlock;
-
if (maymove && moveimp > imp && moveimp > env->best_imp) {
- imp = moveimp - 1;
+ imp = moveimp;
cur = NULL;
goto assign;
}
/*
+ * If the numa importance is less than SMALLIMP,
+ * task migration might only result in ping pong
+ * of tasks and also hurt performance due to cache
+ * misses.
+ */
+ if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
+ goto unlock;
+
+ /*
* In the overloaded case, try and keep the load balanced.
*/
load = task_h_load(env->p) - task_h_load(cur);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
@ 2018-09-10 8:46 ` Ingo Molnar
2018-09-12 15:17 ` Srikar Dronamraju
0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:46 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> If numa improvement from the task migration is going to be very
> minimal, then avoid task migration.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 200892 210118 4.59252
> 1 325766 313171 -3.86627
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 89011.9 91027.5 2.26442
> 1 211338 216460 2.42361
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 190261 191918 0.870909
> 1 195305 207043 6.01009
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 57651.1 58462.1 1.40674
> 1 111351 108334 -2.70945
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 12254.7 12331.9 12297.8 28.1846
> 5 11851.8 11937.3 11890.9 33.5169 -3.30872
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4997.83 5030.14 5015.54 12.947
> 5 4791 5016.08 4962.55 85.9625 -1.05652
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9331.84 9375.11 9352.04 16.0703
> 5 9353.43 9380.49 9369.6 9.04361 0.187767
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 147.55 181.605 168.963 11.3513
> 5 149.518 215.412 179.083 21.5903 5.98948
>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> Changelog v1->v2:
> - Handle trivial changes due to variable name change. (Rik Van Riel)
> - Drop changes where subsequent better cpu find was rejected for
> small numa improvement (Rik Van Riel).
>
> kernel/sched/fair.c | 23 ++++++++++++++++++-----
> 1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5cf921a..a717870 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1568,6 +1568,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> }
>
> /*
> + * Maximum numa importance can be 1998 (2*999);
> + * SMALLIMP @ 30 would be close to 1998/64.
> + * Used to deter task migration.
> + */
> +#define SMALLIMP 30
> +
> +/*
> * This checks if the overall compute and NUMA accesses of the system would
> * be improved if the source tasks was migrated to the target dst_cpu taking
> * into account that it might be best if task running on the dst_cpu should
> @@ -1600,7 +1607,7 @@ static void task_numa_compare(struct task_numa_env *env,
> goto unlock;
>
> if (!cur) {
> - if (maymove || imp > env->best_imp)
> + if (maymove && moveimp >= env->best_imp)
> goto assign;
> else
> goto unlock;
> @@ -1643,16 +1650,22 @@ static void task_numa_compare(struct task_numa_env *env,
> task_weight(cur, env->dst_nid, dist);
> }
>
> - if (imp <= env->best_imp)
> - goto unlock;
> -
> if (maymove && moveimp > imp && moveimp > env->best_imp) {
> - imp = moveimp - 1;
> + imp = moveimp;
> cur = NULL;
> goto assign;
> }
>
> /*
> + * If the numa importance is less than SMALLIMP,
> + * task migration might only result in ping pong
> + * of tasks and also hurt performance due to cache
> + * misses.
> + */
> + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> + goto unlock;
> +
> + /*
> * In the overloaded case, try and keep the load balanced.
> */
> load = task_h_load(env->p) - task_h_load(cur);
So what is this 'NUMA importance'? Seems just like a random parameter which generally isn't a
good idea.
Also, same review feedback as I gave for the previous patches.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement
2018-09-10 8:46 ` Ingo Molnar
@ 2018-09-12 15:17 ` Srikar Dronamraju
0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 15:17 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
> >
> > /*
> > + * Maximum numa importance can be 1998 (2*999);
> > + * SMALLIMP @ 30 would be close to 1998/64.
> > + * Used to deter task migration.
> > + */
> > +#define SMALLIMP 30
> > +
> > +/*
> >
> > /*
> > + * If the numa importance is less than SMALLIMP,
> > + * task migration might only result in ping pong
> > + * of tasks and also hurt performance due to cache
> > + * misses.
> > + */
> > + if (imp < SMALLIMP || imp <= env->best_imp + SMALLIMP / 2)
> > + goto unlock;
> > +
> > + /*
> > * In the overloaded case, try and keep the load balanced.
> > */
> > load = task_h_load(env->p) - task_h_load(cur);
>
> So what is this 'NUMA importance'? Seems just like a random parameter which generally isn't a
> good idea.
>
I refer the weight that is used to compare the suitability of the task to a
node as NUMA Importance. It varies between -999 to 1000. This is not
something that was introduced by this patch, but was introduced as part of
Numa balancing couple of years ago. group_imp, task_imp, best_imp all refer
to the NUMA importance. May be I am using a wrong term here. May be imp
stands for something other than importance.
In this patch, we are trying to limit task migration for small NUMA
importance. i.e if the NUMA importance for moving/swapping tasks is only 10,
then should we drop all the cache affinity for NUMA affinity? May be we need
to wait for the trend to stabilize.
I have chosen 30 as the weight below which we refuse to consider NUMA
importance. Its based on maximum NUMA importance / 64.
Please do suggest if you have a better method to limit task migrations for
small NUMA gain.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (2 preceding siblings ...)
2018-08-03 6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
@ 2018-08-03 6:13 ` Srikar Dronamraju
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
` (2 subsequent siblings)
6 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:13 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.
No functional change.
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/core.c | 2 +-
kernel/sched/deadline.c | 2 +-
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 2 +-
4 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index deafa9f..fdab290 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1167,7 +1167,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
if (task_cpu(p) != new_cpu) {
if (p->sched_class->migrate_task_rq)
- p->sched_class->migrate_task_rq(p);
+ p->sched_class->migrate_task_rq(p, new_cpu);
p->se.nr_migrations++;
rseq_migrate(p);
perf_event_task_migrate(p);
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 997ea7b..91e4202 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1607,7 +1607,7 @@ static void yield_task_dl(struct rq *rq)
return cpu;
}
-static void migrate_task_rq_dl(struct task_struct *p)
+static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused)
{
struct rq *rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a717870..a5936ed 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6308,7 +6308,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
* cfs_rq_of(p) references at time of call are still valid and identify the
* previous CPU. The caller guarantees p->pi_lock or task_rq(p)->lock is held.
*/
-static void migrate_task_rq_fair(struct task_struct *p)
+static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unused)
{
/*
* As blocked tasks retain absolute vruntime the migration needs to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0b91612..455fa33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1524,7 +1524,7 @@ struct sched_class {
#ifdef CONFIG_SMP
int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
- void (*migrate_task_rq)(struct task_struct *p);
+ void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
void (*task_woken)(struct rq *this_rq, struct task_struct *task);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (3 preceding siblings ...)
2018-08-03 6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
@ 2018-08-03 6:14 ` Srikar Dronamraju
2018-09-10 8:48 ` Ingo Molnar
2018-08-03 6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:14 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
Currently task scan rate is reset when numa balancer migrates the task
to a different node. If numa balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.
Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 210118 208862 -0.597759
1 313171 307007 -1.96825
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 91027.5 89911.4 -1.22611
1 216460 216176 -0.131202
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 191918 196078 2.16759
1 207043 214664 3.68088
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 58462.1 60719.2 3.86079
1 108334 112615 3.95167
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 11851.8 11937.3 11890.9 33.5169
5 12511.7 12559.4 12539.5 15.5883 5.45459
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4791 5016.08 4962.55 85.9625
5 4709.28 4979.28 4919.32 105.126 -0.871125
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9353.43 9380.49 9369.6 9.04361
5 9388.38 9406.29 9395.1 5.98959 0.272157
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 149.518 215.412 179.083 21.5903
5 157.71 184.929 174.754 10.7275 -2.41731
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 19 +++++++++++++------
1 file changed, 13 insertions(+), 6 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a5936ed..4ea0eff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
if (env.best_cpu == -1)
return -EAGAIN;
- /*
- * Reset the scan period if the task is being rescheduled on an
- * alternative node to recheck if the tasks is now properly placed.
- */
- p->numa_scan_period = task_scan_start(p);
-
best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
ret = migrate_task_to(p, env.best_cpu);
@@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
/* We have migrated, no longer consider this task hot */
p->se.exec_start = 0;
+
+#ifdef CONFIG_NUMA_BALANCING
+ if (!p->mm || (p->flags & PF_EXITING))
+ return;
+
+ if (p->numa_faults) {
+ int src_nid = cpu_to_node(task_cpu(p));
+ int dst_nid = cpu_to_node(new_cpu);
+
+ if (src_nid != dst_nid)
+ p->numa_scan_period = task_scan_start(p);
+ }
+#endif
}
static void task_dead_fair(struct task_struct *p)
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-09-10 8:48 ` Ingo Molnar
2018-09-12 15:19 ` Srikar Dronamraju
0 siblings, 1 reply; 16+ messages in thread
From: Ingo Molnar @ 2018-09-10 8:48 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> wrote:
> Currently task scan rate is reset when numa balancer migrates the task
> to a different node. If numa balancer initiates a swap, reset is only
> applicable to the task that initiates the swap. Similarly no scan rate
> reset is done if the task is migrated across nodes by traditional load
> balancer.
>
> Instead move the scan reset to the migrate_task_rq. This ensures the
> task moved out of its preferred node, either gets back to its preferred
> node quickly or finds a new preferred node. Doing so, would be fair to
> all tasks migrating across nodes.
>
> specjbb2005 / bops/JVM / higher bops are better
> on 2 Socket/2 Node Intel
> JVMS Prev Current %Change
> 4 210118 208862 -0.597759
> 1 313171 307007 -1.96825
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> JVMS Prev Current %Change
> 8 91027.5 89911.4 -1.22611
> 1 216460 216176 -0.131202
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> JVMS Prev Current %Change
> 4 191918 196078 2.16759
> 1 207043 214664 3.68088
>
>
> on 4 Socket/4 Node Power7
> JVMS Prev Current %Change
> 8 58462.1 60719.2 3.86079
> 1 108334 112615 3.95167
>
>
> dbench / transactions / higher numbers are better
> on 2 Socket/2 Node Intel
> count Min Max Avg Variance %Change
> 5 11851.8 11937.3 11890.9 33.5169
> 5 12511.7 12559.4 12539.5 15.5883 5.45459
>
>
> on 2 Socket/4 Node Power8 (PowerNV)
> count Min Max Avg Variance %Change
> 5 4791 5016.08 4962.55 85.9625
> 5 4709.28 4979.28 4919.32 105.126 -0.871125
>
>
> on 2 Socket/2 Node Power9 (PowerNV)
> count Min Max Avg Variance %Change
> 5 9353.43 9380.49 9369.6 9.04361
> 5 9388.38 9406.29 9395.1 5.98959 0.272157
>
>
> on 4 Socket/4 Node Power7
> count Min Max Avg Variance %Change
> 5 149.518 215.412 179.083 21.5903
> 5 157.71 184.929 174.754 10.7275 -2.41731
>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
> kernel/sched/fair.c | 19 +++++++++++++------
> 1 file changed, 13 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a5936ed..4ea0eff 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1837,12 +1837,6 @@ static int task_numa_migrate(struct task_struct *p)
> if (env.best_cpu == -1)
> return -EAGAIN;
>
> - /*
> - * Reset the scan period if the task is being rescheduled on an
> - * alternative node to recheck if the tasks is now properly placed.
> - */
> - p->numa_scan_period = task_scan_start(p);
> -
> best_rq = cpu_rq(env.best_cpu);
> if (env.best_task == NULL) {
> ret = migrate_task_to(p, env.best_cpu);
> @@ -6361,6 +6355,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
>
> /* We have migrated, no longer consider this task hot */
> p->se.exec_start = 0;
> +
> +#ifdef CONFIG_NUMA_BALANCING
> + if (!p->mm || (p->flags & PF_EXITING))
> + return;
> +
> + if (p->numa_faults) {
> + int src_nid = cpu_to_node(task_cpu(p));
> + int dst_nid = cpu_to_node(new_cpu);
> +
> + if (src_nid != dst_nid)
> + p->numa_scan_period = task_scan_start(p);
> + }
> +#endif
Please don't add #ifdeffery inside functions, especially not if they do weird flow control like
a 'return' from the middle of a block.
A properly named inline helper would work I suppose.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes
2018-09-10 8:48 ` Ingo Molnar
@ 2018-09-12 15:19 ` Srikar Dronamraju
0 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-09-12 15:19 UTC (permalink / raw)
To: Ingo Molnar
Cc: Peter Zijlstra, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
> > +#ifdef CONFIG_NUMA_BALANCING
> > + if (!p->mm || (p->flags & PF_EXITING))
> > + return;
> > +
> > + if (p->numa_faults) {
> > + int src_nid = cpu_to_node(task_cpu(p));
> > + int dst_nid = cpu_to_node(new_cpu);
> > +
> > + if (src_nid != dst_nid)
> > + p->numa_scan_period = task_scan_start(p);
> > + }
> > +#endif
>
> Please don't add #ifdeffery inside functions, especially not if they do weird flow control like
> a 'return' from the middle of a block.
>
> A properly named inline helper would work I suppose.
>
Okay, will take care.
^ permalink raw reply [flat|nested] 16+ messages in thread
* [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (4 preceding siblings ...)
2018-08-03 6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-08-03 6:14 ` Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
6 siblings, 0 replies; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-03 6:14 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner
From: Mel Gorman <mgorman@techsingularity.net>
migrate_task_rq_fair resets the scan rate for NUMA balancing on every
cross-node migration. In the event of excessive load balancing due to
saturation, this may result in the scan rate being pegged at maximum and
further overloading the machine.
This patch only resets the scan if NUMA balancing is active, a preferred
node has been selected and the task is being migrated from the preferred
node as these are the most harmful. For example, a migration to the preferred
node does not justify a faster scan rate. Similarly, a migration between two
nodes that are not preferred is probably bouncing due to over-saturation of
the machine. In that case, scanning faster and trapping more NUMA faults
will further overload the machine.
specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 208862 209029 0.0799571
1 307007 326585 6.37705
on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89911.4 89627.8 -0.315422
1 216176 221299 2.36983
on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 196078 195444 -0.323341
1 214664 222390 3.59911
on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 60719.2 60152.4 -0.933477
1 112615 111458 -1.02739
dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12511.7 12559.4 12539.5 15.5883
5 12904.6 12969 12942.6 23.9053 3.21464
on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4709.28 4979.28 4919.32 105.126
5 4984.25 5025.95 5004.5 14.2253 1.73154
on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9388.38 9406.29 9395.1 5.98959
5 9277.64 9357.22 9322.07 26.3558 -0.77732
on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 157.71 184.929 174.754 10.7275
5 160.632 175.558 168.655 5.26823 -3.49005
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
kernel/sched/fair.c | 25 +++++++++++++++++++++++--
1 file changed, 23 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ea0eff..6e251e6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6357,6 +6357,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
p->se.exec_start = 0;
#ifdef CONFIG_NUMA_BALANCING
+ if (!static_branch_likely(&sched_numa_balancing))
+ return;
+
if (!p->mm || (p->flags & PF_EXITING))
return;
@@ -6364,8 +6367,26 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
int src_nid = cpu_to_node(task_cpu(p));
int dst_nid = cpu_to_node(new_cpu);
- if (src_nid != dst_nid)
- p->numa_scan_period = task_scan_start(p);
+ if (src_nid == dst_nid)
+ return;
+
+ /*
+ * Allow resets if faults have been trapped before one scan
+ * has completed. This is most likely due to a new task that
+ * is pulled cross-node due to wakeups or load balancing.
+ */
+ if (p->numa_scan_seq) {
+ /*
+ * Avoid scan adjustments if moving to the preferred
+ * node or if the task was not previously running on
+ * the preferred node.
+ */
+ if (dst_nid == p->numa_preferred_nid ||
+ (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))
+ return;
+ }
+
+ p->numa_scan_period = task_scan_start(p);
}
#endif
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 16+ messages in thread
* Re: [PATCH 0/6] numa-balancing patches
2018-08-03 6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
` (5 preceding siblings ...)
2018-08-03 6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
@ 2018-08-21 12:01 ` Srikar Dronamraju
2018-09-06 12:17 ` Peter Zijlstra
6 siblings, 1 reply; 16+ messages in thread
From: Srikar Dronamraju @ 2018-08-21 12:01 UTC (permalink / raw)
To: Ingo Molnar, Peter Zijlstra
Cc: LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-03 11:43:55]:
> This patchset based on current tip/sched/core, provides left out patches
> from the previous series. This version handles the comments given to some of
> the patches. It drops "sched/numa: Restrict migrating in parallel to the same
> node." It adds an additional patch from Mel Gorman.
> It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
> basis on 4 node and 2 node systems.
>
> v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
> v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
>
Hi Peter, Ingo
Can you please respond with your comments suggestions.
Mel and Rik have acked most of the patches.
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH 0/6] numa-balancing patches
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
@ 2018-09-06 12:17 ` Peter Zijlstra
0 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2018-09-06 12:17 UTC (permalink / raw)
To: Srikar Dronamraju
Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner
On Tue, Aug 21, 2018 at 05:01:50AM -0700, Srikar Dronamraju wrote:
> * Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2018-08-03 11:43:55]:
>
> > This patchset based on current tip/sched/core, provides left out patches
> > from the previous series. This version handles the comments given to some of
> > the patches. It drops "sched/numa: Restrict migrating in parallel to the same
> > node." It adds an additional patch from Mel Gorman.
> > It also provides specjbb2005 /dbench/ perf bench numa numbers on a patch
> > basis on 4 node and 2 node systems.
> >
> > v2: http://lkml.kernel.org/r/1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com
> > v1: http://lkml.kernel.org/r/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com
> >
>
> Hi Peter, Ingo
>
> Can you please respond with your comments suggestions.
>
> Mel and Rik have acked most of the patches.
The patches do not contain a single ack.
But I pickd them up now. Thanks.
^ permalink raw reply [flat|nested] 16+ messages in thread