linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/19]  Fixes for sched/numa_balancing
@ 2018-06-20 17:02 Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 01/19] sched/numa: Remove redundant field Srikar Dronamraju
                   ` (20 more replies)
  0 siblings, 21 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This patchset based on v4.17, provides few simple cleanups and fixes in
the sched/numa_balancing code. Some of these fixes are specific to systems
having more than 2 nodes. Few patches add per-rq and per-node complexities
to solve what I feel are a fairness/correctness issues.

This version handles the comments given to some of the patches.
It also provides specjbb2005 numbers on a patch basis on a 4 node and
16 node system.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
(higher bops are better)
JVMS  v4.17  	v4.17+patch   %CHANGE
16    25705.2     26158.1     1.731
1     74433	  72725       -2.34

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
(higher bops are better)
JVMS  v4.17  	v4.17+patch   %CHANGE
8     96589.6     113992      15.26
1     181830      174947      -3.93

Only patches, 2, 4, 13 and 16 have code changes. The rest of the patches are
unchanged.

For overall numbers with v1 running perf-bench please look at
https://lwn.net/ml/linux-kernel/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com

Srikar Dronamraju (19):
  sched/numa: Remove redundant field.
  sched/numa: Evaluate move once per node
  sched/numa: Simplify load_too_imbalanced
  sched/numa: Set preferred_node based on best_cpu
  sched/numa: Use task faults only if numa_group is not yet setup
  sched/debug: Reverse the order of printing faults
  sched/numa: Skip nodes that are at hoplimit
  sched/numa: Remove unused task_capacity from numa_stats
  sched/numa: Modify migrate_swap to accept additional params
  sched/numa: Stop multiple tasks from moving to the cpu at the same time
  sched/numa: Restrict migrating in parallel to the same node.
  sched/numa: Remove numa_has_capacity
  mm/migrate: Use xchg instead of spinlock
  sched/numa: Updation of scan period need not be in lock
  sched/numa: Use group_weights to identify if migration degrades locality
  sched/numa: Detect if node actively handling migration
  sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  sched/numa: Reset scan rate whenever task moves across nodes
  sched/numa: Move task_placement closer to numa_migrate_preferred

 include/linux/mmzone.h  |   4 +-
 include/linux/sched.h   |   1 -
 kernel/sched/core.c     |  11 +-
 kernel/sched/deadline.c |   2 +-
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     | 325 +++++++++++++++++++++++-------------------------
 kernel/sched/sched.h    |   6 +-
 mm/migrate.c            |  20 ++-
 mm/page_alloc.c         |   2 +-
 9 files changed, 187 insertions(+), 188 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 01/19] sched/numa: Remove redundant field.
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:23   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

numa_entry is a list_head defined in task_struct, but never used.

No functional change

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca3f3ea..6207ad2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1016,7 +1016,6 @@ struct task_struct {
 	u64				last_sum_exec_runtime;
 	struct callback_head		numa_work;
 
-	struct list_head		numa_entry;
 	struct numa_group		*numa_group;
 
 	/*
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 02/19] sched/numa: Evaluate move once per node
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 01/19] sched/numa: Remove redundant field Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-21  9:06   ` Mel Gorman
  2018-07-25 14:24   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
                   ` (18 subsequent siblings)
  20 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

task_numa_compare() helps choose the best cpu to move or swap the
selected task. To achieve this task_numa_compare() is called for every
cpu in the node. Currently it evaluates if the task can be moved/swapped
for each of the cpus. However the move evaluation is mostly independent
of the cpu. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25705.2     25058.2     -2.51
1     74433       72950       -1.99

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     96589.6     105930      9.670
1     181830      178624      -1.76

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      440.65      941.32      758.98      189.17
numa01.sh       Sys:      183.48      320.07      258.42       50.09
numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
numa02.sh      Real:       61.24       65.35       62.49        1.49
numa02.sh       Sys:       16.83       24.18       21.40        2.60
numa02.sh      User:     5219.59     5356.34     5264.03       49.07
numa03.sh      Real:      822.04      912.40      873.55       37.35
numa03.sh       Sys:      118.80      140.94      132.90        7.60
numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
numa04.sh      Real:      690.66      872.12      778.49       65.44
numa04.sh       Sys:      459.26      563.03      494.03       42.39
numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
numa05.sh      Real:      418.37      562.28      525.77       54.27
numa05.sh       Sys:      299.45      481.00      392.49       64.27
numa05.sh      User:    34115.09    41324.02    39105.30     2627.68

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
Style change and variable rename as suggested by Rik.

 kernel/sched/fair.c | 128 +++++++++++++++++++++++-----------------------------
 1 file changed, 57 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79f574d..69136e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1541,9 +1541,8 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
-			      long taskimp, long groupimp)
+			      long taskimp, long groupimp, bool maymove)
 {
-	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long src_load, dst_load;
@@ -1564,97 +1563,73 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (cur == env->p)
 		goto unlock;
 
+	if (!cur) {
+		if (maymove || imp > env->best_imp)
+			goto assign;
+		else
+			goto unlock;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
-	 * the value is, the more rmeote accesses that would be expected to
+	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
 	 */
-	if (cur) {
-		/* Skip this swap candidate if cannot move to the source CPU: */
-		if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
-			goto unlock;
+	/* Skip this swap candidate if cannot move to the source cpu */
+	if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
+		goto unlock;
 
+	/*
+	 * If dst and source tasks are in the same NUMA group, or not
+	 * in any group then look only at task weights.
+	 */
+	if (cur->numa_group == env->p->numa_group) {
+		imp = taskimp + task_weight(cur, env->src_nid, dist) -
+		      task_weight(cur, env->dst_nid, dist);
 		/*
-		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights.
+		 * Add some hysteresis to prevent swapping the
+		 * tasks within a group over tiny differences.
 		 */
-		if (cur->numa_group == env->p->numa_group) {
-			imp = taskimp + task_weight(cur, env->src_nid, dist) -
-			      task_weight(cur, env->dst_nid, dist);
-			/*
-			 * Add some hysteresis to prevent swapping the
-			 * tasks within a group over tiny differences.
-			 */
-			if (cur->numa_group)
-				imp -= imp/16;
-		} else {
-			/*
-			 * Compare the group weights. If a task is all by
-			 * itself (not part of a group), use the task weight
-			 * instead.
-			 */
-			if (cur->numa_group)
-				imp += group_weight(cur, env->src_nid, dist) -
-				       group_weight(cur, env->dst_nid, dist);
-			else
-				imp += task_weight(cur, env->src_nid, dist) -
-				       task_weight(cur, env->dst_nid, dist);
-		}
+		if (cur->numa_group)
+			imp -= imp / 16;
+	} else {
+		/*
+		 * Compare the group weights. If a task is all by itself
+		 * (not part of a group), use the task weight instead.
+		 */
+		if (cur->numa_group && env->p->numa_group)
+			imp += group_weight(cur, env->src_nid, dist) -
+			       group_weight(cur, env->dst_nid, dist);
+		else
+			imp += task_weight(cur, env->src_nid, dist) -
+			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp && moveimp <= env->best_imp)
+	if (imp <= env->best_imp)
 		goto unlock;
 
-	if (!cur) {
-		/* Is there capacity at our destination? */
-		if (env->src_stats.nr_running <= env->src_stats.task_capacity &&
-		    !env->dst_stats.has_free_capacity)
-			goto unlock;
-
-		goto balance;
-	}
-
-	/* Balance doesn't matter much if we're running a task per CPU: */
-	if (imp > env->best_imp && src_rq->nr_running == 1 &&
-			dst_rq->nr_running == 1)
+	if (maymove && moveimp > imp && moveimp > env->best_imp) {
+		imp = moveimp - 1;
+		cur = NULL;
 		goto assign;
+	}
 
 	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
-balance:
-	load = task_h_load(env->p);
+	load = task_h_load(env->p) - task_h_load(cur);
+	if (!load)
+		goto assign;
+
 	dst_load = env->dst_stats.load + load;
 	src_load = env->src_stats.load - load;
 
-	if (moveimp > imp && moveimp > env->best_imp) {
-		/*
-		 * If the improvement from just moving env->p direction is
-		 * better than swapping tasks around, check if a move is
-		 * possible. Store a slightly smaller score than moveimp,
-		 * so an actually idle CPU will win.
-		 */
-		if (!load_too_imbalanced(src_load, dst_load, env)) {
-			imp = moveimp - 1;
-			cur = NULL;
-			goto assign;
-		}
-	}
-
-	if (imp <= env->best_imp)
-		goto unlock;
-
-	if (cur) {
-		load = task_h_load(cur);
-		dst_load -= load;
-		src_load += load;
-	}
-
 	if (load_too_imbalanced(src_load, dst_load, env))
 		goto unlock;
 
+assign:
 	/*
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
@@ -1670,7 +1645,6 @@ static void task_numa_compare(struct task_numa_env *env,
 		local_irq_enable();
 	}
 
-assign:
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
@@ -1679,15 +1653,27 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
+	long src_load, dst_load, load;
+	bool maymove = false;
 	int cpu;
 
+	load = task_h_load(env->p);
+	dst_load = env->dst_stats.load + load;
+	src_load = env->src_stats.load - load;
+
+	/*
+	 * If the improvement from just moving env->p direction is better
+	 * than swapping tasks around, check if a move is possible.
+	 */
+	maymove = !load_too_imbalanced(src_load, dst_load, env);
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp);
+		task_numa_compare(env, taskimp, groupimp, maymove);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 03/19] sched/numa: Simplify load_too_imbalanced
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 01/19] sched/numa: Remove redundant field Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:24   ` [tip:sched/core] sched/numa: Simplify load_too_imbalanced() tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.

However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25058.2     25122.9     0.25
1     72950       73850       1.23

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      516.14      892.41      739.84      151.32
numa01.sh       Sys:      153.16      192.99      177.70       14.58
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
numa02.sh      Real:       60.91       62.35       61.58        0.63
numa02.sh       Sys:       16.47       26.16       21.20        3.85
numa02.sh      User:     5227.58     5309.61     5265.17       31.04
numa03.sh      Real:      739.07      917.73      795.75       64.45
numa03.sh       Sys:       94.46      136.08      109.48       14.58
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
numa04.sh      Real:      442.61      715.43      530.31       96.12
numa04.sh       Sys:      224.90      348.63      285.61       48.83
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
numa05.sh      Real:      386.13      489.17      434.94       43.59
numa05.sh       Sys:      144.29      438.56      278.80      105.78
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 69136e9..285d7ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1507,28 +1507,12 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	src_capacity = env->src_stats.compute_capacity;
 	dst_capacity = env->dst_stats.compute_capacity;
 
-	/* We care about the slope of the imbalance, not the direction. */
-	if (dst_load < src_load)
-		swap(dst_load, src_load);
-
-	/* Is the difference below the threshold? */
-	imb = dst_load * src_capacity * 100 -
-	      src_load * dst_capacity * env->imbalance_pct;
-	if (imb <= 0)
-		return false;
+	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
 
-	/*
-	 * The imbalance is above the allowed threshold.
-	 * Compare it with the old imbalance.
-	 */
 	orig_src_load = env->src_stats.load;
 	orig_dst_load = env->dst_stats.load;
 
-	if (orig_dst_load < orig_src_load)
-		swap(orig_dst_load, orig_src_load);
-
-	old_imb = orig_dst_load * src_capacity * 100 -
-		  orig_src_load * dst_capacity * env->imbalance_pct;
+	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
 
 	/* Would this change make things worse? */
 	return (imb > old_imb);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (2 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-21  9:17   ` Mel Gorman
  2018-07-25 14:25   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
                   ` (16 subsequent siblings)
  20 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the cpu/node ideal for node migration.

Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.

Modify to set the preferred node based of best_cpu. Earlier setting
preferred node was skipped if nr_active_nodes is 1. This could result in
the task being moved out of the preferred node to a random node during
regular load balancing.

Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25122.9     25549.6     1.698
1     73850       73190       -0.89

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     105930      113437      7.08676
1     178624      196130      9.80047

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.78      653.81      534.58       83.20
numa01.sh       Sys:      121.93      187.18      145.90       23.47
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
numa02.sh      Real:       60.64       61.63       61.19        0.40
numa02.sh       Sys:       14.72       25.68       19.06        4.03
numa02.sh      User:     5210.95     5266.69     5233.30       20.82
numa03.sh      Real:      746.51      808.24      780.36       23.88
numa03.sh       Sys:       97.26      108.48      105.07        4.28
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
numa04.sh      Real:      465.97      519.27      484.81       19.62
numa04.sh       Sys:      304.43      359.08      334.68       20.64
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
numa05.sh      Real:      411.57      457.20      433.29       16.58
numa05.sh       Sys:      230.05      435.48      339.95       67.58
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
Fix setting sched_setnuma under !sd pointed by Peter Zijlstra.
Modify commit message to describe the reason for change.

 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 285d7ae..2366fda2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1726,7 +1726,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 * elsewhere, so there is no point in (re)trying.
 	 */
 	if (unlikely(!sd)) {
-		p->numa_preferred_nid = task_node(p);
+		sched_setnuma(p, task_node(p));
 		return -EINVAL;
 	}
 
@@ -1785,15 +1785,13 @@ static int task_numa_migrate(struct task_struct *p)
 	 * trying for a better one later. Do not set the preferred node here.
 	 */
 	if (p->numa_group) {
-		struct numa_group *ng = p->numa_group;
-
 		if (env.best_cpu == -1)
 			nid = env.src_nid;
 		else
-			nid = env.dst_nid;
+			nid = cpu_to_node(env.best_cpu);
 
-		if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng))
-			sched_setnuma(p, env.dst_nid);
+		if (nid != p->numa_preferred_nid)
+			sched_setnuma(p, nid);
 	}
 
 	/* No better CPU than the current one was found. */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (3 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-21  9:38   ` Mel Gorman
  2018-07-25 14:25   ` [tip:sched/core] sched/numa: Use task faults only if numa_group is not yet set up tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
                   ` (15 subsequent siblings)
  20 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

When numa_group faults are available, task_numa_placement only uses
numa_group faults to evaluate preferred node. However it still accounts
task faults and even evaluates the preferred node just based on task
faults just to discard it in favour of preferred node chosen on the
basis of numa_group.

Instead use task faults only if numa_group is not set.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25549.6     25215.7     -1.30
1     73190       72107       -1.47

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     113437      113372      -0.05
1     196130      177403      -9.54

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      506.35      794.46      599.06      104.26
numa01.sh       Sys:      150.37      223.56      195.99       24.94
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
numa02.sh      Real:       60.33       62.40       61.31        0.90
numa02.sh       Sys:       18.12       31.66       24.28        5.89
numa02.sh      User:     5203.91     5325.32     5260.29       49.98
numa03.sh      Real:      696.47      853.62      745.80       57.28
numa03.sh       Sys:       85.68      123.71       97.89       13.48
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
numa04.sh      Real:      444.05      514.83      497.06       26.85
numa04.sh       Sys:      230.39      375.79      316.23       48.58
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
numa05.sh      Real:      423.09      460.41      439.57       13.92
numa05.sh       Sys:      287.38      480.15      369.37       68.52
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2366fda2..23c39fb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2071,8 +2071,8 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1, max_group_nid = -1;
-	unsigned long max_faults = 0, max_group_faults = 0;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 	unsigned long fault_types[2] = { 0, 0 };
 	unsigned long total_faults;
 	u64 runtime, period;
@@ -2151,15 +2151,15 @@ static void task_numa_placement(struct task_struct *p)
 			}
 		}
 
-		if (faults > max_faults) {
-			max_faults = faults;
+		if (!p->numa_group) {
+			if (faults > max_faults) {
+				max_faults = faults;
+				max_nid = nid;
+			}
+		} else if (group_faults > max_faults) {
+			max_faults = group_faults;
 			max_nid = nid;
 		}
-
-		if (group_faults > max_group_faults) {
-			max_group_faults = group_faults;
-			max_group_nid = nid;
-		}
 	}
 
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
@@ -2167,7 +2167,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_group) {
 		numa_group_count_active_nodes(p->numa_group);
 		spin_unlock_irq(group_lock);
-		max_nid = preferred_group_nid(p, max_group_nid);
+		max_nid = preferred_group_nid(p, max_nid);
 	}
 
 	if (max_faults) {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 06/19] sched/debug: Reverse the order of printing faults
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (4 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:26   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Fix the order in which the private and shared numa faults are getting
printed.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25215.7     25375.3     0.63
1     72107       72617       0.70

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/debug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 15b10e2..82ac522 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -869,8 +869,8 @@ void print_numa_stats(struct seq_file *m, int node, unsigned long tsf,
 		unsigned long tpf, unsigned long gsf, unsigned long gpf)
 {
 	SEQ_printf(m, "numa_faults node=%d ", node);
-	SEQ_printf(m, "task_private=%lu task_shared=%lu ", tsf, tpf);
-	SEQ_printf(m, "group_private=%lu group_shared=%lu\n", gsf, gpf);
+	SEQ_printf(m, "task_private=%lu task_shared=%lu ", tpf, tsf);
+	SEQ_printf(m, "group_private=%lu group_shared=%lu\n", gpf, gsf);
 }
 #endif
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 07/19] sched/numa: Skip nodes that are at hoplimit
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (5 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:27   ` [tip:sched/core] sched/numa: Skip nodes that are at 'hoplimit' tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
                   ` (13 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

When comparing two nodes at a distance of hoplimit, we should consider
nodes only upto hoplimit. Currently we also consider nodes at hoplimit
distance too. Hence two nodes at a distance of "hoplimit" will have same
groupweight. Fix this by skipping nodes at hoplimit.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25375.3     25308.6     -0.26
1     72617       72964       0.477

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     113372      108750      -4.07684
1     177403      183115      3.21979

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      478.45      565.90      515.11       30.87
numa01.sh       Sys:      207.79      271.04      232.94       21.33
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
numa02.sh      Real:       60.00       61.46       60.78        0.49
numa02.sh       Sys:       15.71       25.31       20.69        3.42
numa02.sh      User:     5175.92     5265.86     5235.97       32.82
numa03.sh      Real:      776.42      834.85      806.01       23.22
numa03.sh       Sys:      114.43      128.75      121.65        5.49
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
numa04.sh      Real:      456.93      511.95      482.91       20.88
numa04.sh       Sys:      178.09      460.89      356.86       94.58
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
numa05.sh      Real:      393.98      493.48      436.61       35.59
numa05.sh       Sys:      164.49      329.15      265.87       61.78
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%

While there is a regression with this change, this change is needed from a
correctness perspective. Also it helps consolidation as seen from perf bench
output.

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23c39fb..ed60fbd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1273,7 +1273,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * of each group. Skip other nodes.
 		 */
 		if (sched_numa_topology_type == NUMA_BACKPLANE &&
-					dist > maxdist)
+					dist >= maxdist)
 			continue;
 
 		/* Add up the faults from nearby nodes. */
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 08/19] sched/numa: Remove unused task_capacity from numa_stats
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (6 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:27   ` [tip:sched/core] sched/numa: Remove unused task_capacity from 'struct numa_stats' tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
                   ` (12 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

task_capacity field in struct numa_stats is redundant.
Also move nr_running for better packing within the struct.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25308.6     25377.3     0.271
1     72964       72287       -0.92

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed60fbd..0580a27 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1411,14 +1411,12 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
-	unsigned long nr_running;
 	unsigned long load;
 
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
 
-	/* Approximate capacity in terms of runnable tasks on a node */
-	unsigned long task_capacity;
+	unsigned int nr_running;
 	int has_free_capacity;
 };
 
@@ -1456,9 +1454,9 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
 	capacity = cpus / smt; /* cores */
 
-	ns->task_capacity = min_t(unsigned, capacity,
+	capacity = min_t(unsigned, capacity,
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
-	ns->has_free_capacity = (ns->nr_running < ns->task_capacity);
+	ns->has_free_capacity = (ns->nr_running < capacity);
 }
 
 struct task_numa_env {
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 09/19] sched/numa: Modify migrate_swap to accept additional params
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (7 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:28   ` [tip:sched/core] sched/numa: Modify migrate_swap() to accept additional parameters tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

There are checks in migrate_swap_stop that check if the task/cpu
combination is as per migrate_swap_arg before migrating.

However atleast one of the two tasks to be swapped by migrate_swap could
have migrated to a completely different cpu before updating the
migrate_swap_arg. The new cpu where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node.  Instead of achieving node
consolidation, it may end up spreading the load across nodes.

To avoid that pass the cpus as additional parameters.

While here, place migrate_swap under CONFIG_NUMA_BALANCING.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25377.3     25226.6     -0.59
1     72287       73326       1.437

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c  | 9 ++++++---
 kernel/sched/fair.c  | 3 ++-
 kernel/sched/sched.h | 3 ++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 211890e..36f1c7c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1197,6 +1197,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
 	if (task_on_rq_queued(p)) {
@@ -1278,16 +1279,17 @@ static int migrate_swap_stop(void *data)
 /*
  * Cross migrate two tasks
  */
-int migrate_swap(struct task_struct *cur, struct task_struct *p)
+int migrate_swap(struct task_struct *cur, struct task_struct *p,
+		int target_cpu, int curr_cpu)
 {
 	struct migration_swap_arg arg;
 	int ret = -EINVAL;
 
 	arg = (struct migration_swap_arg){
 		.src_task = cur,
-		.src_cpu = task_cpu(cur),
+		.src_cpu = curr_cpu,
 		.dst_task = p,
-		.dst_cpu = task_cpu(p),
+		.dst_cpu = target_cpu,
 	};
 
 	if (arg.src_cpu == arg.dst_cpu)
@@ -1312,6 +1314,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 out:
 	return ret;
 }
+#endif /* CONFIG_NUMA_BALANCING */
 
 /*
  * wait_task_inactive - wait for a thread to unschedule.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0580a27..0d0248b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1809,7 +1809,8 @@ static int task_numa_migrate(struct task_struct *p)
 		return ret;
 	}
 
-	ret = migrate_swap(p, env.best_task);
+	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cb467c22..52ba2d6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1068,7 +1068,8 @@ enum numa_faults_stats {
 };
 extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
-extern int migrate_swap(struct task_struct *, struct task_struct *);
+extern int migrate_swap(struct task_struct *p, struct task_struct *t,
+			int cpu, int scpu);
 #endif /* CONFIG_NUMA_BALANCING */
 
 #ifdef CONFIG_SMP
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (8 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
  moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
  to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.

This problem is more likely if there are more cores per node or more
nodes in the system.

Use a per run-queue variable to check if numa-balance is active on the
run-queue.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25226.6     25436.1     0.83
1     73326       74031       0.96

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     108750      110355      1.475
1     183115      178401      -2.57

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      414.64      819.20      556.08      147.70
numa01.sh       Sys:       77.52      205.04      139.40       52.05
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38
numa02.sh      Real:       60.80       63.32       61.63        0.88
numa02.sh       Sys:       17.35       39.37       25.71        7.33
numa02.sh      User:     5213.79     5374.73     5268.90       55.09
numa03.sh      Real:      780.09      948.64      831.43       63.02
numa03.sh       Sys:      104.96      136.92      116.31       11.34
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14
numa04.sh      Real:      412.60      681.92      521.29       96.64
numa04.sh       Sys:      210.32      314.10      251.77       37.71
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53
numa05.sh      Real:      394.79      439.63      411.35       16.87
numa05.sh       Sys:      238.32      330.09      292.31       38.32
numa05.sh      User:    33456.45    34876.07    34138.62      609.45

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      434.84      676.90      550.53      106.24 	 1.008%
numa01.sh       Sys:      125.98      217.34      179.41       30.35 	 -22.3%
numa01.sh      User:    38318.48    53789.56    45864.17     6620.80 	 -0.75%
numa02.sh      Real:       60.06       61.27       60.59        0.45 	 1.716%
numa02.sh       Sys:       14.25       17.86       16.09        1.28 	 59.78%
numa02.sh      User:     5190.13     5225.67     5209.24       13.19 	 1.145%
numa03.sh      Real:      748.21      960.25      823.15       73.51 	 1.005%
numa03.sh       Sys:       96.68      122.10      110.42       11.29 	 5.334%
numa03.sh      User:    58222.16    72595.27    63552.22     5048.87 	 1.283%
numa04.sh      Real:      433.08      630.55      499.30       68.15 	 4.404%
numa04.sh       Sys:      245.22      386.75      306.09       63.32 	 -17.7%
numa04.sh      User:    35014.68    46151.72    38530.26     3924.65 	 0.010%
numa05.sh      Real:      394.77      410.07      401.41        5.99 	 2.476%
numa05.sh       Sys:      212.40      301.82      256.23       35.41 	 14.08%
numa05.sh      User:    33224.86    34201.40    33665.61      313.40 	 1.405%

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c  | 17 +++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0d0248b..50c7727 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1478,6 +1478,16 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
+	struct rq *rq = cpu_rq(env->dst_cpu);
+
+	if (xchg(&rq->numa_migrate_on, 1))
+		return;
+
+	if (env->best_cpu != -1) {
+		rq = cpu_rq(env->best_cpu);
+		WRITE_ONCE(rq->numa_migrate_on, 0);
+	}
+
 	if (env->best_task)
 		put_task_struct(env->best_task);
 	if (p)
@@ -1533,6 +1543,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	long moveimp = imp;
 	int dist = env->dist;
 
+	if (READ_ONCE(dst_rq->numa_migrate_on))
+		return;
+
 	rcu_read_lock();
 	cur = task_rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1699,6 +1712,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1,
 	};
 	struct sched_domain *sd;
+	struct rq *best_rq;
 	unsigned long taskweight, groupweight;
 	int nid, ret, dist;
 	long taskimp, groupimp;
@@ -1802,14 +1816,17 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	p->numa_scan_period = task_scan_start(p);
 
+	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		ret = migrate_task_to(p, env.best_cpu);
+		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
 	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 52ba2d6..5b15c52 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -756,6 +756,7 @@ struct rq {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
 	unsigned int		nr_preferred_running;
+	unsigned int		numa_migrate_on;
 #endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long		cpu_load[CPU_LOAD_IDX_MAX];
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (9 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-23 10:38   ` Peter Zijlstra
  2018-06-20 17:02 ` [PATCH v2 12/19] sched/numa: Remove numa_has_capacity Srikar Dronamraju
                   ` (9 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Since task migration under numa balancing can happen in parallel, more
than one task might choose to move to the same node at the same time.
This can cause load imbalances at the node level.

The problem is more likely if there are more cores per node or more
nodes in system.

Use a per-node variable to indicate if task migration
to the node under numa balance is currently active.
This per-node variable will not track swapping of tasks.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25436.1     25657.9     0.87
1     74031       74435       0.54

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     110355      101748      -7.79
1     178401      170818      -4.25

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      414.64      819.20      556.08      147.70
numa01.sh       Sys:       77.52      205.04      139.40       52.05
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38
numa02.sh      Real:       60.80       63.32       61.63        0.88
numa02.sh       Sys:       17.35       39.37       25.71        7.33
numa02.sh      User:     5213.79     5374.73     5268.90       55.09
numa03.sh      Real:      780.09      948.64      831.43       63.02
numa03.sh       Sys:      104.96      136.92      116.31       11.34
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14
numa04.sh      Real:      412.60      681.92      521.29       96.64
numa04.sh       Sys:      210.32      314.10      251.77       37.71
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53
numa05.sh      Real:      394.79      439.63      411.35       16.87
numa05.sh       Sys:      238.32      330.09      292.31       38.32
numa05.sh      User:    33456.45    34876.07    34138.62      609.45

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      434.84      676.90      550.53      106.24 	 1.008%
numa01.sh       Sys:      125.98      217.34      179.41       30.35 	 -22.3%
numa01.sh      User:    38318.48    53789.56    45864.17     6620.80 	 -0.75%
numa02.sh      Real:       60.06       61.27       60.59        0.45 	 1.716%
numa02.sh       Sys:       14.25       17.86       16.09        1.28 	 59.78%
numa02.sh      User:     5190.13     5225.67     5209.24       13.19 	 1.145%
numa03.sh      Real:      748.21      960.25      823.15       73.51 	 1.005%
numa03.sh       Sys:       96.68      122.10      110.42       11.29 	 5.334%
numa03.sh      User:    58222.16    72595.27    63552.22     5048.87 	 1.283%
numa04.sh      Real:      433.08      630.55      499.30       68.15 	 4.404%
numa04.sh       Sys:      245.22      386.75      306.09       63.32 	 -17.7%
numa04.sh      User:    35014.68    46151.72    38530.26     3924.65 	 0.010%
numa05.sh      Real:      394.77      410.07      401.41        5.99 	 2.476%
numa05.sh       Sys:      212.40      301.82      256.23       35.41 	 14.08%
numa05.sh      User:    33224.86    34201.40    33665.61      313.40 	 1.405%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 include/linux/mmzone.h |  1 +
 kernel/sched/fair.c    | 14 ++++++++++++++
 mm/page_alloc.c        |  1 +
 3 files changed, 16 insertions(+)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 32699b2..b0767703 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -677,6 +677,7 @@ struct zonelist {
 
 	/* Number of pages migrated during the rate limiting time interval */
 	unsigned long numabalancing_migrate_nr_pages;
+	int active_node_migrate;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 50c7727..87fb20e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1478,11 +1478,22 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
+	pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
 	struct rq *rq = cpu_rq(env->dst_cpu);
 
 	if (xchg(&rq->numa_migrate_on, 1))
 		return;
 
+	if (!env->best_task && env->best_cpu != -1)
+		WRITE_ONCE(pgdat->active_node_migrate, 0);
+
+	if (!p) {
+		if (xchg(&pgdat->active_node_migrate, 1)) {
+			WRITE_ONCE(rq->numa_migrate_on, 0);
+			return;
+		}
+	}
+
 	if (env->best_cpu != -1) {
 		rq = cpu_rq(env->best_cpu);
 		WRITE_ONCE(rq->numa_migrate_on, 0);
@@ -1818,8 +1829,11 @@ static int task_numa_migrate(struct task_struct *p)
 
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
+		pg_data_t *pgdat = NODE_DATA(cpu_to_node(env.dst_cpu));
+
 		ret = migrate_task_to(p, env.best_cpu);
 		WRITE_ONCE(best_rq->numa_migrate_on, 0);
+		WRITE_ONCE(pgdat->active_node_migrate, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 22320ea27..8a522d2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6209,6 +6209,7 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 #ifdef CONFIG_NUMA_BALANCING
 	spin_lock_init(&pgdat->numabalancing_migrate_lock);
 	pgdat->numabalancing_migrate_nr_pages = 0;
+	pgdat->active_node_migrate = 0;
 	pgdat->numabalancing_migrate_next_window = jiffies;
 #endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 12/19] sched/numa: Remove numa_has_capacity
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (10 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:28   ` [tip:sched/core] sched/numa: Remove numa_has_capacity() tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

task_numa_find_cpu helps to find the cpu to swap/move the task.
Its guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps numa placement.

Further load_too_imbalanced, which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.

Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25657.9     25804.1     0.569
1     74435       73413       -1.37

Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87fb20e..10b6886 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1417,7 +1417,6 @@ struct numa_stats {
 	unsigned long compute_capacity;
 
 	unsigned int nr_running;
-	int has_free_capacity;
 };
 
 /*
@@ -1444,8 +1443,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	 * the @ns structure is NULL'ed and task_numa_compare() will
 	 * not find this node attractive.
 	 *
-	 * We'll either bail at !has_free_capacity, or we'll detect a huge
-	 * imbalance and bail there.
+	 * We'll detect a huge imbalance and bail there.
 	 */
 	if (!cpus)
 		return;
@@ -1456,7 +1454,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 
 	capacity = min_t(unsigned, capacity,
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
-	ns->has_free_capacity = (ns->nr_running < capacity);
 }
 
 struct task_numa_env {
@@ -1683,31 +1680,6 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }
 
-/* Only move tasks to a NUMA node less busy than the current node. */
-static bool numa_has_capacity(struct task_numa_env *env)
-{
-	struct numa_stats *src = &env->src_stats;
-	struct numa_stats *dst = &env->dst_stats;
-
-	if (src->has_free_capacity && !dst->has_free_capacity)
-		return false;
-
-	/*
-	 * Only consider a task move if the source has a higher load
-	 * than the destination, corrected for CPU capacity on each node.
-	 *
-	 *      src->load                dst->load
-	 * --------------------- vs ---------------------
-	 * src->compute_capacity    dst->compute_capacity
-	 */
-	if (src->load * dst->compute_capacity * env->imbalance_pct >
-
-	    dst->load * src->compute_capacity * 100)
-		return true;
-
-	return false;
-}
-
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1763,8 +1735,7 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
-	if (numa_has_capacity(&env))
-		task_numa_find_cpu(&env, taskimp, groupimp);
+	task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/*
 	 * Look at other nodes in these cases:
@@ -1794,8 +1765,7 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			if (numa_has_capacity(&env))
-				task_numa_find_cpu(&env, taskimp, groupimp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (11 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 12/19] sched/numa: Remove numa_has_capacity Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-21  9:51   ` Mel Gorman
  2018-07-23 10:54   ` Peter Zijlstra
  2018-06-20 17:02 ` [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
                   ` (7 subsequent siblings)
  20 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently resetting the migrate rate limit is under a spinlock.
The spinlock will only serialize the migrate rate limiting and something
similar can actually be achieved by a simpler xchg.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25804.1     25355.9     -1.73
1     73413       72812       -0.81

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     101748      110199      8.30
1     170818      176303      3.21

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.67      707.28      527.49       97.85
numa01.sh       Sys:       76.41      231.19      162.49       56.13
numa01.sh      User:    38247.36    59033.52    45129.31     7642.69
numa02.sh      Real:       60.35       62.09       61.09        0.69
numa02.sh       Sys:       15.01       30.20       20.64        5.56
numa02.sh      User:     5195.93     5294.82     5240.99       40.55
numa03.sh      Real:      752.04      919.89      836.81       63.29
numa03.sh       Sys:      115.10      133.35      125.46        7.78
numa03.sh      User:    58736.44    70084.26    65103.67     4416.10
numa04.sh      Real:      418.43      709.69      512.53      104.17
numa04.sh       Sys:      242.99      370.47      297.39       42.20
numa04.sh      User:    34916.14    48429.54    38955.65     4928.05
numa05.sh      Real:      379.27      434.05      403.70       17.79
numa05.sh       Sys:      145.94      344.50      268.72       68.53
numa05.sh      User:    32679.32    35449.75    33989.10      913.19

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      490.04      774.86      596.26       96.46 	 -11.5%
numa01.sh       Sys:      151.52      242.88      184.82       31.71 	 -12.0%
numa01.sh      User:    41418.41    60844.59    48776.09     6564.27 	 -7.47%
numa02.sh      Real:       60.14       62.94       60.98        1.00 	 0.180%
numa02.sh       Sys:       16.11       30.77       21.20        5.28 	 -2.64%
numa02.sh      User:     5184.33     5311.09     5228.50       44.24 	 0.238%
numa03.sh      Real:      790.95      856.35      826.41       24.11 	 1.258%
numa03.sh       Sys:      114.93      118.85      117.05        1.63 	 7.184%
numa03.sh      User:    60990.99    64959.28    63470.43     1415.44 	 2.573%
numa04.sh      Real:      434.37      597.92      504.87       59.70 	 1.517%
numa04.sh       Sys:      237.63      397.40      289.74       55.98 	 2.640%
numa04.sh      User:    34854.87    41121.83    38572.52     2615.84 	 0.993%
numa05.sh      Real:      386.77      448.90      417.22       22.79 	 -3.24%
numa05.sh       Sys:      149.23      379.95      303.04       79.55 	 -11.3%
numa05.sh      User:    32951.76    35959.58    34562.18     1034.05 	 -1.65%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
Fix stretch every interval pointed by Peter Zijlstra.

 include/linux/mmzone.h |  3 ---
 mm/migrate.c           | 20 ++++++++++++++------
 mm/page_alloc.c        |  1 -
 3 files changed, 14 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b0767703..0dbe1d5 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -669,9 +669,6 @@ struct zonelist {
 	struct task_struct *kcompactd;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/* Lock serializing the migrate rate limiting window */
-	spinlock_t numabalancing_migrate_lock;
-
 	/* Rate limiting time interval */
 	unsigned long numabalancing_migrate_next_window;
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f..c774990 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1868,17 +1868,25 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
 					unsigned long nr_pages)
 {
+	unsigned long next_window, interval;
+
+	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
+	interval = msecs_to_jiffies(migrate_interval_millisecs);
+
 	/*
 	 * Rate-limit the amount of data that is being migrated to a node.
 	 * Optimal placement is no good if the memory bus is saturated and
 	 * all the time is being spent migrating!
 	 */
-	if (time_after(jiffies, pgdat->numabalancing_migrate_next_window)) {
-		spin_lock(&pgdat->numabalancing_migrate_lock);
-		pgdat->numabalancing_migrate_nr_pages = 0;
-		pgdat->numabalancing_migrate_next_window = jiffies +
-			msecs_to_jiffies(migrate_interval_millisecs);
-		spin_unlock(&pgdat->numabalancing_migrate_lock);
+	if (time_after(jiffies, next_window)) {
+		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0)) {
+			do {
+				next_window += interval;
+			} while (unlikely(time_after(jiffies, next_window)));
+
+			WRITE_ONCE(pgdat->numabalancing_migrate_next_window,
+							       next_window);
+		}
 	}
 	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
 		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a522d2..ff8e730 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6207,7 +6207,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 
 	pgdat_resize_init(pgdat);
 #ifdef CONFIG_NUMA_BALANCING
-	spin_lock_init(&pgdat->numabalancing_migrate_lock);
 	pgdat->numabalancing_migrate_nr_pages = 0;
 	pgdat->active_node_migrate = 0;
 	pgdat->numabalancing_migrate_next_window = jiffies;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (12 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-21  9:51   ` Mel Gorman
  2018-07-25 14:29   ` [tip:sched/core] sched/numa: Update the scan period without holding the numa_group lock tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
                   ` (6 subsequent siblings)
  20 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

The metrics for updating scan periods are local or task specific.
Currently this updation happens under numa_group lock which seems
unnecessary. Hence move this updation outside the lock.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25355.9     25645.4     1.141
1     72812       72142       -0.92

Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10b6886..711b533 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2162,8 +2162,6 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	update_task_scan_period(p, fault_types[0], fault_types[1]);
-
 	if (p->numa_group) {
 		numa_group_count_active_nodes(p->numa_group);
 		spin_unlock_irq(group_lock);
@@ -2178,6 +2176,8 @@ static void task_numa_placement(struct task_struct *p)
 		if (task_node(p) != p->numa_preferred_nid)
 			numa_migrate_preferred(p);
 	}
+
+	update_task_scan_period(p, fault_types[0], fault_types[1]);
 }
 
 static inline int get_numa_group(struct numa_group *grp)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 15/19] sched/numa: Use group_weights to identify if migration degrades locality
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (13 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-07-25 14:29   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
                   ` (5 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
consolidated to the closest group of nodes. In such a case, relying on
group_fault metric may not always help to consolidate. There can always
be a case where a node closer to the preferred node may have lesser
faults than a node further away from the preferred node. In such a case,
moving to node with more faults might avoid numa consolidation.

Using group_weight would help to consolidate task/memory around the
preferred_node.

While here, to be on the conservative side, don't override migrate thread
degrades locality logic for CPU_NEWLY_IDLE load balancing.

Note: Similar problems exist with should_numa_migrate_memory and will be
dealt separately.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25645.4     25960       1.22
1     72142       73550       1.95

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     110199      120071      8.958
1     176303      176249      -0.03

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      490.04      774.86      596.26       96.46
numa01.sh       Sys:      151.52      242.88      184.82       31.71
numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
numa02.sh      Real:       60.14       62.94       60.98        1.00
numa02.sh       Sys:       16.11       30.77       21.20        5.28
numa02.sh      User:     5184.33     5311.09     5228.50       44.24
numa03.sh      Real:      790.95      856.35      826.41       24.11
numa03.sh       Sys:      114.93      118.85      117.05        1.63
numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
numa04.sh      Real:      434.37      597.92      504.87       59.70
numa04.sh       Sys:      237.63      397.40      289.74       55.98
numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
numa05.sh      Real:      386.77      448.90      417.22       22.79
numa05.sh       Sys:      149.23      379.95      303.04       79.55
numa05.sh      User:    32951.76    35959.58    34562.18     1034.05

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 711b533..9db09a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7222,8 +7222,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
+	unsigned long src_weight, dst_weight;
+	int src_nid, dst_nid, dist;
 
 	if (!static_branch_likely(&sched_numa_balancing))
 		return -1;
@@ -7250,18 +7250,19 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 		return 0;
 
 	/* Leaving a core idle is often worse than degrading locality. */
-	if (env->idle != CPU_NOT_IDLE)
+	if (env->idle == CPU_IDLE)
 		return -1;
 
+	dist = node_distance(src_nid, dst_nid);
 	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
+		src_weight = group_weight(p, src_nid, dist);
+		dst_weight = group_weight(p, dst_nid, dist);
 	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
+		src_weight = task_weight(p, src_nid, dist);
+		dst_weight = task_weight(p, dst_nid, dist);
 	}
 
-	return dst_faults < src_faults;
+	return dst_weight < src_weight;
 }
 
 #else
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 16/19] sched/numa: Detect if node actively handling migration
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (14 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

If a node is the destination for a task migration under numa balancing,
then any parallel movements to the node will be restricted. In such a
scenario, detect at the earliest and avoid evaluation for a task
movement.

While here, avoid task migration if the numa imbalance is very minimal.
Especially consider two tasks A and B racing with each other to find the
best cpu to swap. If task A already has found one task/cpu pair to
swap and trying to find a better cpu. Task B is yet to find a better
cpu/task to swap. Task A can race with task B and deprive it from
getting a task/cpu to swap.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25960       26015.6     0.214
1     73550       73484       -0.08

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     120071      120453      0.31
1     176249      181140      2.77

Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      493.19      672.88      597.51       59.38
numa01.sh       Sys:      150.09      245.48      207.76       34.26
numa01.sh      User:    41928.51    53779.17    48747.06     3901.39
numa02.sh      Real:       60.63       62.87       61.22        0.83
numa02.sh       Sys:       16.64       27.97       20.25        4.06
numa02.sh      User:     5222.92     5309.60     5254.03       29.98
numa03.sh      Real:      821.52      902.15      863.60       32.41
numa03.sh       Sys:      112.04      130.66      118.35        7.08
numa03.sh      User:    62245.16    69165.14    66443.04     2450.32
numa04.sh      Real:      414.53      519.57      476.25       37.00
numa04.sh       Sys:      181.84      335.67      280.41       54.07
numa04.sh      User:    33924.50    39115.39    37343.78     1934.26
numa05.sh      Real:      408.30      441.45      417.90       12.05
numa05.sh       Sys:      233.41      381.60      295.58       57.37
numa05.sh      User:    33301.31    35972.50    34335.19      938.94

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      428.48      837.17      700.45      162.77 	 -14.6%
numa01.sh       Sys:       78.64      247.70      164.45       58.32 	 26.33%
numa01.sh      User:    37487.25    63728.06    54399.27    10088.13 	 -10.3%
numa02.sh      Real:       60.07       62.65       61.41        0.85 	 -0.30%
numa02.sh       Sys:       15.83       29.36       21.04        4.48 	 -3.75%
numa02.sh      User:     5194.27     5280.60     5236.55       28.01 	 0.333%
numa03.sh      Real:      814.33      881.93      849.69       27.06 	 1.637%
numa03.sh       Sys:      111.45      134.02      125.28        7.69 	 -5.53%
numa03.sh      User:    63007.36    68013.46    65590.46     2023.37 	 1.299%
numa04.sh      Real:      412.19      438.75      424.43        9.28 	 12.20%
numa04.sh       Sys:      232.97      315.77      268.98       26.98 	 4.249%
numa04.sh      User:    33997.30    35292.88    34711.66      415.78 	 7.582%
numa05.sh      Real:      394.88      449.45      424.30       22.53 	 -1.50%
numa05.sh       Sys:      262.03      390.10      314.53       51.01 	 -6.02%
numa05.sh      User:    33389.03    35684.40    34561.34      942.34 	 -0.65%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
 - Handle trivial changes due to variable name change.
 - Also reevaluate node active migration for every cpu.
 - Now detect active migration in task_numa_find_cpu as
	suggested by Rik Van Riel.

 kernel/sched/fair.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9db09a6..c07ac30 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1535,6 +1535,13 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 }
 
 /*
+ * Maximum numa importance can be 1998 (2*999);
+ * SMALLIMP @ 30 would be close to 1998/64.
+ * Used to deter task migration.
+ */
+#define SMALLIMP	30
+
+/*
  * This checks if the overall compute and NUMA accesses of the system would
  * be improved if the source tasks was migrated to the target dst_cpu taking
  * into account that it might be best if task running on the dst_cpu should
@@ -1567,7 +1574,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		goto unlock;
 
 	if (!cur) {
-		if (maymove || imp > env->best_imp)
+		if (maymove && moveimp >= env->best_imp)
 			goto assign;
 		else
 			goto unlock;
@@ -1610,16 +1617,22 @@ static void task_numa_compare(struct task_numa_env *env,
 			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp)
-		goto unlock;
-
 	if (maymove && moveimp > imp && moveimp > env->best_imp) {
-		imp = moveimp - 1;
+		imp = moveimp;
 		cur = NULL;
 		goto assign;
 	}
 
 	/*
+	 * If the numa importance is less than SMALLIMP,
+	 * task migration might only result in ping pong
+	 * of tasks and also hurt performance due to cache
+	 * misses.
+	 */
+	if (imp < SMALLIMP)
+		goto unlock;
+
+	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
 	load = task_h_load(env->p) - task_h_load(cur);
@@ -1656,6 +1669,7 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
+	pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
 	long src_load, dst_load, load;
 	bool maymove = false;
 	int cpu;
@@ -1671,12 +1685,14 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	maymove = !load_too_imbalanced(src_load, dst_load, env);
 
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+		bool move = maymove && !READ_ONCE(pgdat->active_node_migrate);
+
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp, maymove);
+		task_numa_compare(env, taskimp, groupimp, move);
 	}
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (15 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-20 17:02 ` [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This additional parameter (new_cpu) is used later for identifying if
task migration is across nodes.

No functional change.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    26015.6     26251.7     0.90
1     73484       74108       0.84

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/core.c     | 2 +-
 kernel/sched/deadline.c | 2 +-
 kernel/sched/fair.c     | 2 +-
 kernel/sched/sched.h    | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f1c7c..d6b5a64 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1189,7 +1189,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 
 	if (task_cpu(p) != new_cpu) {
 		if (p->sched_class->migrate_task_rq)
-			p->sched_class->migrate_task_rq(p);
+			p->sched_class->migrate_task_rq(p, new_cpu);
 		p->se.nr_migrations++;
 		perf_event_task_migrate(p);
 	}
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index fbfc3f1..1c1fbaa 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1608,7 +1608,7 @@ static void yield_task_dl(struct rq *rq)
 	return cpu;
 }
 
-static void migrate_task_rq_dl(struct task_struct *p)
+static void migrate_task_rq_dl(struct task_struct *p, int new_cpu __maybe_unused)
 {
 	struct rq *rq;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c07ac30..7350f09 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6615,7 +6615,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
  * cfs_rq_of(p) references at time of call are still valid and identify the
  * previous CPU. The caller guarantees p->pi_lock or task_rq(p)->lock is held.
  */
-static void migrate_task_rq_fair(struct task_struct *p)
+static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unused)
 {
 	/*
 	 * As blocked tasks retain absolute vruntime the migration needs to
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5b15c52..5b10d24 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1487,7 +1487,7 @@ struct sched_class {
 
 #ifdef CONFIG_SMP
 	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
-	void (*migrate_task_rq)(struct task_struct *p);
+	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);
 
 	void (*task_woken)(struct rq *this_rq, struct task_struct *task);
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (16 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
@ 2018-06-20 17:02 ` Srikar Dronamraju
  2018-06-21 10:05   ` Mel Gorman
  2018-06-20 17:03 ` [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:02 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

Currently task scan rate is reset when numa balancer migrates the task
to a different node. If numa balancer initiates a swap, reset is only
applicable to the task that initiates the swap. Similarly no scan rate
reset is done if the task is migrated across nodes by traditional load
balancer.

Instead move the scan reset to the migrate_task_rq. This ensures the
task moved out of its preferred node, either gets back to its preferred
node quickly or finds a new preferred node. Doing so, would be fair to
all tasks migrating across nodes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    26251.7     25862.6     -1.48
1     74108       74357       0.335

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     120453      117019      -2.85
1     181140      179095      -1.12

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      428.48      837.17      700.45      162.77
numa01.sh       Sys:       78.64      247.70      164.45       58.32
numa01.sh      User:    37487.25    63728.06    54399.27    10088.13
numa02.sh      Real:       60.07       62.65       61.41        0.85
numa02.sh       Sys:       15.83       29.36       21.04        4.48
numa02.sh      User:     5194.27     5280.60     5236.55       28.01
numa03.sh      Real:      814.33      881.93      849.69       27.06
numa03.sh       Sys:      111.45      134.02      125.28        7.69
numa03.sh      User:    63007.36    68013.46    65590.46     2023.37
numa04.sh      Real:      412.19      438.75      424.43        9.28
numa04.sh       Sys:      232.97      315.77      268.98       26.98
numa04.sh      User:    33997.30    35292.88    34711.66      415.78
numa05.sh      Real:      394.88      449.45      424.30       22.53
numa05.sh       Sys:      262.03      390.10      314.53       51.01
numa05.sh      User:    33389.03    35684.40    34561.34      942.34

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      449.46      770.77      615.22      101.70 	 13.85%
numa01.sh       Sys:      132.72      208.17      170.46       24.96 	 -3.52%
numa01.sh      User:    39185.26    60290.89    50066.76     6807.84 	 8.653%
numa02.sh      Real:       60.85       61.79       61.28        0.37 	 0.212%
numa02.sh       Sys:       15.34       24.71       21.08        3.61 	 -0.18%
numa02.sh      User:     5204.41     5249.85     5231.21       17.60 	 0.102%
numa03.sh      Real:      785.50      916.97      840.77       44.98 	 1.060%
numa03.sh       Sys:      108.08      133.60      119.43        8.82 	 4.898%
numa03.sh      User:    61422.86    70919.75    64720.87     3310.61 	 1.343%
numa04.sh      Real:      429.57      587.37      480.80       57.40 	 -11.7%
numa04.sh       Sys:      240.61      321.97      290.84       33.58 	 -7.51%
numa04.sh      User:    34597.65    40498.99    37079.48     2060.72 	 -6.38%
numa05.sh      Real:      392.09      431.25      414.65       13.82 	 2.327%
numa05.sh       Sys:      229.41      372.48      297.54       53.14 	 5.710%
numa05.sh      User:    33390.86    34697.49    34222.43      556.42 	 0.990%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 19 +++++++++++++------
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7350f09..36d1414 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1807,12 +1807,6 @@ static int task_numa_migrate(struct task_struct *p)
 	if (env.best_cpu == -1)
 		return -EAGAIN;
 
-	/*
-	 * Reset the scan period if the task is being rescheduled on an
-	 * alternative node to recheck if the tasks is now properly placed.
-	 */
-	p->numa_scan_period = task_scan_start(p);
-
 	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		pg_data_t *pgdat = NODE_DATA(cpu_to_node(env.dst_cpu));
@@ -6668,6 +6662,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 
 	/* We have migrated, no longer consider this task hot */
 	p->se.exec_start = 0;
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (!p->mm || (p->flags & PF_EXITING))
+		return;
+
+	if (p->numa_faults) {
+		int src_nid = cpu_to_node(task_cpu(p));
+		int dst_nid = cpu_to_node(new_cpu);
+
+		if (src_nid != dst_nid)
+			p->numa_scan_period = task_scan_start(p);
+	}
+#endif
 }
 
 static void task_dead_fair(struct task_struct *p)
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (17 preceding siblings ...)
  2018-06-20 17:02 ` [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-06-20 17:03 ` Srikar Dronamraju
  2018-06-21 10:06   ` Mel Gorman
  2018-07-25 14:30   ` [tip:sched/core] sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() tip-bot for Srikar Dronamraju
  2018-06-20 17:03 ` [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
  2018-07-23 13:57 ` Peter Zijlstra
  20 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

numa_migrate_preferred is called periodically or when task preferred
node changes. Preferred node evaluations happen once per scan sequence.

If the scan completion happens just after the periodic numa migration,
then we try to migrate to the preferred node and the preferred node might
change, needing another node migration.

Avoid this by checking for scan sequence completion only when checking
for periodic migration.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25862.6     26158.1     1.14258
1     74357       72725       -2.19482

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     117019      113992      -2.58
1     179095      174947      -2.31

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      449.46      770.77      615.22      101.70
numa01.sh       Sys:      132.72      208.17      170.46       24.96
numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
numa02.sh      Real:       60.85       61.79       61.28        0.37
numa02.sh       Sys:       15.34       24.71       21.08        3.61
numa02.sh      User:     5204.41     5249.85     5231.21       17.60
numa03.sh      Real:      785.50      916.97      840.77       44.98
numa03.sh       Sys:      108.08      133.60      119.43        8.82
numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
numa04.sh      Real:      429.57      587.37      480.80       57.40
numa04.sh       Sys:      240.61      321.97      290.84       33.58
numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
numa05.sh      Real:      392.09      431.25      414.65       13.82
numa05.sh       Sys:      229.41      372.48      297.54       53.14
numa05.sh      User:    33390.86    34697.49    34222.43      556.42


Testcase       Time:         Min         Max         Avg      StdDev 	%Change
numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 kernel/sched/fair.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 36d1414..f29d59f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2182,9 +2182,6 @@ static void task_numa_placement(struct task_struct *p)
 		/* Set the new preferred node */
 		if (max_nid != p->numa_preferred_nid)
 			sched_setnuma(p, max_nid);
-
-		if (task_node(p) != p->numa_preferred_nid)
-			numa_migrate_preferred(p);
 	}
 
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
@@ -2387,14 +2384,14 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 				numa_is_active_node(mem_node, ng))
 		local = 1;
 
-	task_numa_placement(p);
-
 	/*
 	 * Retry task to preferred node migration periodically, in case it
 	 * case it previously failed, or the scheduler moved us.
 	 */
-	if (time_after(jiffies, p->numa_migrate_retry))
+	if (time_after(jiffies, p->numa_migrate_retry)) {
+		task_numa_placement(p);
 		numa_migrate_preferred(p);
+	}
 
 	if (migrated)
 		p->numa_pages_migrated += pages;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (18 preceding siblings ...)
  2018-06-20 17:03 ` [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
@ 2018-06-20 17:03 ` Srikar Dronamraju
  2018-07-23 13:57 ` Peter Zijlstra
  20 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-06-20 17:03 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra
  Cc: LKML, Mel Gorman, Rik van Riel, Srikar Dronamraju, Thomas Gleixner

This patchset based on v4.17, provides few simple cleanups and fixes in
the sched/numa_balancing code. Some of these fixes are specific to systems
having more than 2 nodes. Few patches add per-rq and per-node complexities
to solve what I feel are a fairness/correctness issues.

This version handles the comments given to some of the patches.
It also provides specjbb2005 numbers on a patch basis on a 4 node and
16 node system.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
(higher bops are better)
JVMS  v4.17  	v4.17+patch   %CHANGE
16    25705.2     26158.1     1.731
1     74433	  72725       -2.34

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
(higher bops are better)
JVMS  v4.17  	v4.17+patch   %CHANGE
8     96589.6     113992      15.26
1     181830      174947      -3.93

Only patches, 2, 4, 13 and 16 have changes. The rest of the patches are
unchanged.
r
For overall numbers with v1 running perf-bench please look at
https://lwn.net/ml/linux-kernel/1528106428-19992-1-git-send-email-srikar@linux.vnet.ibm.com

Srikar Dronamraju (19):
  sched/numa: Remove redundant field.
  sched/numa: Evaluate move once per node
  sched/numa: Simplify load_too_imbalanced
  sched/numa: Set preferred_node based on best_cpu
  sched/numa: Use task faults only if numa_group is not yet setup
  sched/debug: Reverse the order of printing faults
  sched/numa: Skip nodes that are at hoplimit
  sched/numa: Remove unused task_capacity from numa_stats
  sched/numa: Modify migrate_swap to accept additional params
  sched/numa: Stop multiple tasks from moving to the cpu at the same time
  sched/numa: Restrict migrating in parallel to the same node.
  sched/numa: Remove numa_has_capacity
  mm/migrate: Use xchg instead of spinlock
  sched/numa: Updation of scan period need not be in lock
  sched/numa: Use group_weights to identify if migration degrades locality
  sched/numa: Detect if node actively handling migration
  sched/numa: Pass destination cpu as a parameter to migrate_task_rq
  sched/numa: Reset scan rate whenever task moves across nodes
  sched/numa: Move task_placement closer to numa_migrate_preferred

 include/linux/mmzone.h  |   4 +-
 include/linux/sched.h   |   1 -
 kernel/sched/core.c     |  11 +-
 kernel/sched/deadline.c |   2 +-
 kernel/sched/debug.c    |   4 +-
 kernel/sched/fair.c     | 325 +++++++++++++++++++++++-------------------------
 kernel/sched/sched.h    |   6 +-
 mm/migrate.c            |  20 ++-
 mm/page_alloc.c         |   2 +-
 9 files changed, 187 insertions(+), 188 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 02/19] sched/numa: Evaluate move once per node
  2018-06-20 17:02 ` [PATCH v2 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
@ 2018-06-21  9:06   ` Mel Gorman
  2018-07-25 14:24   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2018-06-21  9:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:43PM +0530, Srikar Dronamraju wrote:
> task_numa_compare() helps choose the best cpu to move or swap the
> selected task. To achieve this task_numa_compare() is called for every
> cpu in the node. Currently it evaluates if the task can be moved/swapped
> for each of the cpus. However the move evaluation is mostly independent
> of the cpu. Evaluating the move logic once per node, provides scope for
> simplifying task_numa_compare().
> 
> Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 16    25705.2     25058.2     -2.51
> 1     74433       72950       -1.99
> 
> Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 8     96589.6     105930      9.670
> 1     181830      178624      -1.76
> 
> (numbers from v1 based on v4.17-rc5)
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      440.65      941.32      758.98      189.17
> numa01.sh       Sys:      183.48      320.07      258.42       50.09
> numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
> numa02.sh      Real:       61.24       65.35       62.49        1.49
> numa02.sh       Sys:       16.83       24.18       21.40        2.60
> numa02.sh      User:     5219.59     5356.34     5264.03       49.07
> numa03.sh      Real:      822.04      912.40      873.55       37.35
> numa03.sh       Sys:      118.80      140.94      132.90        7.60
> numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
> numa04.sh      Real:      690.66      872.12      778.49       65.44
> numa04.sh       Sys:      459.26      563.03      494.03       42.39
> numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
> numa05.sh      Real:      418.37      562.28      525.77       54.27
> numa05.sh       Sys:      299.45      481.00      392.49       64.27
> numa05.sh      User:    34115.09    41324.02    39105.30     2627.68
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
> numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
> numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
> numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
> numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
> numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
> numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
> numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
> numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
> numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
> numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
> numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
> numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
> numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
> numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu
  2018-06-20 17:02 ` [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
@ 2018-06-21  9:17   ` Mel Gorman
  2018-07-25 14:25   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2018-06-21  9:17 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:45PM +0530, Srikar Dronamraju wrote:
> Currently preferred node is set to dst_nid which is the last node in the
> iteration whose group weight or task weight is greater than the current
> node. However it doesn't guarantee that dst_nid has the numa capacity
> to move. It also doesn't guarantee that dst_nid has the best_cpu which
> is the cpu/node ideal for node migration.
> 
> Lets consider faults on a 4 node system with group weight numbers
> in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
> is running on 3 and 0 is its preferred node but its capacity is full.
> Consider nodes 1, 2 and 3 have capacity. Then the task should be
> migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
> points to the last node whose faults were greater than current node.
> 
> Modify to set the preferred node based of best_cpu. Earlier setting
> preferred node was skipped if nr_active_nodes is 1. This could result in
> the task being moved out of the preferred node to a random node during
> regular load balancing.
> 
> Also while modifying task_numa_migrate(), use sched_setnuma to set
> preferred node. This ensures out numa accounting is correct.
> 
> Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 16    25122.9     25549.6     1.698
> 1     73850       73190       -0.89
> 
> Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 8     105930      113437      7.08676
> 1     178624      196130      9.80047
> 
> (numbers from v1 based on v4.17-rc5)
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      435.78      653.81      534.58       83.20
> numa01.sh       Sys:      121.93      187.18      145.90       23.47
> numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
> numa02.sh      Real:       60.64       61.63       61.19        0.40
> numa02.sh       Sys:       14.72       25.68       19.06        4.03
> numa02.sh      User:     5210.95     5266.69     5233.30       20.82
> numa03.sh      Real:      746.51      808.24      780.36       23.88
> numa03.sh       Sys:       97.26      108.48      105.07        4.28
> numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
> numa04.sh      Real:      465.97      519.27      484.81       19.62
> numa04.sh       Sys:      304.43      359.08      334.68       20.64
> numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
> numa05.sh      Real:      411.57      457.20      433.29       16.58
> numa05.sh       Sys:      230.05      435.48      339.95       67.58
> numa05.sh      User:    33325.54    36896.31    35637.84     1222.64
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
> numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
> numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
> numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
> numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
> numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
> numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
> numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
> numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
> numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
> numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
> numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
> numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
> numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
> numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

Also minor comment below;

> ---
> Changelog v1->v2:
> Fix setting sched_setnuma under !sd pointed by Peter Zijlstra.
> Modify commit message to describe the reason for change.
> 
>  kernel/sched/fair.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 285d7ae..2366fda2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1726,7 +1726,7 @@ static int task_numa_migrate(struct task_struct *p)
>  	 * elsewhere, so there is no point in (re)trying.
>  	 */
>  	if (unlikely(!sd)) {
> -		p->numa_preferred_nid = task_node(p);
> +		sched_setnuma(p, task_node(p));
>  		return -EINVAL;
>  	}
>  

That looks like it had the potential to corrupt the stats managed by
account_numa_enqueue/dequeue :/

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup
  2018-06-20 17:02 ` [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
@ 2018-06-21  9:38   ` Mel Gorman
  2018-07-25 14:25   ` [tip:sched/core] sched/numa: Use task faults only if numa_group is not yet set up tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2018-06-21  9:38 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:46PM +0530, Srikar Dronamraju wrote:
> When numa_group faults are available, task_numa_placement only uses
> numa_group faults to evaluate preferred node. However it still accounts
> task faults and even evaluates the preferred node just based on task
> faults just to discard it in favour of preferred node chosen on the
> basis of numa_group.
> 
> Instead use task faults only if numa_group is not set.
> 
> Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 16    25549.6     25215.7     -1.30
> 1     73190       72107       -1.47
> 
> Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 8     113437      113372      -0.05
> 1     196130      177403      -9.54
> 
> (numbers from v1 based on v4.17-rc5)
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      506.35      794.46      599.06      104.26
> numa01.sh       Sys:      150.37      223.56      195.99       24.94
> numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
> numa02.sh      Real:       60.33       62.40       61.31        0.90
> numa02.sh       Sys:       18.12       31.66       24.28        5.89
> numa02.sh      User:     5203.91     5325.32     5260.29       49.98
> numa03.sh      Real:      696.47      853.62      745.80       57.28
> numa03.sh       Sys:       85.68      123.71       97.89       13.48
> numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
> numa04.sh      Real:      444.05      514.83      497.06       26.85
> numa04.sh       Sys:      230.39      375.79      316.23       48.58
> numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
> numa05.sh      Real:      423.09      460.41      439.57       13.92
> numa05.sh       Sys:      287.38      480.15      369.37       68.52
> numa05.sh      User:    34732.12    38016.80    36255.85     1070.51
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
> numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
> numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
> numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
> numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
> numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
> numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
> numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
> numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
> numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
> numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
> numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
> numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
> numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
> numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-20 17:02 ` [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
@ 2018-06-21  9:51   ` Mel Gorman
  2018-07-23 10:54   ` Peter Zijlstra
  1 sibling, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2018-06-21  9:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:54PM +0530, Srikar Dronamraju wrote:
> Currently resetting the migrate rate limit is under a spinlock.
> The spinlock will only serialize the migrate rate limiting and something
> similar can actually be achieved by a simpler xchg.
> 
> Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 16    25804.1     25355.9     -1.73
> 1     73413       72812       -0.81
> 
> Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 8     101748      110199      8.30
> 1     170818      176303      3.21
> 
> (numbers from v1 based on v4.17-rc5)
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      435.67      707.28      527.49       97.85
> numa01.sh       Sys:       76.41      231.19      162.49       56.13
> numa01.sh      User:    38247.36    59033.52    45129.31     7642.69
> numa02.sh      Real:       60.35       62.09       61.09        0.69
> numa02.sh       Sys:       15.01       30.20       20.64        5.56
> numa02.sh      User:     5195.93     5294.82     5240.99       40.55
> numa03.sh      Real:      752.04      919.89      836.81       63.29
> numa03.sh       Sys:      115.10      133.35      125.46        7.78
> numa03.sh      User:    58736.44    70084.26    65103.67     4416.10
> numa04.sh      Real:      418.43      709.69      512.53      104.17
> numa04.sh       Sys:      242.99      370.47      297.39       42.20
> numa04.sh      User:    34916.14    48429.54    38955.65     4928.05
> numa05.sh      Real:      379.27      434.05      403.70       17.79
> numa05.sh       Sys:      145.94      344.50      268.72       68.53
> numa05.sh      User:    32679.32    35449.75    33989.10      913.19
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
> numa01.sh      Real:      490.04      774.86      596.26       96.46 	 -11.5%
> numa01.sh       Sys:      151.52      242.88      184.82       31.71 	 -12.0%
> numa01.sh      User:    41418.41    60844.59    48776.09     6564.27 	 -7.47%
> numa02.sh      Real:       60.14       62.94       60.98        1.00 	 0.180%
> numa02.sh       Sys:       16.11       30.77       21.20        5.28 	 -2.64%
> numa02.sh      User:     5184.33     5311.09     5228.50       44.24 	 0.238%
> numa03.sh      Real:      790.95      856.35      826.41       24.11 	 1.258%
> numa03.sh       Sys:      114.93      118.85      117.05        1.63 	 7.184%
> numa03.sh      User:    60990.99    64959.28    63470.43     1415.44 	 2.573%
> numa04.sh      Real:      434.37      597.92      504.87       59.70 	 1.517%
> numa04.sh       Sys:      237.63      397.40      289.74       55.98 	 2.640%
> numa04.sh      User:    34854.87    41121.83    38572.52     2615.84 	 0.993%
> numa05.sh      Real:      386.77      448.90      417.22       22.79 	 -3.24%
> numa05.sh       Sys:      149.23      379.95      303.04       79.55 	 -11.3%
> numa05.sh      User:    32951.76    35959.58    34562.18     1034.05 	 -1.65%
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

However, I'm actively considering removing rate limiting entirely. It
has been reported elsewhere that it was a limiting factor for some
workloads. When it was introduced, it was to avoid worst-case migration
storms but I believe that those ping-pong style problems should now be
ok.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock
  2018-06-20 17:02 ` [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
@ 2018-06-21  9:51   ` Mel Gorman
  2018-07-25 14:29   ` [tip:sched/core] sched/numa: Update the scan period without holding the numa_group lock tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2018-06-21  9:51 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:55PM +0530, Srikar Dronamraju wrote:
> The metrics for updating scan periods are local or task specific.
> Currently this updation happens under numa_group lock which seems
> unnecessary. Hence move this updation outside the lock.
> 
> Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 16    25355.9     25645.4     1.141
> 1     72812       72142       -0.92
> 
> Reviewed-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-20 17:02 ` [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
@ 2018-06-21 10:05   ` Mel Gorman
  2018-07-04 11:19     ` Srikar Dronamraju
  0 siblings, 1 reply; 53+ messages in thread
From: Mel Gorman @ 2018-06-21 10:05 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:59PM +0530, Srikar Dronamraju wrote:
> @@ -6668,6 +6662,19 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
>  
>  	/* We have migrated, no longer consider this task hot */
>  	p->se.exec_start = 0;
> +
> +#ifdef CONFIG_NUMA_BALANCING
> +	if (!p->mm || (p->flags & PF_EXITING))
> +		return;
> +
> +	if (p->numa_faults) {
> +		int src_nid = cpu_to_node(task_cpu(p));
> +		int dst_nid = cpu_to_node(new_cpu);
> +
> +		if (src_nid != dst_nid)
> +			p->numa_scan_period = task_scan_start(p);
> +	}
> +#endif
>  }
>  

We talked about this before but I would at least suggest that you not
reset the scan if moving to the preferred node or if the node movement
has nothing to do with the preferred nid. e.g.

	/*
	 * Ignore if the migration is not changing node, if it is migrating to
	 * the preferred node or moving between two nodes that are not preferred
	 */

	if (p->numa_faults) {
		int src_nid = cpu_to_node(task_cpu(p));
		int dst_nid = cpu_to_node(new_cpu);

		if (src_nid == dst_nid || dst_nid == p->numa_preferred_nid ||
		    (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))
			return;

		p->numa_scan_period = task_scan_start(p);

Note too that the next scan can be an arbitrary amount of time in the
future. Consider as an alternative to schedule an immediate scan instead
of adjusting the rate with

		p->mm->numa_next_scan = jiffies;

That might be less harmful in terms of overhead while still collecting
some data in the short-term.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred
  2018-06-20 17:03 ` [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
@ 2018-06-21 10:06   ` Mel Gorman
  2018-07-25 14:30   ` [tip:sched/core] sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: Mel Gorman @ 2018-06-21 10:06 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:33:00PM +0530, Srikar Dronamraju wrote:
> numa_migrate_preferred is called periodically or when task preferred
> node changes. Preferred node evaluations happen once per scan sequence.
> 
> If the scan completion happens just after the periodic numa migration,
> then we try to migrate to the preferred node and the preferred node might
> change, needing another node migration.
> 
> Avoid this by checking for scan sequence completion only when checking
> for periodic migration.
> 
> Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 16    25862.6     26158.1     1.14258
> 1     74357       72725       -2.19482
> 
> Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
> JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
> 8     117019      113992      -2.58
> 1     179095      174947      -2.31
> 
> (numbers from v1 based on v4.17-rc5)
> Testcase       Time:         Min         Max         Avg      StdDev
> numa01.sh      Real:      449.46      770.77      615.22      101.70
> numa01.sh       Sys:      132.72      208.17      170.46       24.96
> numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
> numa02.sh      Real:       60.85       61.79       61.28        0.37
> numa02.sh       Sys:       15.34       24.71       21.08        3.61
> numa02.sh      User:     5204.41     5249.85     5231.21       17.60
> numa03.sh      Real:      785.50      916.97      840.77       44.98
> numa03.sh       Sys:      108.08      133.60      119.43        8.82
> numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
> numa04.sh      Real:      429.57      587.37      480.80       57.40
> numa04.sh       Sys:      240.61      321.97      290.84       33.58
> numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
> numa05.sh      Real:      392.09      431.25      414.65       13.82
> numa05.sh       Sys:      229.41      372.48      297.54       53.14
> numa05.sh      User:    33390.86    34697.49    34222.43      556.42
> 
> 
> Testcase       Time:         Min         Max         Avg      StdDev 	%Change
> numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
> numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
> numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
> numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
> numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
> numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
> numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
> numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
> numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
> numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
> numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
> numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
> numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
> numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
> numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%
> 
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Acked-by: Mel Gorman <mgorman@techsingularity.net>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes
  2018-06-21 10:05   ` Mel Gorman
@ 2018-07-04 11:19     ` Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-07-04 11:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Peter Zijlstra, LKML, Rik van Riel, Thomas Gleixner

> We talked about this before but I would at least suggest that you not
> reset the scan if moving to the preferred node or if the node movement
> has nothing to do with the preferred nid. e.g.
> 

I understand your concern and okay to drop this patch for now.
I will try to rework and come back later.

> 	/*
> 	 * Ignore if the migration is not changing node, if it is migrating to
> 	 * the preferred node or moving between two nodes that are not preferred
> 	 */
> 
> 	if (p->numa_faults) {
> 		int src_nid = cpu_to_node(task_cpu(p));
> 		int dst_nid = cpu_to_node(new_cpu);
> 
> 		if (src_nid == dst_nid || dst_nid == p->numa_preferred_nid ||
> 		    (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))

With out this change/patch, we used to reduce our scan rate whenever numa
balancer moved a task to a preferred node. This was made to verify that
the task movement was correct. The check ( dst_nid == p->numa_preferred_nid)
will negate that verification. I thought the whole point of reducing the
scan period was to correct if we choose a wrong node.

> 			return;
> 
> 		p->numa_scan_period = task_scan_start(p);
> 
> Note too that the next scan can be an arbitrary amount of time in the
> future. Consider as an alternative to schedule an immediate scan instead
> of adjusting the rate with
> 
> 		p->mm->numa_next_scan = jiffies;
> 

I will try to work along these lines.  Though the scan happens
immediately the task placement will not happen immediately so I am not
sure if it would help much.

> That might be less harmful in terms of overhead while still collecting
> some data in the short-term.
> 

-- 
Thanks and Regards
Srikar


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-06-20 17:02 ` [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
@ 2018-07-23 10:38   ` Peter Zijlstra
  2018-07-23 11:16     ` Srikar Dronamraju
  0 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-07-23 10:38 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:52PM +0530, Srikar Dronamraju wrote:
> Since task migration under numa balancing can happen in parallel, more
> than one task might choose to move to the same node at the same time.
> This can cause load imbalances at the node level.
> 
> The problem is more likely if there are more cores per node or more
> nodes in system.
> 
> Use a per-node variable to indicate if task migration
> to the node under numa balance is currently active.
> This per-node variable will not track swapping of tasks.


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 50c7727..87fb20e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1478,11 +1478,22 @@ struct task_numa_env {
>  static void task_numa_assign(struct task_numa_env *env,
>  			     struct task_struct *p, long imp)
>  {
> +	pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
>  	struct rq *rq = cpu_rq(env->dst_cpu);
>  
>  	if (xchg(&rq->numa_migrate_on, 1))
>  		return;
>  
> +	if (!env->best_task && env->best_cpu != -1)
> +		WRITE_ONCE(pgdat->active_node_migrate, 0);
> +
> +	if (!p) {
> +		if (xchg(&pgdat->active_node_migrate, 1)) {
> +			WRITE_ONCE(rq->numa_migrate_on, 0);
> +			return;
> +		}
> +	}
> +
>  	if (env->best_cpu != -1) {
>  		rq = cpu_rq(env->best_cpu);
>  		WRITE_ONCE(rq->numa_migrate_on, 0);


Urgh, that's prertty magical code. And it doesn't even have a comment.

For isntance, I cannot tell why we clear that active_node_migrate thing
right there.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock
  2018-06-20 17:02 ` [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
  2018-06-21  9:51   ` Mel Gorman
@ 2018-07-23 10:54   ` Peter Zijlstra
  2018-07-23 11:20     ` Srikar Dronamraju
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-07-23 10:54 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:54PM +0530, Srikar Dronamraju wrote:
> Currently resetting the migrate rate limit is under a spinlock.
> The spinlock will only serialize the migrate rate limiting and something
> similar can actually be achieved by a simpler xchg.

You're again not explaining things right. The xchg isn't simpler in any
way. It just happens to be faster, esp. so on PPC.

> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8c0af0f..c774990 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1868,17 +1868,25 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
>  static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
>  					unsigned long nr_pages)
>  {
> +	unsigned long next_window, interval;
> +
> +	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
> +	interval = msecs_to_jiffies(migrate_interval_millisecs);
> +
>  	/*
>  	 * Rate-limit the amount of data that is being migrated to a node.
>  	 * Optimal placement is no good if the memory bus is saturated and
>  	 * all the time is being spent migrating!
>  	 */
> +	if (time_after(jiffies, next_window)) {
> +		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0)) {
> +			do {
> +				next_window += interval;
> +			} while (unlikely(time_after(jiffies, next_window)));
> +
> +			WRITE_ONCE(pgdat->numabalancing_migrate_next_window,
> +							       next_window);
> +		}
>  	}
>  	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
>  		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,

If you maybe write that like:

	if (time_after(jiffies, next_window) &&
	    xchg(&pgdat->numabalancing_migrate_nr_pages, 0UL)) {

		do {
			next_window += interval;
		} while (unlikely(time_after(jiffies, next_window)));

		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
	}

Then you avoid an indent level and line-wrap, resulting imo easier to
read code.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node.
  2018-07-23 10:38   ` Peter Zijlstra
@ 2018-07-23 11:16     ` Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-07-23 11:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-07-23 12:38:30]:

> On Wed, Jun 20, 2018 at 10:32:52PM +0530, Srikar Dronamraju wrote:
> > Since task migration under numa balancing can happen in parallel, more
> > than one task might choose to move to the same node at the same time.
> > This can cause load imbalances at the node level.
> > 
> > The problem is more likely if there are more cores per node or more
> > nodes in system.
> > 
> > Use a per-node variable to indicate if task migration
> > to the node under numa balance is currently active.
> > This per-node variable will not track swapping of tasks.
> 
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 50c7727..87fb20e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1478,11 +1478,22 @@ struct task_numa_env {
> >  static void task_numa_assign(struct task_numa_env *env,
> >  			     struct task_struct *p, long imp)
> >  {
> > +	pg_data_t *pgdat = NODE_DATA(cpu_to_node(env->dst_cpu));
> >  	struct rq *rq = cpu_rq(env->dst_cpu);
> >  
> >  	if (xchg(&rq->numa_migrate_on, 1))
> >  		return;
> >  
> > +	if (!env->best_task && env->best_cpu != -1)
> > +		WRITE_ONCE(pgdat->active_node_migrate, 0);
> > +
> > +	if (!p) {
> > +		if (xchg(&pgdat->active_node_migrate, 1)) {
> > +			WRITE_ONCE(rq->numa_migrate_on, 0);
> > +			return;
> > +		}
> > +	}
> > +
> >  	if (env->best_cpu != -1) {
> >  		rq = cpu_rq(env->best_cpu);
> >  		WRITE_ONCE(rq->numa_migrate_on, 0);
> 
> 
> Urgh, that's prertty magical code. And it doesn't even have a comment.
> 
> For isntance, I cannot tell why we clear that active_node_migrate thing
> right there.
> 

active_node_migrate doesn't track swaps, it only tracks task movement to
a node. Here a task finds a first cpu which is idle.  So it would have
set pgdat->active_node_migrate. Here env->best_task is NULL but
env->best_cpu is set.

Next the task might find another cpu where it finds swap to be
beneficial than a move. i.e there is a pair of tasks to be swapped. Now
we have to reset pgdat->active_node_migrate. The test for best_task and
best_cpu will tell us if we had set active_node_migrate.

-- 
Thanks and Regards
Srikar Dronamraju


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock
  2018-07-23 10:54   ` Peter Zijlstra
@ 2018-07-23 11:20     ` Srikar Dronamraju
  2018-07-23 14:04       ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-07-23 11:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

> If you maybe write that like:
> 
> 	if (time_after(jiffies, next_window) &&
> 	    xchg(&pgdat->numabalancing_migrate_nr_pages, 0UL)) {
> 
> 		do {
> 			next_window += interval;
> 		} while (unlikely(time_after(jiffies, next_window)));
> 
> 		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
> 	}
> 
> Then you avoid an indent level and line-wrap, resulting imo easier to
> read code.
> 

Okay will do.


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
                   ` (19 preceding siblings ...)
  2018-06-20 17:03 ` [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
@ 2018-07-23 13:57 ` Peter Zijlstra
  2018-07-23 15:09   ` Srikar Dronamraju
  20 siblings, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-07-23 13:57 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Wed, Jun 20, 2018 at 10:32:41PM +0530, Srikar Dronamraju wrote:
> Srikar Dronamraju (19):
>   sched/numa: Remove redundant field.
>   sched/numa: Evaluate move once per node
>   sched/numa: Simplify load_too_imbalanced
>   sched/numa: Set preferred_node based on best_cpu
>   sched/numa: Use task faults only if numa_group is not yet setup
>   sched/debug: Reverse the order of printing faults
>   sched/numa: Skip nodes that are at hoplimit
>   sched/numa: Remove unused task_capacity from numa_stats
>   sched/numa: Modify migrate_swap to accept additional params
>   sched/numa: Restrict migrating in parallel to the same node.
>   sched/numa: Remove numa_has_capacity
>   sched/numa: Use group_weights to identify if migration degrades locality
>   sched/numa: Move task_placement closer to numa_migrate_preferred

I took the above, but left the below for next time.

>   sched/numa: Stop multiple tasks from moving to the cpu at the same time
>   mm/migrate: Use xchg instead of spinlock
>   sched/numa: Updation of scan period need not be in lock
>   sched/numa: Detect if node actively handling migration
>   sched/numa: Pass destination cpu as a parameter to migrate_task_rq
>   sched/numa: Reset scan rate whenever task moves across nodes

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock
  2018-07-23 11:20     ` Srikar Dronamraju
@ 2018-07-23 14:04       ` Peter Zijlstra
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2018-07-23 14:04 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jul 23, 2018 at 04:20:32AM -0700, Srikar Dronamraju wrote:
> > If you maybe write that like:
> > 
> > 	if (time_after(jiffies, next_window) &&
> > 	    xchg(&pgdat->numabalancing_migrate_nr_pages, 0UL)) {
> > 
> > 		do {
> > 			next_window += interval;
> > 		} while (unlikely(time_after(jiffies, next_window)));
> > 
> > 		WRITE_ONCE(pgdat->numabalancing_migrate_next_window, next_window);
> > 	}
> > 
> > Then you avoid an indent level and line-wrap, resulting imo easier to
> > read code.
> > 
> 
> Okay will do.

FWIW, that code seems to rely on @nr_pages != 0, otherwise you can have
the xchg fail even though time_after is true.

Probably not a problem, but it is a semantic change vs the spinlock.
Also, the spinlock thing could probably have changed to a trylock and
you'd have seens similar 'gains' I suppose.

Another difference vs the lock+unlock is that you lost the release
ordering. Again, probably not a big deal, but it does make the whole
things a little dodgy.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-07-23 13:57 ` Peter Zijlstra
@ 2018-07-23 15:09   ` Srikar Dronamraju
  2018-07-23 15:21     ` Peter Zijlstra
  2018-07-23 15:33     ` Rik van Riel
  0 siblings, 2 replies; 53+ messages in thread
From: Srikar Dronamraju @ 2018-07-23 15:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-07-23 15:57:00]:

> On Wed, Jun 20, 2018 at 10:32:41PM +0530, Srikar Dronamraju wrote:
> > Srikar Dronamraju (19):
> >   sched/numa: Remove redundant field.
> >   sched/numa: Evaluate move once per node
> >   sched/numa: Simplify load_too_imbalanced
> >   sched/numa: Set preferred_node based on best_cpu
> >   sched/numa: Use task faults only if numa_group is not yet setup
> >   sched/debug: Reverse the order of printing faults
> >   sched/numa: Skip nodes that are at hoplimit
> >   sched/numa: Remove unused task_capacity from numa_stats
> >   sched/numa: Modify migrate_swap to accept additional params
> >   sched/numa: Restrict migrating in parallel to the same node.
> >   sched/numa: Remove numa_has_capacity
> >   sched/numa: Use group_weights to identify if migration degrades locality
> >   sched/numa: Move task_placement closer to numa_migrate_preferred
>
> I took the above, but left the below for next time.


>
> >   sched/numa: Stop multiple tasks from moving to the cpu at the same time

This patch has go-ahead from Mel and Rik and no outstanding comments.

In my analysis, I did find a lot of cases where the same cpu ended up
being the target. + I am not sure you can apply "sched/numa: Restrict
migrating in parallel to the same node" cleanly without this patch.

So I am a bit confused. If possible, please clarify.

> >   mm/migrate: Use xchg instead of spinlock

Will try with spin_trylock and get back.

> >   sched/numa: Updation of scan period need not be in lock

I didnt see any comments for this apart from an ack from Rik.
+ It thought it was trivial and shouldnt have any side-effect.

> >   sched/numa: Detect if node actively handling migration
> >   sched/numa: Pass destination cpu as a parameter to migrate_task_rq
> >   sched/numa: Reset scan rate whenever task moves across nodes
>


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-07-23 15:09   ` Srikar Dronamraju
@ 2018-07-23 15:21     ` Peter Zijlstra
  2018-07-23 16:29       ` Srikar Dronamraju
  2018-07-23 15:33     ` Rik van Riel
  1 sibling, 1 reply; 53+ messages in thread
From: Peter Zijlstra @ 2018-07-23 15:21 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jul 23, 2018 at 08:09:55AM -0700, Srikar Dronamraju wrote:

> > >   sched/numa: Stop multiple tasks from moving to the cpu at the same time
> 
> This patch has go-ahead from Mel and Rik and no outstanding comments.

I left it out because it's part of the big xchg() mess.

In particular:

+       if (xchg(&rq->numa_migrate_on, 1))
+               return;
+
+       if (env->best_cpu != -1) {
+               rq = cpu_rq(env->best_cpu);
+               WRITE_ONCE(rq->numa_migrate_on, 0);
+       }

I'm again confused by clearing numa_migrate_on at this point..

> > >   sched/numa: Updation of scan period need not be in lock
> 
> I didnt see any comments for this apart from an ack from Rik.
> + It thought it was trivial and shouldnt have any side-effect.

Oh, my bad I actually have this one.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-07-23 15:09   ` Srikar Dronamraju
  2018-07-23 15:21     ` Peter Zijlstra
@ 2018-07-23 15:33     ` Rik van Riel
  1 sibling, 0 replies; 53+ messages in thread
From: Rik van Riel @ 2018-07-23 15:33 UTC (permalink / raw)
  To: Srikar Dronamraju, Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Thomas Gleixner

[-- Attachment #1: Type: text/plain, Size: 879 bytes --]

On Mon, 2018-07-23 at 08:09 -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-07-23 15:57:00]:
> 
> > On Wed, Jun 20, 2018 at 10:32:41PM +0530, Srikar Dronamraju wrote:
> > > Srikar Dronamraju (19):
> > 
> > >   sched/numa: Stop multiple tasks from moving to the cpu at the
> > > same time
> 
> This patch has go-ahead from Mel and Rik and no outstanding comments.
> 
> In my analysis, I did find a lot of cases where the same cpu ended up
> being the target. + I am not sure you can apply "sched/numa: Restrict
> migrating in parallel to the same node" cleanly without this patch.
> 
> So I am a bit confused. If possible, please clarify.

I believe that patch fixes a real issue, but it would
be nice if the code could be documented better so it
does not confuse people who look at the code later.


-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-07-23 15:21     ` Peter Zijlstra
@ 2018-07-23 16:29       ` Srikar Dronamraju
  2018-07-23 16:47         ` Peter Zijlstra
  0 siblings, 1 reply; 53+ messages in thread
From: Srikar Dronamraju @ 2018-07-23 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

* Peter Zijlstra <peterz@infradead.org> [2018-07-23 17:21:47]:

> On Mon, Jul 23, 2018 at 08:09:55AM -0700, Srikar Dronamraju wrote:
> 
> > > >   sched/numa: Stop multiple tasks from moving to the cpu at the same time
> > 
> > This patch has go-ahead from Mel and Rik and no outstanding comments.
> 
> I left it out because it's part of the big xchg() mess.
> 
> In particular:
> 
> +       if (xchg(&rq->numa_migrate_on, 1))
> +               return;
> +
> +       if (env->best_cpu != -1) {
> +               rq = cpu_rq(env->best_cpu);
> +               WRITE_ONCE(rq->numa_migrate_on, 0);
> +       }
> 
> I'm again confused by clearing numa_migrate_on at this point..

First task choose a cpu to swap/migrate, sets the cpu to best_cpu and
also numa_migrate_on.  Next it finds a better cpu to swap/move. Now if
the task is able to move to the better cpu, then it should clear
numa_migrate_on on the previous best_cpu.

If we dont reset numa_migrate_on on finding a better cpu, the
numa_migrate_on stays set for the previous cpu, causing previous cpu to
never be a target of numa balance.


> 
> > > >   sched/numa: Updation of scan period need not be in lock
> > 
> > I didnt see any comments for this apart from an ack from Rik.
> > + It thought it was trivial and shouldnt have any side-effect.
> 
> Oh, my bad I actually have this one.
> 


^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 00/19]  Fixes for sched/numa_balancing
  2018-07-23 16:29       ` Srikar Dronamraju
@ 2018-07-23 16:47         ` Peter Zijlstra
  0 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2018-07-23 16:47 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Ingo Molnar, LKML, Mel Gorman, Rik van Riel, Thomas Gleixner

On Mon, Jul 23, 2018 at 09:29:54AM -0700, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2018-07-23 17:21:47]:
> 
> > On Mon, Jul 23, 2018 at 08:09:55AM -0700, Srikar Dronamraju wrote:
> > 
> > > > >   sched/numa: Stop multiple tasks from moving to the cpu at the same time
> > > 
> > > This patch has go-ahead from Mel and Rik and no outstanding comments.
> > 
> > I left it out because it's part of the big xchg() mess.
> > 
> > In particular:
> > 
> > +       if (xchg(&rq->numa_migrate_on, 1))
> > +               return;
> > +
> > +       if (env->best_cpu != -1) {
> > +               rq = cpu_rq(env->best_cpu);
> > +               WRITE_ONCE(rq->numa_migrate_on, 0);
> > +       }
> > 
> > I'm again confused by clearing numa_migrate_on at this point..
> 
> First task choose a cpu to swap/migrate, sets the cpu to best_cpu and
> also numa_migrate_on.  Next it finds a better cpu to swap/move. Now if
> the task is able to move to the better cpu, then it should clear
> numa_migrate_on on the previous best_cpu.
> 
> If we dont reset numa_migrate_on on finding a better cpu, the
> numa_migrate_on stays set for the previous cpu, causing previous cpu to
> never be a target of numa balance.

Don't tell me, write better patches.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Remove redundant field
  2018-06-20 17:02 ` [PATCH v2 01/19] sched/numa: Remove redundant field Srikar Dronamraju
@ 2018-07-25 14:23   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:23 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: riel, torvalds, mingo, tglx, hpa, peterz, mgorman, srikar, linux-kernel

Commit-ID:  6e30396767508101eacec8b93b068e8905e660dc
Gitweb:     https://git.kernel.org/tip/6e30396767508101eacec8b93b068e8905e660dc
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:42 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:06 +0200

sched/numa: Remove redundant field

'numa_entry' is a struct list_head defined in task_struct, but never used.

No functional change.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-2-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 43731fe51c97..e0f4f56c9310 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1017,7 +1017,6 @@ struct task_struct {
 	u64				last_sum_exec_runtime;
 	struct callback_head		numa_work;
 
-	struct list_head		numa_entry;
 	struct numa_group		*numa_group;
 
 	/*

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Evaluate move once per node
  2018-06-20 17:02 ` [PATCH v2 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
  2018-06-21  9:06   ` Mel Gorman
@ 2018-07-25 14:24   ` tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: hpa, linux-kernel, mingo, mgorman, peterz, srikar, riel, tglx, torvalds

Commit-ID:  305c1fac3225dfa7eeb89bfe91b7335a6edd5172
Gitweb:     https://git.kernel.org/tip/305c1fac3225dfa7eeb89bfe91b7335a6edd5172
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:43 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:06 +0200

sched/numa: Evaluate move once per node

task_numa_compare() helps choose the best CPU to move or swap the
selected task. To achieve this task_numa_compare() is called for every
CPU in the node. Currently it evaluates if the task can be moved/swapped
for each of the CPUs. However the move evaluation is mostly independent
of the CPU. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25705.2     25058.2     -2.51
1     74433       72950       -1.99

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     96589.6     105930      9.670
1     181830      178624      -1.76

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      440.65      941.32      758.98      189.17
numa01.sh       Sys:      183.48      320.07      258.42       50.09
numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
numa02.sh      Real:       61.24       65.35       62.49        1.49
numa02.sh       Sys:       16.83       24.18       21.40        2.60
numa02.sh      User:     5219.59     5356.34     5264.03       49.07
numa03.sh      Real:      822.04      912.40      873.55       37.35
numa03.sh       Sys:      118.80      140.94      132.90        7.60
numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
numa04.sh      Real:      690.66      872.12      778.49       65.44
numa04.sh       Sys:      459.26      563.03      494.03       42.39
numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
numa05.sh      Real:      418.37      562.28      525.77       54.27
numa05.sh       Sys:      299.45      481.00      392.49       64.27
numa05.sh      User:    34115.09    41324.02    39105.30     2627.68

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 128 +++++++++++++++++++++++-----------------------------
 1 file changed, 57 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 14c3fddf822a..b10e0663a49e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1580,9 +1580,8 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
-			      long taskimp, long groupimp)
+			      long taskimp, long groupimp, bool maymove)
 {
-	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long src_load, dst_load;
@@ -1603,97 +1602,73 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (cur == env->p)
 		goto unlock;
 
+	if (!cur) {
+		if (maymove || imp > env->best_imp)
+			goto assign;
+		else
+			goto unlock;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
-	 * the value is, the more rmeote accesses that would be expected to
+	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
 	 */
-	if (cur) {
-		/* Skip this swap candidate if cannot move to the source CPU: */
-		if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
-			goto unlock;
+	/* Skip this swap candidate if cannot move to the source cpu */
+	if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
+		goto unlock;
 
+	/*
+	 * If dst and source tasks are in the same NUMA group, or not
+	 * in any group then look only at task weights.
+	 */
+	if (cur->numa_group == env->p->numa_group) {
+		imp = taskimp + task_weight(cur, env->src_nid, dist) -
+		      task_weight(cur, env->dst_nid, dist);
 		/*
-		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights.
+		 * Add some hysteresis to prevent swapping the
+		 * tasks within a group over tiny differences.
 		 */
-		if (cur->numa_group == env->p->numa_group) {
-			imp = taskimp + task_weight(cur, env->src_nid, dist) -
-			      task_weight(cur, env->dst_nid, dist);
-			/*
-			 * Add some hysteresis to prevent swapping the
-			 * tasks within a group over tiny differences.
-			 */
-			if (cur->numa_group)
-				imp -= imp/16;
-		} else {
-			/*
-			 * Compare the group weights. If a task is all by
-			 * itself (not part of a group), use the task weight
-			 * instead.
-			 */
-			if (cur->numa_group)
-				imp += group_weight(cur, env->src_nid, dist) -
-				       group_weight(cur, env->dst_nid, dist);
-			else
-				imp += task_weight(cur, env->src_nid, dist) -
-				       task_weight(cur, env->dst_nid, dist);
-		}
+		if (cur->numa_group)
+			imp -= imp / 16;
+	} else {
+		/*
+		 * Compare the group weights. If a task is all by itself
+		 * (not part of a group), use the task weight instead.
+		 */
+		if (cur->numa_group && env->p->numa_group)
+			imp += group_weight(cur, env->src_nid, dist) -
+			       group_weight(cur, env->dst_nid, dist);
+		else
+			imp += task_weight(cur, env->src_nid, dist) -
+			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp && moveimp <= env->best_imp)
+	if (imp <= env->best_imp)
 		goto unlock;
 
-	if (!cur) {
-		/* Is there capacity at our destination? */
-		if (env->src_stats.nr_running <= env->src_stats.task_capacity &&
-		    !env->dst_stats.has_free_capacity)
-			goto unlock;
-
-		goto balance;
-	}
-
-	/* Balance doesn't matter much if we're running a task per CPU: */
-	if (imp > env->best_imp && src_rq->nr_running == 1 &&
-			dst_rq->nr_running == 1)
+	if (maymove && moveimp > imp && moveimp > env->best_imp) {
+		imp = moveimp - 1;
+		cur = NULL;
 		goto assign;
+	}
 
 	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
-balance:
-	load = task_h_load(env->p);
+	load = task_h_load(env->p) - task_h_load(cur);
+	if (!load)
+		goto assign;
+
 	dst_load = env->dst_stats.load + load;
 	src_load = env->src_stats.load - load;
 
-	if (moveimp > imp && moveimp > env->best_imp) {
-		/*
-		 * If the improvement from just moving env->p direction is
-		 * better than swapping tasks around, check if a move is
-		 * possible. Store a slightly smaller score than moveimp,
-		 * so an actually idle CPU will win.
-		 */
-		if (!load_too_imbalanced(src_load, dst_load, env)) {
-			imp = moveimp - 1;
-			cur = NULL;
-			goto assign;
-		}
-	}
-
-	if (imp <= env->best_imp)
-		goto unlock;
-
-	if (cur) {
-		load = task_h_load(cur);
-		dst_load -= load;
-		src_load += load;
-	}
-
 	if (load_too_imbalanced(src_load, dst_load, env))
 		goto unlock;
 
+assign:
 	/*
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
@@ -1709,7 +1684,6 @@ balance:
 		local_irq_enable();
 	}
 
-assign:
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
@@ -1718,15 +1692,27 @@ unlock:
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
+	long src_load, dst_load, load;
+	bool maymove = false;
 	int cpu;
 
+	load = task_h_load(env->p);
+	dst_load = env->dst_stats.load + load;
+	src_load = env->src_stats.load - load;
+
+	/*
+	 * If the improvement from just moving env->p direction is better
+	 * than swapping tasks around, check if a move is possible.
+	 */
+	maymove = !load_too_imbalanced(src_load, dst_load, env);
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp);
+		task_numa_compare(env, taskimp, groupimp, maymove);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Simplify load_too_imbalanced()
  2018-06-20 17:02 ` [PATCH v2 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
@ 2018-07-25 14:24   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:24 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mgorman, torvalds, mingo, linux-kernel, srikar, riel, tglx, hpa, peterz

Commit-ID:  5f95ba7a43057f28a349ea1f03ee8d04e0f445ea
Gitweb:     https://git.kernel.org/tip/5f95ba7a43057f28a349ea1f03ee8d04e0f445ea
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:44 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:06 +0200

sched/numa: Simplify load_too_imbalanced()

Currently load_too_imbalance() cares about the slope of imbalance.
It doesn't care of the direction of the imbalance.

However this may not work if nodes that are being compared have
dissimilar capacities. Few nodes might have more cores than other nodes
in the system. Also unlike traditional load balance at a NUMA sched
domain, multiple requests to migrate from the same source node to same
destination node may run in parallel. This can cause huge load
imbalance. This is specially true on a larger machines with either large
cores per node or more number of nodes in the system. Hence allow
move/swap only if the imbalance is going to reduce.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25058.2     25122.9     0.25
1     72950       73850       1.23

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      516.14      892.41      739.84      151.32
numa01.sh       Sys:      153.16      192.99      177.70       14.58
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48
numa02.sh      Real:       60.91       62.35       61.58        0.63
numa02.sh       Sys:       16.47       26.16       21.20        3.85
numa02.sh      User:     5227.58     5309.61     5265.17       31.04
numa03.sh      Real:      739.07      917.73      795.75       64.45
numa03.sh       Sys:       94.46      136.08      109.48       14.58
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69
numa04.sh      Real:      442.61      715.43      530.31       96.12
numa04.sh       Sys:      224.90      348.63      285.61       48.83
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26
numa05.sh      Real:      386.13      489.17      434.94       43.59
numa05.sh       Sys:      144.29      438.56      278.80      105.78
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      435.78      653.81      534.58       83.20 	 38.39%
numa01.sh       Sys:      121.93      187.18      145.90       23.47 	 21.79%
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75 	 31.03%
numa02.sh      Real:       60.64       61.63       61.19        0.40 	 0.637%
numa02.sh       Sys:       14.72       25.68       19.06        4.03 	 11.22%
numa02.sh      User:     5210.95     5266.69     5233.30       20.82 	 0.608%
numa03.sh      Real:      746.51      808.24      780.36       23.88 	 1.972%
numa03.sh       Sys:       97.26      108.48      105.07        4.28 	 4.197%
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82 	 2.661%
numa04.sh      Real:      465.97      519.27      484.81       19.62 	 9.385%
numa04.sh       Sys:      304.43      359.08      334.68       20.64 	 -14.6%
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91 	 2.478%
numa05.sh      Real:      411.57      457.20      433.29       16.58 	 0.380%
numa05.sh       Sys:      230.05      435.48      339.95       67.58 	 -17.9%
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64 	 -2.12%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-4-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b10e0663a49e..226837960ec0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1546,28 +1546,12 @@ static bool load_too_imbalanced(long src_load, long dst_load,
 	src_capacity = env->src_stats.compute_capacity;
 	dst_capacity = env->dst_stats.compute_capacity;
 
-	/* We care about the slope of the imbalance, not the direction. */
-	if (dst_load < src_load)
-		swap(dst_load, src_load);
-
-	/* Is the difference below the threshold? */
-	imb = dst_load * src_capacity * 100 -
-	      src_load * dst_capacity * env->imbalance_pct;
-	if (imb <= 0)
-		return false;
+	imb = abs(dst_load * src_capacity - src_load * dst_capacity);
 
-	/*
-	 * The imbalance is above the allowed threshold.
-	 * Compare it with the old imbalance.
-	 */
 	orig_src_load = env->src_stats.load;
 	orig_dst_load = env->dst_stats.load;
 
-	if (orig_dst_load < orig_src_load)
-		swap(orig_dst_load, orig_src_load);
-
-	old_imb = orig_dst_load * src_capacity * 100 -
-		  orig_src_load * dst_capacity * env->imbalance_pct;
+	old_imb = abs(orig_dst_load * src_capacity - orig_src_load * dst_capacity);
 
 	/* Would this change make things worse? */
 	return (imb > old_imb);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Set preferred_node based on best_cpu
  2018-06-20 17:02 ` [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
  2018-06-21  9:17   ` Mel Gorman
@ 2018-07-25 14:25   ` tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, linux-kernel, riel, torvalds, srikar, mingo, hpa, mgorman

Commit-ID:  8cd45eee43bd46b933158b25aa7c742e0f3e811f
Gitweb:     https://git.kernel.org/tip/8cd45eee43bd46b933158b25aa7c742e0f3e811f
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:45 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:06 +0200

sched/numa: Set preferred_node based on best_cpu

Currently preferred node is set to dst_nid which is the last node in the
iteration whose group weight or task weight is greater than the current
node. However it doesn't guarantee that dst_nid has the numa capacity
to move. It also doesn't guarantee that dst_nid has the best_cpu which
is the CPU/node ideal for node migration.

Lets consider faults on a 4 node system with group weight numbers
in different nodes being in 0 < 1 < 2 < 3 proportion. Consider the task
is running on 3 and 0 is its preferred node but its capacity is full.
Consider nodes 1, 2 and 3 have capacity. Then the task should be
migrated to node 1. Currently the task gets moved to node 2. env.dst_nid
points to the last node whose faults were greater than current node.

Modify to set the preferred node based of best_cpu. Earlier setting
preferred node was skipped if nr_active_nodes is 1. This could result in
the task being moved out of the preferred node to a random node during
regular load balancing.

Also while modifying task_numa_migrate(), use sched_setnuma to set
preferred node. This ensures out numa accounting is correct.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25122.9     25549.6     1.698
1     73850       73190       -0.89

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     105930      113437      7.08676
1     178624      196130      9.80047

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      435.78      653.81      534.58       83.20
numa01.sh       Sys:      121.93      187.18      145.90       23.47
numa01.sh      User:    37082.81    51402.80    43647.60     5409.75
numa02.sh      Real:       60.64       61.63       61.19        0.40
numa02.sh       Sys:       14.72       25.68       19.06        4.03
numa02.sh      User:     5210.95     5266.69     5233.30       20.82
numa03.sh      Real:      746.51      808.24      780.36       23.88
numa03.sh       Sys:       97.26      108.48      105.07        4.28
numa03.sh      User:    58956.30    61397.05    60162.95     1050.82
numa04.sh      Real:      465.97      519.27      484.81       19.62
numa04.sh       Sys:      304.43      359.08      334.68       20.64
numa04.sh      User:    37544.16    41186.15    39262.44     1314.91
numa05.sh      Real:      411.57      457.20      433.29       16.58
numa05.sh       Sys:      230.05      435.48      339.95       67.58
numa05.sh      User:    33325.54    36896.31    35637.84     1222.64

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      506.35      794.46      599.06      104.26 	 -10.76%
numa01.sh       Sys:      150.37      223.56      195.99       24.94 	 -25.55%
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33 	 -11.43%
numa02.sh      Real:       60.33       62.40       61.31        0.90 	 -0.195%
numa02.sh       Sys:       18.12       31.66       24.28        5.89 	 -21.49%
numa02.sh      User:     5203.91     5325.32     5260.29       49.98 	 -0.513%
numa03.sh      Real:      696.47      853.62      745.80       57.28 	 4.6339%
numa03.sh       Sys:       85.68      123.71       97.89       13.48 	 7.3347%
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97 	 1.5323%
numa04.sh      Real:      444.05      514.83      497.06       26.85 	 -2.464%
numa04.sh       Sys:      230.39      375.79      316.23       48.58 	 5.8343%
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08 	 -1.153%
numa05.sh      Real:      423.09      460.41      439.57       13.92 	 -1.428%
numa05.sh       Sys:      287.38      480.15      369.37       68.52 	 -7.964%
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51 	 -1.704%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-5-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226837960ec0..0532195c38d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1765,7 +1765,7 @@ static int task_numa_migrate(struct task_struct *p)
 	 * elsewhere, so there is no point in (re)trying.
 	 */
 	if (unlikely(!sd)) {
-		p->numa_preferred_nid = task_node(p);
+		sched_setnuma(p, task_node(p));
 		return -EINVAL;
 	}
 
@@ -1824,15 +1824,13 @@ static int task_numa_migrate(struct task_struct *p)
 	 * trying for a better one later. Do not set the preferred node here.
 	 */
 	if (p->numa_group) {
-		struct numa_group *ng = p->numa_group;
-
 		if (env.best_cpu == -1)
 			nid = env.src_nid;
 		else
-			nid = env.dst_nid;
+			nid = cpu_to_node(env.best_cpu);
 
-		if (ng->active_nodes > 1 && numa_is_active_node(env.dst_nid, ng))
-			sched_setnuma(p, env.dst_nid);
+		if (nid != p->numa_preferred_nid)
+			sched_setnuma(p, nid);
 	}
 
 	/* No better CPU than the current one was found. */

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Use task faults only if numa_group is not yet set up
  2018-06-20 17:02 ` [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
  2018-06-21  9:38   ` Mel Gorman
@ 2018-07-25 14:25   ` tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, tglx, mgorman, srikar, peterz, riel, hpa, linux-kernel

Commit-ID:  f03bb6760b8e5e2bcecc88d2a2ef41c09adcab39
Gitweb:     https://git.kernel.org/tip/f03bb6760b8e5e2bcecc88d2a2ef41c09adcab39
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:46 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:06 +0200

sched/numa: Use task faults only if numa_group is not yet set up

When numa_group faults are available, task_numa_placement only uses
numa_group faults to evaluate preferred node. However it still accounts
task faults and even evaluates the preferred node just based on task
faults just to discard it in favour of preferred node chosen on the
basis of numa_group.

Instead use task faults only if numa_group is not set.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25549.6     25215.7     -1.30
1     73190       72107       -1.47

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     113437      113372      -0.05
1     196130      177403      -9.54

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      506.35      794.46      599.06      104.26
numa01.sh       Sys:      150.37      223.56      195.99       24.94
numa01.sh      User:    43450.69    61752.04    49281.50     6635.33
numa02.sh      Real:       60.33       62.40       61.31        0.90
numa02.sh       Sys:       18.12       31.66       24.28        5.89
numa02.sh      User:     5203.91     5325.32     5260.29       49.98
numa03.sh      Real:      696.47      853.62      745.80       57.28
numa03.sh       Sys:       85.68      123.71       97.89       13.48
numa03.sh      User:    55978.45    66418.63    59254.94     3737.97
numa04.sh      Real:      444.05      514.83      497.06       26.85
numa04.sh       Sys:      230.39      375.79      316.23       48.58
numa04.sh      User:    35403.12    41004.10    39720.80     2163.08
numa05.sh      Real:      423.09      460.41      439.57       13.92
numa05.sh       Sys:      287.38      480.15      369.37       68.52
numa05.sh      User:    34732.12    38016.80    36255.85     1070.51

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      478.45      565.90      515.11       30.87 	 16.29%
numa01.sh       Sys:      207.79      271.04      232.94       21.33 	 -15.8%
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86 	 14.04%
numa02.sh      Real:       60.00       61.46       60.78        0.49 	 0.871%
numa02.sh       Sys:       15.71       25.31       20.69        3.42 	 17.35%
numa02.sh      User:     5175.92     5265.86     5235.97       32.82 	 0.464%
numa03.sh      Real:      776.42      834.85      806.01       23.22 	 -7.47%
numa03.sh       Sys:      114.43      128.75      121.65        5.49 	 -19.5%
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39 	 -5.36%
numa04.sh      Real:      456.93      511.95      482.91       20.88 	 2.930%
numa04.sh       Sys:      178.09      460.89      356.86       94.58 	 -11.3%
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96 	 0.246%
numa05.sh      Real:      393.98      493.48      436.61       35.59 	 0.677%
numa05.sh       Sys:      164.49      329.15      265.87       61.78 	 38.92%
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71 	 3.368%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-6-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0532195c38d0..a10c4f8f47e8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2110,8 +2110,8 @@ static int preferred_group_nid(struct task_struct *p, int nid)
 
 static void task_numa_placement(struct task_struct *p)
 {
-	int seq, nid, max_nid = -1, max_group_nid = -1;
-	unsigned long max_faults = 0, max_group_faults = 0;
+	int seq, nid, max_nid = -1;
+	unsigned long max_faults = 0;
 	unsigned long fault_types[2] = { 0, 0 };
 	unsigned long total_faults;
 	u64 runtime, period;
@@ -2190,15 +2190,15 @@ static void task_numa_placement(struct task_struct *p)
 			}
 		}
 
-		if (faults > max_faults) {
-			max_faults = faults;
+		if (!p->numa_group) {
+			if (faults > max_faults) {
+				max_faults = faults;
+				max_nid = nid;
+			}
+		} else if (group_faults > max_faults) {
+			max_faults = group_faults;
 			max_nid = nid;
 		}
-
-		if (group_faults > max_group_faults) {
-			max_group_faults = group_faults;
-			max_group_nid = nid;
-		}
 	}
 
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
@@ -2206,7 +2206,7 @@ static void task_numa_placement(struct task_struct *p)
 	if (p->numa_group) {
 		numa_group_count_active_nodes(p->numa_group);
 		spin_unlock_irq(group_lock);
-		max_nid = preferred_group_nid(p, max_group_nid);
+		max_nid = preferred_group_nid(p, max_nid);
 	}
 
 	if (max_faults) {

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/debug: Reverse the order of printing faults
  2018-06-20 17:02 ` [PATCH v2 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
@ 2018-07-25 14:26   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:26 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, srikar, mgorman, linux-kernel, riel, hpa, tglx, torvalds

Commit-ID:  67d9f6c256cd66e15f85c92670f52a7ad4689cff
Gitweb:     https://git.kernel.org/tip/67d9f6c256cd66e15f85c92670f52a7ad4689cff
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:47 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:07 +0200

sched/debug: Reverse the order of printing faults

Fix the order in which the private and shared numa faults are getting
printed.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25215.7     25375.3     0.63
1     72107       72617       0.70

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/debug.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c96e89cc4bc7..870d4f3da285 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -842,8 +842,8 @@ void print_numa_stats(struct seq_file *m, int node, unsigned long tsf,
 		unsigned long tpf, unsigned long gsf, unsigned long gpf)
 {
 	SEQ_printf(m, "numa_faults node=%d ", node);
-	SEQ_printf(m, "task_private=%lu task_shared=%lu ", tsf, tpf);
-	SEQ_printf(m, "group_private=%lu group_shared=%lu\n", gsf, gpf);
+	SEQ_printf(m, "task_private=%lu task_shared=%lu ", tpf, tsf);
+	SEQ_printf(m, "group_private=%lu group_shared=%lu\n", gpf, gsf);
 }
 #endif
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Skip nodes that are at 'hoplimit'
  2018-06-20 17:02 ` [PATCH v2 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
@ 2018-07-25 14:27   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, hpa, mgorman, srikar, riel, tglx, mingo, peterz, linux-kernel

Commit-ID:  0ee7e74dc0dc64d9900751d03c5c22dfdd173fb8
Gitweb:     https://git.kernel.org/tip/0ee7e74dc0dc64d9900751d03c5c22dfdd173fb8
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:48 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:07 +0200

sched/numa: Skip nodes that are at 'hoplimit'

When comparing two nodes at a distance of 'hoplimit', we should consider
nodes only up to 'hoplimit'. Currently we also consider nodes at 'oplimit'
distance too. Hence two nodes at a distance of 'hoplimit' will have same
groupweight. Fix this by skipping nodes at hoplimit.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25375.3     25308.6     -0.26
1     72617       72964       0.477

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     113372      108750      -4.07684
1     177403      183115      3.21979

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      478.45      565.90      515.11       30.87
numa01.sh       Sys:      207.79      271.04      232.94       21.33
numa01.sh      User:    39763.93    47303.12    43210.73     2644.86
numa02.sh      Real:       60.00       61.46       60.78        0.49
numa02.sh       Sys:       15.71       25.31       20.69        3.42
numa02.sh      User:     5175.92     5265.86     5235.97       32.82
numa03.sh      Real:      776.42      834.85      806.01       23.22
numa03.sh       Sys:      114.43      128.75      121.65        5.49
numa03.sh      User:    60773.93    64855.25    62616.91     1576.39
numa04.sh      Real:      456.93      511.95      482.91       20.88
numa04.sh       Sys:      178.09      460.89      356.86       94.58
numa04.sh      User:    36312.09    42553.24    39623.21     2247.96
numa05.sh      Real:      393.98      493.48      436.61       35.59
numa05.sh       Sys:      164.49      329.15      265.87       61.78
numa05.sh      User:    33182.65    36654.53    35074.51     1187.71

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      414.64      819.20      556.08      147.70 	 -7.36%
numa01.sh       Sys:       77.52      205.04      139.40       52.05 	 67.10%
numa01.sh      User:    37043.24    61757.88    45517.48     9290.38 	 -5.06%
numa02.sh      Real:       60.80       63.32       61.63        0.88 	 -1.37%
numa02.sh       Sys:       17.35       39.37       25.71        7.33 	 -19.5%
numa02.sh      User:     5213.79     5374.73     5268.90       55.09 	 -0.62%
numa03.sh      Real:      780.09      948.64      831.43       63.02 	 -3.05%
numa03.sh       Sys:      104.96      136.92      116.31       11.34 	 4.591%
numa03.sh      User:    60465.42    73339.78    64368.03     4700.14 	 -2.72%
numa04.sh      Real:      412.60      681.92      521.29       96.64 	 -7.36%
numa04.sh       Sys:      210.32      314.10      251.77       37.71 	 41.74%
numa04.sh      User:    34026.38    45581.20    38534.49     4198.53 	 2.825%
numa05.sh      Real:      394.79      439.63      411.35       16.87 	 6.140%
numa05.sh       Sys:      238.32      330.09      292.31       38.32 	 -9.04%
numa05.sh      User:    33456.45    34876.07    34138.62      609.45 	 2.741%

While there is a regression with this change, this change is needed from a
correctness perspective. Also it helps consolidation as seen from perf bench
output.

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-8-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a10c4f8f47e8..e5f39e8dfe53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1312,7 +1312,7 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
 		 * of each group. Skip other nodes.
 		 */
 		if (sched_numa_topology_type == NUMA_BACKPLANE &&
-					dist > maxdist)
+					dist >= maxdist)
 			continue;
 
 		/* Add up the faults from nearby nodes. */

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Remove unused task_capacity from 'struct numa_stats'
  2018-06-20 17:02 ` [PATCH v2 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
@ 2018-07-25 14:27   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:27 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: srikar, peterz, hpa, torvalds, mingo, linux-kernel, riel, tglx, mgorman

Commit-ID:  10864a9e222048a862da2c21efa28929a4dfed15
Gitweb:     https://git.kernel.org/tip/10864a9e222048a862da2c21efa28929a4dfed15
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:49 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:07 +0200

sched/numa: Remove unused task_capacity from 'struct numa_stats'

The task_capacity field in 'struct numa_stats' is redundant.
Also move nr_running for better packing within the struct.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25308.6     25377.3     0.271
1     72964       72287       -0.92

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-9-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5f39e8dfe53..4ac60b296d96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1450,14 +1450,12 @@ static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
-	unsigned long nr_running;
 	unsigned long load;
 
 	/* Total compute capacity of CPUs on a node */
 	unsigned long compute_capacity;
 
-	/* Approximate capacity in terms of runnable tasks on a node */
-	unsigned long task_capacity;
+	unsigned int nr_running;
 	int has_free_capacity;
 };
 
@@ -1495,9 +1493,9 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, ns->compute_capacity);
 	capacity = cpus / smt; /* cores */
 
-	ns->task_capacity = min_t(unsigned, capacity,
+	capacity = min_t(unsigned, capacity,
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
-	ns->has_free_capacity = (ns->nr_running < ns->task_capacity);
+	ns->has_free_capacity = (ns->nr_running < capacity);
 }
 
 struct task_numa_env {

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Modify migrate_swap() to accept additional parameters
  2018-06-20 17:02 ` [PATCH v2 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
@ 2018-07-25 14:28   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: srikar, peterz, linux-kernel, torvalds, hpa, riel, tglx, mingo, mgorman

Commit-ID:  0ad4e3dfe6cf3f207e61cbd8e3e4a943f8c1ad20
Gitweb:     https://git.kernel.org/tip/0ad4e3dfe6cf3f207e61cbd8e3e4a943f8c1ad20
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:50 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:07 +0200

sched/numa: Modify migrate_swap() to accept additional parameters

There are checks in migrate_swap_stop() that check if the task/CPU
combination is as per migrate_swap_arg before migrating.

However atleast one of the two tasks to be swapped by migrate_swap() could
have migrated to a completely different CPU before updating the
migrate_swap_arg. The new CPU where the task is currently running could
be a different node too. If the task has migrated, numa balancer might
end up placing a task in a wrong node.  Instead of achieving node
consolidation, it may end up spreading the load across nodes.

To avoid that pass the CPUs as additional parameters.

While here, place migrate_swap under CONFIG_NUMA_BALANCING.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25377.3     25226.6     -0.59
1     72287       73326       1.437

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-10-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 9 ++++++---
 kernel/sched/fair.c  | 3 ++-
 kernel/sched/sched.h | 3 ++-
 3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bc391a574e6..deafa9fe602b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1176,6 +1176,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 	__set_task_cpu(p, new_cpu);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 static void __migrate_swap_task(struct task_struct *p, int cpu)
 {
 	if (task_on_rq_queued(p)) {
@@ -1257,16 +1258,17 @@ unlock:
 /*
  * Cross migrate two tasks
  */
-int migrate_swap(struct task_struct *cur, struct task_struct *p)
+int migrate_swap(struct task_struct *cur, struct task_struct *p,
+		int target_cpu, int curr_cpu)
 {
 	struct migration_swap_arg arg;
 	int ret = -EINVAL;
 
 	arg = (struct migration_swap_arg){
 		.src_task = cur,
-		.src_cpu = task_cpu(cur),
+		.src_cpu = curr_cpu,
 		.dst_task = p,
-		.dst_cpu = task_cpu(p),
+		.dst_cpu = target_cpu,
 	};
 
 	if (arg.src_cpu == arg.dst_cpu)
@@ -1291,6 +1293,7 @@ int migrate_swap(struct task_struct *cur, struct task_struct *p)
 out:
 	return ret;
 }
+#endif /* CONFIG_NUMA_BALANCING */
 
 /*
  * wait_task_inactive - wait for a thread to unschedule.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4ac60b296d96..7b4eddec3ccc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1848,7 +1848,8 @@ static int task_numa_migrate(struct task_struct *p)
 		return ret;
 	}
 
-	ret = migrate_swap(p, env.best_task);
+	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
 	put_task_struct(env.best_task);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 614170d9b1aa..4a2e8cae63c4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1099,7 +1099,8 @@ enum numa_faults_stats {
 };
 extern void sched_setnuma(struct task_struct *p, int node);
 extern int migrate_task_to(struct task_struct *p, int cpu);
-extern int migrate_swap(struct task_struct *, struct task_struct *);
+extern int migrate_swap(struct task_struct *p, struct task_struct *t,
+			int cpu, int scpu);
 extern void init_numa_balancing(unsigned long clone_flags, struct task_struct *p);
 #else
 static inline void

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Remove numa_has_capacity()
  2018-06-20 17:02 ` [PATCH v2 12/19] sched/numa: Remove numa_has_capacity Srikar Dronamraju
@ 2018-07-25 14:28   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:28 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: torvalds, tglx, hpa, mingo, srikar, linux-kernel, mgorman, peterz, riel

Commit-ID:  2d4056fafa196e1ab4e7161bae4df76f9602d56d
Gitweb:     https://git.kernel.org/tip/2d4056fafa196e1ab4e7161bae4df76f9602d56d
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:53 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:08 +0200

sched/numa: Remove numa_has_capacity()

task_numa_find_cpu() helps to find the CPU to swap/move the task to.
It's guarded by numa_has_capacity(). However node not having capacity
shouldn't deter a task swapping if it helps NUMA placement.

Further load_too_imbalanced(), which evaluates possibilities of move/swap,
provides similar checks as numa_has_capacity.

Hence remove numa_has_capacity() to enhance possibilities of task
swapping even if load is imbalanced.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25657.9     25804.1     0.569
1     74435       73413       -1.37

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-13-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 36 +++---------------------------------
 1 file changed, 3 insertions(+), 33 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7b4eddec3ccc..3bcf0e864613 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1456,7 +1456,6 @@ struct numa_stats {
 	unsigned long compute_capacity;
 
 	unsigned int nr_running;
-	int has_free_capacity;
 };
 
 /*
@@ -1483,8 +1482,7 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 	 * the @ns structure is NULL'ed and task_numa_compare() will
 	 * not find this node attractive.
 	 *
-	 * We'll either bail at !has_free_capacity, or we'll detect a huge
-	 * imbalance and bail there.
+	 * We'll detect a huge imbalance and bail there.
 	 */
 	if (!cpus)
 		return;
@@ -1495,7 +1493,6 @@ static void update_numa_stats(struct numa_stats *ns, int nid)
 
 	capacity = min_t(unsigned, capacity,
 		DIV_ROUND_CLOSEST(ns->compute_capacity, SCHED_CAPACITY_SCALE));
-	ns->has_free_capacity = (ns->nr_running < capacity);
 }
 
 struct task_numa_env {
@@ -1698,31 +1695,6 @@ static void task_numa_find_cpu(struct task_numa_env *env,
 	}
 }
 
-/* Only move tasks to a NUMA node less busy than the current node. */
-static bool numa_has_capacity(struct task_numa_env *env)
-{
-	struct numa_stats *src = &env->src_stats;
-	struct numa_stats *dst = &env->dst_stats;
-
-	if (src->has_free_capacity && !dst->has_free_capacity)
-		return false;
-
-	/*
-	 * Only consider a task move if the source has a higher load
-	 * than the destination, corrected for CPU capacity on each node.
-	 *
-	 *      src->load                dst->load
-	 * --------------------- vs ---------------------
-	 * src->compute_capacity    dst->compute_capacity
-	 */
-	if (src->load * dst->compute_capacity * env->imbalance_pct >
-
-	    dst->load * src->compute_capacity * 100)
-		return true;
-
-	return false;
-}
-
 static int task_numa_migrate(struct task_struct *p)
 {
 	struct task_numa_env env = {
@@ -1777,8 +1749,7 @@ static int task_numa_migrate(struct task_struct *p)
 	update_numa_stats(&env.dst_stats, env.dst_nid);
 
 	/* Try to find a spot on the preferred nid. */
-	if (numa_has_capacity(&env))
-		task_numa_find_cpu(&env, taskimp, groupimp);
+	task_numa_find_cpu(&env, taskimp, groupimp);
 
 	/*
 	 * Look at other nodes in these cases:
@@ -1808,8 +1779,7 @@ static int task_numa_migrate(struct task_struct *p)
 			env.dist = dist;
 			env.dst_nid = nid;
 			update_numa_stats(&env.dst_stats, env.dst_nid);
-			if (numa_has_capacity(&env))
-				task_numa_find_cpu(&env, taskimp, groupimp);
+			task_numa_find_cpu(&env, taskimp, groupimp);
 		}
 	}
 

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Update the scan period without holding the numa_group lock
  2018-06-20 17:02 ` [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
  2018-06-21  9:51   ` Mel Gorman
@ 2018-07-25 14:29   ` tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mgorman, srikar, peterz, mingo, riel, torvalds, tglx, linux-kernel, hpa

Commit-ID:  30619c89b17d46808b4cdf5b3f81b6a01ade1473
Gitweb:     https://git.kernel.org/tip/30619c89b17d46808b4cdf5b3f81b6a01ade1473
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:55 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:08 +0200

sched/numa: Update the scan period without holding the numa_group lock

The metrics for updating scan periods are local or task specific.
Currently this update happens under the numa_group lock, which seems
unnecessary. Hence move this update outside the lock.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25355.9     25645.4     1.141
1     72812       72142       -0.92

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-15-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3bcf0e864613..fc33a4b40a09 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2170,8 +2170,6 @@ static void task_numa_placement(struct task_struct *p)
 		}
 	}
 
-	update_task_scan_period(p, fault_types[0], fault_types[1]);
-
 	if (p->numa_group) {
 		numa_group_count_active_nodes(p->numa_group);
 		spin_unlock_irq(group_lock);
@@ -2186,6 +2184,8 @@ static void task_numa_placement(struct task_struct *p)
 		if (task_node(p) != p->numa_preferred_nid)
 			numa_migrate_preferred(p);
 	}
+
+	update_task_scan_period(p, fault_types[0], fault_types[1]);
 }
 
 static inline int get_numa_group(struct numa_group *grp)

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Use group_weights to identify if migration degrades locality
  2018-06-20 17:02 ` [PATCH v2 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
@ 2018-07-25 14:29   ` tip-bot for Srikar Dronamraju
  0 siblings, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: peterz, tglx, mingo, torvalds, riel, hpa, mgorman, linux-kernel, srikar

Commit-ID:  f35678b6a17063f3b0d391af5ab8f8c83cf31b0c
Gitweb:     https://git.kernel.org/tip/f35678b6a17063f3b0d391af5ab8f8c83cf31b0c
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:32:56 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:08 +0200

sched/numa: Use group_weights to identify if migration degrades locality

On NUMA_BACKPLANE and NUMA_GLUELESS_MESH systems, tasks/memory should be
consolidated to the closest group of nodes. In such a case, relying on
group_fault metric may not always help to consolidate. There can always
be a case where a node closer to the preferred node may have lesser
faults than a node further away from the preferred node. In such a case,
moving to node with more faults might avoid numa consolidation.

Using group_weight would help to consolidate task/memory around the
preferred_node.

While here, to be on the conservative side, don't override migrate thread
degrades locality logic for CPU_NEWLY_IDLE load balancing.

Note: Similar problems exist with should_numa_migrate_memory and will be
dealt separately.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25645.4     25960       1.22
1     72142       73550       1.95

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     110199      120071      8.958
1     176303      176249      -0.03

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      490.04      774.86      596.26       96.46
numa01.sh       Sys:      151.52      242.88      184.82       31.71
numa01.sh      User:    41418.41    60844.59    48776.09     6564.27
numa02.sh      Real:       60.14       62.94       60.98        1.00
numa02.sh       Sys:       16.11       30.77       21.20        5.28
numa02.sh      User:     5184.33     5311.09     5228.50       44.24
numa03.sh      Real:      790.95      856.35      826.41       24.11
numa03.sh       Sys:      114.93      118.85      117.05        1.63
numa03.sh      User:    60990.99    64959.28    63470.43     1415.44
numa04.sh      Real:      434.37      597.92      504.87       59.70
numa04.sh       Sys:      237.63      397.40      289.74       55.98
numa04.sh      User:    34854.87    41121.83    38572.52     2615.84
numa05.sh      Real:      386.77      448.90      417.22       22.79
numa05.sh       Sys:      149.23      379.95      303.04       79.55
numa05.sh      User:    32951.76    35959.58    34562.18     1034.05

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      493.19      672.88      597.51       59.38 	 -0.20%
numa01.sh       Sys:      150.09      245.48      207.76       34.26 	 -11.0%
numa01.sh      User:    41928.51    53779.17    48747.06     3901.39 	 0.059%
numa02.sh      Real:       60.63       62.87       61.22        0.83 	 -0.39%
numa02.sh       Sys:       16.64       27.97       20.25        4.06 	 4.691%
numa02.sh      User:     5222.92     5309.60     5254.03       29.98 	 -0.48%
numa03.sh      Real:      821.52      902.15      863.60       32.41 	 -4.30%
numa03.sh       Sys:      112.04      130.66      118.35        7.08 	 -1.09%
numa03.sh      User:    62245.16    69165.14    66443.04     2450.32 	 -4.47%
numa04.sh      Real:      414.53      519.57      476.25       37.00 	 6.009%
numa04.sh       Sys:      181.84      335.67      280.41       54.07 	 3.327%
numa04.sh      User:    33924.50    39115.39    37343.78     1934.26 	 3.290%
numa05.sh      Real:      408.30      441.45      417.90       12.05 	 -0.16%
numa05.sh       Sys:      233.41      381.60      295.58       57.37 	 2.523%
numa05.sh      User:    33301.31    35972.50    34335.19      938.94 	 0.661%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-16-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fc33a4b40a09..9c9e54ea65d9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6899,8 +6899,8 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
 static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 {
 	struct numa_group *numa_group = rcu_dereference(p->numa_group);
-	unsigned long src_faults, dst_faults;
-	int src_nid, dst_nid;
+	unsigned long src_weight, dst_weight;
+	int src_nid, dst_nid, dist;
 
 	if (!static_branch_likely(&sched_numa_balancing))
 		return -1;
@@ -6927,18 +6927,19 @@ static int migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
 		return 0;
 
 	/* Leaving a core idle is often worse than degrading locality. */
-	if (env->idle != CPU_NOT_IDLE)
+	if (env->idle == CPU_IDLE)
 		return -1;
 
+	dist = node_distance(src_nid, dst_nid);
 	if (numa_group) {
-		src_faults = group_faults(p, src_nid);
-		dst_faults = group_faults(p, dst_nid);
+		src_weight = group_weight(p, src_nid, dist);
+		dst_weight = group_weight(p, dst_nid, dist);
 	} else {
-		src_faults = task_faults(p, src_nid);
-		dst_faults = task_faults(p, dst_nid);
+		src_weight = task_weight(p, src_nid, dist);
+		dst_weight = task_weight(p, dst_nid, dist);
 	}
 
-	return dst_faults < src_faults;
+	return dst_weight < src_weight;
 }
 
 #else

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [tip:sched/core] sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()
  2018-06-20 17:03 ` [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
  2018-06-21 10:06   ` Mel Gorman
@ 2018-07-25 14:30   ` tip-bot for Srikar Dronamraju
  1 sibling, 0 replies; 53+ messages in thread
From: tip-bot for Srikar Dronamraju @ 2018-07-25 14:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, torvalds, hpa, tglx, riel, mingo, mgorman, peterz, srikar

Commit-ID:  b6a60cf36d497e7fbde9dd5b86fabd96850249f6
Gitweb:     https://git.kernel.org/tip/b6a60cf36d497e7fbde9dd5b86fabd96850249f6
Author:     Srikar Dronamraju <srikar@linux.vnet.ibm.com>
AuthorDate: Wed, 20 Jun 2018 22:33:00 +0530
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 25 Jul 2018 11:41:08 +0200

sched/numa: Move task_numa_placement() closer to numa_migrate_preferred()

numa_migrate_preferred() is called periodically or when task preferred
node changes. Preferred node evaluations happen once per scan sequence.

If the scan completion happens just after the periodic NUMA migration,
then we try to migrate to the preferred node and the preferred node might
change, needing another node migration.

Avoid this by checking for scan sequence completion only when checking
for periodic migration.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25862.6     26158.1     1.14258
1     74357       72725       -2.19482

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     117019      113992      -2.58
1     179095      174947      -2.31

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      449.46      770.77      615.22      101.70
numa01.sh       Sys:      132.72      208.17      170.46       24.96
numa01.sh      User:    39185.26    60290.89    50066.76     6807.84
numa02.sh      Real:       60.85       61.79       61.28        0.37
numa02.sh       Sys:       15.34       24.71       21.08        3.61
numa02.sh      User:     5204.41     5249.85     5231.21       17.60
numa03.sh      Real:      785.50      916.97      840.77       44.98
numa03.sh       Sys:      108.08      133.60      119.43        8.82
numa03.sh      User:    61422.86    70919.75    64720.87     3310.61
numa04.sh      Real:      429.57      587.37      480.80       57.40
numa04.sh       Sys:      240.61      321.97      290.84       33.58
numa04.sh      User:    34597.65    40498.99    37079.48     2060.72
numa05.sh      Real:      392.09      431.25      414.65       13.82
numa05.sh       Sys:      229.41      372.48      297.54       53.14
numa05.sh      User:    33390.86    34697.49    34222.43      556.42

Testcase       Time:         Min         Max         Avg      StdDev 	%Change
numa01.sh      Real:      424.63      566.18      498.12       59.26 	 23.50%
numa01.sh       Sys:      160.19      256.53      208.98       37.02 	 -18.4%
numa01.sh      User:    37320.00    46225.58    42001.57     3482.45 	 19.20%
numa02.sh      Real:       60.17       62.47       60.91        0.85 	 0.607%
numa02.sh       Sys:       15.30       22.82       17.04        2.90 	 23.70%
numa02.sh      User:     5202.13     5255.51     5219.08       20.14 	 0.232%
numa03.sh      Real:      823.91      844.89      833.86        8.46 	 0.828%
numa03.sh       Sys:      130.69      148.29      140.47        6.21 	 -14.9%
numa03.sh      User:    62519.15    64262.20    63613.38      620.05 	 1.740%
numa04.sh      Real:      515.30      603.74      548.56       30.93 	 -12.3%
numa04.sh       Sys:      459.73      525.48      489.18       21.63 	 -40.5%
numa04.sh      User:    40561.96    44919.18    42047.87     1526.85 	 -11.8%
numa05.sh      Real:      396.58      454.37      421.13       19.71 	 -1.53%
numa05.sh       Sys:      208.72      422.02      348.90       73.60 	 -14.7%
numa05.sh      User:    33124.08    36109.35    34846.47     1089.74 	 -1.79%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1529514181-9842-20-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c9e54ea65d9..309c93fcc604 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2180,9 +2180,6 @@ static void task_numa_placement(struct task_struct *p)
 		/* Set the new preferred node */
 		if (max_nid != p->numa_preferred_nid)
 			sched_setnuma(p, max_nid);
-
-		if (task_node(p) != p->numa_preferred_nid)
-			numa_migrate_preferred(p);
 	}
 
 	update_task_scan_period(p, fault_types[0], fault_types[1]);
@@ -2385,14 +2382,14 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
 				numa_is_active_node(mem_node, ng))
 		local = 1;
 
-	task_numa_placement(p);
-
 	/*
 	 * Retry task to preferred node migration periodically, in case it
 	 * case it previously failed, or the scheduler moved us.
 	 */
-	if (time_after(jiffies, p->numa_migrate_retry))
+	if (time_after(jiffies, p->numa_migrate_retry)) {
+		task_numa_placement(p);
 		numa_migrate_preferred(p);
+	}
 
 	if (migrated)
 		p->numa_pages_migrated += pages;

^ permalink raw reply related	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2018-07-25 14:39 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 01/19] sched/numa: Remove redundant field Srikar Dronamraju
2018-07-25 14:23   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 02/19] sched/numa: Evaluate move once per node Srikar Dronamraju
2018-06-21  9:06   ` Mel Gorman
2018-07-25 14:24   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
2018-07-25 14:24   ` [tip:sched/core] sched/numa: Simplify load_too_imbalanced() tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
2018-06-21  9:17   ` Mel Gorman
2018-07-25 14:25   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
2018-06-21  9:38   ` Mel Gorman
2018-07-25 14:25   ` [tip:sched/core] sched/numa: Use task faults only if numa_group is not yet set up tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
2018-07-25 14:26   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
2018-07-25 14:27   ` [tip:sched/core] sched/numa: Skip nodes that are at 'hoplimit' tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
2018-07-25 14:27   ` [tip:sched/core] sched/numa: Remove unused task_capacity from 'struct numa_stats' tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
2018-07-25 14:28   ` [tip:sched/core] sched/numa: Modify migrate_swap() to accept additional parameters tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
2018-07-23 10:38   ` Peter Zijlstra
2018-07-23 11:16     ` Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 12/19] sched/numa: Remove numa_has_capacity Srikar Dronamraju
2018-07-25 14:28   ` [tip:sched/core] sched/numa: Remove numa_has_capacity() tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
2018-06-21  9:51   ` Mel Gorman
2018-07-23 10:54   ` Peter Zijlstra
2018-07-23 11:20     ` Srikar Dronamraju
2018-07-23 14:04       ` Peter Zijlstra
2018-06-20 17:02 ` [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
2018-06-21  9:51   ` Mel Gorman
2018-07-25 14:29   ` [tip:sched/core] sched/numa: Update the scan period without holding the numa_group lock tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
2018-07-25 14:29   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
2018-06-21 10:05   ` Mel Gorman
2018-07-04 11:19     ` Srikar Dronamraju
2018-06-20 17:03 ` [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
2018-06-21 10:06   ` Mel Gorman
2018-07-25 14:30   ` [tip:sched/core] sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() tip-bot for Srikar Dronamraju
2018-06-20 17:03 ` [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
2018-07-23 13:57 ` Peter Zijlstra
2018-07-23 15:09   ` Srikar Dronamraju
2018-07-23 15:21     ` Peter Zijlstra
2018-07-23 16:29       ` Srikar Dronamraju
2018-07-23 16:47         ` Peter Zijlstra
2018-07-23 15:33     ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).