linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Rik van Riel <riel@surriel.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: [PATCH v2 02/19] sched/numa: Evaluate move once per node
Date: Wed, 20 Jun 2018 22:32:43 +0530	[thread overview]
Message-ID: <1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.com> (raw)
In-Reply-To: <1529514181-9842-1-git-send-email-srikar@linux.vnet.ibm.com>

task_numa_compare() helps choose the best cpu to move or swap the
selected task. To achieve this task_numa_compare() is called for every
cpu in the node. Currently it evaluates if the task can be moved/swapped
for each of the cpus. However the move evaluation is mostly independent
of the cpu. Evaluating the move logic once per node, provides scope for
simplifying task_numa_compare().

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
16    25705.2     25058.2     -2.51
1     74433       72950       -1.99

Running SPECjbb2005 on a 16 node machine and comparing bops/JVM
JVMS  LAST_PATCH  WITH_PATCH  %CHANGE
8     96589.6     105930      9.670
1     181830      178624      -1.76

(numbers from v1 based on v4.17-rc5)
Testcase       Time:         Min         Max         Avg      StdDev
numa01.sh      Real:      440.65      941.32      758.98      189.17
numa01.sh       Sys:      183.48      320.07      258.42       50.09
numa01.sh      User:    37384.65    71818.14    60302.51    13798.96
numa02.sh      Real:       61.24       65.35       62.49        1.49
numa02.sh       Sys:       16.83       24.18       21.40        2.60
numa02.sh      User:     5219.59     5356.34     5264.03       49.07
numa03.sh      Real:      822.04      912.40      873.55       37.35
numa03.sh       Sys:      118.80      140.94      132.90        7.60
numa03.sh      User:    62485.19    70025.01    67208.33     2967.10
numa04.sh      Real:      690.66      872.12      778.49       65.44
numa04.sh       Sys:      459.26      563.03      494.03       42.39
numa04.sh      User:    51116.44    70527.20    58849.44     8461.28
numa05.sh      Real:      418.37      562.28      525.77       54.27
numa05.sh       Sys:      299.45      481.00      392.49       64.27
numa05.sh      User:    34115.09    41324.02    39105.30     2627.68

Testcase       Time:         Min         Max         Avg      StdDev 	 %Change
numa01.sh      Real:      516.14      892.41      739.84      151.32 	 2.587%
numa01.sh       Sys:      153.16      192.99      177.70       14.58 	 45.42%
numa01.sh      User:    39821.04    69528.92    57193.87    10989.48 	 5.435%
numa02.sh      Real:       60.91       62.35       61.58        0.63 	 1.477%
numa02.sh       Sys:       16.47       26.16       21.20        3.85 	 0.943%
numa02.sh      User:     5227.58     5309.61     5265.17       31.04 	 -0.02%
numa03.sh      Real:      739.07      917.73      795.75       64.45 	 9.776%
numa03.sh       Sys:       94.46      136.08      109.48       14.58 	 21.39%
numa03.sh      User:    57478.56    72014.09    61764.48     5343.69 	 8.813%
numa04.sh      Real:      442.61      715.43      530.31       96.12 	 46.79%
numa04.sh       Sys:      224.90      348.63      285.61       48.83 	 72.97%
numa04.sh      User:    35836.84    47522.47    40235.41     3985.26 	 46.26%
numa05.sh      Real:      386.13      489.17      434.94       43.59 	 20.88%
numa05.sh       Sys:      144.29      438.56      278.80      105.78 	 40.77%
numa05.sh      User:    33255.86    36890.82    34879.31     1641.98 	 12.11%

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v1->v2:
Style change and variable rename as suggested by Rik.

 kernel/sched/fair.c | 128 +++++++++++++++++++++++-----------------------------
 1 file changed, 57 insertions(+), 71 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 79f574d..69136e9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1541,9 +1541,8 @@ static bool load_too_imbalanced(long src_load, long dst_load,
  * be exchanged with the source task
  */
 static void task_numa_compare(struct task_numa_env *env,
-			      long taskimp, long groupimp)
+			      long taskimp, long groupimp, bool maymove)
 {
-	struct rq *src_rq = cpu_rq(env->src_cpu);
 	struct rq *dst_rq = cpu_rq(env->dst_cpu);
 	struct task_struct *cur;
 	long src_load, dst_load;
@@ -1564,97 +1563,73 @@ static void task_numa_compare(struct task_numa_env *env,
 	if (cur == env->p)
 		goto unlock;
 
+	if (!cur) {
+		if (maymove || imp > env->best_imp)
+			goto assign;
+		else
+			goto unlock;
+	}
+
 	/*
 	 * "imp" is the fault differential for the source task between the
 	 * source and destination node. Calculate the total differential for
 	 * the source task and potential destination task. The more negative
-	 * the value is, the more rmeote accesses that would be expected to
+	 * the value is, the more remote accesses that would be expected to
 	 * be incurred if the tasks were swapped.
 	 */
-	if (cur) {
-		/* Skip this swap candidate if cannot move to the source CPU: */
-		if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
-			goto unlock;
+	/* Skip this swap candidate if cannot move to the source cpu */
+	if (!cpumask_test_cpu(env->src_cpu, &cur->cpus_allowed))
+		goto unlock;
 
+	/*
+	 * If dst and source tasks are in the same NUMA group, or not
+	 * in any group then look only at task weights.
+	 */
+	if (cur->numa_group == env->p->numa_group) {
+		imp = taskimp + task_weight(cur, env->src_nid, dist) -
+		      task_weight(cur, env->dst_nid, dist);
 		/*
-		 * If dst and source tasks are in the same NUMA group, or not
-		 * in any group then look only at task weights.
+		 * Add some hysteresis to prevent swapping the
+		 * tasks within a group over tiny differences.
 		 */
-		if (cur->numa_group == env->p->numa_group) {
-			imp = taskimp + task_weight(cur, env->src_nid, dist) -
-			      task_weight(cur, env->dst_nid, dist);
-			/*
-			 * Add some hysteresis to prevent swapping the
-			 * tasks within a group over tiny differences.
-			 */
-			if (cur->numa_group)
-				imp -= imp/16;
-		} else {
-			/*
-			 * Compare the group weights. If a task is all by
-			 * itself (not part of a group), use the task weight
-			 * instead.
-			 */
-			if (cur->numa_group)
-				imp += group_weight(cur, env->src_nid, dist) -
-				       group_weight(cur, env->dst_nid, dist);
-			else
-				imp += task_weight(cur, env->src_nid, dist) -
-				       task_weight(cur, env->dst_nid, dist);
-		}
+		if (cur->numa_group)
+			imp -= imp / 16;
+	} else {
+		/*
+		 * Compare the group weights. If a task is all by itself
+		 * (not part of a group), use the task weight instead.
+		 */
+		if (cur->numa_group && env->p->numa_group)
+			imp += group_weight(cur, env->src_nid, dist) -
+			       group_weight(cur, env->dst_nid, dist);
+		else
+			imp += task_weight(cur, env->src_nid, dist) -
+			       task_weight(cur, env->dst_nid, dist);
 	}
 
-	if (imp <= env->best_imp && moveimp <= env->best_imp)
+	if (imp <= env->best_imp)
 		goto unlock;
 
-	if (!cur) {
-		/* Is there capacity at our destination? */
-		if (env->src_stats.nr_running <= env->src_stats.task_capacity &&
-		    !env->dst_stats.has_free_capacity)
-			goto unlock;
-
-		goto balance;
-	}
-
-	/* Balance doesn't matter much if we're running a task per CPU: */
-	if (imp > env->best_imp && src_rq->nr_running == 1 &&
-			dst_rq->nr_running == 1)
+	if (maymove && moveimp > imp && moveimp > env->best_imp) {
+		imp = moveimp - 1;
+		cur = NULL;
 		goto assign;
+	}
 
 	/*
 	 * In the overloaded case, try and keep the load balanced.
 	 */
-balance:
-	load = task_h_load(env->p);
+	load = task_h_load(env->p) - task_h_load(cur);
+	if (!load)
+		goto assign;
+
 	dst_load = env->dst_stats.load + load;
 	src_load = env->src_stats.load - load;
 
-	if (moveimp > imp && moveimp > env->best_imp) {
-		/*
-		 * If the improvement from just moving env->p direction is
-		 * better than swapping tasks around, check if a move is
-		 * possible. Store a slightly smaller score than moveimp,
-		 * so an actually idle CPU will win.
-		 */
-		if (!load_too_imbalanced(src_load, dst_load, env)) {
-			imp = moveimp - 1;
-			cur = NULL;
-			goto assign;
-		}
-	}
-
-	if (imp <= env->best_imp)
-		goto unlock;
-
-	if (cur) {
-		load = task_h_load(cur);
-		dst_load -= load;
-		src_load += load;
-	}
-
 	if (load_too_imbalanced(src_load, dst_load, env))
 		goto unlock;
 
+assign:
 	/*
 	 * One idle CPU per node is evaluated for a task numa move.
 	 * Call select_idle_sibling to maybe find a better one.
@@ -1670,7 +1645,6 @@ static void task_numa_compare(struct task_numa_env *env,
 		local_irq_enable();
 	}
 
-assign:
 	task_numa_assign(env, cur, imp);
 unlock:
 	rcu_read_unlock();
@@ -1679,15 +1653,27 @@ static void task_numa_compare(struct task_numa_env *env,
 static void task_numa_find_cpu(struct task_numa_env *env,
 				long taskimp, long groupimp)
 {
+	long src_load, dst_load, load;
+	bool maymove = false;
 	int cpu;
 
+	load = task_h_load(env->p);
+	dst_load = env->dst_stats.load + load;
+	src_load = env->src_stats.load - load;
+
+	/*
+	 * If the improvement from just moving env->p direction is better
+	 * than swapping tasks around, check if a move is possible.
+	 */
+	maymove = !load_too_imbalanced(src_load, dst_load, env);
+
 	for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
 		/* Skip this CPU if the source task cannot migrate */
 		if (!cpumask_test_cpu(cpu, &env->p->cpus_allowed))
 			continue;
 
 		env->dst_cpu = cpu;
-		task_numa_compare(env, taskimp, groupimp);
+		task_numa_compare(env, taskimp, groupimp, maymove);
 	}
 }
 
-- 
1.8.3.1


  parent reply	other threads:[~2018-06-20 17:03 UTC|newest]

Thread overview: 53+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-06-20 17:02 [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 01/19] sched/numa: Remove redundant field Srikar Dronamraju
2018-07-25 14:23   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` Srikar Dronamraju [this message]
2018-06-21  9:06   ` [PATCH v2 02/19] sched/numa: Evaluate move once per node Mel Gorman
2018-07-25 14:24   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 03/19] sched/numa: Simplify load_too_imbalanced Srikar Dronamraju
2018-07-25 14:24   ` [tip:sched/core] sched/numa: Simplify load_too_imbalanced() tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 04/19] sched/numa: Set preferred_node based on best_cpu Srikar Dronamraju
2018-06-21  9:17   ` Mel Gorman
2018-07-25 14:25   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 05/19] sched/numa: Use task faults only if numa_group is not yet setup Srikar Dronamraju
2018-06-21  9:38   ` Mel Gorman
2018-07-25 14:25   ` [tip:sched/core] sched/numa: Use task faults only if numa_group is not yet set up tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 06/19] sched/debug: Reverse the order of printing faults Srikar Dronamraju
2018-07-25 14:26   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 07/19] sched/numa: Skip nodes that are at hoplimit Srikar Dronamraju
2018-07-25 14:27   ` [tip:sched/core] sched/numa: Skip nodes that are at 'hoplimit' tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 08/19] sched/numa: Remove unused task_capacity from numa_stats Srikar Dronamraju
2018-07-25 14:27   ` [tip:sched/core] sched/numa: Remove unused task_capacity from 'struct numa_stats' tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 09/19] sched/numa: Modify migrate_swap to accept additional params Srikar Dronamraju
2018-07-25 14:28   ` [tip:sched/core] sched/numa: Modify migrate_swap() to accept additional parameters tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 10/19] sched/numa: Stop multiple tasks from moving to the cpu at the same time Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 11/19] sched/numa: Restrict migrating in parallel to the same node Srikar Dronamraju
2018-07-23 10:38   ` Peter Zijlstra
2018-07-23 11:16     ` Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 12/19] sched/numa: Remove numa_has_capacity Srikar Dronamraju
2018-07-25 14:28   ` [tip:sched/core] sched/numa: Remove numa_has_capacity() tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 13/19] mm/migrate: Use xchg instead of spinlock Srikar Dronamraju
2018-06-21  9:51   ` Mel Gorman
2018-07-23 10:54   ` Peter Zijlstra
2018-07-23 11:20     ` Srikar Dronamraju
2018-07-23 14:04       ` Peter Zijlstra
2018-06-20 17:02 ` [PATCH v2 14/19] sched/numa: Updation of scan period need not be in lock Srikar Dronamraju
2018-06-21  9:51   ` Mel Gorman
2018-07-25 14:29   ` [tip:sched/core] sched/numa: Update the scan period without holding the numa_group lock tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 15/19] sched/numa: Use group_weights to identify if migration degrades locality Srikar Dronamraju
2018-07-25 14:29   ` [tip:sched/core] " tip-bot for Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 16/19] sched/numa: Detect if node actively handling migration Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
2018-06-20 17:02 ` [PATCH v2 18/19] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
2018-06-21 10:05   ` Mel Gorman
2018-07-04 11:19     ` Srikar Dronamraju
2018-06-20 17:03 ` [PATCH v2 19/19] sched/numa: Move task_placement closer to numa_migrate_preferred Srikar Dronamraju
2018-06-21 10:06   ` Mel Gorman
2018-07-25 14:30   ` [tip:sched/core] sched/numa: Move task_numa_placement() closer to numa_migrate_preferred() tip-bot for Srikar Dronamraju
2018-06-20 17:03 ` [PATCH v2 00/19] Fixes for sched/numa_balancing Srikar Dronamraju
2018-07-23 13:57 ` Peter Zijlstra
2018-07-23 15:09   ` Srikar Dronamraju
2018-07-23 15:21     ` Peter Zijlstra
2018-07-23 16:29       ` Srikar Dronamraju
2018-07-23 16:47         ` Peter Zijlstra
2018-07-23 15:33     ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1529514181-9842-3-git-send-email-srikar@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).