All of lore.kernel.org
 help / color / mirror / Atom feed
From: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
To: Ingo Molnar <mingo@kernel.org>, Peter Zijlstra <peterz@infradead.org>
Cc: LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>,
	Rik van Riel <riel@surriel.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time
Date: Fri,  3 Aug 2018 11:43:56 +0530	[thread overview]
Message-ID: <1533276841-16341-2-git-send-email-srikar@linux.vnet.ibm.com> (raw)
In-Reply-To: <1533276841-16341-1-git-send-email-srikar@linux.vnet.ibm.com>

Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
  moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
  to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.

This problem is more likely if there are more cores per node or more
nodes in the system.

Use a per run-queue variable to check if numa-balance is active on the
run-queue.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS  Prev    Current  %Change
4     199709  206350   3.32534
1     330830  319963   -3.28477


on 2 Socket/4 Node Power8 (PowerNV)
JVMS  Prev     Current  %Change
8     89011.9  89627.8  0.69193
1     218946   211338   -3.47483


on 2 Socket/2 Node Power9 (PowerNV)
JVMS  Prev    Current  %Change
4     180473  186539   3.36117
1     212805  220344   3.54268


on 4 Socket/4 Node Power7
JVMS  Prev     Current  %Change
8     56941.8  56836    -0.185804
1     111686   112970   1.14965


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count  Min      Max      Avg      Variance  %Change
5      12029.8  12124.6  12060.9  34.0076
5      13136.1  13170.2  13150.2  14.7482   9.03166


on 2 Socket/4 Node Power8 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      4968.51  5006.62  4981.31  13.4151
5      4319.79  4998.19  4836.53  261.109   -2.90646


on 2 Socket/2 Node Power9 (PowerNV)
count  Min      Max      Avg      Variance  %Change
5      9342.92  9381.44  9363.92  12.8587
5      9325.56  9402.7   9362.49  25.9638   -0.0152714


on 4 Socket/4 Node Power7
count  Min      Max      Avg      Variance  %Change
5      143.4    188.892  170.225  16.9929
5      132.581  191.072  170.554  21.6444   0.193274

Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>

Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
Changelog v2->v3:
Add comments as requested by Peter.

 kernel/sched/fair.c  | 22 ++++++++++++++++++++++
 kernel/sched/sched.h |  1 +
 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..5cf921a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1514,6 +1514,21 @@ struct task_numa_env {
 static void task_numa_assign(struct task_numa_env *env,
 			     struct task_struct *p, long imp)
 {
+	struct rq *rq = cpu_rq(env->dst_cpu);
+
+	/* Bail out if run-queue part of active numa balance. */
+	if (xchg(&rq->numa_migrate_on, 1))
+		return;
+
+	/*
+	 * Clear previous best_cpu/rq numa-migrate flag, since task now
+	 * found a better cpu to move/swap.
+	 */
+	if (env->best_cpu != -1) {
+		rq = cpu_rq(env->best_cpu);
+		WRITE_ONCE(rq->numa_migrate_on, 0);
+	}
+
 	if (env->best_task)
 		put_task_struct(env->best_task);
 	if (p)
@@ -1569,6 +1584,9 @@ static void task_numa_compare(struct task_numa_env *env,
 	long moveimp = imp;
 	int dist = env->dist;
 
+	if (READ_ONCE(dst_rq->numa_migrate_on))
+		return;
+
 	rcu_read_lock();
 	cur = task_rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1710,6 +1728,7 @@ static int task_numa_migrate(struct task_struct *p)
 		.best_cpu = -1,
 	};
 	struct sched_domain *sd;
+	struct rq *best_rq;
 	unsigned long taskweight, groupweight;
 	int nid, ret, dist;
 	long taskimp, groupimp;
@@ -1811,14 +1830,17 @@ static int task_numa_migrate(struct task_struct *p)
 	 */
 	p->numa_scan_period = task_scan_start(p);
 
+	best_rq = cpu_rq(env.best_cpu);
 	if (env.best_task == NULL) {
 		ret = migrate_task_to(p, env.best_cpu);
+		WRITE_ONCE(best_rq->numa_migrate_on, 0);
 		if (ret != 0)
 			trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
 		return ret;
 	}
 
 	ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+	WRITE_ONCE(best_rq->numa_migrate_on, 0);
 
 	if (ret != 0)
 		trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..0b91612 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,6 +783,7 @@ struct rq {
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int		nr_numa_running;
 	unsigned int		nr_preferred_running;
+	unsigned int		numa_migrate_on;
 #endif
 	#define CPU_LOAD_IDX_MAX 5
 	unsigned long		cpu_load[CPU_LOAD_IDX_MAX];
-- 
1.8.3.1


  reply	other threads:[~2018-08-03  6:14 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-03  6:13 [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-08-03  6:13 ` Srikar Dronamraju [this message]
2018-09-10  8:42   ` [PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time Ingo Molnar
2018-08-03  6:13 ` [PATCH 2/6] mm/migrate: Use trylock while resetting rate limit Srikar Dronamraju
2018-09-06 11:48   ` Peter Zijlstra
2018-09-10  8:39   ` Ingo Molnar
2018-08-03  6:13 ` [PATCH 3/6] sched/numa: Avoid task migration for small numa improvement Srikar Dronamraju
2018-09-10  8:46   ` Ingo Molnar
2018-09-12 15:17     ` Srikar Dronamraju
2018-08-03  6:13 ` [PATCH 4/6] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamraju
2018-08-03  6:14 ` [PATCH 5/6] sched/numa: Reset scan rate whenever task moves across nodes Srikar Dronamraju
2018-09-10  8:48   ` Ingo Molnar
2018-09-12 15:19     ` Srikar Dronamraju
2018-08-03  6:14 ` [PATCH 6/6] sched/numa: Limit the conditions where scan period is reset Srikar Dronamraju
2018-08-21 12:01 ` [PATCH 0/6] numa-balancing patches Srikar Dronamraju
2018-09-06 12:17   ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1533276841-16341-2-git-send-email-srikar@linux.vnet.ibm.com \
    --to=srikar@linux.vnet.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.