linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [GIT TREE] Unified NUMA balancing tree, v3
@ 2012-12-07  0:19 Ingo Molnar
  2012-12-07  0:19 ` [PATCH 1/9] numa, sched: Fix NUMA tick ->numa_shared setting Ingo Molnar
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

I'm pleased to announce the -v3 version of the unified NUMA tree,
which can be accessed at the following Git address:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

[ To test this tree, just pick up the Git tree and enable
  CONFIG_NUMA_BALANCING=y. On an at least 2 node NUMA system
  you should see speedups in many types of long-running,
  memory-intense user-space workloads with this feature enabled.
  Or a slowdown if the plan does not work out. Please report
  both cases. ]

The focus of the -v3 release is regression fixes. Half of the
regression fixed were related to the unification, half of them
were due prior bugs.

Main changes since -v2:

  - Implement last-CPU+PID hash tracking
  - Improve staggered convergence
  - Improve directed convergence
  - Fix !THP, 4K-pte "2M-emu" NUMA fault handling

In particular the new CPU+PID hashing code works very well, and
I'd be curious whether others can confirm that they are seeing
speedups as well.

Some performance figures. Here's the comparison to mainline:

 ##############
 # res-v3.6-vanilla.log vs res-numaunified-v3.log:
 #------------------------------------------------------------------------------------>
   autonuma benchmark                run time (lower is better)         speedup %
 ------------------------------------------------------------------------------------->
   numa01                           :   337.29  vs.  195.47   |           +72.5 %
   numa01_THREAD_ALLOC              :   428.79  vs.  119.97   |          +257.4 %
   numa02                           :    56.32  vs.   16.82   |          +234.8 %
   numa02_SMT                       :    56.55  vs.   16.98   |          +233.0 %
   ------------------------------------------------------------

Still much better, all around.

Comparison to the -v17, the last non-regressing pre-unification
tree:

 ##############
 # res-numacore-v18b.log vs res-numaunified-v3.log:
 #------------------------------------------------------------------------------------>
   autonuma benchmark                run time (lower is better)         speedup %
 ------------------------------------------------------------------------------------->
   numa01                           :   177.64  vs.  195.47   |            -9.1 %
   numa01_THREAD_ALLOC              :   127.07  vs.  119.97   |            +5.9 %
   numa02                           :    18.08  vs.   16.82   |            +7.4 %
   numa02_SMT                       :    36.97  vs.   16.98   |          +117.7 %
   ------------------------------------------------------------

[ Note: the 'numa01' result is a bit slower, due to us not
  taking node distances into account on larger than 2-node
  systems, and this run spreading the tasks in a A-B-A-B
  suboptimal order, instead of A-A-B-B. There's a 50% chance for
  that outcome and this run got the worse convergence layout.

  That behavior due to node assymetry will be improved in future
  versions. Note that even in the less ideal layout it's faster
  than mainline. ]

- The twice as fast numa02_SMT result is a regression fix.

- 'numa02' and 'numa01_THREAD_ALLOC' got genuinely faster - and
  that's good news because those are our prime target 'good'
  NUMA workloads.

The SPECjbb 4x JVM numbers are still very close to the
hard-binding results:

  Fri Dec  7 02:08:42 CET 2012
  spec1.txt:           throughput =     188667.94 SPECjbb2005 bops
  spec2.txt:           throughput =     190109.31 SPECjbb2005 bops
  spec3.txt:           throughput =     191438.13 SPECjbb2005 bops
  spec4.txt:           throughput =     192508.34 SPECjbb2005 bops
                                      --------------------------
        SUM:           throughput =     762723.72 SPECjbb2005 bops

And the same is true for !THP as well.

( In case you have sent a regression report please re-test this
  version - I'll try to work down some of my email backlog and
  reply to any mails I have not replied to yet. )

Reports, fixes, suggestions are welcome, as always!

Thanks,

	Ingo

------------------------------------------------->
Ingo Molnar (9):
  numa, sched: Fix NUMA tick ->numa_shared setting
  numa, sched: Add tracking of runnable NUMA tasks
  numa, sched: Implement wake-cpu migration support
  numa, mm, sched: Implement last-CPU+PID hash tracking
  numa, mm, sched: Fix NUMA affinity tracking logic
  numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling
  numa, sched: Improve staggered convergence
  numa, sched: Improve directed convergence
  numa, sched: Streamline and fix numa_allow_migration() use

 include/linux/init_task.h         |   4 +-
 include/linux/mempolicy.h         |   4 +-
 include/linux/mm.h                |  79 +++++---
 include/linux/mm_types.h          |   4 +-
 include/linux/page-flags-layout.h |  23 ++-
 include/linux/sched.h             |   9 +-
 kernel/sched/core.c               |  29 ++-
 kernel/sched/fair.c               | 370 +++++++++++++++++++++++++++-----------
 kernel/sched/features.h           |   2 +
 kernel/sched/sched.h              |   4 +
 kernel/sysctl.c                   |   8 +
 mm/huge_memory.c                  |  25 +--
 mm/memory.c                       | 175 +++++++++++++-----
 mm/mempolicy.c                    |  50 ++++--
 mm/migrate.c                      |   4 +-
 mm/mprotect.c                     |   4 +-
 mm/page_alloc.c                   |   6 +-
 17 files changed, 574 insertions(+), 226 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/9] numa, sched: Fix NUMA tick ->numa_shared setting
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 2/9] numa, sched: Add tracking of runnable NUMA tasks Ingo Molnar
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Split out an unlocked variant of __sched_setnuma(),
and use it in the NUMA tick when we are modifying
p->numa_shared.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c  | 19 +++++++++++++++----
 kernel/sched/fair.c  |  2 +-
 kernel/sched/sched.h |  1 +
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 69b18b3..cce84c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6091,13 +6091,10 @@ static struct sched_domain_topology_level *sched_domain_topology = default_topol
 /*
  * Change a task's NUMA state - called from the placement tick.
  */
-void sched_setnuma(struct task_struct *p, int node, int shared)
+void __sched_setnuma(struct rq *rq, struct task_struct *p, int node, int shared)
 {
-	unsigned long flags;
 	int on_rq, running;
-	struct rq *rq;
 
-	rq = task_rq_lock(p, &flags);
 	on_rq = p->on_rq;
 	running = task_current(rq, p);
 
@@ -6113,6 +6110,20 @@ void sched_setnuma(struct task_struct *p, int node, int shared)
 		p->sched_class->set_curr_task(rq);
 	if (on_rq)
 		enqueue_task(rq, p, 0);
+}
+
+/*
+ * Change a task's NUMA state - called from the placement tick.
+ */
+void sched_setnuma(struct task_struct *p, int node, int shared)
+{
+	unsigned long flags;
+	struct rq *rq;
+
+	rq = task_rq_lock(p, &flags);
+
+	__sched_setnuma(rq, p, node, shared);
+
 	task_rq_unlock(rq, p, &flags);
 
 	/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 21c10f7..0c83689 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2540,7 +2540,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	/* Cheap checks first: */
 	if (!task_numa_candidate(curr)) {
 		if (curr->numa_shared >= 0)
-			curr->numa_shared = -1;
+			__sched_setnuma(rq, curr, -1, -1);
 		return;
 	}
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0fdd304..f75bf06 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -513,6 +513,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
 #define raw_rq()		(&__raw_get_cpu_var(runqueues))
 
 #ifdef CONFIG_NUMA_BALANCING
+extern void __sched_setnuma(struct rq *rq, struct task_struct *p, int node, int shared);
 extern void sched_setnuma(struct task_struct *p, int node, int shared);
 static inline void task_numa_free(struct task_struct *p)
 {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/9] numa, sched: Add tracking of runnable NUMA tasks
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
  2012-12-07  0:19 ` [PATCH 1/9] numa, sched: Fix NUMA tick ->numa_shared setting Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 3/9] numa, sched: Implement wake-cpu migration support Ingo Molnar
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

This is mostly taken from:

  sched: Add adaptive NUMA affinity support

  Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
  Date:   Sun Nov 11 15:09:59 2012 +0100

With some robustness changes to make sure we will dequeue
the same state we enqueued.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |  3 ++-
 include/linux/sched.h     |  2 ++
 kernel/sched/core.c       |  6 ++++++
 kernel/sched/fair.c       | 48 +++++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h      |  3 +++
 5 files changed, 59 insertions(+), 3 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index ed98982..a5da0fc 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -145,7 +145,8 @@ extern struct task_group root_task_group;
 
 #ifdef CONFIG_NUMA_BALANCING
 # define INIT_TASK_NUMA(tsk)						\
-	.numa_shared = -1,
+	.numa_shared = -1,						\
+	.numa_shared_enqueue = -1
 #else
 # define INIT_TASK_NUMA(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6a29dfd..ee39f6b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
 #endif
 #ifdef CONFIG_NUMA_BALANCING
 	int numa_shared;
+	int numa_shared_enqueue;
 	int numa_max_node;
 	int numa_scan_seq;
 	unsigned long numa_scan_ts_secs;
@@ -1511,6 +1512,7 @@ struct task_struct {
 	u64 node_stamp;			/* migration stamp  */
 	unsigned long convergence_strength;
 	int convergence_node;
+	unsigned long numa_weight;
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
 	struct callback_head numa_scan_work;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cce84c3..a7f0000 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1554,6 +1554,9 @@ static void __sched_fork(struct task_struct *p)
 	}
 
 	p->numa_shared = -1;
+	p->numa_weight = 0;
+	p->numa_shared_enqueue = -1;
+	p->numa_max_node = -1;
 	p->node_stamp = 0ULL;
 	p->convergence_strength		= 0;
 	p->convergence_node		= -1;
@@ -6103,6 +6106,9 @@ void __sched_setnuma(struct rq *rq, struct task_struct *p, int node, int shared)
 	if (running)
 		p->sched_class->put_prev_task(rq, p);
 
+	WARN_ON_ONCE(p->numa_shared_enqueue != -1);
+	WARN_ON_ONCE(p->numa_weight);
+
 	p->numa_shared = shared;
 	p->numa_max_node = node;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c83689..8cdbfde 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -801,6 +801,45 @@ static unsigned long task_h_load(struct task_struct *p);
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+	int shared = task_numa_shared(p);
+
+	WARN_ON_ONCE(p->numa_weight);
+
+	if (shared != -1) {
+		p->numa_weight = p->se.load.weight;
+		WARN_ON_ONCE(!p->numa_weight);
+		p->numa_shared_enqueue = shared;
+
+		rq->nr_numa_running++;
+		rq->nr_shared_running += shared;
+		rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+		rq->numa_weight += p->numa_weight;
+	} else {
+		if (p->numa_weight) {
+			WARN_ON_ONCE(p->numa_weight);
+			p->numa_weight = 0;
+		}
+	}
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+	if (p->numa_shared_enqueue != -1) {
+		rq->nr_numa_running--;
+		rq->nr_shared_running -= p->numa_shared_enqueue;
+		rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
+		rq->numa_weight -= p->numa_weight;
+		p->numa_weight = 0;
+		p->numa_shared_enqueue = -1;
+	} else {
+		if (p->numa_weight) {
+			WARN_ON_ONCE(p->numa_weight);
+			p->numa_weight = 0;
+		}
+	}
+}
 
 /*
  * Scan @scan_size MB every @scan_period after an initial @scan_delay.
@@ -2551,8 +2590,11 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 #else /* !CONFIG_NUMA_BALANCING: */
 #ifdef CONFIG_SMP
 static inline int task_ideal_cpu(struct task_struct *p)				{ return -1; }
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)	{ }
 #endif
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)	{ }
 static inline void task_tick_numa(struct rq *rq, struct task_struct *curr)	{ }
+static inline void task_numa_migrate(struct task_struct *p, int next_cpu)	{ }
 #endif /* CONFIG_NUMA_BALANCING */
 
 /**************************************************
@@ -2569,6 +2611,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (entity_is_task(se)) {
 		struct rq *rq = rq_of(cfs_rq);
 
+		account_numa_enqueue(rq, task_of(se));
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 #endif /* CONFIG_SMP */
@@ -2581,9 +2624,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se))
+	if (entity_is_task(se)) {
 		list_del_init(&se->group_node);
-
+		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
+	}
 	cfs_rq->nr_running--;
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f75bf06..f00eb80 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -438,6 +438,9 @@ struct rq {
 	struct list_head cfs_tasks;
 
 #ifdef CONFIG_NUMA_BALANCING
+	unsigned long numa_weight;
+	unsigned long nr_numa_running;
+	unsigned long nr_ideal_running;
 	struct task_struct *curr_buddy;
 #endif
 	unsigned long nr_shared_running;	/* 0 on non-NUMA */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/9] numa, sched: Implement wake-cpu migration support
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
  2012-12-07  0:19 ` [PATCH 1/9] numa, sched: Fix NUMA tick ->numa_shared setting Ingo Molnar
  2012-12-07  0:19 ` [PATCH 2/9] numa, sched: Add tracking of runnable NUMA tasks Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 4/9] numa, mm, sched: Implement last-CPU+PID hash tracking Ingo Molnar
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Task flipping via sched_rebalance_to() was only partially successful
in a number of cases, especially with modified preemption options.

The reason was that when we'd migrate over to the target node,
with the right timing the source task might already be sleeping
waiting for the migration thread to run - which prevented it
from changing its target CPU.

But we cannot simply set the CPU in the migration handler, because
our per entity load average calculations rely on tasks spending
their sleeping time on the CPU they went to sleep and only being
requeued at wakeup.

So introduce a ->wake_cpu construct to allow the migration at
wakeup time. This gives us maximum information while still
preserving the task-flipping destination.

( Also make sure we don't wake up to CPUs that are outside
  the hard affinity ->cpus_allowed CPU mask. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/init_task.h |  3 ++-
 include/linux/sched.h     |  1 +
 kernel/sched/core.c       |  3 +++
 kernel/sched/fair.c       | 14 +++++++++++---
 4 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/include/linux/init_task.h b/include/linux/init_task.h
index a5da0fc..ec31d7b 100644
--- a/include/linux/init_task.h
+++ b/include/linux/init_task.h
@@ -146,7 +146,8 @@ extern struct task_group root_task_group;
 #ifdef CONFIG_NUMA_BALANCING
 # define INIT_TASK_NUMA(tsk)						\
 	.numa_shared = -1,						\
-	.numa_shared_enqueue = -1
+	.numa_shared_enqueue = -1,					\
+	.wake_cpu = -1,
 #else
 # define INIT_TASK_NUMA(tsk)
 #endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ee39f6b..1c3cc50 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1502,6 +1502,7 @@ struct task_struct {
 	short il_next;
 	short pref_node_fork;
 #endif
+	int wake_cpu;
 #ifdef CONFIG_NUMA_BALANCING
 	int numa_shared;
 	int numa_shared_enqueue;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a7f0000..cfa8426 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1546,6 +1546,7 @@ static void __sched_fork(struct task_struct *p)
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+	p->wake_cpu = -1;
 
 #ifdef CONFIG_NUMA_BALANCING
 	if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
@@ -4782,6 +4783,8 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 		set_task_cpu(p, dest_cpu);
 		enqueue_task(rq_dest, p, 0);
 		check_preempt_curr(rq_dest, p, 0);
+	} else {
+		p->wake_cpu = dest_cpu;
 	}
 done:
 	ret = 1;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cdbfde..8664f39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4963,6 +4963,12 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
+	int wake_cpu = p->wake_cpu;
+
+	if (wake_cpu != -1 && cpumask_test_cpu(wake_cpu, tsk_cpus_allowed(p))) {
+		p->wake_cpu = -1;
+		return wake_cpu;
+	}
 
 	if (p->nr_cpus_allowed == 1)
 		return prev_cpu;
@@ -5044,10 +5050,12 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		/* while loop will break here if sd == NULL */
 	}
 unlock:
-	rcu_read_unlock();
+	if (!numa_allow_migration(p, prev0_cpu, new_cpu)) {
+		if (cpumask_test_cpu(prev0_cpu, tsk_cpus_allowed(p)))
+			new_cpu = prev0_cpu;
+	}
 
-	if (!numa_allow_migration(p, prev0_cpu, new_cpu))
-		return prev0_cpu;
+	rcu_read_unlock();
 
 	return new_cpu;
 }
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/9] numa, mm, sched: Implement last-CPU+PID hash tracking
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (2 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 3/9] numa, sched: Implement wake-cpu migration support Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 5/9] numa, mm, sched: Fix NUMA affinity tracking logic Ingo Molnar
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

We rely on the page::last_cpu field (embedded in remaining bits of the
page flags field), to drive NUMA placement: the last_cpu gives us
information about which tasks access memory on what node, and it also
gives us information about which tasks relate to each other, in that
they access the same pages.

There was a constant source of statistics skew resulting out of
last_cpu, in that if a task migrated from one CPU to another, it
would see last_cpu accesses from that last CPU (the pages it
accessed on that CPU), but would account them as a 'shared' memory
access relationship.

So in a sense a phantom task on the previous CPU would haunt our
task, even if this task never ever iteracts with other tasks in
any serious manner.

Anothr skew of the statistics was that of preemption: if a task
ran on another CPU and accessed our pages but descheduled, then
we'd suspect that CPU - and the next task running on it - of being
in a sharing relationship with us.

To solve these skews and to improve the quality of the statistics
hash the last 8 bits of the PID to the last_cpu field. We name
this 'cpupid' because it's cheaper to handle it as a single integer
in most places. Wherever code needs to take an actual look at
the last_cpu and last_pid information embedded, it it can do so
via simple shifting and masking.

Propagate this all way through the code and make use of it.

As a result of this change the sharing/private fault statistics
stabilized and improved very markedly: convergence is faster
and less prone to workload noise. 4x JVM runs come to within
2-3% of the theoretical performance maximum:

 Thu Dec  6 16:10:34 CET 2012
 spec1.txt:           throughput =     190191.50 SPECjbb2005 bops
 spec2.txt:           throughput =     194783.63 SPECjbb2005 bops
 spec3.txt:           throughput =     192812.69 SPECjbb2005 bops
 spec4.txt:           throughput =     193898.09 SPECjbb2005 bops
                                      --------------------------
       SUM:           throughput =     771685.91 SPECjbb2005 bops

The cost is 8 more bits used from the page flags - this space
is still available on 64-bit systems, with a common distro
config (Fedora) compiled.

There is the potential of false sharing if the PIDs of two tasks
are equal modulo 256 - this degrades the statistics somewhat but
does not completely eliminate it. Related tasks are typically
launched close to each other, so I don't expect this to be a
problem in practice - if it is then we can do some better (maybe
wider) PID hashing in the future.

This mechanism is only used on (default-off) CONFIG_NUMA_BALANCING=y
kernels.

Also, while at it, pass the 'migrated' information to the
task_numa_fault() handler consistently.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mm.h                | 79 ++++++++++++++++++++++++++-------------
 include/linux/mm_types.h          |  4 +-
 include/linux/page-flags-layout.h | 23 ++++++++----
 include/linux/sched.h             |  4 +-
 kernel/sched/fair.c               | 26 +++++++++++--
 mm/huge_memory.c                  | 23 ++++++------
 mm/memory.c                       | 26 +++++++------
 mm/mempolicy.c                    | 23 ++++++++++--
 mm/migrate.c                      |  4 +-
 mm/page_alloc.c                   |  6 ++-
 10 files changed, 148 insertions(+), 70 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a9454ca..c576b43 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -585,7 +585,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGOFF		((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
 #define NODES_PGOFF		(SECTIONS_PGOFF - NODES_WIDTH)
 #define ZONES_PGOFF		(NODES_PGOFF - ZONES_WIDTH)
-#define LAST_CPU_PGOFF		(ZONES_PGOFF - LAST_CPU_WIDTH)
+#define LAST_CPUPID_PGOFF	(ZONES_PGOFF - LAST_CPUPID_WIDTH)
 
 /*
  * Define the bit shifts to access each section.  For non-existent
@@ -595,7 +595,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define SECTIONS_PGSHIFT	(SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
 #define NODES_PGSHIFT		(NODES_PGOFF * (NODES_WIDTH != 0))
 #define ZONES_PGSHIFT		(ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_CPU_PGSHIFT	(LAST_CPU_PGOFF * (LAST_CPU_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT	(LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
 
 /* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
 #ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -617,7 +617,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
 #define ZONES_MASK		((1UL << ZONES_WIDTH) - 1)
 #define NODES_MASK		((1UL << NODES_WIDTH) - 1)
 #define SECTIONS_MASK		((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_CPU_MASK		((1UL << LAST_CPU_WIDTH) - 1)
+#define LAST_CPUPID_MASK	((1UL << LAST_CPUPID_WIDTH) - 1)
 #define ZONEID_MASK		((1UL << ZONEID_SHIFT) - 1)
 
 static inline enum zone_type page_zonenum(const struct page *page)
@@ -657,64 +657,93 @@ static inline int page_to_nid(const struct page *page)
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
-static inline int page_xchg_last_cpu(struct page *page, int cpu)
+
+static inline int cpupid_to_cpu(int cpupid)
+{
+	return (cpupid >> CPUPID_PID_BITS) & CPUPID_CPU_MASK;
+}
+
+static inline int cpupid_to_pid(int cpupid)
+{
+	return cpupid & CPUPID_PID_MASK;
+}
+
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
+{
+	return ((cpu & CPUPID_CPU_MASK) << CPUPID_CPU_BITS) | (pid & CPUPID_PID_MASK);
+}
+
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_xchg_last_cpupid(struct page *page, int cpupid)
 {
-	return xchg(&page->_last_cpu, cpu);
+	return xchg(&page->_last_cpupid, cpupid);
 }
 
-static inline int page_last_cpu(struct page *page)
+static inline int page_last__cpupid(struct page *page)
 {
-	return page->_last_cpu;
+	return page->_last_cpupid;
 }
 
-static inline void reset_page_last_cpu(struct page *page)
+static inline void reset_page_last_cpupid(struct page *page)
 {
-	page->_last_cpu = -1;
+	page->_last_cpupid = -1;
 }
+
 #else
-static inline int page_xchg_last_cpu(struct page *page, int cpu)
+static inline int page_xchg_last_cpupid(struct page *page, int cpupid)
 {
 	unsigned long old_flags, flags;
-	int last_cpu;
+	int last_cpupid;
 
 	do {
 		old_flags = flags = page->flags;
-		last_cpu = (flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+		last_cpupid = (flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
+
+		flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+		flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
 
-		flags &= ~(LAST_CPU_MASK << LAST_CPU_PGSHIFT);
-		flags |= (cpu & LAST_CPU_MASK) << LAST_CPU_PGSHIFT;
 	} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
 
-	return last_cpu;
+	return last_cpupid;
+}
+
+static inline int page_last__cpupid(struct page *page)
+{
+	return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
+}
+
+static inline void reset_page_last_cpupid(struct page *page)
+{
+	page_xchg_last_cpupid(page, -1);
 }
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
 
-static inline int page_last_cpu(struct page *page)
+static inline int page_last__cpu(struct page *page)
 {
-	return (page->flags >> LAST_CPU_PGSHIFT) & LAST_CPU_MASK;
+	return cpupid_to_cpu(page_last__cpupid(page));
 }
 
-static inline void reset_page_last_cpu(struct page *page)
+static inline int page_last__pid(struct page *page)
 {
+	return cpupid_to_pid(page_last__cpupid(page));
 }
 
-#endif /* LAST_CPU_NOT_IN_PAGE_FLAGS */
-#else /* CONFIG_NUMA_BALANCING */
-static inline int page_xchg_last_cpu(struct page *page, int cpu)
+#else /* !CONFIG_NUMA_BALANCING: */
+static inline int page_xchg_last_cpupid(struct page *page, int cpu)
 {
 	return page_to_nid(page);
 }
 
-static inline int page_last_cpu(struct page *page)
+static inline int page_last__cpupid(struct page *page)
 {
 	return page_to_nid(page);
 }
 
-static inline void reset_page_last_cpu(struct page *page)
+static inline void reset_page_last_cpupid(struct page *page)
 {
 }
 
-#endif /* CONFIG_NUMA_BALANCING */
+#endif /* !CONFIG_NUMA_BALANCING */
 
 static inline struct zone *page_zone(const struct page *page)
 {
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index cd2be76..ba08f34 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -178,8 +178,8 @@ struct page {
 	void *shadow;
 #endif
 
-#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
-	int _last_cpu;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+	int _last_cpupid;
 #endif
 }
 /*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index b258132..9435d64 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -56,16 +56,23 @@
 #define NODES_WIDTH		0
 #endif
 
+/* Reduce false sharing: */
+#define CPUPID_PID_BITS		8
+#define CPUPID_PID_MASK		((1 << CPUPID_PID_BITS)-1)
+
+#define CPUPID_CPU_BITS		NR_CPUS_BITS
+#define CPUPID_CPU_MASK		((1 << CPUPID_CPU_BITS)-1)
+
 #ifdef CONFIG_NUMA_BALANCING
-#define LAST_CPU_SHIFT	NR_CPUS_BITS
+# define LAST_CPUPID_SHIFT	(CPUPID_CPU_BITS+CPUPID_PID_BITS)
 #else
-#define LAST_CPU_SHIFT	0
+# define LAST_CPUPID_SHIFT	0
 #endif
 
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPU_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_CPU_WIDTH	LAST_CPU_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+# define LAST_CPUPID_WIDTH	LAST_CPUPID_SHIFT
 #else
-#define LAST_CPU_WIDTH	0
+# define LAST_CPUPID_WIDTH	0
 #endif
 
 /*
@@ -73,11 +80,11 @@
  * there.  This includes the case where there is no node, so it is implicit.
  */
 #if !(NODES_WIDTH > 0 || NODES_SHIFT == 0)
-#define NODE_NOT_IN_PAGE_FLAGS
+# define NODE_NOT_IN_PAGE_FLAGS
 #endif
 
-#if defined(CONFIG_NUMA_BALANCING) && LAST_CPU_WIDTH == 0
-#define LAST_CPU_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+# define LAST_CPUPID_NOT_IN_PAGE_FLAGS
 #endif
 
 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c3cc50..1041c0d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1601,9 +1601,9 @@ struct task_struct {
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
 
 #ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int cpu, int pages);
+extern void task_numa_fault(unsigned long addr, int node, int cpupid, int pages, bool migrated);
 #else
-static inline void task_numa_fault(int node, int cpu, int pages) { }
+static inline void task_numa_fault(unsigned long addr, int node, int cpupid, int pages, bool migrated) { }
 #endif /* CONFIG_NUMA_BALANCING */
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8664f39..1547d66 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2267,11 +2267,31 @@ out_hit:
 /*
  * Got a PROT_NONE fault for a page on @node.
  */
-void task_numa_fault(int node, int last_cpu, int pages)
+void task_numa_fault(unsigned long addr, int node, int last_cpupid, int pages, bool migrated)
 {
 	struct task_struct *p = current;
-	int priv = (task_cpu(p) == last_cpu);
-	int idx = 2*node + priv;
+	int this_cpu = raw_smp_processor_id();
+	int last_cpu = cpupid_to_cpu(last_cpupid);
+	int last_pid = cpupid_to_pid(last_cpupid);
+	int this_pid = current->pid & CPUPID_PID_MASK;
+	int priv;
+	int idx;
+
+	if (last_cpupid != cpu_pid_to_cpupid(-1, -1)) {
+		/* Did we access it last time around? */
+		if (last_pid == this_pid) {
+			priv = 1;
+		} else {
+			priv = 0;
+		}
+	} else {
+		/* The default for fresh pages is private: */
+		priv = 1;
+		last_cpu = this_cpu;
+		node = cpu_to_node(this_cpu);
+	}
+
+	idx = 2*node + priv;
 
 	WARN_ON_ONCE(last_cpu == -1 || node == -1);
 	BUG_ON(!p->numa_faults);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 53e08a2..e6820aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1024,10 +1024,10 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *page;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	int last_cpu;
+	int last_cpupid;
 	int target_nid;
-	int current_nid = -1;
-	bool migrated;
+	int page_nid = -1;
+	bool migrated = false;
 	bool page_locked = false;
 
 	spin_lock(&mm->page_table_lock);
@@ -1036,10 +1036,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(pmd);
 	get_page(page);
-	current_nid = page_to_nid(page);
-	last_cpu = page_last_cpu(page);
+	page_nid = page_to_nid(page);
+	last_cpupid = page_last__cpupid(page);
+
 	count_vm_numa_event(NUMA_HINT_FAULTS);
-	if (current_nid == numa_node_id())
+	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
 	target_nid = mpol_misplaced(page, vma, haddr);
@@ -1067,7 +1068,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 				pmdp, pmd, addr,
 				page, target_nid);
 	if (migrated)
-		current_nid = target_nid;
+		page_nid = target_nid;
 	else {
 		spin_lock(&mm->page_table_lock);
 		if (unlikely(!pmd_same(pmd, *pmdp))) {
@@ -1077,7 +1078,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		goto clear_pmdnuma;
 	}
 
-	task_numa_fault(current_nid, last_cpu, HPAGE_PMD_NR);
+	task_numa_fault(addr, page_nid, last_cpupid, HPAGE_PMD_NR, migrated);
 	return 0;
 
 clear_pmdnuma:
@@ -1090,8 +1091,8 @@ clear_pmdnuma:
 
 out_unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (current_nid != -1)
-		task_numa_fault(current_nid, last_cpu, HPAGE_PMD_NR);
+	if (page_nid != -1)
+		task_numa_fault(addr, page_nid, last_cpupid, HPAGE_PMD_NR, migrated);
 	return 0;
 }
 
@@ -1384,7 +1385,7 @@ static void __split_huge_page_refcount(struct page *page)
 		page_tail->mapping = page->mapping;
 
 		page_tail->index = page->index + i;
-		page_xchg_last_cpu(page_tail, page_last_cpu(page));
+		page_xchg_last_cpupid(page_tail, page_last__cpupid(page));
 
 		BUG_ON(!PageAnon(page_tail));
 		BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index cca216e..6ebfbbe 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -70,8 +70,8 @@
 
 #include "internal.h"
 
-#ifdef LAST_CPU_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA config, growing page-frame for last_cpu.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+# warning Large NUMA config, growing page-frame for last_cpu+pid.
 #endif
 
 #ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3472,7 +3472,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	bool migrated = false;
 	spinlock_t *ptl;
 	int target_nid;
-	int last_cpu;
+	int last_cpupid;
 	int page_nid;
 
 	/*
@@ -3505,14 +3505,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	WARN_ON_ONCE(page_nid == -1);
 
 	/* Get it before mpol_misplaced() flips it: */
-	last_cpu = page_last_cpu(page);
-	WARN_ON_ONCE(last_cpu == -1);
+	last_cpupid = page_last__cpupid(page);
 
 	target_nid = numa_migration_target(page, vma, addr, page_nid);
 	if (target_nid == -1) {
 		pte_unmap_unlock(ptep, ptl);
 		goto out;
 	}
+	WARN_ON_ONCE(target_nid == page_nid);
 
 	/* Get a reference for migration: */
 	get_page(page);
@@ -3524,7 +3524,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_nid = target_nid;
 out:
 	/* Always account where the page currently is, physically: */
-	task_numa_fault(page_nid, last_cpu, 1);
+	task_numa_fault(addr, page_nid, last_cpupid, 1, migrated);
 
 	return 0;
 }
@@ -3562,7 +3562,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct page *page;
 		int page_nid;
 		int target_nid;
-		int last_cpu;
+		int last_cpupid;
 		bool migrated;
 		pte_t pteval;
 
@@ -3592,12 +3592,16 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		page_nid = page_to_nid(page);
 		WARN_ON_ONCE(page_nid == -1);
 
-		last_cpu = page_last_cpu(page);
-		WARN_ON_ONCE(last_cpu == -1);
+		last_cpupid = page_last__cpupid(page);
 
 		target_nid = numa_migration_target(page, vma, addr, page_nid);
-		if (target_nid == -1)
+		if (target_nid == -1) {
+			/* Always account where the page currently is, physically: */
+			task_numa_fault(addr, page_nid, last_cpupid, 1, 0);
+
 			continue;
+		}
+		WARN_ON_ONCE(target_nid == page_nid);
 
 		/* Get a reference for the migration: */
 		get_page(page);
@@ -3609,7 +3613,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			page_nid = target_nid;
 
 		/* Always account where the page currently is, physically: */
-		task_numa_fault(page_nid, last_cpu, 1);
+		task_numa_fault(addr, page_nid, last_cpupid, 1, migrated);
 
 		pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 42da0f2..2f2095c 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2338,6 +2338,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	struct zone *zone;
 	int page_nid = page_to_nid(page);
 	int target_node = page_nid;
+#ifdef CONFIG_NUMA_BALANCING
+	int cpupid_last_access = -1;
+	int cpu_last_access = -1;
+#endif
 
 	BUG_ON(!vma);
 
@@ -2394,15 +2398,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		BUG();
 	}
 
+#ifdef CONFIG_NUMA_BALANCING
 	/* Migrate the page towards the node whose CPU is referencing it */
 	if (pol->flags & MPOL_F_MORON) {
-		int cpu_last_access;
+		int this_cpupid;
 		int this_cpu;
 		int this_node;
 
 		this_cpu = raw_smp_processor_id();
 		this_node = numa_node_id();
 
+		this_cpupid = cpu_pid_to_cpupid(this_cpu, current->pid);
+
 		/*
 		 * Multi-stage node selection is used in conjunction
 		 * with a periodic migration fault to build a temporal
@@ -2424,12 +2431,20 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 		 * it less likely we act on an unlikely task<->page
 		 * relation.
 		 */
-		cpu_last_access = page_xchg_last_cpu(page, this_cpu);
+		cpupid_last_access = page_xchg_last_cpupid(page, this_cpupid);
 
-		/* Migrate towards us: */
-		if (cpu_last_access == this_cpu)
+		/* Freshly allocated pages not accessed by anyone else yet: */
+		if (cpupid_last_access == cpu_pid_to_cpupid(-1, -1)) {
+			cpu_last_access = this_cpu;
 			target_node = this_node;
+		} else {
+			cpu_last_access = cpupid_to_cpu(cpupid_last_access);
+			/* Migrate towards us in the default policy: */
+			if (cpu_last_access == this_cpu)
+				target_node = this_node;
+		}
 	}
+#endif
 out_keep_page:
 	mpol_cond_put(pol);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 14202e7..9562fa8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1458,7 +1458,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 					  __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 	if (newpage)
-		page_xchg_last_cpu(newpage, page_last_cpu(page));
+		page_xchg_last_cpupid(newpage, page_last__cpupid(page));
 
 	return newpage;
 }
@@ -1567,7 +1567,7 @@ int migrate_misplaced_transhuge_page_put(struct mm_struct *mm,
 		count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
 		goto out_dropref;
 	}
-	page_xchg_last_cpu(new_page, page_last_cpu(page));
+	page_xchg_last_cpupid(new_page, page_last__cpupid(page));
 
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 92e88bd..6d72372 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -608,7 +608,7 @@ static inline int free_pages_check(struct page *page)
 		bad_page(page);
 		return 1;
 	}
-	reset_page_last_cpu(page);
+	reset_page_last_cpupid(page);
 	if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
 		page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	return 0;
@@ -850,6 +850,8 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags)
 	arch_alloc_page(page, order);
 	kernel_map_pages(page, 1 << order, 1);
 
+	reset_page_last_cpupid(page);
+
 	if (gfp_flags & __GFP_ZERO)
 		prep_zero_page(page, order, gfp_flags);
 
@@ -3827,7 +3829,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
 		mminit_verify_page_links(page, zone, nid, pfn);
 		init_page_count(page);
 		reset_page_mapcount(page);
-		reset_page_last_cpu(page);
+		reset_page_last_cpupid(page);
 		SetPageReserved(page);
 		/*
 		 * Mark the block movable so that blocks are reserved for
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 5/9] numa, mm, sched: Fix NUMA affinity tracking logic
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (3 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 4/9] numa, mm, sched: Implement last-CPU+PID hash tracking Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 6/9] numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling Ingo Molnar
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Support for the p->numa_policy affinity tracking by the scheduler
went missing during the mm/ unification: revive and integrate it
properly.

( This in particular fixes NUMA_POLICY_MANYBUDDIES, which
  bug caused a few regressions in various workloads such as
  numa01 and regressed !THP workloads in particular. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/mempolicy.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 2f2095c..6bb9fd0 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -121,8 +121,10 @@ static struct mempolicy default_policy_local = {
 static struct mempolicy *default_policy(void)
 {
 #ifdef CONFIG_NUMA_BALANCING
-	if (task_numa_shared(current) == 1)
-		return &current->numa_policy;
+	struct mempolicy *pol = &current->numa_policy;
+
+	if (task_numa_shared(current) == 1 && nodes_weight(pol->v.nodes) >= 2)
+		return pol;
 #endif
 	return &default_policy_local;
 }
@@ -135,6 +137,11 @@ static struct mempolicy *get_task_policy(struct task_struct *p)
 	int node;
 
 	if (!pol) {
+#ifdef CONFIG_NUMA_BALANCING
+		pol = default_policy();
+		if (pol != &default_policy_local)
+			return pol;
+#endif
 		node = numa_node_id();
 		if (node != -1)
 			pol = &preferred_node_policy[node];
@@ -2367,7 +2374,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 			shift = PAGE_SHIFT;
 
 		target_node = interleave_nid(pol, vma, addr, shift);
-		break;
+
+		goto out_keep_page;
 	}
 
 	case MPOL_PREFERRED:
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 6/9] numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (4 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 5/9] numa, mm, sched: Fix NUMA affinity tracking logic Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 7/9] numa, sched: Improve staggered convergence Ingo Molnar
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

The !THP pte_numa code from the unified tree is not working very well
for me: I suspect it would work better with migration bandwidth throttling
in place, but without that (and in form of my port to the unified tree)
it performs badly in a number of situations:

 - when for whatever reason the numa_pmd entry is not established
   yet and threads are hitting the 4K ptes then the pte lock can
   kill performance quickly:

    19.29%        process 1  [kernel.kallsyms]          [k] do_raw_spin_lock
                  |
                  --- do_raw_spin_lock
                     |
                     |--99.67%-- _raw_spin_lock
                     |          |
                     |          |--34.47%-- remove_migration_pte
                     |          |          rmap_walk
                     |          |          move_to_new_page
                     |          |          migrate_pages
                     |          |          migrate_misplaced_page_put
                     |          |          __do_numa_page.isra.56
                     |          |          handle_pte_fault
                     |          |          handle_mm_fault
                     |          |          __do_page_fault
                     |          |          do_page_fault
                     |          |          page_fault
                     |          |          __memset_sse2
                     |          |
                     |          |--34.32%-- __page_check_address
                     |          |          try_to_unmap_one
                     |          |          try_to_unmap_anon
                     |          |          try_to_unmap
                     |          |          migrate_pages
                     |          |          migrate_misplaced_page_put
                     |          |          __do_numa_page.isra.56
                     |          |          handle_pte_fault
                     |          |          handle_mm_fault
                     |          |          __do_page_fault
                     |          |          do_page_fault
                     |          |          page_fault
                     |          |          __memset_sse2
                     |          |
    [...]

 - even if the pmd entry is established we'd hit ptes in a loop while
   other CPUs do it too, seeing the migration ptes as they are being
   established and torn down - resulting in up to 1 million page faults
   per second on my test-system. Not a happy sight and you really don't
   want me to cite that profile here.

So import the 2M-EMU handling code from the v17 numa/core tree, which
was working reasonably well, and add a few other goodies as well:

 - let the first page of an emulated large page determine the target
   node - and also pass down the expected interleaving shift to
   mpol_misplaced(), for overload situations where one group of threads
   spans multiple nodes.

 - turn off the pmd clustering in change_protection() - because the
   2M-emu code works better at the moment. We can re-establish it if
   it's enhanced. I kept both variants for the time being, feedback
   is welcome on this issue.

 - instead of calling mpol_misplaced() 512 times per emulated hugepage,
   extract the cpupid operation from it. Results in measurably lower
   CPU overhead for this functionality.

4K-intense workloads are immediately much happier: 3-5K pagefaults/sec
on my 32-way test-box and a lot less migrations all around.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/mempolicy.h |   4 +-
 mm/huge_memory.c          |   2 +-
 mm/memory.c               | 153 +++++++++++++++++++++++++++++++++++-----------
 mm/mempolicy.c            |  13 +---
 mm/mprotect.c             |   4 +-
 5 files changed, 127 insertions(+), 49 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index f44b7f3..8bb6ab5 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -161,7 +161,7 @@ static inline int vma_migratable(struct vm_area_struct *vma)
 	return 1;
 }
 
-extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+extern int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr, int shift);
 
 #else
 
@@ -289,7 +289,7 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
 }
 
 static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
-				 unsigned long address)
+				 unsigned long address, int shift)
 {
 	return -1; /* no node preference */
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e6820aa..7c82f28 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1043,7 +1043,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	target_nid = mpol_misplaced(page, vma, haddr);
+	target_nid = mpol_misplaced(page, vma, haddr, HPAGE_SHIFT);
 	if (target_nid == -1) {
 		put_page(page);
 		goto clear_pmdnuma;
diff --git a/mm/memory.c b/mm/memory.c
index 6ebfbbe..fc0026e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3455,6 +3455,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
+#ifdef CONFIG_NUMA_BALANCING
 static int numa_migration_target(struct page *page, struct vm_area_struct *vma,
 			       unsigned long addr, int page_nid)
 {
@@ -3462,57 +3463,50 @@ static int numa_migration_target(struct page *page, struct vm_area_struct *vma,
 	if (page_nid == numa_node_id())
 		count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
 
-	return mpol_misplaced(page, vma, addr);
+	return mpol_misplaced(page, vma, addr, PAGE_SHIFT);
 }
 
-int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		   unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+static int __do_numa_page(int target_nid, struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr, pte_t *ptep, pmd_t *pmd,
+			unsigned int flags, pte_t pte, spinlock_t *ptl)
 {
 	struct page *page = NULL;
 	bool migrated = false;
-	spinlock_t *ptl;
-	int target_nid;
 	int last_cpupid;
 	int page_nid;
 
-	/*
-	* The "pte" at this point cannot be used safely without
-	* validation through pte_unmap_same(). It's of NUMA type but
-	* the pfn may be screwed if the read is non atomic.
-	*
-	* ptep_modify_prot_start is not called as this is clearing
-	* the _PAGE_NUMA bit and it is not really expected that there
-	* would be concurrent hardware modifications to the PTE.
-	*/
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, pte))) {
-		pte_unmap_unlock(ptep, ptl);
-		return 0;
-	}
-
+	/* Mark it non-NUMA first: */
 	pte = pte_mknonnuma(pte);
 	set_pte_at(mm, addr, ptep, pte);
 	update_mmu_cache(vma, addr, ptep);
 
 	page = vm_normal_page(vma, addr, pte);
-	if (!page) {
-		pte_unmap_unlock(ptep, ptl);
+	if (!page)
 		return 0;
-	}
 
 	page_nid = page_to_nid(page);
 	WARN_ON_ONCE(page_nid == -1);
 
-	/* Get it before mpol_misplaced() flips it: */
-	last_cpupid = page_last__cpupid(page);
+	/*
+	 * Propagate the last_cpupid access info, even though
+	 * the target_nid has already been established for
+	 * this NID range:
+	 */
+	{
+		int this_cpupid;
+		int this_cpu;
+		int this_node;
+
+		this_cpu = raw_smp_processor_id();
+		this_node = numa_node_id();
 
-	target_nid = numa_migration_target(page, vma, addr, page_nid);
-	if (target_nid == -1) {
-		pte_unmap_unlock(ptep, ptl);
-		goto out;
+		this_cpupid = cpu_pid_to_cpupid(this_cpu, current->pid);
+
+		last_cpupid = page_xchg_last_cpupid(page, this_cpupid);
 	}
-	WARN_ON_ONCE(target_nid == page_nid);
+
+	if (target_nid == -1 || target_nid == page_nid)
+		goto out;
 
 	/* Get a reference for migration: */
 	get_page(page);
@@ -3522,6 +3516,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	migrated = migrate_misplaced_page_put(page, target_nid); /* Drops the reference */
 	if (migrated)
 		page_nid = target_nid;
+
+	spin_lock(ptl);
 out:
 	/* Always account where the page currently is, physically: */
 	task_numa_fault(addr, page_nid, last_cpupid, 1, migrated);
@@ -3529,9 +3525,81 @@ out:
 	return 0;
 }
 
+/*
+ * Also fault over nearby ptes from within the same pmd and vma,
+ * in order to minimize the overhead from page fault exceptions:
+ */
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
+			unsigned int flags, pte_t entry0)
+{
+	unsigned long addr0_pmd;
+	unsigned long addr_start;
+	unsigned long addr;
+	struct page *page0;
+	spinlock_t *ptl;
+	pte_t *ptep_start;
+	pte_t *ptep;
+	pte_t entry;
+	int target_nid;
+
+	WARN_ON_ONCE(addr0 < vma->vm_start || addr0 >= vma->vm_end);
+
+	addr0_pmd = addr0 & PMD_MASK;
+	addr_start = max(addr0_pmd, vma->vm_start);
+
+	ptep_start = pte_offset_map(pmd, addr_start);
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	ptep = ptep_start+1;
+
+	/*
+	 * The first page of the range represents the NUMA
+	 * placement of the range. This way we get consistent
+	 * placement even if the faults themselves might hit
+	 * this area at different offsets:
+	 */
+	target_nid = -1;
+	entry = ACCESS_ONCE(*ptep_start);
+	if (pte_present(entry)) {
+		page0 = vm_normal_page(vma, addr_start, entry);
+		if (page0) {
+			target_nid = mpol_misplaced(page0, vma, addr_start, PMD_SHIFT);
+			if (target_nid == -1)
+				target_nid = page_to_nid(page0);
+		}
+		if (WARN_ON_ONCE(target_nid == -1))
+			target_nid = numa_node_id();
+	}
+
+	for (addr = addr_start+PAGE_SIZE; addr < vma->vm_end; addr += PAGE_SIZE, ptep++) {
+
+		if ((addr & PMD_MASK) != addr0_pmd)
+			break;
+
+		entry = ACCESS_ONCE(*ptep);
+
+		if (!pte_present(entry))
+			continue;
+		if (!pte_numa(entry))
+			continue;
+
+		__do_numa_page(target_nid, mm, vma, addr, ptep, pmd, flags, entry, ptl);
+	}
+
+	entry = ACCESS_ONCE(*ptep_start);
+	if (pte_present(entry) && pte_numa(entry))
+		__do_numa_page(target_nid, mm, vma, addr_start, ptep_start, pmd, flags, entry, ptl);
+
+	pte_unmap_unlock(ptep_start, ptl);
+
+	return 0;
+}
+
 /* NUMA hinting page fault entry point for regular pmds */
-int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		     unsigned long addr, pmd_t *pmdp)
+static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			    unsigned long addr, pmd_t *pmdp)
 {
 	pmd_t pmd;
 	pte_t *pte, *orig_pte;
@@ -3558,6 +3626,7 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	VM_BUG_ON(offset >= PMD_SIZE);
 	orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
 	pte += offset >> PAGE_SHIFT;
+
 	for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
 		struct page *page;
 		int page_nid;
@@ -3581,6 +3650,9 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		if (pte_numa(pteval)) {
 			pteval = pte_mknonnuma(pteval);
 			set_pte_at(mm, addr, pte, pteval);
+		} else {
+			/* Should not happen */
+			WARN_ON_ONCE(1);
 		}
 		page = vm_normal_page(vma, addr, pteval);
 		if (unlikely(!page))
@@ -3621,6 +3693,19 @@ int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	return 0;
 }
+#else
+static inline int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
+			unsigned int flags, pte_t entry0)
+{
+	return 0;
+}
+static inline int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+		     unsigned long addr, pmd_t *pmdp)
+{
+	return 0;
+}
+#endif
 
 /*
  * These routines also need to handle stuff like marking pages dirty
@@ -3661,7 +3746,7 @@ int handle_pte_fault(struct mm_struct *mm,
 	}
 
 	if (pte_numa(entry))
-		return do_numa_page(mm, vma, address, entry, pte, pmd);
+		return do_numa_page(mm, vma, address, pte, pmd, flags, entry);
 
 	ptl = pte_lockptr(mm, pmd);
 	spin_lock(ptl);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6bb9fd0..128e2e7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2339,7 +2339,7 @@ static void sp_free(struct sp_node *n)
  * Policy determination "mimics" alloc_page_vma().
  * Called from fault path where we know the vma and faulting address.
  */
-int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr, int shift)
 {
 	struct mempolicy *pol;
 	struct zone *zone;
@@ -2353,6 +2353,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	BUG_ON(!vma);
 
 	pol = get_vma_policy(current, vma, addr);
+
 	if (!(pol->flags & MPOL_F_MOF))
 		goto out_keep_page;
 	if (task_numa_shared(current) < 0)
@@ -2360,23 +2361,13 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	
 	switch (pol->mode) {
 	case MPOL_INTERLEAVE:
-	{
-		int shift;
 
 		BUG_ON(addr >= vma->vm_end);
 		BUG_ON(addr < vma->vm_start);
 
-#ifdef CONFIG_HUGETLB_PAGE
-		if (transparent_hugepage_enabled(vma) || vma->vm_flags & VM_HUGETLB)
-			shift = HPAGE_SHIFT;
-		else
-#endif
-			shift = PAGE_SHIFT;
-
 		target_node = interleave_nid(pol, vma, addr, shift);
 
 		goto out_keep_page;
-	}
 
 	case MPOL_PREFERRED:
 		if (pol->flags & MPOL_F_LOCAL)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 47335a9..b5be3f1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -138,19 +138,21 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		pages += change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable, prot_numa, &all_same_node);
 
+#ifdef CONFIG_NUMA_BALANCING
 		/*
 		 * If we are changing protections for NUMA hinting faults then
 		 * set pmd_numa if the examined pages were all on the same
 		 * node. This allows a regular PMD to be handled as one fault
 		 * and effectively batches the taking of the PTL
 		 */
-		if (prot_numa && all_same_node) {
+		if (prot_numa && all_same_node && 0) {
 			struct mm_struct *mm = vma->vm_mm;
 
 			spin_lock(&mm->page_table_lock);
 			set_pmd_at(mm, addr & PMD_MASK, pmd, pmd_mknuma(*pmd));
 			spin_unlock(&mm->page_table_lock);
 		}
+#endif
 	} while (pmd++, addr = next, addr != end);
 
 	return pages;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 7/9] numa, sched: Improve staggered convergence
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (5 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 6/9] numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 8/9] numa, sched: Improve directed convergence Ingo Molnar
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Add/tune two convergence mechanisms:

 - add a sysctl for the fault moving average weight factor
   and change it to 3

 - add an initial settlement delay of 2 periods

 - add back parts of the throttling that triggers after a
   task migrates, to let it settle - with a sysctl that is
   set to 0 for now.

This tunes the code to be in harmony with our changed and
more precise 'cpupid' fault statistics, which allows us to
converge more aggressively without introducing destabilizing
turbulences into the convergence flow.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |  2 ++
 kernel/sched/core.c     |  1 +
 kernel/sched/fair.c     | 78 ++++++++++++++++++++++++++++++++++++-------------
 kernel/sched/features.h |  2 ++
 kernel/sysctl.c         |  8 +++++
 5 files changed, 71 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1041c0d..6e63022 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1509,6 +1509,7 @@ struct task_struct {
 	int numa_max_node;
 	int numa_scan_seq;
 	unsigned long numa_scan_ts_secs;
+	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
 	unsigned long convergence_strength;
@@ -2064,6 +2065,7 @@ extern unsigned int sysctl_sched_numa_scan_size_min;
 extern unsigned int sysctl_sched_numa_scan_size_max;
 extern unsigned int sysctl_sched_numa_rss_threshold;
 extern unsigned int sysctl_sched_numa_settle_count;
+extern unsigned int sysctl_sched_numa_fault_weight;
 
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cfa8426..3c74af7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1562,6 +1562,7 @@ static void __sched_fork(struct task_struct *p)
 	p->convergence_strength		= 0;
 	p->convergence_node		= -1;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+	p->numa_migrate_seq = 2;
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1547d66..fd49920 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -854,9 +854,20 @@ unsigned int sysctl_sched_numa_scan_size_max	__read_mostly = 512;	/* MB */
 unsigned int sysctl_sched_numa_rss_threshold	__read_mostly = 128;	/* MB */
 
 /*
- * Wait for the 2-sample stuff to settle before migrating again
+ * Wait for the 3-sample stuff to settle before migrating again
  */
-unsigned int sysctl_sched_numa_settle_count	__read_mostly = 2;
+unsigned int sysctl_sched_numa_settle_count	__read_mostly = 0;
+
+/*
+ * Weight of decay of the fault stats:
+ */
+unsigned int sysctl_sched_numa_fault_weight	__read_mostly = 3;
+
+static void task_numa_migrate(struct task_struct *p, int next_cpu)
+{
+	if (cpu_to_node(next_cpu) != cpu_to_node(task_cpu(p)))
+		p->numa_migrate_seq = 0;
+}
 
 static int task_ideal_cpu(struct task_struct *p)
 {
@@ -2051,7 +2062,9 @@ static void task_numa_placement_tick(struct task_struct *p)
 {
 	unsigned long total[2] = { 0, 0 };
 	unsigned long faults, max_faults = 0;
-	int node, priv, shared, ideal_node = -1;
+	unsigned long total_priv, total_shared;
+	int node, priv, new_shared, prev_shared, ideal_node = -1;
+	int settle_limit;
 	int flip_tasks;
 	int this_node;
 	int this_cpu;
@@ -2065,14 +2078,17 @@ static void task_numa_placement_tick(struct task_struct *p)
 		for (priv = 0; priv < 2; priv++) {
 			unsigned int new_faults;
 			unsigned int idx;
+			unsigned int weight;
 
 			idx = 2*node + priv;
 			new_faults = p->numa_faults_curr[idx];
 			p->numa_faults_curr[idx] = 0;
 
-			/* Keep a simple running average: */
-			p->numa_faults[idx] = p->numa_faults[idx]*15 + new_faults;
-			p->numa_faults[idx] /= 16;
+			/* Keep a simple exponential moving average: */
+			weight = sysctl_sched_numa_fault_weight;
+
+			p->numa_faults[idx] = p->numa_faults[idx]*(weight-1) + new_faults;
+			p->numa_faults[idx] /= weight;
 
 			faults += p->numa_faults[idx];
 			total[priv] += p->numa_faults[idx];
@@ -2092,23 +2108,38 @@ static void task_numa_placement_tick(struct task_struct *p)
 	 * we might want to consider a different equation below to reduce
 	 * the impact of a little private memory accesses.
 	 */
-	shared = p->numa_shared;
-
-	if (shared < 0) {
-		shared = (total[0] >= total[1]);
-	} else if (shared == 0) {
-		/* If it was private before, make it harder to become shared: */
-		if (total[0] >= total[1]*2)
-			shared = 1;
-	} else if (shared == 1 ) {
+	prev_shared = p->numa_shared;
+	new_shared = prev_shared;
+
+	settle_limit = sysctl_sched_numa_settle_count;
+
+	/*
+	 * Note: shared is spread across multiple tasks and in the future
+	 * we might want to consider a different equation below to reduce
+	 * the impact of a little private memory accesses.
+	 */
+	total_priv = total[1] / 2;
+	total_shared = total[0];
+
+	if (prev_shared < 0) {
+		/* Start out as private: */
+		new_shared = 0;
+	} else if (prev_shared == 0 && p->numa_migrate_seq >= settle_limit) {
+		/*
+		 * Hysteresis: if it was private before, make it harder to
+		 * become shared:
+		 */
+		if (total_shared*2 >= total_priv*3)
+			new_shared = 1;
+	} else if (prev_shared == 1 && p->numa_migrate_seq >= settle_limit) {
 		 /* If it was shared before, make it harder to become private: */
-		if (total[0]*2 <= total[1])
-			shared = 0;
+		if (total_shared*3 <= total_priv*2)
+			new_shared = 0;
 	}
 
 	flip_tasks = 0;
 
-	if (shared)
+	if (new_shared)
 		p->ideal_cpu = sched_update_ideal_cpu_shared(p, &flip_tasks);
 	else
 		p->ideal_cpu = sched_update_ideal_cpu_private(p);
@@ -2126,7 +2157,9 @@ static void task_numa_placement_tick(struct task_struct *p)
 			ideal_node = p->numa_max_node;
 	}
 
-	if (shared != task_numa_shared(p) || (ideal_node != -1 && ideal_node != p->numa_max_node)) {
+	if (new_shared != prev_shared || (ideal_node != -1 && ideal_node != p->numa_max_node)) {
+
+		p->numa_migrate_seq = 0;
 		/*
 		 * Fix up node migration fault statistics artifact, as we
 		 * migrate to another node we'll soon bring over our private
@@ -2141,7 +2174,7 @@ static void task_numa_placement_tick(struct task_struct *p)
 			p->numa_faults[idx_newnode] += p->numa_faults[idx_oldnode];
 			p->numa_faults[idx_oldnode] = 0;
 		}
-		sched_setnuma(p, ideal_node, shared);
+		sched_setnuma(p, ideal_node, new_shared);
 
 		/* Allocate only the maximum node: */
 		if (sched_feat(NUMA_POLICY_MAXNODE)) {
@@ -2323,6 +2356,10 @@ void task_numa_placement_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	p->numa_migrate_seq++;
+	if (sched_feat(NUMA_SETTLE) && p->numa_migrate_seq < sysctl_sched_numa_settle_count)
+		return;
+
 	task_numa_placement_tick(p);
 }
 
@@ -5116,6 +5153,7 @@ static void
 migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 {
 	migrate_task_rq_entity(p, next_cpu);
+	task_numa_migrate(p, next_cpu);
 }
 #endif /* CONFIG_SMP */
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 5598f63..c2f137f 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,6 +63,8 @@ SCHED_FEAT(NONTASK_POWER, true)
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 
+SCHED_FEAT(NUMA_SETTLE,			true)
+
 SCHED_FEAT(FORCE_SD_OVERLAP,		false)
 SCHED_FEAT(RT_RUNTIME_SHARE,		true)
 SCHED_FEAT(LB_MIN,			false)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 75ab895..254a2b4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -401,6 +401,14 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+	{
+		.procname	= "sched_numa_fault_weight",
+		.data		= &sysctl_sched_numa_fault_weight,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+		.extra1		= &two,	/* a weight minimum of 2 */
+	},
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_SCHED_DEBUG */
 	{
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 8/9] numa, sched: Improve directed convergence
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (6 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 7/9] numa, sched: Improve staggered convergence Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-07  0:19 ` [PATCH 9/9] numa, sched: Streamline and fix numa_allow_migration() use Ingo Molnar
  2012-12-10 18:22 ` [GIT TREE] Unified NUMA balancing tree, v3 Thomas Gleixner
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Improve a few aspects of directed convergence, which problems
have become more visible with the improved (and thus more
aggressive) task-flipping code.

We also have the new 'cpupid' sharing info which converges more
precisely and thus highlights weaknesses in group balancing more
visibly:

 - We should only balance over buddy groups that are smaller
   than the other (not fully filled) buddy groups

 - Do not 'spread' buddy groups that fully fill a node

 - Do not 'spread' singular buddy groups

These bugs were prominently visible with certain preemption
options and timings with the previous code as well, so this
is a regression fix as well.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 101 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 71 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd49920..c393fba 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -926,6 +926,7 @@ static long calc_node_load(int node, bool use_higher)
 	long cpu_load_highfreq;
 	long cpu_load_lowfreq;
 	long cpu_load_curr;
+	long cpu_load_numa;
 	long min_cpu_load;
 	long max_cpu_load;
 	long node_load;
@@ -935,18 +936,22 @@ static long calc_node_load(int node, bool use_higher)
 
 	for_each_cpu(cpu, cpumask_of_node(node)) {
 		struct rq *rq = cpu_rq(cpu);
+		long cpu_load;
 
+		cpu_load_numa		= rq->numa_weight;
 		cpu_load_curr		= rq->load.weight;
 		cpu_load_lowfreq	= rq->cpu_load[NUMA_LOAD_IDX_LOWFREQ];
 		cpu_load_highfreq	= rq->cpu_load[NUMA_LOAD_IDX_HIGHFREQ];
 
-		min_cpu_load = min(min(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
-		max_cpu_load = max(max(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
+		min_cpu_load = min(min(min(cpu_load_numa, cpu_load_curr), cpu_load_lowfreq), cpu_load_highfreq);
+		max_cpu_load = max(max(max(cpu_load_numa, cpu_load_curr), cpu_load_lowfreq), cpu_load_highfreq);
 
 		if (use_higher)
-			node_load += max_cpu_load;
+			cpu_load = max_cpu_load;
 		else
-			node_load += min_cpu_load;
+			cpu_load = min_cpu_load;
+
+		node_load += cpu_load;
 	}
 
 	return node_load;
@@ -1087,6 +1092,7 @@ static int find_intranode_imbalance(int this_node, int this_cpu)
 	long cpu_load_lowfreq;
 	long this_cpu_load;
 	long cpu_load_curr;
+	long cpu_load_numa;
 	long min_cpu_load;
 	long cpu_load;
 	int min_cpu;
@@ -1102,14 +1108,15 @@ static int find_intranode_imbalance(int this_node, int this_cpu)
 	for_each_cpu(cpu, cpumask_of_node(this_node)) {
 		struct rq *rq = cpu_rq(cpu);
 
+		cpu_load_numa		= rq->numa_weight;
 		cpu_load_curr		= rq->load.weight;
 		cpu_load_lowfreq	= rq->cpu_load[NUMA_LOAD_IDX_LOWFREQ];
 		cpu_load_highfreq	= rq->cpu_load[NUMA_LOAD_IDX_HIGHFREQ];
 
-		if (cpu == this_cpu) {
-			this_cpu_load = min(min(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
-		}
-		cpu_load = max(max(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
+		if (cpu == this_cpu)
+			this_cpu_load = min(min(min(cpu_load_numa, cpu_load_curr), cpu_load_lowfreq), cpu_load_highfreq);
+
+		cpu_load = max(max(max(cpu_load_numa, cpu_load_curr), cpu_load_lowfreq), cpu_load_highfreq);
 
 		/* Find the idlest CPU: */
 		if (cpu_load < min_cpu_load) {
@@ -1128,16 +1135,18 @@ static int find_intranode_imbalance(int this_node, int this_cpu)
 /*
  * Search a node for the smallest-group task and return
  * it plus the size of the group it is in.
+ *
+ * TODO: can be done with a single pass.
  */
-static int buddy_group_size(int node, struct task_struct *p)
+static int buddy_group_size(int node, struct task_struct *p, bool *our_group_p)
 {
 	const cpumask_t *node_cpus_mask = cpumask_of_node(node);
 	cpumask_t cpus_to_check_mask;
-	bool our_group_found;
+	bool buddies_group_found;
 	int cpu1, cpu2;
 
 	cpumask_copy(&cpus_to_check_mask, node_cpus_mask);
-	our_group_found = false;
+	buddies_group_found = false;
 
 	if (WARN_ON_ONCE(cpumask_empty(&cpus_to_check_mask)))
 		return 0;
@@ -1156,7 +1165,12 @@ static int buddy_group_size(int node, struct task_struct *p)
 
 			group_size = 1;
 			if (tasks_buddies(group_head, p))
-				our_group_found = true;
+				buddies_group_found = true;
+
+			if (group_head == p) {
+				*our_group_p = true;
+				buddies_group_found = true;
+			}
 
 			/* Non-NUMA-shared tasks are 1-task groups: */
 			if (task_numa_shared(group_head) != 1)
@@ -1169,21 +1183,22 @@ static int buddy_group_size(int node, struct task_struct *p)
 				struct task_struct *p2 = rq2->curr;
 
 				WARN_ON_ONCE(rq1 == rq2);
+				if (p2 == p)
+					*our_group_p = true;
 				if (tasks_buddies(group_head, p2)) {
 					/* 'group_head' and 'rq2->curr' are in the same group: */
 					cpumask_clear_cpu(cpu2, &cpus_to_check_mask);
 					group_size++;
 					if (tasks_buddies(p2, p))
-						our_group_found = true;
+						buddies_group_found = true;
 				}
 			}
 next:
-
 			/*
 			 * If we just found our group and checked all
 			 * node local CPUs then return the result:
 			 */
-			if (our_group_found)
+			if (buddies_group_found)
 				return group_size;
 		}
 	} while (!cpumask_empty(&cpus_to_check_mask));
@@ -1261,8 +1276,15 @@ pick_non_numa_task:
 	return min_group_cpu;
 }
 
-static int find_max_node(struct task_struct *p, int *our_group_size)
+/*
+ * Find the node with the biggest buddy group of ours, but exclude
+ * our own local group on this node and also exclude fully filled
+ * nodes:
+ */
+static int
+find_max_node(struct task_struct *p, int *our_group_size_p, int *max_group_size_p, int full_size)
 {
+	bool our_group = false;
 	int max_group_size;
 	int group_size;
 	int max_node;
@@ -1272,9 +1294,12 @@ static int find_max_node(struct task_struct *p, int *our_group_size)
 	max_node = -1;
 
 	for_each_node(node) {
-		int full_size = cpumask_weight(cpumask_of_node(node));
 
-		group_size = buddy_group_size(node, p);
+		group_size = buddy_group_size(node, p, &our_group);
+		if (our_group) {
+			our_group = false;
+			*our_group_size_p = group_size;
+		}
 		if (group_size == full_size)
 			continue;
 
@@ -1284,7 +1309,7 @@ static int find_max_node(struct task_struct *p, int *our_group_size)
 		}
 	}
 
-	*our_group_size = max_group_size;
+	*max_group_size_p = max_group_size;
 
 	return max_node;
 }
@@ -1460,19 +1485,23 @@ static int find_max_node(struct task_struct *p, int *our_group_size)
  */
 static int improve_group_balance_compress(struct task_struct *p, int this_cpu, int this_node)
 {
-	int our_group_size = -1;
+	int full_size = cpumask_weight(cpumask_of_node(this_node));
+	int max_group_size = -1;
 	int min_group_size = -1;
+	int our_group_size = -1;
 	int max_node;
 	int min_cpu;
 
 	if (!sched_feat(NUMA_GROUP_LB_COMPRESS))
 		return -1;
 
-	max_node = find_max_node(p, &our_group_size);
+	max_node = find_max_node(p, &our_group_size, &max_group_size, full_size);
 	if (max_node == -1)
 		return -1;
+	if (our_group_size == -1)
+		return -1;
 
-	if (WARN_ON_ONCE(our_group_size == -1))
+	if (our_group_size == full_size || our_group_size > max_group_size)
 		return -1;
 
 	/* We are already in the right spot: */
@@ -1517,6 +1546,7 @@ static int improve_group_balance_spread(struct task_struct *p, int this_cpu, int
 	long this_node_load = -1;
 	long delta_load_before;
 	long delta_load_after;
+	int group_count = 0;
 	int idlest_cpu = -1;
 	int cpu1, cpu2;
 
@@ -1585,6 +1615,7 @@ static int improve_group_balance_spread(struct task_struct *p, int this_cpu, int
 				min_group_size = group_size;
 			else
 				min_group_size = min(group_size, min_group_size);
+			group_count++;
 		}
 	} while (!cpumask_empty(&cpus_to_check_mask));
 
@@ -1594,13 +1625,23 @@ static int improve_group_balance_spread(struct task_struct *p, int this_cpu, int
 	 */
 	if (!found_our_group)
 		return -1;
+
 	if (!our_group_smallest)
 		return -1;
+
 	if (WARN_ON_ONCE(min_group_size == -1))
 		return -1;
 	if (WARN_ON_ONCE(our_group_size == -1))
 		return -1;
 
+	/* Since the current task is shared, this should not happen: */
+	if (WARN_ON_ONCE(group_count < 1))
+		return -1;
+
+	/* No point in moving if we are a single group: */
+	if (group_count <= 1)
+		return -1;
+
 	idlest_node = find_idlest_node(&idlest_cpu);
 	if (idlest_node == -1)
 		return -1;
@@ -1622,7 +1663,7 @@ static int improve_group_balance_spread(struct task_struct *p, int this_cpu, int
 	 */
 	delta_load_before = this_node_load - idlest_node_load;
 	delta_load_after = (this_node_load-this_group_load) - (idlest_node_load+this_group_load);
-	
+
 	if (abs(delta_load_after)+SCHED_LOAD_SCALE > abs(delta_load_before))
 		return -1;
 
@@ -1806,7 +1847,7 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p, int *flip_tasks)
 	this_node_capacity	= calc_node_capacity(this_node);
 
 	this_node_overloaded = false;
-	if (this_node_load > this_node_capacity + 512)
+	if (this_node_load > this_node_capacity + SCHED_LOAD_SCALE/2)
 		this_node_overloaded = true;
 
 	/* If we'd stay within this node then stay put: */
@@ -1922,7 +1963,7 @@ static int sched_update_ideal_cpu_private(struct task_struct *p)
 	this_node_capacity	= calc_node_capacity(this_node);
 
 	this_node_overloaded = false;
-	if (this_node_load > this_node_capacity + 512)
+	if (this_node_load > this_node_capacity + SCHED_LOAD_SCALE/2)
 		this_node_overloaded = true;
 
 	if (this_node == min_node)
@@ -1934,8 +1975,8 @@ static int sched_update_ideal_cpu_private(struct task_struct *p)
 
 	WARN_ON_ONCE(max_node_load < min_node_load);
 
-	/* Is the load difference at least 125% of one standard task load? */
-	if (this_node_load - min_node_load < 1536)
+	/* Is the load difference at least 150% of one standard task load? */
+	if (this_node_load - min_node_load < SCHED_LOAD_SCALE*3/2)
 		goto out_check_intranode;
 
 	/*
@@ -5476,7 +5517,7 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
  *
  * The adjacency matrix of the resulting graph is given by:
  *
- *             log_2 n     
+ *             log_2 n
  *   A_i,j = \Union     (i % 2^k == 0) && i / 2^(k+1) == j / 2^(k+1)  (6)
  *             k = 0
  *
@@ -5522,7 +5563,7 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
  *
  * [XXX write more on how we solve this.. _after_ merging pjt's patches that
  *      rewrite all of this once again.]
- */ 
+ */
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
@@ -6133,7 +6174,7 @@ void update_group_power(struct sched_domain *sd, int cpu)
 		/*
 		 * !SD_OVERLAP domains can assume that child groups
 		 * span the current group.
-		 */ 
+		 */
 
 		group = child->groups;
 		do {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 9/9] numa, sched: Streamline and fix numa_allow_migration() use
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (7 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 8/9] numa, sched: Improve directed convergence Ingo Molnar
@ 2012-12-07  0:19 ` Ingo Molnar
  2012-12-10 18:22 ` [GIT TREE] Unified NUMA balancing tree, v3 Thomas Gleixner
  9 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

There were a few inconsistencies in how numa_allow_migration() was
used, in particular it did no always take into account
high-imbalance scenarios, where affinity preferences are generally
overriden.

To fix this make use of numa_allow_migration() more consistent and
also pass in the load-balancing environment to the function, where
it can look at env->failed and env->sd->cache_nice_tries.

Also add a NUMA check to ALB.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 103 +++++++++++++++++++++++++++++-----------------------
 1 file changed, 57 insertions(+), 46 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c393fba..503ec29 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4792,6 +4792,39 @@ static inline unsigned long effective_load(struct task_group *tg, int cpu,
 
 #endif
 
+#define LBF_ALL_PINNED	0x01
+#define LBF_NEED_BREAK	0x02
+#define LBF_SOME_PINNED	0x04
+
+struct lb_env {
+	struct sched_domain	*sd;
+
+	struct rq		*src_rq;
+	int			src_cpu;
+
+	int			dst_cpu;
+	struct rq		*dst_rq;
+
+	struct cpumask		*dst_grpmask;
+	int			new_dst_cpu;
+	enum cpu_idle_type	idle;
+	long			imbalance;
+	/* The set of CPUs under consideration for load-balancing */
+	struct cpumask		*cpus;
+
+	unsigned int		flags;
+	unsigned int		failed;
+	unsigned int		iteration;
+
+	unsigned int		loop;
+	unsigned int		loop_break;
+	unsigned int		loop_max;
+
+	struct rq *		(*find_busiest_queue)(struct lb_env *,
+						      struct sched_group *);
+};
+
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
@@ -5011,30 +5044,35 @@ done:
 	return target;
 }
 
-static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cpu)
+static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cpu,
+				 struct lb_env *env)
 {
 #ifdef CONFIG_NUMA_BALANCING
+
 	if (sched_feat(NUMA_CONVERGE_MIGRATIONS)) {
 		/* Help in the direction of expected convergence: */
 		if (p->convergence_node >= 0 && (cpu_to_node(new_cpu) != p->convergence_node))
 			return false;
 
-		return true;
-	}
-
-	if (sched_feat(NUMA_BALANCE_ALL)) {
- 		if (task_numa_shared(p) >= 0)
-			return false;
-
-		return true;
+		if (!env || env->failed <= env->sd->cache_nice_tries) {
+			if (task_numa_shared(p) >= 0 &&
+					cpu_to_node(prev_cpu) != cpu_to_node(new_cpu))
+				return false;
+		}
 	}
 
 	if (sched_feat(NUMA_BALANCE_INTERNODE)) {
 		if (task_numa_shared(p) >= 0) {
- 			if (cpu_to_node(prev_cpu) != cpu_to_node(new_cpu))
+			if (cpu_to_node(prev_cpu) != cpu_to_node(new_cpu))
 				return false;
 		}
 	}
+
+	if (sched_feat(NUMA_BALANCE_ALL)) {
+		if (task_numa_shared(p) >= 0)
+			return false;
+	}
+
 #endif
 	return true;
 }
@@ -5148,7 +5186,7 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		/* while loop will break here if sd == NULL */
 	}
 unlock:
-	if (!numa_allow_migration(p, prev0_cpu, new_cpu)) {
+	if (!numa_allow_migration(p, prev0_cpu, new_cpu, NULL)) {
 		if (cpumask_test_cpu(prev0_cpu, tsk_cpus_allowed(p)))
 			new_cpu = prev0_cpu;
 	}
@@ -5567,38 +5605,6 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp
 
 static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
-#define LBF_ALL_PINNED	0x01
-#define LBF_NEED_BREAK	0x02
-#define LBF_SOME_PINNED	0x04
-
-struct lb_env {
-	struct sched_domain	*sd;
-
-	struct rq		*src_rq;
-	int			src_cpu;
-
-	int			dst_cpu;
-	struct rq		*dst_rq;
-
-	struct cpumask		*dst_grpmask;
-	int			new_dst_cpu;
-	enum cpu_idle_type	idle;
-	long			imbalance;
-	/* The set of CPUs under consideration for load-balancing */
-	struct cpumask		*cpus;
-
-	unsigned int		flags;
-	unsigned int		failed;
-	unsigned int		iteration;
-
-	unsigned int		loop;
-	unsigned int		loop_break;
-	unsigned int		loop_max;
-
-	struct rq *		(*find_busiest_queue)(struct lb_env *,
-						      struct sched_group *);
-};
-
 /*
  * move_task - move a task from one runqueue to another runqueue.
  * Both runqueues must be locked.
@@ -5699,7 +5705,7 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	/* We do NUMA balancing elsewhere: */
 
 	if (env->failed <= env->sd->cache_nice_tries) {
-		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu))
+		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu, env))
 			return false;
 	}
 
@@ -5760,7 +5766,7 @@ static int move_one_task(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			continue;
 
-		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu))
+		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu, env))
 			continue;
 
 		move_task(p, env);
@@ -5823,7 +5829,7 @@ static int move_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu))
+		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu, env))
 			goto next;
 
 		move_task(p, env);
@@ -6944,6 +6950,11 @@ more_balance:
 			goto out_pinned;
 		}
 
+		/* Is this active load-balancing NUMA-beneficial? */
+		if (!numa_allow_migration(busiest->curr, env.src_rq->cpu, env.dst_cpu, &env)) {
+			raw_spin_unlock_irqrestore(&busiest->lock, flags);
+			goto out;
+		}
 		/*
 		 * ->active_balance synchronizes accesses to
 		 * ->active_balance_work.  Once set, it's cleared
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
                   ` (8 preceding siblings ...)
  2012-12-07  0:19 ` [PATCH 9/9] numa, sched: Streamline and fix numa_allow_migration() use Ingo Molnar
@ 2012-12-10 18:22 ` Thomas Gleixner
  2012-12-10 18:41   ` Rik van Riel
  2012-12-10 19:32   ` Mel Gorman
  9 siblings, 2 replies; 19+ messages in thread
From: Thomas Gleixner @ 2012-12-10 18:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins

On Fri, 7 Dec 2012, Ingo Molnar wrote:
> The SPECjbb 4x JVM numbers are still very close to the
> hard-binding results:
> 
>   Fri Dec  7 02:08:42 CET 2012
>   spec1.txt:           throughput =     188667.94 SPECjbb2005 bops
>   spec2.txt:           throughput =     190109.31 SPECjbb2005 bops
>   spec3.txt:           throughput =     191438.13 SPECjbb2005 bops
>   spec4.txt:           throughput =     192508.34 SPECjbb2005 bops
>                                       --------------------------
>         SUM:           throughput =     762723.72 SPECjbb2005 bops
> 
> And the same is true for !THP as well.

I could not resist to throw all relevant trees on my own 4node machine
and run a SPECjbb 4x JVM comparison. All results have been averaged
over 10 runs.

mainline:	v3.7-rc8
autonuma:	mm-autonuma-v28fastr4-mels-rebase
balancenuma:	mm-balancenuma-v10r3
numacore:	Unified NUMA balancing tree, v3

The config is based on a F16 config with CONFIG_PREEMPT_NONE=y and the
relevant NUMA options enabled for the 4 trees.

THP off: manual placement result:     125239

		Auto result	Man/Auto	Mainline/Auto	Variance
mainline    :	     93945	0.750		1.000		 5.91%
autonuma    :	    123651	0.987		1.316		 5.15%
balancenuma :	     97327	0.777		1.036		 5.19%
numacore    :	    123009	0.982		1.309		 5.73%


THP on: manual placement result:     143170

		Auto result	Auto/Manual	Auto/Mainline	Variance
mainline    :	    104462	0.730		1.000		 8.47%
autonuma    :	    137363	0.959		1.315		 5.81%
balancenuma :	    112183	0.784		1.074		11.58%
numacore    :	    142728	0.997		1.366		 2.94%

So autonuma and numacore are basically on the same page, with a slight
advantage for numacore in the THP enabled case. balancenuma is closer
to mainline than to autonuma/numacore.

Thanks,

	tglx



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 18:22 ` [GIT TREE] Unified NUMA balancing tree, v3 Thomas Gleixner
@ 2012-12-10 18:41   ` Rik van Riel
  2012-12-10 19:15     ` Ingo Molnar
  2012-12-10 19:32   ` Mel Gorman
  1 sibling, 1 reply; 19+ messages in thread
From: Rik van Riel @ 2012-12-10 18:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Mel Gorman, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Johannes Weiner, Hugh Dickins

On 12/10/2012 01:22 PM, Thomas Gleixner wrote:

> So autonuma and numacore are basically on the same page, with a slight
> advantage for numacore in the THP enabled case. balancenuma is closer
> to mainline than to autonuma/numacore.

Indeed, when the system is fully loaded, numacore does very well.

The main issues that have been observed with numacore are when
the system is only partially loaded. Something strange seems to
be going on that causes performance regressions in that situation.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 18:41   ` Rik van Riel
@ 2012-12-10 19:15     ` Ingo Molnar
  2012-12-10 19:28       ` Mel Gorman
  0 siblings, 1 reply; 19+ messages in thread
From: Ingo Molnar @ 2012-12-10 19:15 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Thomas Gleixner, linux-kernel, linux-mm, Peter Zijlstra,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins


* Rik van Riel <riel@redhat.com> wrote:

> On 12/10/2012 01:22 PM, Thomas Gleixner wrote:
> 
> > So autonuma and numacore are basically on the same page, 
> > with a slight advantage for numacore in the THP enabled 
> > case. balancenuma is closer to mainline than to 
> > autonuma/numacore.
> 
> Indeed, when the system is fully loaded, numacore does very 
> well.

Note that the latest (-v3) code also does well in under-loaded 
situations:

   http://lkml.org/lkml/2012/12/7/331

Here's the 'perf bench numa' comparison to 'balancenuma':

                            balancenuma  | NUMA-tip
 [test unit]            :          -v10  |    -v3
------------------------------------------------------------
 2x1-bw-process         :         6.136  |  9.647:  57.2%
 3x1-bw-process         :         7.250  | 14.528: 100.4%
 4x1-bw-process         :         6.867  | 18.903: 175.3%
 8x1-bw-process         :         7.974  | 26.829: 236.5%
 8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
 16x1-bw-process        :         5.592  | 29.294: 423.9%
 4x1-bw-thread          :        13.598  | 19.290:  41.9%
 8x1-bw-thread          :        16.356  | 26.391:  61.4%
 16x1-bw-thread         :        24.608  | 29.557:  20.1%
 32x1-bw-thread         :        25.477  | 30.232:  18.7%
 2x3-bw-thread          :         8.785  | 15.327:  74.5%
 4x4-bw-thread          :         6.366  | 27.957: 339.2%
 4x6-bw-thread          :         6.287  | 27.877: 343.4%
 4x8-bw-thread          :         5.860  | 28.439: 385.3%
 4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
 3x3-bw-thread          :         8.235  | 21.560: 161.8%
 5x5-bw-thread          :         5.762  | 26.081: 352.6%
 2x16-bw-thread         :         5.920  | 23.269: 293.1%
 1x32-bw-thread         :         5.828  | 18.985: 225.8%
 numa02-bw              :        29.054  | 31.431:   8.2%
 numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
 numa01-bw-thread	:        20.338  | 28.607:  40.7%
 numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
------------------------------------------------------------

More than half of these testcases are under-loaded situations.

> The main issues that have been observed with numacore are when 
> the system is only partially loaded. Something strange seems 
> to be going on that causes performance regressions in that 
> situation.

I haven't seen such reports with -v3 yet, which is what Thomas 
tested. Mel has not tested -v3 yet AFAICS.

If there are any such instances left then I'll investigate, but 
right now it's looking pretty good.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 19:15     ` Ingo Molnar
@ 2012-12-10 19:28       ` Mel Gorman
  2012-12-10 20:07         ` Ingo Molnar
  0 siblings, 1 reply; 19+ messages in thread
From: Mel Gorman @ 2012-12-10 19:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Thomas Gleixner, linux-kernel, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins

On Mon, Dec 10, 2012 at 08:15:45PM +0100, Ingo Molnar wrote:
> 
> * Rik van Riel <riel@redhat.com> wrote:
> 
> > On 12/10/2012 01:22 PM, Thomas Gleixner wrote:
> > 
> > > So autonuma and numacore are basically on the same page, 
> > > with a slight advantage for numacore in the THP enabled 
> > > case. balancenuma is closer to mainline than to 
> > > autonuma/numacore.
> > 
> > Indeed, when the system is fully loaded, numacore does very 
> > well.
> 
> Note that the latest (-v3) code also does well in under-loaded 
> situations:
> 
>    http://lkml.org/lkml/2012/12/7/331
> 
> Here's the 'perf bench numa' comparison to 'balancenuma':
> 
>                             balancenuma  | NUMA-tip
>  [test unit]            :          -v10  |    -v3
> ------------------------------------------------------------
>  2x1-bw-process         :         6.136  |  9.647:  57.2%
>  3x1-bw-process         :         7.250  | 14.528: 100.4%
>  4x1-bw-process         :         6.867  | 18.903: 175.3%
>  8x1-bw-process         :         7.974  | 26.829: 236.5%
>  8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
>  16x1-bw-process        :         5.592  | 29.294: 423.9%
>  4x1-bw-thread          :        13.598  | 19.290:  41.9%
>  8x1-bw-thread          :        16.356  | 26.391:  61.4%
>  16x1-bw-thread         :        24.608  | 29.557:  20.1%
>  32x1-bw-thread         :        25.477  | 30.232:  18.7%
>  2x3-bw-thread          :         8.785  | 15.327:  74.5%
>  4x4-bw-thread          :         6.366  | 27.957: 339.2%
>  4x6-bw-thread          :         6.287  | 27.877: 343.4%
>  4x8-bw-thread          :         5.860  | 28.439: 385.3%
>  4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
>  3x3-bw-thread          :         8.235  | 21.560: 161.8%
>  5x5-bw-thread          :         5.762  | 26.081: 352.6%
>  2x16-bw-thread         :         5.920  | 23.269: 293.1%
>  1x32-bw-thread         :         5.828  | 18.985: 225.8%
>  numa02-bw              :        29.054  | 31.431:   8.2%
>  numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
>  numa01-bw-thread	:        20.338  | 28.607:  40.7%
>  numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
> ------------------------------------------------------------
> 
> More than half of these testcases are under-loaded situations.
> 
> > The main issues that have been observed with numacore are when 
> > the system is only partially loaded. Something strange seems 
> > to be going on that causes performance regressions in that 
> > situation.
> 
> I haven't seen such reports with -v3 yet, which is what Thomas 
> tested. Mel has not tested -v3 yet AFAICS.
> 

Yes, I have. The drop I took and the results I posted to you were based
on a tip/master pull from December 9th. v3 was released on December
7th and your release said to test based on tip/master. The results are
here https://lkml.org/lkml/2012/12/9/108 . Look at the columns marked
numafix-20121209 which is tip/master with a bodge on top to remove the "if
(p->nr_cpus_allowed != num_online_cpus())" check.

To my continued frustration, the results begin at the line "Here is the
comparison on the rough off-chance you actually read it this time." I
guess you didn't feel the need.

> If there are any such instances left then I'll investigate, but 
> right now it's looking pretty good.
> 

If you had read that report, you would know that I didn't have results
for specjbb with THP enabled due to the JVM crashing with null pointer
exceptions.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 18:22 ` [GIT TREE] Unified NUMA balancing tree, v3 Thomas Gleixner
  2012-12-10 18:41   ` Rik van Riel
@ 2012-12-10 19:32   ` Mel Gorman
  1 sibling, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2012-12-10 19:32 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Johannes Weiner, Hugh Dickins

On Mon, Dec 10, 2012 at 07:22:37PM +0100, Thomas Gleixner wrote:
> On Fri, 7 Dec 2012, Ingo Molnar wrote:
> > The SPECjbb 4x JVM numbers are still very close to the
> > hard-binding results:
> > 
> >   Fri Dec  7 02:08:42 CET 2012
> >   spec1.txt:           throughput =     188667.94 SPECjbb2005 bops
> >   spec2.txt:           throughput =     190109.31 SPECjbb2005 bops
> >   spec3.txt:           throughput =     191438.13 SPECjbb2005 bops
> >   spec4.txt:           throughput =     192508.34 SPECjbb2005 bops
> >                                       --------------------------
> >         SUM:           throughput =     762723.72 SPECjbb2005 bops
> > 
> > And the same is true for !THP as well.
> 
> I could not resist to throw all relevant trees on my own 4node machine
> and run a SPECjbb 4x JVM comparison. All results have been averaged
> over 10 runs.
> 
> mainline:	v3.7-rc8
> autonuma:	mm-autonuma-v28fastr4-mels-rebase
> balancenuma:	mm-balancenuma-v10r3
> numacore:	Unified NUMA balancing tree, v3
> 
> The config is based on a F16 config with CONFIG_PREEMPT_NONE=y and the
> relevant NUMA options enabled for the 4 trees.
> 

Ok, I had PREEMPT enabled so we differ on that at least. I don't know if
it would be enough to hide the problems that led to the JVM crashing on
me for the latest version of numacore or not.

> THP off: manual placement result:     125239
> 
> 		Auto result	Man/Auto	Mainline/Auto	Variance
> mainline    :	     93945	0.750		1.000		 5.91%
> autonuma    :	    123651	0.987		1.316		 5.15%
> balancenuma :	     97327	0.777		1.036		 5.19%
> numacore    :	    123009	0.982		1.309		 5.73%
> 
> 
> THP on: manual placement result:     143170
> 
> 		Auto result	Auto/Manual	Auto/Mainline	Variance
> mainline    :	    104462	0.730		1.000		 8.47%
> autonuma    :	    137363	0.959		1.315		 5.81%
> balancenuma :	    112183	0.784		1.074		11.58%
> numacore    :	    142728	0.997		1.366		 2.94%
> 
> So autonuma and numacore are basically on the same page, with a slight
> advantage for numacore in the THP enabled case. balancenuma is closer
> to mainline than to autonuma/numacore.
> 

I would expect balancenuma to be closer to mainline than autonuma, whatever
about numacore which I get mixed results for. balancenumas objective was
not to be the best, it was meant to be a baseline that either autonuma
or numacore could compete based on scheduler policies for while the MM
portions would be common to either. If I thought otherwise I would have
spent the last 2 weeks working on the scheduler aspects which would have
been generally unhelpful.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 19:28       ` Mel Gorman
@ 2012-12-10 20:07         ` Ingo Molnar
  2012-12-10 20:10           ` Ingo Molnar
                             ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-10 20:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Thomas Gleixner, linux-kernel, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins


* Mel Gorman <mgorman@suse.de> wrote:

> On Mon, Dec 10, 2012 at 08:15:45PM +0100, Ingo Molnar wrote:
> > 
> > * Rik van Riel <riel@redhat.com> wrote:
> > 
> > > On 12/10/2012 01:22 PM, Thomas Gleixner wrote:
> > > 
> > > > So autonuma and numacore are basically on the same page, 
> > > > with a slight advantage for numacore in the THP enabled 
> > > > case. balancenuma is closer to mainline than to 
> > > > autonuma/numacore.
> > > 
> > > Indeed, when the system is fully loaded, numacore does very 
> > > well.
> > 
> > Note that the latest (-v3) code also does well in under-loaded 
> > situations:
> > 
> >    http://lkml.org/lkml/2012/12/7/331
> > 
> > Here's the 'perf bench numa' comparison to 'balancenuma':
> > 
> >                             balancenuma  | NUMA-tip
> >  [test unit]            :          -v10  |    -v3
> > ------------------------------------------------------------
> >  2x1-bw-process         :         6.136  |  9.647:  57.2%
> >  3x1-bw-process         :         7.250  | 14.528: 100.4%
> >  4x1-bw-process         :         6.867  | 18.903: 175.3%
> >  8x1-bw-process         :         7.974  | 26.829: 236.5%
> >  8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
> >  16x1-bw-process        :         5.592  | 29.294: 423.9%
> >  4x1-bw-thread          :        13.598  | 19.290:  41.9%
> >  8x1-bw-thread          :        16.356  | 26.391:  61.4%
> >  16x1-bw-thread         :        24.608  | 29.557:  20.1%
> >  32x1-bw-thread         :        25.477  | 30.232:  18.7%
> >  2x3-bw-thread          :         8.785  | 15.327:  74.5%
> >  4x4-bw-thread          :         6.366  | 27.957: 339.2%
> >  4x6-bw-thread          :         6.287  | 27.877: 343.4%
> >  4x8-bw-thread          :         5.860  | 28.439: 385.3%
> >  4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
> >  3x3-bw-thread          :         8.235  | 21.560: 161.8%
> >  5x5-bw-thread          :         5.762  | 26.081: 352.6%
> >  2x16-bw-thread         :         5.920  | 23.269: 293.1%
> >  1x32-bw-thread         :         5.828  | 18.985: 225.8%
> >  numa02-bw              :        29.054  | 31.431:   8.2%
> >  numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
> >  numa01-bw-thread	:        20.338  | 28.607:  40.7%
> >  numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
> > ------------------------------------------------------------
> > 
> > More than half of these testcases are under-loaded situations.
> > 
> > > The main issues that have been observed with numacore are when 
> > > the system is only partially loaded. Something strange seems 
> > > to be going on that causes performance regressions in that 
> > > situation.
> > 
> > I haven't seen such reports with -v3 yet, which is what Thomas 
> > tested. Mel has not tested -v3 yet AFAICS.
> > 
> 
> Yes, I have. The drop I took and the results I posted to you 
> were based on a tip/master pull from December 9th. v3 was 
> released on December 7th and your release said to test based 
> on tip/master. The results are here 
> https://lkml.org/lkml/2012/12/9/108 . Look at the columns 
> marked numafix-20121209 which is tip/master with a bodge on 
> top to remove the "if (p->nr_cpus_allowed != 
> num_online_cpus())" check.

Ah, indeed - I saw those results but the 'numafix' tag threw me 
off.

Looks like at least in terms of AutoNUMA-benchmark numbers you 
measured the best-ever results with the -v3 tree? That aspect is 
obviously good news.

This part isn't:

> > If there are any such instances left then I'll investigate, 
> > but right now it's looking pretty good.
> 
> If you had read that report, you would know that I didn't have 
> results for specjbb with THP enabled due to the JVM crashing 
> with null pointer exceptions.

Hm, it's the unified tree where most of the mm/ bits are the 
AutoNUMA bits from your tree. (It does not match 100%, because 
your tree has an ancient version of key memory usage statistics 
that the scheduler needs for its convergence model. I'll take a 
look at the differences.)

Given how well the unified kernel performs, and given that the 
segfaults occur on your box, would you be willing to debug this 
a bit and help me out fixing the bug? Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 20:07         ` Ingo Molnar
@ 2012-12-10 20:10           ` Ingo Molnar
  2012-12-10 21:03           ` Ingo Molnar
  2012-12-10 22:19           ` Mel Gorman
  2 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-10 20:10 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Thomas Gleixner, linux-kernel, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Mel Gorman <mgorman@suse.de> wrote:
> 
> > On Mon, Dec 10, 2012 at 08:15:45PM +0100, Ingo Molnar wrote:
> > > 
> > > * Rik van Riel <riel@redhat.com> wrote:
> > > 
> > > > On 12/10/2012 01:22 PM, Thomas Gleixner wrote:
> > > > 
> > > > > So autonuma and numacore are basically on the same page, 
> > > > > with a slight advantage for numacore in the THP enabled 
> > > > > case. balancenuma is closer to mainline than to 
> > > > > autonuma/numacore.
> > > > 
> > > > Indeed, when the system is fully loaded, numacore does very 
> > > > well.
> > > 
> > > Note that the latest (-v3) code also does well in under-loaded 
> > > situations:
> > > 
> > >    http://lkml.org/lkml/2012/12/7/331
> > > 
> > > Here's the 'perf bench numa' comparison to 'balancenuma':
> > > 
> > >                             balancenuma  | NUMA-tip
> > >  [test unit]            :          -v10  |    -v3
> > > ------------------------------------------------------------
> > >  2x1-bw-process         :         6.136  |  9.647:  57.2%
> > >  3x1-bw-process         :         7.250  | 14.528: 100.4%
> > >  4x1-bw-process         :         6.867  | 18.903: 175.3%
> > >  8x1-bw-process         :         7.974  | 26.829: 236.5%
> > >  8x1-bw-process-NOTHP   :         5.937  | 22.237: 274.5%
> > >  16x1-bw-process        :         5.592  | 29.294: 423.9%
> > >  4x1-bw-thread          :        13.598  | 19.290:  41.9%
> > >  8x1-bw-thread          :        16.356  | 26.391:  61.4%
> > >  16x1-bw-thread         :        24.608  | 29.557:  20.1%
> > >  32x1-bw-thread         :        25.477  | 30.232:  18.7%
> > >  2x3-bw-thread          :         8.785  | 15.327:  74.5%
> > >  4x4-bw-thread          :         6.366  | 27.957: 339.2%
> > >  4x6-bw-thread          :         6.287  | 27.877: 343.4%
> > >  4x8-bw-thread          :         5.860  | 28.439: 385.3%
> > >  4x8-bw-thread-NOTHP    :         6.167  | 25.067: 306.5%
> > >  3x3-bw-thread          :         8.235  | 21.560: 161.8%
> > >  5x5-bw-thread          :         5.762  | 26.081: 352.6%
> > >  2x16-bw-thread         :         5.920  | 23.269: 293.1%
> > >  1x32-bw-thread         :         5.828  | 18.985: 225.8%
> > >  numa02-bw              :        29.054  | 31.431:   8.2%
> > >  numa02-bw-NOTHP        :        27.064  | 29.104:   7.5%
> > >  numa01-bw-thread	:        20.338  | 28.607:  40.7%
> > >  numa01-bw-thread-NOTHP :        18.528  | 21.119:  14.0%
> > > ------------------------------------------------------------
> > > 
> > > More than half of these testcases are under-loaded situations.
> > > 
> > > > The main issues that have been observed with numacore are when 
> > > > the system is only partially loaded. Something strange seems 
> > > > to be going on that causes performance regressions in that 
> > > > situation.
> > > 
> > > I haven't seen such reports with -v3 yet, which is what Thomas 
> > > tested. Mel has not tested -v3 yet AFAICS.
> > > 
> > 
> > Yes, I have. The drop I took and the results I posted to you 
> > were based on a tip/master pull from December 9th. v3 was 
> > released on December 7th and your release said to test based 
> > on tip/master. The results are here 
> > https://lkml.org/lkml/2012/12/9/108 . Look at the columns 
> > marked numafix-20121209 which is tip/master with a bodge on 
> > top to remove the "if (p->nr_cpus_allowed != 
> > num_online_cpus())" check.
> 
> Ah, indeed - I saw those results but the 'numafix' tag threw me 
> off.
> 
> Looks like at least in terms of AutoNUMA-benchmark numbers you 
> measured the best-ever results with the -v3 tree? That aspect 
> is obviously good news.

... at least for the numa01 row. numa02 and numa01-THREAD_ALLOC 
isn't as good yet in your tests.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 20:07         ` Ingo Molnar
  2012-12-10 20:10           ` Ingo Molnar
@ 2012-12-10 21:03           ` Ingo Molnar
  2012-12-10 22:19           ` Mel Gorman
  2 siblings, 0 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-10 21:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Rik van Riel, Thomas Gleixner, linux-kernel, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins


* Ingo Molnar <mingo@kernel.org> wrote:

> > If you had read that report, you would know that I didn't 
> > have results for specjbb with THP enabled due to the JVM 
> > crashing with null pointer exceptions.
> 
> Hm, it's the unified tree where most of the mm/ bits are the 
> AutoNUMA bits from your tree. (It does not match 100%, because 
> your tree has an ancient version of key memory usage 
> statistics that the scheduler needs for its convergence model. 
> I'll take a look at the differences.)

Beyond the difference in page frame statistics and the 
difference in the handling of "4K-EMU", the bits below are the 
difference I found (on the THP side) between numa/base-v3 and 
your -v10 tree - but I'm not sure it should have effect on your 
JVM segfault under THP ...

I tried with preemption on/off, debugging on/off, tried your 
.config - none triggers JVM segfaults with 4x JVM or 1x JVM 
SPECjbb tests.

Thanks,

	Ingo

------------------------->
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c25e37c..409b2f3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -711,8 +711,7 @@ out:
 	 * run pte_offset_map on the pmd, if an huge pmd could
 	 * materialize from under us from a different thread.
 	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
+	if (unlikely(__pte_alloc(mm, vma, pmd, address)))
 		return VM_FAULT_OOM;
 	/* if an huge pmd materialized from under us just retry later */
 	if (unlikely(pmd_trans_huge(*pmd)))
diff --git a/mm/memory.c b/mm/memory.c
index 8022526..30e1335 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3845,8 +3750,7 @@ retry:
 	 * run pte_offset_map on the pmd, if an huge pmd could
 	 * materialize from under us from a different thread.
 	 */
-	if (unlikely(pmd_none(*pmd)) &&
-	    unlikely(__pte_alloc(mm, vma, pmd, address)))
+	if (unlikely(pmd_none(*pmd)) && __pte_alloc(mm, vma, pmd, address))
 		return VM_FAULT_OOM;
 	/* if an huge pmd materialized from under us just retry later */
 	if (unlikely(pmd_trans_huge(*pmd)))


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [GIT TREE] Unified NUMA balancing tree, v3
  2012-12-10 20:07         ` Ingo Molnar
  2012-12-10 20:10           ` Ingo Molnar
  2012-12-10 21:03           ` Ingo Molnar
@ 2012-12-10 22:19           ` Mel Gorman
  2 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2012-12-10 22:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Thomas Gleixner, linux-kernel, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Andrew Morton, Andrea Arcangeli, Linus Torvalds, Johannes Weiner,
	Hugh Dickins

On Mon, Dec 10, 2012 at 09:07:55PM +0100, Ingo Molnar wrote:
> > > 
> > 
> > Yes, I have. The drop I took and the results I posted to you 
> > were based on a tip/master pull from December 9th. v3 was 
> > released on December 7th and your release said to test based 
> > on tip/master. The results are here 
> > https://lkml.org/lkml/2012/12/9/108 . Look at the columns 
> > marked numafix-20121209 which is tip/master with a bodge on 
> > top to remove the "if (p->nr_cpus_allowed != 
> > num_online_cpus())" check.
> 
> Ah, indeed - I saw those results but the 'numafix' tag threw me 
> off.
> 
> Looks like at least in terms of AutoNUMA-benchmark numbers you 
> measured the best-ever results with the -v3 tree? That aspect is 
> obviously good news.
> 

It's still regressing specjbb for lower numbers of warehouses and single
JVM with THP disabled performed very poorly. The system CPU usage is
still through the roof for a number of tests. The rate of migration looks
excessive at parts. Maybe that rate of migration is really necessary but
it seems doubtful that so much bandwidth should be consumed moving data
around by the kernel.

> This part isn't:
> 
> > > If there are any such instances left then I'll investigate, 
> > > but right now it's looking pretty good.
> > 
> > If you had read that report, you would know that I didn't have 
> > results for specjbb with THP enabled due to the JVM crashing 
> > with null pointer exceptions.
> 
> Hm, it's the unified tree where most of the mm/ bits are the 
> AutoNUMA bits from your tree.

The handling of PTEs as an effective hugepage is a major difference.
Holding PTL across task_numa_fault() is a major difference and could be a
significant contributer to the ptl-related bottlenecks you are complaining
about. The fault stats are busted but that's a minor issue. All this is
already in another mail http://www.spinics.net/lists/linux-mm/msg47888.html.

> (It does not match 100%, because 
> your tree has an ancient version of key memory usage statistics 
> that the scheduler needs for its convergence model. I'll take a 
> look at the differences.)
> 

I'm assuming you are referring to the last_cpuid versus last_nid
information that is fed in. That should have been a fairly minor delta
between balancenuma and numacore. It would also affect what mpol_misplaced()
returned.

> Given how well the unified kernel performs,

Except for the places where it doesn't such as single JVM with THP disabled.
Maybe I have a spectacularly unlucky machine.

> and given that the 
> segfaults occur on your box, would you be willing to debug this 
> a bit and help me out fixing the bug? Thanks!
> 

The machine is currently occupied running current tip/master. When it
frees up, I'll try find the time to debug it. My strong suspicion is
that the bug is in the patch that treats 4K as effect hugepage faults,
particularly as an earlier version of that patch had serious problems.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2012-12-10 22:19 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-07  0:19 [GIT TREE] Unified NUMA balancing tree, v3 Ingo Molnar
2012-12-07  0:19 ` [PATCH 1/9] numa, sched: Fix NUMA tick ->numa_shared setting Ingo Molnar
2012-12-07  0:19 ` [PATCH 2/9] numa, sched: Add tracking of runnable NUMA tasks Ingo Molnar
2012-12-07  0:19 ` [PATCH 3/9] numa, sched: Implement wake-cpu migration support Ingo Molnar
2012-12-07  0:19 ` [PATCH 4/9] numa, mm, sched: Implement last-CPU+PID hash tracking Ingo Molnar
2012-12-07  0:19 ` [PATCH 5/9] numa, mm, sched: Fix NUMA affinity tracking logic Ingo Molnar
2012-12-07  0:19 ` [PATCH 6/9] numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling Ingo Molnar
2012-12-07  0:19 ` [PATCH 7/9] numa, sched: Improve staggered convergence Ingo Molnar
2012-12-07  0:19 ` [PATCH 8/9] numa, sched: Improve directed convergence Ingo Molnar
2012-12-07  0:19 ` [PATCH 9/9] numa, sched: Streamline and fix numa_allow_migration() use Ingo Molnar
2012-12-10 18:22 ` [GIT TREE] Unified NUMA balancing tree, v3 Thomas Gleixner
2012-12-10 18:41   ` Rik van Riel
2012-12-10 19:15     ` Ingo Molnar
2012-12-10 19:28       ` Mel Gorman
2012-12-10 20:07         ` Ingo Molnar
2012-12-10 20:10           ` Ingo Molnar
2012-12-10 21:03           ` Ingo Molnar
2012-12-10 22:19           ` Mel Gorman
2012-12-10 19:32   ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).