linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] Latest numa/core release, v18
@ 2012-11-30 19:58 Ingo Molnar
  2012-11-30 19:58 ` [PATCH 01/10] sched: Add "task flipping" support Ingo Molnar
                   ` (13 more replies)
  0 siblings, 14 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

I'm pleased to announce the latest, -v18 numa/core release.

This release fixes regressions and improves NUMA performance.
It has the following main changes:

  - Introduce directed NUMA convergence, which is based on
    the 'task buddy' relation introduced in -v17, and make
    use of the new "task flipping" facility.

  - Add "related task group" balancing notion to the scheduler, to
    be able to 'compress' and 'spread' NUMA workloads
    based on which tasks relate to each other via their
    working set (i.e. which tasks access the same memory areas).

  - Track the quality and strength of NUMA convergence and
    create a feedback loop with the scheduler:

     - use it to direct migrations

     - use it to slow down and speed up the rate of the
       NUMA hinting page faults

  - Turn 4K pte NUMA faults into effective hugepage ones

  - Refine the 'shared tasks' memory interleaving logic

  - Improve CONFIG_NUMA_BALANCING=y OOM behavior

One key practical area of improvement are enhancements to
the NUMA convergence of "multiple JVM" kind of workloads.

As a recap, this was -v17 performance with 4x SPECjbb instances
on a 4-node system (32 CPUs, 4 instances, 8 warehouses each, 240
seconds runtime, +THP):

     spec1.txt:           throughput =     177460.44 SPECjbb2005 bops
     spec2.txt:           throughput =     176175.08 SPECjbb2005 bops
     spec3.txt:           throughput =     175053.91 SPECjbb2005 bops
     spec4.txt:           throughput =     171383.52 SPECjbb2005 bops
                                      --------------------------
           SUM:           throughput =     700072.95 SPECjbb2005 bops

The new -v18 figures are:

     spec1.txt:           throughput =     191415.52 SPECjbb2005 bops 
     spec2.txt:           throughput =     193481.96 SPECjbb2005 bops 
     spec3.txt:           throughput =     192865.30 SPECjbb2005 bops 
     spec4.txt:           throughput =     191627.40 SPECjbb2005 bops 
                                           --------------------------
           SUM:           throughput =     769390.18 SPECjbb2005 bops

Which is 10% faster than -v17, 22% faster than mainline and it is
within 1% of the hard-binding results (where each JVM is explicitly
memory and CPU-bound to a single node each).

Occording to my measurements the -v18 NUMA kernel is also faster than
AutoNUMA (+THP-fix):

     spec1.txt:           throughput =     184327.49 SPECjbb2005 bops
     spec2.txt:           throughput =     187508.83 SPECjbb2005 bops
     spec3.txt:           throughput =     186206.44 SPECjbb2005 bops
     spec4.txt:           throughput =     188739.22 SPECjbb2005 bops
                                           --------------------------
           SUM:           throughput =     746781.98 SPECjbb2005 bops

Mainline has the following 4x JVM performance:

     spec1.txt:           throughput =     157839.25 SPECjbb2005 bops
     spec2.txt:           throughput =     156969.15 SPECjbb2005 bops
     spec3.txt:           throughput =     157571.59 SPECjbb2005 bops
     spec4.txt:           throughput =     157873.86 SPECjbb2005 bops
                                      --------------------------
           SUM:           throughput =     630253.85 SPECjbb2005 bops

Another key area of improvement is !THP (4K pages) performance.

Mainline 4x SPECjbb !THP JVM results:

     spec1.txt:           throughput =     128575.47 SPECjbb2005 bops 
     spec2.txt:           throughput =     125767.24 SPECjbb2005 bops 
     spec3.txt:           throughput =     130042.30 SPECjbb2005 bops 
     spec4.txt:           throughput =     128155.32 SPECjbb2005 bops 
                                       --------------------------
           SUM:           throughput =     512540.33 SPECjbb2005 bops


numa/core -v18 4x SPECjbb JVM !THP results:

     spec1.txt:           throughput =     158023.05 SPECjbb2005 bops 
     spec2.txt:           throughput =     156895.51 SPECjbb2005 bops 
     spec3.txt:           throughput =     156158.11 SPECjbb2005 bops 
     spec4.txt:           throughput =     157414.52 SPECjbb2005 bops 
                                      --------------------------
           SUM:           throughput =     628491.19 SPECjbb2005 bops

That too is roughly 22% faster than mainline - the !THP regression
that was reported by Mel Gorman appears to be fixed.

AutoNUMA-benchmark comparison to the mainline kernel:

 ##############
 # res-v3.6-vanilla.log vs res-numacore-v18b.log:
 #------------------------------------------------------------------------------------>
   autonuma benchmark                run time (lower is better)         speedup %
 ------------------------------------------------------------------------------------->
   numa01                           :   337.29  vs.  177.64   |           +89.8 %
   numa01_THREAD_ALLOC              :   428.79  vs.  127.07   |          +237.4 %
   numa02                           :    56.32  vs.   18.08   |          +211.5 %
   ------------------------------------------------------------

(this is similar to -v17, within noise.)

Comparison to AutoNUMA-v28 (+THP-fix):

 ##############
 # res-autonuma-v28-THP.log vs res-numacore-v18b.log:
 #------------------------------------------------------------------------------------>
   autonuma benchmark                run time (lower is better)         speedup %
 ------------------------------------------------------------------------------------->
   numa01                           :   235.77  vs.  177.64   |           +32.7 %
   numa01_THREAD_ALLOC              :   134.53  vs.  127.07   |            +5.8 %
   numa02                           :    19.49  vs.   18.08   |            +7.7 %
   ------------------------------------------------------------

A few caveats: I'm still seeing problems on !THP.

Here's the analysis of one of the last regression sources I'm still
seeing with it on larger systems. I have identified the source
of the regression, and I see how the AutoNUMA and 'balancenuma' trees
solved this problem - but I disagree with the solution.

When pushed hard enough via threaded workloads (for example via the
numa02 test) then the upstream page migration code in mm/migration.c
becomes unscalable, resulting in lot of scheduling on the anon vma
mutex and a subsequent drop in performance.

When the points of scheduling are call-graph profiled, the
unscalability appears to be due to interaction between the
following page migration code paths:

    96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                  |
                  --- perf_trace_sched_switch
                      __schedule
                      schedule
                      schedule_preempt_disabled
                      __mutex_lock_common.isra.6
                      __mutex_lock_slowpath
                      mutex_lock
                     |
                     |--50.61%-- rmap_walk
                     |          move_to_new_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          __do_numa_page.isra.69
                     |          handle_pte_fault
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          __memset_sse2
                     |          |
                     |           --100.00%-- worker_thread
                     |                     |
                     |                      --100.00%-- start_thread
                     |
                      --49.39%-- page_lock_anon_vma
                                try_to_unmap_anon
                                try_to_unmap
                                migrate_pages
                                migrate_misplaced_page
                                __do_numa_page.isra.69
                                handle_pte_fault
                                handle_mm_fault
                                __do_page_fault
                                do_page_fault
                                page_fault
                                __memset_sse2
                                |
                                 --100.00%-- worker_thread
                                           start_thread

>From what I can see theAutoNUMA and 'balancenuma' kernels works
around this !THP scalability issue by rate-limiting migrations.
For example balancenuma rate-limits migrations to about 1.2 GB/sec
bandwidth.

Rate-limiting to solve scalability limits is not the right
solution IMO, because it hurts cases where migration is justified.
The migration of the working set itself is not a problem, it would
in fact be beneficial - but our implementation of it does not scale
beyond a certain rate.

( THP, which has a 512 times lower natural rate of migration page
  faults, does not run into this scalability limit. )

So this issue is still open and testers are encouraged to use THP
if they can.

These patches are on top of the "v17" tree (no point in resending those),
and it can all be found in the tip:master tree as well:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

Please re-report any bugs and regressions that you can still see.

Reports, fixes, suggestions are welcome, as always!

Thanks,

	Ingo

--------------------->

Ingo Molnar (10):
  sched: Add "task flipping" support
  sched: Move the NUMA placement logic to a worklet
  numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior
  mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
  sched: Introduce directed NUMA convergence
  sched: Remove statistical NUMA scheduling
  sched: Track quality and strength of convergence
  sched: Converge NUMA migrations
  sched: Add convergence strength based adaptive NUMA page fault rate
  sched: Refine the 'shared tasks' memory interleaving logic

 include/linux/migrate.h        |    6 +
 include/linux/sched.h          |   12 +-
 include/uapi/linux/mempolicy.h |    1 +
 init/Kconfig                   |    1 +
 kernel/sched/core.c            |   99 ++-
 kernel/sched/fair.c            | 1913 ++++++++++++++++++++++++++++------------
 kernel/sched/features.h        |   24 +-
 kernel/sched/sched.h           |   19 +-
 kernel/sysctl.c                |   11 +-
 mm/huge_memory.c               |   50 +-
 mm/memory.c                    |  151 +++-
 mm/mempolicy.c                 |   86 +-
 mm/migrate.c                   |    3 +-
 mm/mprotect.c                  |   24 +-
 14 files changed, 1699 insertions(+), 701 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 01/10] sched: Add "task flipping" support
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 02/10] sched: Move the NUMA placement logic to a worklet Ingo Molnar
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

NUMA balancing will make use of the new sched_rebalance_to() mode:
the ability to 'flip' two tasks.

When two tasks have a similar weight but one of them executes on
the wrong CPU or node, then it's beneficial to do a quick flipping
operation. This will not change the general load of the source
and the target CPUs, so it won't disturb the scheduling balance.

With this we can do NUMA placement while the system is otherwise
in equilibrium.

The code has to be careful about races and whether the source and
target CPUs are allowed for the tasks in question.

This method is also faster: it can execute two migrations via one
migration-thread call in essence - instead of two such calls. The
thread on the target CPU acts as the 'migration thread' for the
replaced task.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  1 -
 kernel/sched/core.c   | 68 +++++++++++++++++++++++++++++++++++++--------------
 kernel/sched/fair.c   |  2 +-
 kernel/sched/sched.h  |  6 +++++
 4 files changed, 57 insertions(+), 20 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8bc3a03..696492e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2020,7 +2020,6 @@ task_sched_runtime(struct task_struct *task);
 /* sched_exec is called by processes performing an exec */
 #ifdef CONFIG_SMP
 extern void sched_exec(void);
-extern void sched_rebalance_to(int dest_cpu);
 #else
 #define sched_exec()   {}
 #endif
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 93f2561..cad6c89 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -963,8 +963,8 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
 }
 
 struct migration_arg {
-	struct task_struct *task;
-	int dest_cpu;
+	struct task_struct	*task;
+	int			dest_cpu;
 };
 
 static int migration_cpu_stop(void *data);
@@ -2596,22 +2596,6 @@ unlock:
 	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
 }
 
-/*
- * sched_rebalance_to()
- *
- * Active load-balance to a target CPU.
- */
-void sched_rebalance_to(int dest_cpu)
-{
-	struct task_struct *p = current;
-	struct migration_arg arg = { p, dest_cpu };
-
-	if (!cpumask_test_cpu(dest_cpu, tsk_cpus_allowed(p)))
-		return;
-
-	stop_one_cpu(raw_smp_processor_id(), migration_cpu_stop, &arg);
-}
-
 #endif
 
 DEFINE_PER_CPU(struct kernel_stat, kstat);
@@ -4778,6 +4762,54 @@ fail:
 }
 
 /*
+ * sched_rebalance_to()
+ *
+ * Active load-balance to a target CPU.
+ */
+void sched_rebalance_to(int dst_cpu, int flip_tasks)
+{
+	struct task_struct *p_src = current;
+	struct task_struct *p_dst;
+	int src_cpu = raw_smp_processor_id();
+	struct migration_arg arg = { p_src, dst_cpu };
+	struct rq *dst_rq;
+
+	if (!cpumask_test_cpu(dst_cpu, tsk_cpus_allowed(p_src)))
+		return;
+
+	if (flip_tasks) {
+		dst_rq = cpu_rq(dst_cpu);
+
+		local_irq_disable();
+		raw_spin_lock(&dst_rq->lock);
+
+		p_dst = dst_rq->curr;
+		get_task_struct(p_dst);
+
+		raw_spin_unlock(&dst_rq->lock);
+		local_irq_enable();
+	}
+
+	stop_one_cpu(src_cpu, migration_cpu_stop, &arg);
+	/*
+	 * Task-flipping.
+	 *
+	 * We are now on the new CPU - check whether we can migrate
+	 * the task we just preempted, to where we came from:
+	 */
+	if (flip_tasks) {
+		local_irq_disable();
+		if (raw_smp_processor_id() == dst_cpu) {
+ 			/* Note that the arguments flip: */
+			__migrate_task(p_dst, dst_cpu, src_cpu);
+		}
+		local_irq_enable();
+
+		put_task_struct(p_dst);
+	}
+}
+
+/*
  * migration_cpu_stop - this will be executed by a highprio stopper thread
  * and performs thread migration by bumping thread off CPU then
  * 'pushing' onto another runqueue.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5cc3620..54c1e7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1176,7 +1176,7 @@ static void task_numa_placement(struct task_struct *p)
 		struct rq *rq = cpu_rq(p->ideal_cpu);
 
 		rq->curr_buddy = p;
-		sched_rebalance_to(p->ideal_cpu);
+		sched_rebalance_to(p->ideal_cpu, 0);
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 810a1a0..f3a284e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1259,4 +1259,10 @@ static inline u64 irq_time_read(int cpu)
 	return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
 }
 #endif /* CONFIG_64BIT */
+#ifdef CONFIG_SMP
+extern void sched_rebalance_to(int dest_cpu, int flip_tasks);
+#else
+static inline void sched_rebalance_to(int dest_cpu, int flip_tasks) { }
+#endif
+
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 02/10] sched: Move the NUMA placement logic to a worklet
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
  2012-11-30 19:58 ` [PATCH 01/10] sched: Add "task flipping" support Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 03/10] numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior Ingo Molnar
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

As an implementational detail, to be able to do directed task placement
we have to change how task_numa_fault() interfaces with the scheduler:
instead of the placement logic being executed directly from the fault
path we now trigger a worklet, similarly to how we do the NUMA
hinting fault work.

This moves placement into process context and allows the execution of the
directed task-flipping code via sched_rebalance_to().

This further decouples the NUMA hinting fault engine from
the actual NUMA placement logic.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |   3 +-
 kernel/sched/core.c   |  21 ++++++-
 kernel/sched/fair.c   | 151 +++++++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h  |   6 ++
 4 files changed, 123 insertions(+), 58 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 696492e..ce9ccd7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1512,7 +1512,8 @@ struct task_struct {
 	unsigned long numa_weight;
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
-	struct callback_head numa_work;
+	struct callback_head numa_scan_work;
+	struct callback_head numa_placement_work;
 
 	struct task_struct *shared_buddy, *shared_buddy_curr;
 	unsigned long shared_buddy_faults, shared_buddy_faults_curr;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cad6c89..0324d5e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -39,6 +39,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/debug_locks.h>
 #include <linux/perf_event.h>
+#include <linux/task_work.h>
 #include <linux/security.h>
 #include <linux/notifier.h>
 #include <linux/profile.h>
@@ -1558,7 +1559,6 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_migrate_seq = 2;
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
-	p->numa_work.next = &p->numa_work;
 
 	p->shared_buddy = NULL;
 	p->shared_buddy_faults = 0;
@@ -1570,6 +1570,25 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_policy.v.preferred_node = 0;
 	p->numa_policy.v.nodes = node_online_map;
 
+	init_task_work(&p->numa_scan_work, task_numa_scan_work);
+	p->numa_scan_work.next = &p->numa_scan_work;
+
+	init_task_work(&p->numa_placement_work, task_numa_placement_work);
+	p->numa_placement_work.next = &p->numa_placement_work;
+
+	if (p->mm) {
+		int entries = 2*nr_node_ids;
+		int size = sizeof(*p->numa_faults) * entries;
+
+		/*
+		 * For efficiency reasons we allocate ->numa_faults[]
+		 * and ->numa_faults_curr[] at once and split the
+		 * buffer we get. They are separate otherwise.
+		 */
+		p->numa_faults = kzalloc(2*size, GFP_KERNEL);
+		if (p->numa_faults)
+			p->numa_faults_curr = p->numa_faults + entries;
+	}
 #endif /* CONFIG_NUMA_BALANCING */
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54c1e7b..fda1b63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1063,19 +1063,18 @@ clear_buddy:
 	p->ideal_cpu_curr		= -1;
 }
 
-static void task_numa_placement(struct task_struct *p)
+/*
+ * Called every couple of hundred milliseconds in the task's
+ * execution life-time, this function decides whether to
+ * change placement parameters:
+ */
+static void task_numa_placement_tick(struct task_struct *p)
 {
-	int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
 	unsigned long total[2] = { 0, 0 };
 	unsigned long faults, max_faults = 0;
 	int node, priv, shared, max_node = -1;
 	int this_node;
 
-	if (p->numa_scan_seq == seq)
-		return;
-
-	p->numa_scan_seq = seq;
-
 	/*
 	 * Update the fault average with the result of the latest
 	 * scan:
@@ -1280,43 +1279,24 @@ void task_numa_fault(int node, int last_cpu, int pages)
 	int idx = 2*node + priv;
 
 	WARN_ON_ONCE(last_cpu == -1 || node == -1);
-
-	if (unlikely(!p->numa_faults)) {
-		int entries = 2*nr_node_ids;
-		int size = sizeof(*p->numa_faults) * entries;
-
-		p->numa_faults = kzalloc(2*size, GFP_KERNEL);
-		if (!p->numa_faults)
-			return;
-		/*
-		 * For efficiency reasons we allocate ->numa_faults[]
-		 * and ->numa_faults_curr[] at once and split the
-		 * buffer we get. They are separate otherwise.
-		 */
-		p->numa_faults_curr = p->numa_faults + entries;
-	}
+	BUG_ON(!p->numa_faults);
 
 	p->numa_faults_curr[idx] += pages;
 	shared_fault_tick(p, node, last_cpu, pages);
-	task_numa_placement(p);
 }
 
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
  */
-void task_numa_work(struct callback_head *work)
+void task_numa_placement_work(struct callback_head *work)
 {
-	long pages_total, pages_left, pages_changed;
-	unsigned long migrate, next_scan, now = jiffies;
-	unsigned long start0, start, end;
 	struct task_struct *p = current;
-	struct mm_struct *mm = p->mm;
-	struct vm_area_struct *vma;
 
-	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_placement_work));
 
 	work->next = work; /* protect against double add */
+
 	/*
 	 * Who cares about NUMA placement when they're dying.
 	 *
@@ -1328,6 +1308,29 @@ void task_numa_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	task_numa_placement_tick(p);
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_scan_work(struct callback_head *work)
+{
+	long pages_total, pages_left, pages_changed;
+	unsigned long migrate, next_scan, now = jiffies;
+	unsigned long start0, start, end;
+	struct task_struct *p = current;
+	struct mm_struct *mm = p->mm;
+	struct vm_area_struct *vma;
+
+	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_scan_work));
+
+	work->next = work; /* protect against double add */
+
+	if (p->flags & PF_EXITING)
+		return;
+
 	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
@@ -1383,15 +1386,12 @@ out:
 /*
  * Drive the periodic memory faults..
  */
-void task_tick_numa(struct rq *rq, struct task_struct *curr)
+static void task_tick_numa_scan(struct rq *rq, struct task_struct *curr)
 {
-	struct callback_head *work = &curr->numa_work;
+	struct callback_head *work = &curr->numa_scan_work;
 	u64 period, now;
 
-	/*
-	 * We don't care about NUMA placement if we don't have memory.
-	 */
-	if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+	if (work->next != work)
 		return;
 
 	/*
@@ -1403,28 +1403,67 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	now = curr->se.sum_exec_runtime;
 	period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
 
-	if (now - curr->node_stamp > period) {
-		curr->node_stamp += period;
-		curr->numa_scan_period = sysctl_sched_numa_scan_period_min;
+	if (now - curr->node_stamp <= period)
+		return;
 
-		/*
-		 * We are comparing runtime to wall clock time here, which
-		 * puts a maximum scan frequency limit on the task work.
-		 *
-		 * This, together with the limits in task_numa_work() filters
-		 * us from over-sampling if there are many threads: if all
-		 * threads happen to come in at the same time we don't create a
-		 * spike in overhead.
-		 *
-		 * We also avoid multiple threads scanning at once in parallel to
-		 * each other.
-		 */
-		if (!time_before(jiffies, curr->mm->numa_next_scan)) {
-			init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
-			task_work_add(curr, work, true);
-		}
-	}
+	curr->node_stamp += period;
+	curr->numa_scan_period = sysctl_sched_numa_scan_period_min;
+
+	/*
+	 * We are comparing runtime to wall clock time here, which
+	 * puts a maximum scan frequency limit on the task work.
+	 *
+	 * This, together with the limits in task_numa_work() filters
+	 * us from over-sampling if there are many threads: if all
+	 * threads happen to come in at the same time we don't create a
+	 * spike in overhead.
+	 *
+	 * We also avoid multiple threads scanning at once in parallel to
+	 * each other.
+	 */
+	if (time_before(jiffies, curr->mm->numa_next_scan))
+		return;
+
+	task_work_add(curr, work, true);
 }
+
+/*
+ * Drive the placement logic:
+ */
+static void task_tick_numa_placement(struct rq *rq, struct task_struct *curr)
+{
+	struct callback_head *work = &curr->numa_placement_work;
+	int seq;
+
+	if (work->next != work)
+		return;
+
+	/*
+	 * Check whether we should run task_numa_placement(),
+	 * and if yes, activate the worklet:
+	 */
+	seq = ACCESS_ONCE(curr->mm->numa_scan_seq);
+
+	if (curr->numa_scan_seq == seq)
+		return;
+
+	curr->numa_scan_seq = seq;
+	task_work_add(curr, work, true);
+}
+
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+	/*
+	 * We don't care about NUMA placement if we don't have memory
+	 * or are exiting:
+	 */
+	if (!curr->mm || (curr->flags & PF_EXITING))
+		return;
+
+	task_tick_numa_scan(rq, curr);
+	task_tick_numa_placement(rq, curr);
+}
+
 #else /* !CONFIG_NUMA_BALANCING: */
 #ifdef CONFIG_SMP
 static inline int task_ideal_cpu(struct task_struct *p)				{ return -1; }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f3a284e..7e53cbf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1259,6 +1259,12 @@ static inline u64 irq_time_read(int cpu)
 	return per_cpu(cpu_softirq_time, cpu) + per_cpu(cpu_hardirq_time, cpu);
 }
 #endif /* CONFIG_64BIT */
+
+#ifdef CONFIG_NUMA_BALANCING
+extern void task_numa_scan_work(struct callback_head *work);
+extern void task_numa_placement_work(struct callback_head *work);
+#endif
+
 #ifdef CONFIG_SMP
 extern void sched_rebalance_to(int dest_cpu, int flip_tasks);
 #else
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 03/10] numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
  2012-11-30 19:58 ` [PATCH 01/10] sched: Add "task flipping" support Ingo Molnar
  2012-11-30 19:58 ` [PATCH 02/10] sched: Move the NUMA placement logic to a worklet Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 04/10] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Zhouping Liu reported worse out-of-memory behavior with
CONFIG_NUMA_BALANCING=y, compared to the mainline kernel.

One reason for that change in behavior is that with typical
applications the mainline kernel allocates memory essentially
randomly, and leaves it where it was.

"Random" placement is not the worst possible placement - in fact
it's a pretty good placement strategy. It's definitely possible
for a NUMA-aware kernel to do worse than that, and
CONFIG_NUMA_BALANCING=y regressed because it's very opinionated
about which node tasks should execute and on which node they
should allocate memory on.

One such problematic case is when a node has already used up
most of its memory - in that case it's pointless trying to
allocate even more memory from there. Doing so would trigger
OOMs even though the system has more memory on other nodes.

The migration code is already trying to be nice when allocating
memory for NUMA purposes - extend this concept to mempolicy
driven allocations as well.

Expose migrate_balanced_pgdat() and use it. If all fails try just
as hard as the old code would.

Hopefully this improves behavior in memory allocation corner
cases.

[ migrate_balanced_pgdat() should probably be moved to
  mm/page_alloc.c and be renamed to balanced_pgdat() or
  so - but this patch tries to be minimalistic. ]

Reported-by: Zhouping Liu <zliu@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/migrate.h        |  6 +++
 include/uapi/linux/mempolicy.h |  1 +
 kernel/sched/core.c            |  2 +-
 mm/huge_memory.c               |  9 +++++
 mm/mempolicy.c                 | 86 +++++++++++++++++++++++++++++++++++-------
 mm/migrate.c                   |  3 +-
 6 files changed, 90 insertions(+), 17 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 72665c9..e5c900f 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -31,6 +31,7 @@ extern void migrate_page_copy(struct page *newpage, struct page *page);
 extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 				  struct page *newpage, struct page *page);
 extern int migrate_misplaced_page(struct page *page, int node);
+extern bool migrate_balanced_pgdat(struct pglist_data *pgdat, int nr_migrate_pages);
 #else
 
 static inline void putback_lru_pages(struct list_head *l) {}
@@ -60,6 +61,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 	return -ENOSYS;
 }
 
+static inline bool migrate_balanced_pgdat(struct pglist_data *pgdat, int nr_migrate_pages)
+{
+	return true;
+}
+
 /* Possible settings for the migrate_page() method in address_operations */
 #define migrate_page NULL
 #define fail_migrate_page NULL
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..5accdc3 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -44,6 +44,7 @@ enum mpol_rebind_step {
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
 #define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
 #define MPOL_F_MEMS_ALLOWED (1<<2) /* return allowed memories */
+#define MPOL_F_MOF	(1<<3)	/* Migrate On Fault */
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0324d5e..129924a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1566,7 +1566,7 @@ static void __sched_fork(struct task_struct *p)
 	p->ideal_cpu_curr = -1;
 	atomic_set(&p->numa_policy.refcnt, 1);
 	p->numa_policy.mode = MPOL_INTERLEAVE;
-	p->numa_policy.flags = 0;
+	p->numa_policy.flags = MPOL_F_MOF;
 	p->numa_policy.v.preferred_node = 0;
 	p->numa_policy.v.nodes = node_online_map;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92e101f..977834c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -788,6 +788,15 @@ unlock:
 migrate:
 	spin_unlock(&mm->page_table_lock);
 
+	/*
+	 * If this node is getting full then don't migrate even
+ 	 * more pages here:
+ 	 */
+	if (!migrate_balanced_pgdat(NODE_DATA(node), HPAGE_PMD_NR)) {
+		put_page(page);
+		return;
+	}
+
 	lock_page(page);
 	spin_lock(&mm->page_table_lock);
 	if (unlikely(!pmd_same(*pmd, entry))) {
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d71a93d..081a505 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -115,7 +115,7 @@ enum zone_type policy_zone = 0;
 static struct mempolicy default_policy_local = {
 	.refcnt		= ATOMIC_INIT(1), /* never free it */
 	.mode		= MPOL_PREFERRED,
-	.flags		= MPOL_F_LOCAL,
+	.flags		= MPOL_F_LOCAL | MPOL_F_MOF,
 };
 
 static struct mempolicy *default_policy(void)
@@ -1675,11 +1675,14 @@ unsigned slab_node(void)
 		struct zonelist *zonelist;
 		struct zone *zone;
 		enum zone_type highest_zoneidx = gfp_zone(GFP_KERNEL);
+		int node;
+
 		zonelist = &NODE_DATA(numa_node_id())->node_zonelists[0];
 		(void)first_zones_zonelist(zonelist, highest_zoneidx,
 							&policy->v.nodes,
 							&zone);
-		return zone ? zone->node : numa_node_id();
+		node = zone ? zone->node : numa_node_id();
+		return node;
 	}
 
 	default:
@@ -1889,6 +1892,62 @@ static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 	return page;
 }
 
+static struct page *
+alloc_pages_nice(gfp_t gfp, int order, struct mempolicy *pol, int best_nid)
+{
+	struct zonelist *zl = policy_zonelist(gfp, pol, best_nid);
+#ifdef CONFIG_NUMA_BALANCING
+	unsigned int pages = 1 << order;
+	gfp_t gfp_nice = gfp | GFP_THISNODE;
+#endif
+	struct page *page = NULL;
+	nodemask_t *nodemask;
+
+	nodemask = policy_nodemask(gfp, pol);
+
+#ifdef CONFIG_NUMA_BALANCING
+	if (migrate_balanced_pgdat(NODE_DATA(best_nid), pages)) {
+		page = alloc_pages_node(best_nid, gfp_nice, order);
+		if (page)
+			return page;
+	}
+
+	/*
+	 * For non-hard-bound tasks, see whether there's another node
+	 * before trying harder:
+	 */
+	if (current->nr_cpus_allowed > 1) {
+		int nid;
+
+		if (nodemask) {
+			int first_nid = find_first_bit(nodemask->bits, MAX_NUMNODES);
+
+			page = alloc_pages_node(first_nid, gfp_nice, order);
+			if (page)
+				return page;
+		}
+
+		/*
+		 * Pick a less loaded node, if possible:
+		 */
+		for_each_node(nid) {
+			if (!migrate_balanced_pgdat(NODE_DATA(nid), pages))
+				continue;
+
+			page = alloc_pages_node(nid, gfp_nice, order);
+			if (page)
+				return page;
+		}
+	}
+#endif
+
+	/* If all failed then try the original plan: */
+	if (!page)
+		page = __alloc_pages_nodemask(gfp, order, zl, nodemask);
+
+	return page;
+}
+
 /**
  * 	alloc_pages_vma	- Allocate a page for a VMA.
  *
@@ -1917,8 +1976,7 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma,
 		unsigned long addr, int node)
 {
 	struct mempolicy *pol;
-	struct zonelist *zl;
-	struct page *page;
+	struct page *page = NULL;
 	unsigned int cpuset_mems_cookie;
 
 retry_cpuset:
@@ -1936,13 +1994,12 @@ retry_cpuset:
 
 		return page;
 	}
-	zl = policy_zonelist(gfp, pol, node);
 	if (unlikely(mpol_needs_cond_ref(pol))) {
 		/*
 		 * slow path: ref counted shared policy
 		 */
-		struct page *page =  __alloc_pages_nodemask(gfp, order,
-						zl, policy_nodemask(gfp, pol));
+		page = alloc_pages_nice(gfp, order, pol, node);
+
 		__mpol_put(pol);
 		if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 			goto retry_cpuset;
@@ -1951,10 +2008,10 @@ retry_cpuset:
 	/*
 	 * fast path:  default or task policy
 	 */
-	page = __alloc_pages_nodemask(gfp, order, zl,
-				      policy_nodemask(gfp, pol));
+	page = alloc_pages_nice(gfp, order, pol, node);
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
+
 	return page;
 }
 
@@ -1980,8 +2037,8 @@ retry_cpuset:
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
 	struct mempolicy *pol = current->mempolicy;
-	struct page *page;
 	unsigned int cpuset_mems_cookie;
+	struct page *page;
 
 	if (!pol || in_interrupt() || (gfp & __GFP_THISNODE))
 		pol = default_policy();
@@ -1996,9 +2053,7 @@ retry_cpuset:
 	if (pol->mode == MPOL_INTERLEAVE)
 		page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
 	else
-		page = __alloc_pages_nodemask(gfp, order,
-				policy_zonelist(gfp, pol, numa_node_id()),
-				policy_nodemask(gfp, pol));
+		page = alloc_pages_nice(gfp, order, pol, numa_node_id());
 
 	if (unlikely(!put_mems_allowed(cpuset_mems_cookie) && !page))
 		goto retry_cpuset;
@@ -2275,7 +2330,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
 	cpu_last_access = page_xchg_last_cpu(page, this_cpu);
 
 	pol = get_vma_policy(current, vma, addr);
-	if (!(task_numa_shared(current) >= 0))
+
+	if (task_numa_shared(current) < 0)
+		goto out_keep_page;
+	if (!(pol->flags & MPOL_F_MOF))
 		goto out_keep_page;
 
 	switch (pol->mode) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 16a4709..3db0543 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1408,8 +1408,7 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
  * Returns true if this is a safe migration target node for misplaced NUMA
  * pages. Currently it only checks the watermarks which is a bit crude.
  */
-static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
-				   int nr_migrate_pages)
+bool migrate_balanced_pgdat(struct pglist_data *pgdat, int nr_migrate_pages)
 {
 	int z;
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 04/10] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (2 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 03/10] numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 05/10] sched: Introduce directed NUMA convergence Ingo Molnar
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Reduce the 4K page fault count by looking around and processing
nearby pages if possible.

To keep the logic and cache overhead simple and straightforward
we do a couple of simplifications:

 - we only scan in the HPAGE_SIZE range of the faulting address
 - we only go as far as the vma allows us

Also simplify the do_numa_page() flow while at it and fix the
previous double faulting we incurred due to not properly fixing
up freshly migrated ptes.

While at it also simplify the THP fault processing code and make
the change_protection() code more robust.

Suggested-by: Mel Gorman <mgorman@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/huge_memory.c |  51 ++++++++++++-------
 mm/memory.c      | 151 ++++++++++++++++++++++++++++++++++++++++++-------------
 mm/mprotect.c    |  24 +++++++--
 3 files changed, 167 insertions(+), 59 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 977834c..5c8de10 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -739,6 +739,7 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct mem_cgroup *memcg = NULL;
 	struct page *new_page;
 	struct page *page = NULL;
+	int page_nid = -1;
 	int last_cpu;
 	int node = -1;
 
@@ -754,12 +755,11 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	page = pmd_page(entry);
 	if (page) {
-		int page_nid = page_to_nid(page);
+		page_nid = page_to_nid(page);
 
 		VM_BUG_ON(!PageCompound(page) || !PageHead(page));
 		last_cpu = page_last_cpu(page);
 
-		get_page(page);
 		/*
 		 * Note that migrating pages shared by others is safe, since
 		 * get_user_pages() or GUP fast would have to fault this page
@@ -769,6 +769,8 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		node = mpol_misplaced(page, vma, haddr);
 		if (node != -1 && node != page_nid)
 			goto migrate;
+
+		task_numa_fault(page_nid, last_cpu, HPAGE_PMD_NR);
 	}
 
 fixup:
@@ -779,32 +781,33 @@ fixup:
 
 unlock:
 	spin_unlock(&mm->page_table_lock);
-	if (page) {
-		task_numa_fault(page_to_nid(page), last_cpu, HPAGE_PMD_NR);
-		put_page(page);
-	}
 	return;
 
 migrate:
-	spin_unlock(&mm->page_table_lock);
-
 	/*
 	 * If this node is getting full then don't migrate even
  	 * more pages here:
  	 */
-	if (!migrate_balanced_pgdat(NODE_DATA(node), HPAGE_PMD_NR)) {
-		put_page(page);
-		return;
-	}
+	if (!migrate_balanced_pgdat(NODE_DATA(node), HPAGE_PMD_NR))
+		goto fixup;
 
-	lock_page(page);
-	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, entry))) {
+	get_page(page);
+
+	/*
+	 * If we cannot lock the page immediately then wait for it
+	 * to migrate and re-take the fault (which might not be
+	 * necessary if the migrating task fixed up the pmd):
+	 */
+	if (!trylock_page(page)) {
 		spin_unlock(&mm->page_table_lock);
+
+		lock_page(page);
 		unlock_page(page);
 		put_page(page);
+
 		return;
 	}
+
 	spin_unlock(&mm->page_table_lock);
 
 	new_page = alloc_pages_node(node,
@@ -884,12 +887,13 @@ migrate:
 
 alloc_fail:
 	unlock_page(page);
+
 	spin_lock(&mm->page_table_lock);
-	if (unlikely(!pmd_same(*pmd, entry))) {
-		put_page(page);
-		page = NULL;
+	put_page(page);
+
+	if (unlikely(!pmd_same(*pmd, entry)))
 		goto unlock;
-	}
+
 	goto fixup;
 }
 #endif
@@ -1275,9 +1279,18 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	if (__pmd_trans_huge_lock(pmd, vma) == 1) {
 		pmd_t entry;
+
 		entry = pmdp_get_and_clear(mm, addr, pmd);
 		entry = pmd_modify(entry, newprot);
+
+		if (pmd_numa(vma, entry)) {
+			struct page *page = pmd_page(*pmd);
+
+ 			if (page_mapcount(page) != 1)
+				goto skip;
+		}
 		set_pmd_at(mm, addr, pmd, entry);
+skip:
 		spin_unlock(&vma->vm_mm->page_table_lock);
 		ret = 1;
 	}
diff --git a/mm/memory.c b/mm/memory.c
index 1f733dc..c6884e8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3457,64 +3457,143 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
 }
 
-static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+static int __do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, pte_t *ptep, pmd_t *pmd,
-			unsigned int flags, pte_t entry)
+			unsigned int flags, pte_t entry, spinlock_t *ptl)
 {
-	struct page *page = NULL;
-	int node, page_nid = -1;
-	int last_cpu = -1;
-	spinlock_t *ptl;
+	struct page *page;
+	int best_node;
+	int last_cpu;
+	int page_nid;
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-	if (unlikely(!pte_same(*ptep, entry)))
-		goto out_unlock;
+	WARN_ON_ONCE(pmd_trans_splitting(*pmd));
 
 	page = vm_normal_page(vma, address, entry);
-	if (page) {
-		get_page(page);
-		page_nid = page_to_nid(page);
-		last_cpu = page_last_cpu(page);
-		node = mpol_misplaced(page, vma, address);
-		if (node != -1 && node != page_nid)
-			goto migrate;
-	}
 
-out_pte_upgrade_unlock:
 	flush_cache_page(vma, address, pte_pfn(entry));
-
 	ptep_modify_prot_start(mm, address, ptep);
 	entry = pte_modify(entry, vma->vm_page_prot);
-	ptep_modify_prot_commit(mm, address, ptep, entry);
 
-	/* No TLB flush needed because we upgraded the PTE */
+	/* Be careful: */
+	if (pte_dirty(entry) && page && PageAnon(page) && (page_mapcount(page) == 1))
+		entry = pte_mkwrite(entry);
 
+	ptep_modify_prot_commit(mm, address, ptep, entry);
+	/* No TLB flush needed because we upgraded the PTE */
 	update_mmu_cache(vma, address, ptep);
 
-out_unlock:
-	pte_unmap_unlock(ptep, ptl);
+	if (!page)
+		return 0;
 
-	if (page) {
+	page_nid = page_to_nid(page);
+	last_cpu = page_last_cpu(page);
+	best_node = mpol_misplaced(page, vma, address);
+
+	if (best_node == -1 || best_node == page_nid || page_mapcount(page) != 1) {
 		task_numa_fault(page_nid, last_cpu, 1);
-		put_page(page);
+		return 0;
 	}
-out:
-	return 0;
 
-migrate:
+	/* Start the migration: */
+
+	get_page(page);
 	pte_unmap_unlock(ptep, ptl);
 
-	if (migrate_misplaced_page(page, node)) {
-		goto out;
+	/* Drops the page reference */
+	if (migrate_misplaced_page(page, best_node))
+		task_numa_fault(best_node, last_cpu, 1);
+
+	spin_lock(ptl);
+	return 0;
+}
+
+/*
+ * Also fault over nearby ptes from within the same pmd and vma,
+ * in order to minimize the overhead from page fault exceptions:
+ */
+static int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+			unsigned long addr0, pte_t *ptep0, pmd_t *pmd,
+			unsigned int flags, pte_t entry0)
+{
+	unsigned long addr0_pmd;
+	unsigned long addr_start;
+	unsigned long addr;
+	struct page *page;
+	spinlock_t *ptl;
+	pte_t *ptep_start;
+	pte_t *ptep;
+	pte_t entry;
+
+	WARN_ON_ONCE(addr0 < vma->vm_start || addr0 >= vma->vm_end);
+
+	addr0_pmd = addr0 & PMD_MASK;
+	addr_start = max(addr0_pmd, vma->vm_start);
+
+	/*
+	 * Serialize the 2MB clustering of this NUMA probing
+	 * pte by taking the lock of the pmd level page.
+	 *
+	 * This allows the whole HPAGE_SIZE-sized NUMA operation
+	 * that was already started by another thread to be
+	 * finished, without us interfering.
+	 *
+	 * It's not like that we are likely to make any meaningful
+	 * progress while the NUMA pte handling logic is running
+	 * in another thread, so we (and other threads) don't
+	 * waste CPU time taking the ptl lock and excessive page
+	 * faults and scheduling.
+	 *
+	 * ( This is also roughly analogous to the serialization of
+	 *   a real 2MB huge page fault. )
+	 */
+	spin_lock(&mm->page_table_lock);
+	page = pmd_page(*pmd);
+	WARN_ON_ONCE(!page);
+	get_page(page);
+	spin_unlock(&mm->page_table_lock);
+
+	if (!trylock_page(page)) {
+
+		lock_page(page);
+		unlock_page(page);
+		put_page(page);
+
+		/* The pte has most likely been resolved by another thread meanwhile */
+
+		return 0;
 	}
-	page = NULL;
 
-	ptep = pte_offset_map_lock(mm, pmd, address, &ptl);
-	if (!pte_same(*ptep, entry))
-		goto out_unlock;
+	ptep_start = pte_offset_map(pmd, addr_start);
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	ptep = ptep_start+1;
 
-	goto out_pte_upgrade_unlock;
+	for (addr = addr_start+PAGE_SIZE; addr < vma->vm_end; addr += PAGE_SIZE, ptep++) {
+
+		if ((addr & PMD_MASK) != addr0_pmd)
+			break;
+
+		entry = ACCESS_ONCE(*ptep);
+
+		if (!pte_present(entry))
+			continue;
+		if (!pte_numa(vma, entry))
+			continue;
+
+		__do_numa_page(mm, vma, addr, ptep, pmd, flags, entry, ptl);
+	}
+
+	entry = ACCESS_ONCE(*ptep_start);
+	if (pte_present(entry) && pte_numa(vma, entry))
+		__do_numa_page(mm, vma, addr_start, ptep_start, pmd, flags, entry, ptl);
+
+	pte_unmap_unlock(ptep_start, ptl);
+
+	unlock_page(page);
+	put_page(page);
+
+	return 0;
 }
 
 /*
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6ff2d5e..7bb3536 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -28,16 +28,19 @@
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
 
-static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
+static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long addr, unsigned long end, pgprot_t newprot,
 		int dirty_accountable)
 {
+	struct mm_struct *mm = vma->vm_mm;
 	pte_t *pte, oldpte;
+	struct page *page;
 	spinlock_t *ptl;
 	unsigned long pages = 0;
 
 	pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
 	arch_enter_lazy_mmu_mode();
+
 	do {
 		oldpte = *pte;
 		if (pte_present(oldpte)) {
@@ -46,6 +49,18 @@ static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 			ptent = ptep_modify_prot_start(mm, addr, pte);
 			ptent = pte_modify(ptent, newprot);
 
+			/* Are we turning it into a NUMA entry? */
+			if (pte_numa(vma, ptent)) {
+				page = vm_normal_page(vma, addr, oldpte);
+
+				/* Skip all but private pages: */
+				if (!page || !PageAnon(page) || page_mapcount(page) != 1)
+					ptent = oldpte;
+				else
+					pages++;
+			} else {
+				pages++;
+			}
 			/*
 			 * Avoid taking write faults for pages we know to be
 			 * dirty.
@@ -54,7 +69,6 @@ static unsigned long change_pte_range(struct mm_struct *mm, pmd_t *pmd,
 				ptent = pte_mkwrite(ptent);
 
 			ptep_modify_prot_commit(mm, addr, pte, ptent);
-			pages++;
 		} else if (IS_ENABLED(CONFIG_MIGRATION) && !pte_file(oldpte)) {
 			swp_entry_t entry = pte_to_swp_entry(oldpte);
 
@@ -98,7 +112,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma, pud_t *
 		}
 		if (pmd_none_or_clear_bad(pmd))
 			continue;
-		pages += change_pte_range(vma->vm_mm, pmd, addr, next, newprot,
+		pages += change_pte_range(vma, pmd, addr, next, newprot,
 				 dirty_accountable);
 	} while (pmd++, addr = next, addr != end);
 
@@ -135,7 +149,9 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
 	unsigned long start = addr;
 	unsigned long pages = 0;
 
-	BUG_ON(addr >= end);
+	if (WARN_ON_ONCE(addr >= end))
+		return 0;
+
 	pgd = pgd_offset(mm, addr);
 	flush_cache_range(vma, addr, end);
 	do {
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 05/10] sched: Introduce directed NUMA convergence
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (3 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 04/10] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 06/10] sched: Remove statistical NUMA scheduling Ingo Molnar
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

This patch replaces the purely statistical NUMA convergence
method with a directed one.

These balancing functions gets called when CPU loads are otherwise
more or less in balance, so we check whether a NUMA task wants
to migrate to another node, to improve NUMA task group clustering.

Our general goal is to converge the load in such a way that
minimizes internode memory access traffic. We do this
in a 'directed', non-statistical way, which drastically
improves the speed of convergence.

We do this directed convergence via using the 'memory buddy'
task relation which we build out of the page::last_cpu NUMA
hinting page fault driven statistics, plus the following
two sets of directed task migration rules.

The first set of rules 'compresses' groups by moving related
tasks closer to each other.

The second set of rules 'spreads' groups, when compression
creates a stable but not yet converging (optimal) layout
of tasks.

1) "group spreading"

This rule checks whether the smallest group on the current node
could move to another node.

This logic is implemented in the improve_group_balance_spread()
function.

2) "group compression"

This logic triggers if the 'spreading' logic finds no more
work to do.

First we search for the 'maximum node', i.e. the node on
which we have the most buddies, but which node is not yet
completely full with our buddies.

If this 'maximum node' is the current node, then we stop.

If this 'maximum node' is a different node from the current
node then we check the size of the smallest buddy group on
it.

If our own buddy group size on that CPU is equal or larger
than the minimum buddy group size, then we can disrupt the
smallest group and migrate to one of their CPUs - even if
that CPU is not idle.

Special cases: idle tasks, non-NUMA tasks or NUMA-private
tasks are special 1-task 'buddy groups' and are preferred
over NUMA-shared tasks, in that order.

If we replace a busy task then once we migrate to the
destination CPU we try to migrate that task to our original
CPU. It will not be able to replace us in the next round of
balancing because the flipping rule is not symmetric: our
group will be larger there than theirs.

This logic is implemented in the improve_group_balance_compress()
function.

An example of a valid group convergence transition:

( G1 is a buddy group of tasks  T1, T2, T3 on node 1 and
  T6, T7, T8 on node 2, G2 is a second buddy group on node 1
  with tasks T4, T5, G3 is a third buddy group on
  node 2 with task T9 and T10, G4 and G5 are two one-task
  groups of singleton tasks T11 and T12. Both nodes are equally
  loaded with 6 tasks and are at full capacity.):

    node 1                                   node 2
    G1(T1, T2, T3), G2(T4, T5), G4(T11)      G1(T6, T7, T8) G3(T9, T10), G5(T12)

                     ||
                    \||/
                     \/

    node 1                                   node 2
    G1(T1, T2, T3, T6), G2(T4, T5)           G1(T7, T8), G3(T9, T10), G4(T11), G5(T12)

I.e. task T6 migrated from node 2 to node 1, flipping task T11.
This was a valid migration that increased the size of group G1
on node 1, at the expense of (smaller) group G4.

The next valid migration step would be for T7 and T8 to
migrate from node 2 to node 1 as well:

                     ||
                    \||/
                     \/

    node 1                            node 2
    G1(T1, T2, T3, T6, T7, T8)        G2(T4, T5), G3(T9, T10), G4(T11), G5(T12)

Which is fully converged, with all 5 groups running on
their own node with no cross-node traffic between group
member tasks.

These two migrations were valid too because group G2 is
smaller than group G1, so it can be disrupted by G1.

"Resonance"

On its face, 'compression' and 'spreading' are opposing forces
and are thus subject to bad resonance feedback effects: what
'compression' does could be undone by 'spreading', all the
time, without it stopping.

But because 'compression' only gets called when 'spreading'
can find no more work anymore, and because 'compression'
creates larger groups on otherwise balanced nodes, which then
cannot be torn apart by 'spreading', no permanent resonance
should occur between the two.

Transients can occur, as both algorithms are lockless and can
(although typically and statistically don't) run at once on
parallel CPUs.

Choice of the convergence algorithms:
=====================================

I went for directed convergence instead of statistical convergence,
because especially on larger systems convergence speed was getting
very slow with statistical methods, only convering the most trivial,
often artificial benchmark workloads.

The mathematical proof looks somewhat difficult to outline (I have
not even tried to formally construct one), but the above logic will
monotonically sort the system until it converges into a fully
and ideally sorted state, and will do that in a finite number of
steps.

In the final state each group is the largest possible and each CPU
is loaded to the fullest: i.e. inter-node traffic is minimize.

This direct path of convergence is very fast (much faster than
the statistical, Monte-Carlo / Brownian motion convergence) but
it is not the mathematically shortest possible route to equilibrium.

By my thinking, finding the minimum-steps route would have
O(NR_CPUS^3) complexity or worse, with memory allocation and
complexity constraints unpractical and unacceptable for kernel space ...

[ If you thought that the lockdep dependency graph logic was complex,
  then such a routine would be a true monster in comparison! ]

Especially with fast moving workloads it's also doubtful whether
it's worth spending kernel resources to calculate the best path
to begin with - by the time we calculate it the scheduling situation
might have changed already ;-)

This routine fits into our existing load-balancing patterns
and algorithm complexity pretty well: it is essentially O(NR_CPUs),
it runs only rarely and tries hard to never at once run on multiple
CPUs.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |    1 +
 init/Kconfig            |    1 +
 kernel/sched/core.c     |    3 -
 kernel/sched/fair.c     | 1185 +++++++++++++++++++++++++++++++++++++++++++----
 kernel/sched/features.h |   20 +-
 5 files changed, 1099 insertions(+), 111 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ce9ccd7..3bc69b7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,6 +1506,7 @@ struct task_struct {
 	int numa_shared;
 	int numa_max_node;
 	int numa_scan_seq;
+	unsigned long numa_scan_ts_secs;
 	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
diff --git a/init/Kconfig b/init/Kconfig
index 9511f0d..3db61b0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1109,6 +1109,7 @@ config UIDGID_STRICT_TYPE_CHECKS
 
 config SCHED_AUTOGROUP
 	bool "Automatic process group scheduling"
+	depends on !NUMA_BALANCING
 	select EVENTFD
 	select CGROUPS
 	select CGROUP_SCHED
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 129924a..48f69a0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4772,9 +4772,6 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
 done:
 	ret = 1;
 fail:
-#ifdef CONFIG_NUMA_BALANCING
-	rq_dest->curr_buddy = NULL;
-#endif
 	double_rq_unlock(rq_src, rq_dest);
 	raw_spin_unlock(&p->pi_lock);
 	return ret;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fda1b63..417c7bb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -848,16 +848,777 @@ static int task_ideal_cpu(struct task_struct *p)
 	return p->ideal_cpu;
 }
 
+/*
+ * Check whether two tasks are probably in the same
+ * shared memory access group:
+ */
+static bool tasks_buddies(struct task_struct *p1, struct task_struct *p2)
+{
+	struct task_struct *p1b, *p2b;
+
+	if (p1 == p2)
+		return true;
+
+	p1b = NULL;
+	if ((task_ideal_cpu(p1) >= 0) && (p1->shared_buddy_faults > 1000))
+		p1b = p1->shared_buddy;
+
+	p2b = NULL;
+	if ((task_ideal_cpu(p2) >= 0) && (p2->shared_buddy_faults > 1000))
+		p2b = p2->shared_buddy;
+
+	if (p1b && p2b) {
+		if (p1b == p2)
+			return true;
+		if (p2b == p2)
+			return true;
+		if (p1b == p2b)
+			return true;
+	}
+
+	/* Are they both NUMA-shared and in the same mm? */
+	if (task_numa_shared(p1) == 1 && task_numa_shared(p2) == 1 && p1->mm == p2->mm)
+		return true;
+
+	return false;
+}
+
+#define NUMA_LOAD_IDX_HIGHFREQ	0
+#define NUMA_LOAD_IDX_LOWFREQ	3
+
+#define LOAD_HIGHER		true
+#define LOAD_LOWER		false
+
+/*
+ * Load of all tasks:
+ */
+static long calc_node_load(int node, bool use_higher)
+{
+	long cpu_load_highfreq;
+	long cpu_load_lowfreq;
+	long cpu_load_curr;
+	long min_cpu_load;
+	long max_cpu_load;
+	long node_load;
+	int cpu;
+
+	node_load = 0;
+
+	for_each_cpu(cpu, cpumask_of_node(node)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		cpu_load_curr		= rq->load.weight;
+		cpu_load_lowfreq	= rq->cpu_load[NUMA_LOAD_IDX_LOWFREQ];
+		cpu_load_highfreq	= rq->cpu_load[NUMA_LOAD_IDX_HIGHFREQ];
+
+		min_cpu_load = min(min(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
+		max_cpu_load = max(max(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
+
+		if (use_higher)
+			node_load += max_cpu_load;
+		else
+			node_load += min_cpu_load;
+	}
+
+	return node_load;
+}
+
+/*
+ * The capacity of this node:
+ */
+static long calc_node_capacity(int node)
+{
+	return cpumask_weight(cpumask_of_node(node)) * SCHED_LOAD_SCALE;
+}
+
+/*
+ * Load of shared NUMA tasks:
+ */
+static long calc_node_shared_load(int node)
+{
+	long node_load = 0;
+	int cpu;
+
+	for_each_cpu(cpu, cpumask_of_node(node)) {
+		struct rq *rq = cpu_rq(cpu);
+		struct task_struct *curr;
+
+		curr = ACCESS_ONCE(rq->curr);
+
+		if (task_numa_shared(curr) == 1)
+			node_load += rq->cpu_load[NUMA_LOAD_IDX_HIGHFREQ];
+	}
+
+	return node_load;
+}
+
+/*
+ * Find the least busy CPU that is below a limit load,
+ * on a specific node:
+ */
+static int __find_min_cpu(int node, long load_threshold)
+{
+	long min_cpu_load;
+	int min_cpu;
+	long cpu_load_highfreq;
+	long cpu_load_lowfreq;
+	long cpu_load;
+	int cpu;
+
+	min_cpu_load = LONG_MAX;
+	min_cpu = -1;
+
+	for_each_cpu(cpu, cpumask_of_node(node)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		cpu_load_highfreq = rq->cpu_load[NUMA_LOAD_IDX_HIGHFREQ];
+		cpu_load_lowfreq = rq->cpu_load[NUMA_LOAD_IDX_LOWFREQ];
+
+		/* Be conservative: */
+		cpu_load = max(cpu_load_highfreq, cpu_load_lowfreq);
+
+		if (cpu_load < min_cpu_load) {
+			min_cpu_load = cpu_load;
+			min_cpu = cpu;
+		}
+	}
+
+	if (min_cpu_load > load_threshold)
+		return -1;
+
+	return min_cpu;
+}
+
+/*
+ * Find an idle CPU:
+ */
+static int find_idle_cpu(int node)
+{
+	return __find_min_cpu(node, SCHED_LOAD_SCALE/2);
+}
+
+/*
+ * Find the least loaded CPU on a node:
+ */
+static int find_min_cpu(int node)
+{
+	return __find_min_cpu(node, LONG_MAX);
+}
+
+/*
+ * Find the most idle node:
+ */
+static int find_idlest_node(int *idlest_cpu)
+{
+	int idlest_node;
+	int max_idle_cpus;
+	int target_cpu = -1;
+	int idle_cpus;
+	int node;
+	int cpu;
+
+	idlest_node = -1;
+	max_idle_cpus = 0;
+
+	for_each_online_node(node) {
+
+		idle_cpus = 0;
+		target_cpu = -1;
+
+		for_each_cpu(cpu, cpumask_of_node(node)) {
+			struct rq *rq = cpu_rq(cpu);
+			struct task_struct *curr;
+
+			curr = ACCESS_ONCE(rq->curr);
+
+			if (curr == rq->idle) {
+				idle_cpus++;
+				if (target_cpu == -1)
+					target_cpu = cpu;
+			}
+		}
+		if (idle_cpus > max_idle_cpus) {
+			max_idle_cpus = idle_cpus;
+
+			idlest_node = node;
+			*idlest_cpu = target_cpu;
+		}
+	}
+
+	return idlest_node;
+}
+
+/*
+ * Find the minimally loaded CPU on this node and see whether
+ * we can balance to it:
+ */
+static int find_intranode_imbalance(int this_node, int this_cpu)
+{
+	long cpu_load_highfreq;
+	long cpu_load_lowfreq;
+	long this_cpu_load;
+	long cpu_load_curr;
+	long min_cpu_load;
+	long cpu_load;
+	int min_cpu;
+	int cpu;
+
+	if (WARN_ON_ONCE(cpu_to_node(this_cpu) != this_node))
+		return -1;
+
+	min_cpu_load = LONG_MAX;
+	this_cpu_load = 0;
+	min_cpu = -1;
+
+	for_each_cpu(cpu, cpumask_of_node(this_node)) {
+		struct rq *rq = cpu_rq(cpu);
+
+		cpu_load_curr		= rq->load.weight;
+		cpu_load_lowfreq	= rq->cpu_load[NUMA_LOAD_IDX_LOWFREQ];
+		cpu_load_highfreq	= rq->cpu_load[NUMA_LOAD_IDX_HIGHFREQ];
+
+		if (cpu == this_cpu) {
+			this_cpu_load = min(min(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
+		}
+		cpu_load = max(max(cpu_load_curr, cpu_load_lowfreq), cpu_load_highfreq);
+
+		/* Find the idlest CPU: */
+		if (cpu_load < min_cpu_load) {
+			min_cpu_load = cpu_load;
+			min_cpu = cpu;
+		}
+	}
+
+	if (this_cpu_load - min_cpu_load < 1536)
+		return -1;
+
+	return min_cpu;
+}
+
+
+/*
+ * Search a node for the smallest-group task and return
+ * it plus the size of the group it is in.
+ */
+static int buddy_group_size(int node, struct task_struct *p)
+{
+	const cpumask_t *node_cpus_mask = cpumask_of_node(node);
+	cpumask_t cpus_to_check_mask;
+	bool our_group_found;
+	int cpu1, cpu2;
+
+	cpumask_copy(&cpus_to_check_mask, node_cpus_mask);
+	our_group_found = false;
+
+	if (WARN_ON_ONCE(cpumask_empty(&cpus_to_check_mask)))
+		return 0;
+
+	/* Iterate over all buddy groups: */
+	do {
+		for_each_cpu(cpu1, &cpus_to_check_mask) {
+			struct task_struct *group_head;
+			struct rq *rq1 = cpu_rq(cpu1);
+			int group_size;
+			int head_cpu;
+
+			group_head = rq1->curr;
+			head_cpu = cpu1;
+			cpumask_clear_cpu(cpu1, &cpus_to_check_mask);
+
+			group_size = 1;
+			if (tasks_buddies(group_head, p))
+				our_group_found = true;
+
+			/* Non-NUMA-shared tasks are 1-task groups: */
+			if (task_numa_shared(group_head) != 1)
+				goto next;
+
+			WARN_ON_ONCE(group_head == rq1->idle);
+
+			for_each_cpu(cpu2, &cpus_to_check_mask) {
+				struct rq *rq2 = cpu_rq(cpu2);
+				struct task_struct *p2 = rq2->curr;
+
+				WARN_ON_ONCE(rq1 == rq2);
+				if (tasks_buddies(group_head, p2)) {
+					/* 'group_head' and 'rq2->curr' are in the same group: */
+					cpumask_clear_cpu(cpu2, &cpus_to_check_mask);
+					group_size++;
+					if (tasks_buddies(p2, p))
+						our_group_found = true;
+				}
+			}
+next:
+
+			/*
+			 * If we just found our group and checked all
+			 * node local CPUs then return the result:
+			 */
+			if (our_group_found)
+				return group_size;
+		}
+	} while (!cpumask_empty(&cpus_to_check_mask));
+
+	return 0;
+}
+
+/*
+ * Search a node for the smallest-group task and return
+ * it plus the size of the group it is in.
+ */
+static int find_group_min_cpu(int node, int *group_size)
+{
+	const cpumask_t *node_cpus_mask = cpumask_of_node(node);
+	cpumask_t cpus_to_check_mask;
+	int min_group_size;
+	int min_group_cpu;
+	int group_cpu;
+	int cpu1, cpu2;
+
+	min_group_size = INT_MAX;
+	min_group_cpu = -1;
+	cpumask_copy(&cpus_to_check_mask, node_cpus_mask);
+
+	WARN_ON_ONCE(cpumask_empty(&cpus_to_check_mask));
+
+	/* Iterate over all buddy groups: */
+	do {
+		group_cpu = -1;
+
+		for_each_cpu(cpu1, &cpus_to_check_mask) {
+			struct task_struct *group_head;
+			struct rq *rq1 = cpu_rq(cpu1);
+			int group_size;
+			int head_cpu;
+
+			group_head = rq1->curr;
+			head_cpu = cpu1;
+			cpumask_clear_cpu(cpu1, &cpus_to_check_mask);
+
+			group_size = 1;
+			group_cpu = cpu1;
+
+			/* Non-NUMA-shared tasks are 1-task groups: */
+			if (task_numa_shared(group_head) != 1)
+				goto pick_non_numa_task;
+
+			WARN_ON_ONCE(group_head == rq1->idle);
+
+			for_each_cpu(cpu2, &cpus_to_check_mask) {
+				struct rq *rq2 = cpu_rq(cpu2);
+				struct task_struct *p2 = rq2->curr;
+
+				WARN_ON_ONCE(rq1 == rq2);
+				if (tasks_buddies(group_head, p2)) {
+					/* 'group_head' and 'rq2->curr' are in the same group: */
+					cpumask_clear_cpu(cpu2, &cpus_to_check_mask);
+					group_size++;
+				}
+			}
+
+			if (group_size < min_group_size) {
+pick_non_numa_task:
+				min_group_size = group_size;
+				min_group_cpu = head_cpu;
+			}
+		}
+	} while (!cpumask_empty(&cpus_to_check_mask));
+
+	if (min_group_cpu != -1)
+		*group_size = min_group_size;
+	else
+		*group_size = 0;
+
+	return min_group_cpu;
+}
+
+static int find_max_node(struct task_struct *p, int *our_group_size)
+{
+	int max_group_size;
+	int group_size;
+	int max_node;
+	int node;
+
+	max_group_size = -1;
+	max_node = -1;
+
+	for_each_node(node) {
+		int full_size = cpumask_weight(cpumask_of_node(node));
+
+		group_size = buddy_group_size(node, p);
+		if (group_size == full_size)
+			continue;
+
+		if (group_size > max_group_size) {
+			max_group_size = group_size;
+			max_node = node;
+		}
+	}
+
+	*our_group_size = max_group_size;
+
+	return max_node;
+}
+
+/*
+ * NUMA convergence.
+ *
+ * This is the heart of the CONFIG_NUMA_BALANCING=y NUMA balancing logic.
+ *
+ * These functions gets called when CPU loads are otherwise more or
+ * less in balance, so we check whether this NUMA task wants to migrate
+ * to another node, to improve NUMA task group clustering.
+ *
+ * Our general goal is to converge the load in such a way that
+ * minimizes internode memory access traffic. We do this
+ * in a 'directed', non-statistical way, which drastically
+ * improves the speed of convergence.
+ *
+ * We do this directed convergence via using the 'memory buddy'
+ * task relation which we build out of the page::last_cpu NUMA
+ * hinting page fault driven statistics, plus the following
+ * two sets of directed task migration rules.
+ *
+ * The first set of rules 'compresses' groups by moving related
+ * tasks closer to each other.
+ *
+ * The second set of rules 'spreads' groups, when compression
+ * creates a stable but not yet converging (optimal) layout
+ * of tasks.
+ *
+ * 1) "group spreading"
+ *
+ * This rule checks whether the smallest group on the current node
+ * could move to another node.
+ *
+ * This logic is implemented in the improve_group_balance_spread()
+ * function.
+ *
+ * ============================================================
+ *
+ * 2) "group compression"
+ *
+ * This logic triggers if the 'spreading' logic finds no more
+ * work to do.
+ *
+ * First we search for the 'maximum node', i.e. the node on
+ * which we have the most buddies, but which node is not yet
+ * completely full with our buddies.
+ *
+ * If this 'maximum node' is the current node, then we stop.
+ *
+ * If this 'maximum node' is a different node from the current
+ * node then we check the size of the smallest buddy group on
+ * it.
+ *
+ * If our own buddy group size on that CPU is equal or larger
+ * than the minimum buddy group size, then we can disrupt the
+ * smallest group and migrate to one of their CPUs - even if
+ * that CPU is not idle.
+ *
+ * Special cases: idle tasks, non-NUMA tasks or NUMA-private
+ * tasks are special 1-task 'buddy groups' and are preferred
+ * over NUMA-shared tasks, in that order.
+ *
+ * If we replace a busy task then once we migrate to the
+ * destination CPU we try to migrate that task to our original
+ * CPU. It will not be able to replace us in the next round of
+ * balancing because the flipping rule is not symmetric: our
+ * group will be larger there than theirs.
+ *
+ * This logic is implemented in the improve_group_balance_compress()
+ * function.
+ *
+ * ============================================================
+ *
+ * An example of a valid group convergence transition:
+ *
+ * ( G1 is a buddy group of tasks  T1, T2, T3 on node 1 and
+ *   T6, T7, T8 on node 2, G2 is a second buddy group on node 1
+ *   with tasks T4, T5, G3 is a third buddy group on
+ *   node 2 with task T9 and T10, G4 and G5 are two one-task
+ *   groups of singleton tasks T11 and T12. Both nodes are equally
+ *   loaded with 6 tasks and are at full capacity.):
+ *
+ *     node 1                                   node 2
+ *     G1(T1, T2, T3), G2(T4, T5), G4(T11)      G1(T6, T7, T8) G3(T9, T10), G5(T12)
+ *
+ *                      ||
+ *                     \||/
+ *                      \/
+ *
+ *     node 1                                   node 2
+ *     G1(T1, T2, T3, T6), G2(T4, T5)           G1(T7, T8), G3(T9, T10), G4(T11), G5(T12)
+ *
+ * I.e. task T6 migrated from node 2 to node 1, flipping task T11.
+ * This was a valid migration that increased the size of group G1
+ * on node 1, at the expense of (smaller) group G4.
+ *
+ * The next valid migration step would be for T7 and T8 to
+ * migrate from node 2 to node 1 as well:
+ *
+ *                      ||
+ *                     \||/
+ *                      \/
+ *
+ *     node 1                            node 2
+ *     G1(T1, T2, T3, T6, T7, T8)        G2(T4, T5), G3(T9, T10), G4(T11), G5(T12)
+ *
+ * Which is fully converged, with all 5 groups running on
+ * their own node with no cross-node traffic between group
+ * member tasks.
+ *
+ * These two migrations were valid too because group G2 is
+ * smaller than group G1, so it can be disrupted by G1.
+ *
+ * ============================================================
+ *
+ * "Resonance"
+ *
+ * On its face, 'compression' and 'spreading' are opposing forces
+ * and are thus subject to bad resonance feedback effects: what
+ * 'compression' does could be undone by 'spreading', all the
+ * time, without it stopping.
+ *
+ * But because 'compression' only gets called when 'spreading'
+ * can find no more work anymore, and because 'compression'
+ * creates larger groups on otherwise balanced nodes, which then
+ * cannot be torn apart by 'spreading', no permanent resonance
+ * should occur between the two.
+ *
+ * Transients can occur, as both algorithms are lockless and can
+ * (although typically and statistically don't) run at once on
+ * parallel CPUs.
+ *
+ * ============================================================
+ *
+ * Choice of the convergence algorithms:
+ *
+ * I went for directed convergence instead of statistical convergence,
+ * because especially on larger systems convergence speed was getting
+ * very slow with statistical methods, only convering the most trivial,
+ * often artificial benchmark workloads.
+ *
+ * The mathematical proof looks somewhat difficult to outline (I have
+ * not even tried to formally construct one), but the above logic will
+ * monotonically sort the system until it converges into a fully
+ * and ideally sorted state, and will do that in a finite number of
+ * steps.
+ *
+ * In the final state each group is the largest possible and each CPU
+ * is loaded to the fullest: i.e. inter-node traffic is minimize.
+ *
+ * This direct path of convergence is very fast (much faster than
+ * the statistical, Monte-Carlo / Brownian motion convergence) but
+ * it is not the mathematically shortest possible route to equilibrium.
+ *
+ * By my thinking, finding the minimum-steps route would have
+ * O(NR_CPUS^3) complexity or worse, with memory allocation and
+ * complexity constraints unpractical and unacceptable for kernel space ...
+ *
+ * [ If you thought that the lockdep dependency graph logic was complex,
+ *   then such a routine would be a true monster in comparison! ]
+ *
+ * Especially with fast moving workloads it's also doubtful whether
+ * it's worth spending kernel resources to calculate the best path
+ * to begin with - by the time we calculate it the scheduling situation
+ * might have changed already ;-)
+ *
+ * This routine fits into our existing load-balancing patterns
+ * and algorithm complexity pretty well: it is essentially O(NR_CPUs),
+ * it runs only rarely and tries hard to never at once run on multiple
+ * CPUs.
+ */
+static int improve_group_balance_compress(struct task_struct *p, int this_cpu, int this_node)
+{
+	int our_group_size = -1;
+	int min_group_size = -1;
+	int max_node;
+	int min_cpu;
+
+	if (!sched_feat(NUMA_GROUP_LB_COMPRESS))
+		return -1;
+
+	max_node = find_max_node(p, &our_group_size);
+	if (max_node == -1)
+		return -1;
+
+	if (WARN_ON_ONCE(our_group_size == -1))
+		return -1;
+
+	/* We are already in the right spot: */
+	if (max_node == this_node)
+		return -1;
+
+	/* Special case, if all CPUs are fully loaded with our buddies: */
+	if (our_group_size == 0)
+		return -1;
+
+	/* Ok, we desire to go to the max node, now see whether we can do it: */
+	min_cpu = find_group_min_cpu(max_node, &min_group_size);
+	if (min_cpu == -1)
+		return -1;
+
+	if (WARN_ON_ONCE(min_group_size <= 0))
+		return -1;
+
+	/*
+	 * If the minimum group is larger than ours then skip it:
+	 */
+	if (min_group_size > our_group_size)
+		return -1;
+
+	/*
+	 * Go pick the minimum CPU:
+	 */
+	return min_cpu;
+}
+
+static int improve_group_balance_spread(struct task_struct *p, int this_cpu, int this_node)
+{
+	const cpumask_t *node_cpus_mask = cpumask_of_node(this_node);
+	cpumask_t cpus_to_check_mask;
+	bool found_our_group = false;
+	bool our_group_smallest = false;
+	int our_group_size = -1;
+	int min_group_size;
+	int idlest_node;
+	long this_group_load;
+	long idlest_node_load = -1;
+	long this_node_load = -1;
+	long delta_load_before;
+	long delta_load_after;
+	int idlest_cpu = -1;
+	int cpu1, cpu2;
+
+	if (!sched_feat(NUMA_GROUP_LB_SPREAD))
+		return -1;
+
+	/* Only consider shared tasks: */
+	if (task_numa_shared(p) != 1)
+		return -1;
+
+	min_group_size = INT_MAX;
+	cpumask_copy(&cpus_to_check_mask, node_cpus_mask);
+
+	WARN_ON_ONCE(cpumask_empty(&cpus_to_check_mask));
+
+	/* Iterate over all buddy groups: */
+	do {
+		for_each_cpu(cpu1, &cpus_to_check_mask) {
+			struct task_struct *group_head;
+			struct rq *rq1 = cpu_rq(cpu1);
+			bool our_group = false;
+			int group_size;
+			int head_cpu;
+
+			group_head = rq1->curr;
+			head_cpu = cpu1;
+			cpumask_clear_cpu(cpu1, &cpus_to_check_mask);
+
+			/* Only NUMA-shared tasks are parts of groups: */
+			if (task_numa_shared(group_head) != 1)
+				continue;
+
+			WARN_ON_ONCE(group_head == rq1->idle);
+			group_size = 1;
+
+			if (group_head == p)
+				our_group = true;
+
+			for_each_cpu(cpu2, &cpus_to_check_mask) {
+				struct rq *rq2 = cpu_rq(cpu2);
+				struct task_struct *p2 = rq2->curr;
+
+				WARN_ON_ONCE(rq1 == rq2);
+				if (tasks_buddies(group_head, p2)) {
+					/* 'group_head' and 'rq2->curr' are in the same group: */
+					cpumask_clear_cpu(cpu2, &cpus_to_check_mask);
+					group_size++;
+					if (p == p2)
+						our_group = true;
+				}
+			}
+
+			if (our_group) {
+				found_our_group = true;
+				our_group_size = group_size;
+				if (group_size <= min_group_size)
+					our_group_smallest = true;
+			} else {
+				if (found_our_group) {
+					if (group_size < our_group_size)
+						our_group_smallest = false;
+				}
+			}
 
-static int sched_update_ideal_cpu_shared(struct task_struct *p)
+			if (min_group_size == -1)
+				min_group_size = group_size;
+			else
+				min_group_size = min(group_size, min_group_size);
+		}
+	} while (!cpumask_empty(&cpus_to_check_mask));
+
+	/*
+	 * Now we know what size our group has and whether we
+	 * are the smallest one:
+	 */
+	if (!found_our_group)
+		return -1;
+	if (!our_group_smallest)
+		return -1;
+	if (WARN_ON_ONCE(min_group_size == -1))
+		return -1;
+	if (WARN_ON_ONCE(our_group_size == -1))
+		return -1;
+
+	idlest_node = find_idlest_node(&idlest_cpu);
+	if (idlest_node == -1)
+		return -1;
+
+	WARN_ON_ONCE(idlest_cpu == -1);
+
+	this_node_load		= calc_node_shared_load(this_node);
+	idlest_node_load	= calc_node_shared_load(idlest_node);
+	this_group_load		= our_group_size*SCHED_LOAD_SCALE;
+
+	/*
+	 * Lets check whether it would make sense to move this smallest
+	 * group - whether it increases system-wide balance.
+	 *
+	 * this node right now has "this_node_load", the potential
+	 * target node has "idlest_node_load". Does the difference
+	 * in load improve if we move over "this_group_load" to that
+	 * node?
+	 */
+	delta_load_before = this_node_load - idlest_node_load;
+	delta_load_after = (this_node_load-this_group_load) - (idlest_node_load+this_group_load);
+	
+	if (abs(delta_load_after)+SCHED_LOAD_SCALE > abs(delta_load_before))
+		return -1;
+
+	return idlest_cpu;
+
+}
+
+static int sched_update_ideal_cpu_shared(struct task_struct *p, int *flip_tasks)
 {
-	int full_buddies;
+	bool idle_target;
 	int max_buddies;
+	long node_load;
+	long this_node_load;
+	long this_node_capacity;
+	bool this_node_overloaded;
+	int min_node;
+	long min_node_load;
+	long ideal_node_load;
+	long ideal_node_capacity;
+	long node_capacity;
 	int target_cpu;
 	int ideal_cpu;
-	int this_cpu;
 	int this_node;
-	int best_node;
+	int ideal_node;
+	int this_cpu;
 	int buddies;
 	int node;
 	int cpu;
@@ -866,16 +1627,23 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p)
 		return -1;
 
 	ideal_cpu = -1;
-	best_node = -1;
+	ideal_node = -1;
 	max_buddies = 0;
 	this_cpu = task_cpu(p);
 	this_node = cpu_to_node(this_cpu);
+	min_node_load = LONG_MAX;
+	min_node = -1;
 
+	/*
+	 * Map out our maximum buddies layout:
+	 */
 	for_each_online_node(node) {
-		full_buddies = cpumask_weight(cpumask_of_node(node));
-
 		buddies = 0;
 		target_cpu = -1;
+		idle_target = false;
+
+		node_load = calc_node_load(node, LOAD_HIGHER);
+		node_capacity = calc_node_capacity(node);
 
 		for_each_cpu(cpu, cpumask_of_node(node)) {
 			struct task_struct *curr;
@@ -892,140 +1660,267 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p)
 			curr = ACCESS_ONCE(rq->curr);
 
 			if (curr == p) {
-				buddies += 1;
+				buddies++;
 				continue;
 			}
 
-			/* Pick up idle tasks immediately: */
-			if (curr == rq->idle && !rq->curr_buddy)
-				target_cpu = cpu;
+			if (curr == rq->idle) {
+				/* Pick up idle CPUs immediately: */
+				if (!rq->curr_buddy) {
+					target_cpu = cpu;
+					idle_target = true;
+				}
+				continue;
+			}
 
 			/* Leave alone non-NUMA tasks: */
 			if (task_numa_shared(curr) < 0)
 				continue;
 
-			/* Private task is an easy target: */
+			/* Private tasks are an easy target: */
 			if (task_numa_shared(curr) == 0) {
-				if (!rq->curr_buddy)
+				if (!rq->curr_buddy && !idle_target)
 					target_cpu = cpu;
 				continue;
 			}
 			if (curr->mm != p->mm) {
 				/* Leave alone different groups on their ideal node: */
-				if (cpu_to_node(curr->ideal_cpu) == node)
+				if (curr->ideal_cpu >= 0 && cpu_to_node(curr->ideal_cpu) == node)
 					continue;
-				if (!rq->curr_buddy)
+				if (!rq->curr_buddy && !idle_target)
 					target_cpu = cpu;
 				continue;
 			}
 
 			buddies++;
 		}
-		WARN_ON_ONCE(buddies > full_buddies);
+
+		if (node_load < min_node_load) {
+			min_node_load = node_load;
+			min_node = node;
+		}
+
 		if (buddies)
 			node_set(node, p->numa_policy.v.nodes);
 		else
 			node_clear(node, p->numa_policy.v.nodes);
 
-		/* Don't go to a node that is already at full capacity: */
-		if (buddies == full_buddies)
+		/* Don't go to a node that is near its capacity limit: */
+		if (node_load + SCHED_LOAD_SCALE > node_capacity)
 			continue;
-
 		if (!buddies)
 			continue;
 
 		if (buddies > max_buddies && target_cpu != -1) {
 			max_buddies = buddies;
-			best_node = node;
+			ideal_node = node;
 			WARN_ON_ONCE(target_cpu == -1);
 			ideal_cpu = target_cpu;
 		}
 	}
 
-	WARN_ON_ONCE(best_node == -1 && ideal_cpu != -1);
-	WARN_ON_ONCE(best_node != -1 && ideal_cpu == -1);
+	if (WARN_ON_ONCE(ideal_node == -1 && ideal_cpu != -1))
+		return this_cpu;
+	if (WARN_ON_ONCE(ideal_node != -1 && ideal_cpu == -1))
+		return this_cpu;
+	if (WARN_ON_ONCE(min_node == -1))
+		return this_cpu;
 
-	this_node = cpu_to_node(this_cpu);
+	ideal_cpu = ideal_node = -1;
+
+	/*
+	 * If things are more or less in balance, check now
+	 * whether we can improve balance by moving larger
+	 * groups than single tasks:
+	 */
+	if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node) {
+		if (ideal_node == this_node || ideal_node == -1) {
+			target_cpu = improve_group_balance_spread(p, this_cpu, this_node);
+			if (target_cpu == -1) {
+				target_cpu = improve_group_balance_compress(p, this_cpu, this_node);
+				/* In compression we override (most) overload concerns: */
+				if (target_cpu != -1) {
+					*flip_tasks = 1;
+					return target_cpu;
+				}
+			}
+			if (target_cpu != -1) {
+				ideal_cpu = target_cpu;
+				ideal_node = cpu_to_node(ideal_cpu);
+			}
+		}
+	}
+
+	this_node_load		= calc_node_load(this_node, LOAD_LOWER);
+	this_node_capacity	= calc_node_capacity(this_node);
+
+	this_node_overloaded = false;
+	if (this_node_load > this_node_capacity + 512)
+		this_node_overloaded = true;
 
 	/* If we'd stay within this node then stay put: */
 	if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node)
-		ideal_cpu = this_cpu;
+		goto out_handle_overload;
+
+	ideal_node = cpu_to_node(ideal_cpu);
+
+	ideal_node_load		= calc_node_load(ideal_node, LOAD_HIGHER);
+	ideal_node_capacity	= calc_node_capacity(ideal_node);
+
+	/* Don't move if the target node is near its capacity limit: */
+	if (ideal_node_load + SCHED_LOAD_SCALE > ideal_node_capacity)
+		goto out_handle_overload;
+
+	/* Only move if we can see an idle CPU: */
+	ideal_cpu = find_min_cpu(ideal_node);
+	if (ideal_cpu == -1)
+		goto out_check_intranode;
+
+	return ideal_cpu;
+
+out_handle_overload:
+	if (!this_node_overloaded)
+		goto out_check_intranode;
+
+	if (this_node == min_node)
+		goto out_check_intranode;
+
+	ideal_cpu = find_idle_cpu(min_node);
+	if (ideal_cpu == -1)
+		goto out_check_intranode;
 
 	return ideal_cpu;
+	/*
+	 * Check for imbalance within this otherwise balanced node:
+	 */
+out_check_intranode:
+	ideal_cpu = find_intranode_imbalance(this_node, this_cpu);
+	if (ideal_cpu != -1 && ideal_cpu != this_cpu)
+		return ideal_cpu;
+
+	return this_cpu;
 }
 
+/*
+ * Private tasks are not part of any groups, so they try place
+ * themselves to improve NUMA load in general.
+ *
+ * For that they simply want to find the least loaded node
+ * in the system, and check whether they can migrate there.
+ *
+ * To speed up convergence and to avoid a thundering herd of
+ * private tasks, we move from the busiest node (which still
+ * has private tasks) to the idlest node.
+ */
 static int sched_update_ideal_cpu_private(struct task_struct *p)
 {
-	int full_idles;
-	int this_idles;
-	int max_idles;
-	int target_cpu;
+	long this_node_load;
+	long this_node_capacity;
+	bool this_node_overloaded;
+	long ideal_node_load;
+	long ideal_node_capacity;
+	long min_node_load;
+	long max_node_load;
+	long node_load;
+	int ideal_node;
 	int ideal_cpu;
-	int best_node;
 	int this_node;
 	int this_cpu;
-	int idles;
+	int min_node;
+	int max_node;
 	int node;
-	int cpu;
 
 	if (!sched_feat(PUSH_PRIVATE_BUDDIES))
 		return -1;
 
 	ideal_cpu = -1;
-	best_node = -1;
-	max_idles = 0;
-	this_idles = 0;
+	ideal_node = -1;
 	this_cpu = task_cpu(p);
 	this_node = cpu_to_node(this_cpu);
 
-	for_each_online_node(node) {
-		full_idles = cpumask_weight(cpumask_of_node(node));
+	min_node_load = LONG_MAX;
+	max_node = -1;
+	max_node_load = 0;
+	min_node = -1;
 
-		idles = 0;
-		target_cpu = -1;
-
-		for_each_cpu(cpu, cpumask_of_node(node)) {
-			struct rq *rq;
+	/*
+	 * Calculate:
+	 *
+	 *  - the most loaded node
+	 *  - the least loaded node
+	 *  - our load
+	 */
+	for_each_online_node(node) {
+		node_load = calc_node_load(node, LOAD_HIGHER);
 
-			WARN_ON_ONCE(cpu_to_node(cpu) != node);
+		if (node_load > max_node_load) {
+			max_node_load = node_load;
+			max_node = node;
+		}
 
-			rq = cpu_rq(cpu);
-			if (rq->curr == rq->idle) {
-				if (!rq->curr_buddy)
-					target_cpu = cpu;
-				idles++;
-			}
+		if (node_load < min_node_load) {
+			min_node_load = node_load;
+			min_node = node;
 		}
-		WARN_ON_ONCE(idles > full_idles);
 
 		if (node == this_node)
-			this_idles = idles;
+			this_node_load = node_load;
+	}
 
-		if (!idles)
-			continue;
+	this_node_load		= calc_node_load(this_node, LOAD_LOWER);
+	this_node_capacity	= calc_node_capacity(this_node);
 
-		if (idles > max_idles && target_cpu != -1) {
-			max_idles = idles;
-			best_node = node;
-			WARN_ON_ONCE(target_cpu == -1);
-			ideal_cpu = target_cpu;
-		}
-	}
+	this_node_overloaded = false;
+	if (this_node_load > this_node_capacity + 512)
+		this_node_overloaded = true;
+
+	if (this_node == min_node)
+		goto out_check_intranode;
+
+	/* When not overloaded, only balance from the busiest node: */
+	if (0 && !this_node_overloaded && this_node != max_node)
+		return this_cpu;
+
+	WARN_ON_ONCE(max_node_load < min_node_load);
+
+	/* Is the load difference at least 125% of one standard task load? */
+	if (this_node_load - min_node_load < 1536)
+		goto out_check_intranode;
+
+	/*
+	 * Ok, the min node is a viable target for us,
+	 * find a target CPU on it, if any:
+	 */
+	ideal_node = min_node;
+	ideal_cpu = find_idle_cpu(ideal_node);
+	if (ideal_cpu == -1)
+		return this_cpu;
 
-	WARN_ON_ONCE(best_node == -1 && ideal_cpu != -1);
-	WARN_ON_ONCE(best_node != -1 && ideal_cpu == -1);
+	ideal_node = cpu_to_node(ideal_cpu);
 
-	/* If the target is not idle enough, skip: */
-	if (max_idles <= this_idles+1)
+	ideal_node_load		= calc_node_load(ideal_node, LOAD_HIGHER);
+	ideal_node_capacity	= calc_node_capacity(ideal_node);
+
+	/* Only move if the target node is less loaded than us: */
+	if (ideal_node_load > this_node_load)
 		ideal_cpu = this_cpu;
-		
-	/* If we'd stay within this node then stay put: */
-	if (ideal_cpu == -1 || cpu_to_node(ideal_cpu) == this_node)
+
+	/* And if the target node is not over capacity: */
+	if (ideal_node_load + SCHED_LOAD_SCALE > ideal_node_capacity)
 		ideal_cpu = this_cpu;
 
 	return ideal_cpu;
-}
 
+	/*
+	 * Check for imbalance within this otherwise balanced node:
+	 */
+out_check_intranode:
+	ideal_cpu = find_intranode_imbalance(this_node, this_cpu);
+	if (ideal_cpu != -1 && ideal_cpu != this_cpu)
+		return ideal_cpu;
+
+	return this_cpu;
+}
 
 /*
  * Called for every full scan - here we consider switching to a new
@@ -1072,8 +1967,10 @@ static void task_numa_placement_tick(struct task_struct *p)
 {
 	unsigned long total[2] = { 0, 0 };
 	unsigned long faults, max_faults = 0;
-	int node, priv, shared, max_node = -1;
+	int node, priv, shared, ideal_node = -1;
+	int flip_tasks;
 	int this_node;
+	int this_cpu;
 
 	/*
 	 * Update the fault average with the result of the latest
@@ -1090,25 +1987,20 @@ static void task_numa_placement_tick(struct task_struct *p)
 			p->numa_faults_curr[idx] = 0;
 
 			/* Keep a simple running average: */
-			p->numa_faults[idx] = p->numa_faults[idx]*7 + new_faults;
-			p->numa_faults[idx] /= 8;
+			p->numa_faults[idx] = p->numa_faults[idx]*15 + new_faults;
+			p->numa_faults[idx] /= 16;
 
 			faults += p->numa_faults[idx];
 			total[priv] += p->numa_faults[idx];
 		}
 		if (faults > max_faults) {
 			max_faults = faults;
-			max_node = node;
+			ideal_node = node;
 		}
 	}
 
 	shared_fault_full_scan_done(p);
 
-	p->numa_migrate_seq++;
-	if (sched_feat(NUMA_SETTLE) &&
-	    p->numa_migrate_seq < sysctl_sched_numa_settle_count)
-		return;
-
 	/*
 	 * Note: shared is spread across multiple tasks and in the future
 	 * we might want to consider a different equation below to reduce
@@ -1128,25 +2020,27 @@ static void task_numa_placement_tick(struct task_struct *p)
 			shared = 0;
 	}
 
+	flip_tasks = 0;
+
 	if (shared)
-		p->ideal_cpu = sched_update_ideal_cpu_shared(p);
+		p->ideal_cpu = sched_update_ideal_cpu_shared(p, &flip_tasks);
 	else
 		p->ideal_cpu = sched_update_ideal_cpu_private(p);
 
 	if (p->ideal_cpu >= 0) {
 		/* Filter migrations a bit - the same target twice in a row is picked: */
-		if (p->ideal_cpu == p->ideal_cpu_candidate) {
-			max_node = cpu_to_node(p->ideal_cpu);
+		if (1 || p->ideal_cpu == p->ideal_cpu_candidate) {
+			ideal_node = cpu_to_node(p->ideal_cpu);
 		} else {
 			p->ideal_cpu_candidate = p->ideal_cpu;
-			max_node = -1;
+			ideal_node = -1;
 		}
 	} else {
-		if (max_node < 0)
-			max_node = p->numa_max_node;
+		if (ideal_node < 0)
+			ideal_node = p->numa_max_node;
 	}
 
-	if (shared != task_numa_shared(p) || (max_node != -1 && max_node != p->numa_max_node)) {
+	if (shared != task_numa_shared(p) || (ideal_node != -1 && ideal_node != p->numa_max_node)) {
 
 		p->numa_migrate_seq = 0;
 		/*
@@ -1156,26 +2050,28 @@ static void task_numa_placement_tick(struct task_struct *p)
 		 * To counter-balance this effect, move this node's private
 		 * stats to the new node.
 		 */
-		if (max_node != -1 && p->numa_max_node != -1 && max_node != p->numa_max_node) {
+		if (sched_feat(MIGRATE_FAULT_STATS) && ideal_node != -1 && p->numa_max_node != -1 && ideal_node != p->numa_max_node) {
 			int idx_oldnode = p->numa_max_node*2 + 1;
-			int idx_newnode = max_node*2 + 1;
+			int idx_newnode = ideal_node*2 + 1;
 
 			p->numa_faults[idx_newnode] += p->numa_faults[idx_oldnode];
 			p->numa_faults[idx_oldnode] = 0;
 		}
-		sched_setnuma(p, max_node, shared);
+		sched_setnuma(p, ideal_node, shared);
 	} else {
 		/* node unchanged, back off: */
 		p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
 	}
 
-	this_node = cpu_to_node(task_cpu(p));
+	this_cpu = task_cpu(p);
+	this_node = cpu_to_node(this_cpu);
 
-	if (max_node >= 0 && p->ideal_cpu >= 0 && max_node != this_node) {
+	if (ideal_node >= 0 && p->ideal_cpu >= 0 && p->ideal_cpu != this_cpu) {
 		struct rq *rq = cpu_rq(p->ideal_cpu);
 
 		rq->curr_buddy = p;
-		sched_rebalance_to(p->ideal_cpu, 0);
+		sched_rebalance_to(p->ideal_cpu, flip_tasks);
+		rq->curr_buddy = NULL;
 	}
 }
 
@@ -1317,7 +2213,7 @@ void task_numa_placement_work(struct callback_head *work)
  */
 void task_numa_scan_work(struct callback_head *work)
 {
-	long pages_total, pages_left, pages_changed;
+	long pages_total, pages_left, pages_changed, sum_pages_scanned;
 	unsigned long migrate, next_scan, now = jiffies;
 	unsigned long start0, start, end;
 	struct task_struct *p = current;
@@ -1331,6 +2227,13 @@ void task_numa_scan_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
+	p->numa_migrate_seq++;
+	if (sched_feat(NUMA_SETTLE) &&
+	    p->numa_migrate_seq < sysctl_sched_numa_settle_count) {
+		trace_printk("NUMA TICK: placement, return to let it settle, task %s:%d\n", p->comm, p->pid);
+		return;
+	}
+
 	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
@@ -1350,37 +2253,69 @@ void task_numa_scan_work(struct callback_head *work)
 	if (!pages_total)
 		return;
 
-	pages_left	= pages_total;
+	sum_pages_scanned = 0;
+	pages_left = pages_total;
 
-	down_write(&mm->mmap_sem);
+	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (!vma) {
 		ACCESS_ONCE(mm->numa_scan_seq)++;
-		end = 0;
-		vma = find_vma(mm, end);
+		start = end = 0;
+		vma = find_vma(mm, start);
 	}
+
 	for (; vma; vma = vma->vm_next) {
-		if (!vma_migratable(vma))
+		if (!vma_migratable(vma)) {
+			end = vma->vm_end;
+			continue;
+		}
+
+		/* Skip small VMAs. They are not likely to be of relevance */
+		if (((vma->vm_end - vma->vm_start) >> PAGE_SHIFT) < HPAGE_PMD_NR) {
+			end = vma->vm_end;
 			continue;
+		}
 
 		do {
+			unsigned long pages_scanned;
+
 			start = max(end, vma->vm_start);
 			end = ALIGN(start + (pages_left << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
+			pages_scanned = (end - start) >> PAGE_SHIFT;
+
+			if (WARN_ON_ONCE(start >= end)) {
+				printk_once("vma->vm_start: %016lx\n", vma->vm_start);
+				printk_once("vma->vm_end:   %016lx\n", vma->vm_end);
+				continue;
+			}
+			if (WARN_ON_ONCE(start < vma->vm_start))
+				continue;
+			if (WARN_ON_ONCE(end > vma->vm_end))
+				continue;
+
 			pages_changed = change_prot_numa(vma, start, end);
 
-			WARN_ON_ONCE(pages_changed > pages_total);
-			BUG_ON(pages_changed < 0);
+			WARN_ON_ONCE(pages_changed > pages_total + HPAGE_SIZE/PAGE_SIZE);
+			WARN_ON_ONCE(pages_changed < 0);
+			WARN_ON_ONCE(pages_changed > pages_scanned);
 
 			pages_left -= pages_changed;
 			if (pages_left <= 0)
 				goto out;
-		} while (end != vma->vm_end);
+
+			sum_pages_scanned += pages_scanned;
+
+			/* Don't overscan: */
+			if (sum_pages_scanned >= 2*pages_total)
+				goto out;
+
+		} while (end < vma->vm_end);
 	}
 out:
 	mm->numa_scan_offset = end;
 
-	up_write(&mm->mmap_sem);
+	up_read(&mm->mmap_sem);
 }
 
 /*
@@ -1433,6 +2368,8 @@ static void task_tick_numa_scan(struct rq *rq, struct task_struct *curr)
 static void task_tick_numa_placement(struct rq *rq, struct task_struct *curr)
 {
 	struct callback_head *work = &curr->numa_placement_work;
+	unsigned long now_secs;
+	unsigned long jiffies_offset;
 	int seq;
 
 	if (work->next != work)
@@ -1444,10 +2381,26 @@ static void task_tick_numa_placement(struct rq *rq, struct task_struct *curr)
 	 */
 	seq = ACCESS_ONCE(curr->mm->numa_scan_seq);
 
-	if (curr->numa_scan_seq == seq)
+	/*
+	 * Smear out the NUMA placement ticks by CPU position.
+	 * We get called about once per jiffy so we can test
+	 * for precisely meeting the jiffies offset.
+	 */
+	jiffies_offset = (jiffies % num_online_cpus());
+	if (jiffies_offset != rq->cpu)
+		return;
+
+	/*
+	 * Recalculate placement at least once per second:
+	 */
+	now_secs = jiffies/HZ;
+
+	if ((curr->numa_scan_seq == seq) && (curr->numa_scan_ts_secs == now_secs))
 		return;
 
+	curr->numa_scan_ts_secs = now_secs;
 	curr->numa_scan_seq = seq;
+
 	task_work_add(curr, work, true);
 }
 
@@ -1457,7 +2410,7 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 	 * We don't care about NUMA placement if we don't have memory
 	 * or are exiting:
 	 */
-	if (!curr->mm || (curr->flags & PF_EXITING))
+	if (!curr->mm || (curr->flags & PF_EXITING) || !curr->numa_faults)
 		return;
 
 	task_tick_numa_scan(rq, curr);
@@ -3815,6 +4768,10 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 #ifdef CONFIG_NUMA_BALANCING
+	/* We do NUMA balancing elsewhere: */
+	if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) >= 0)
+		return prev_cpu;
+
 	if (sched_feat(WAKE_ON_IDEAL_CPU) && p->ideal_cpu >= 0)
 		return p->ideal_cpu;
 #endif
@@ -3893,6 +4850,9 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
+	if (sched_feat(NUMA_BALANCE_INTERNODE) && task_numa_shared(p) >= 0 && (cpu_to_node(prev_cpu) != cpu_to_node(new_cpu)))
+		return prev_cpu;
+
 	return new_cpu;
 }
 
@@ -4584,6 +5544,10 @@ try_migrate:
  */
 static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
+	/* We do NUMA balancing elsewhere: */
+	if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) > 0 && env->failed <= env->sd->cache_nice_tries)
+		return false;
+
 	if (!can_migrate_pinned_task(p, env))
 		return false;
 
@@ -4600,6 +5564,7 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 		int dst_node;
 
 		BUG_ON(env->dst_cpu < 0);
+		WARN_ON_ONCE(p->ideal_cpu < 0);
 
 		ideal_node = cpu_to_node(p->ideal_cpu);
 		dst_node = cpu_to_node(env->dst_cpu);
@@ -4643,6 +5608,12 @@ static int move_one_task(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			continue;
 
+		if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) >= 0)
+			continue;
+
+		if (sched_feat(NUMA_BALANCE_INTERNODE) && task_numa_shared(p) >= 0 && (cpu_to_node(env->src_rq->cpu) != cpu_to_node(env->dst_cpu)))
+			continue;
+
 		move_task(p, env);
 
 		/*
@@ -4703,6 +5674,12 @@ static int move_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
+		if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) >= 0)
+			continue;
+
+		if (sched_feat(NUMA_BALANCE_INTERNODE) && task_numa_shared(p) >= 0 && (cpu_to_node(env->src_rq->cpu) != cpu_to_node(env->dst_cpu)))
+			goto next;
+
 		move_task(p, env);
 		pulled++;
 		env->imbalance -= load;
@@ -5074,6 +6051,9 @@ static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
  */
 static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
 {
+	if (!sched_feat(NUMA_LB))
+		return 0;
+
 	if (!sds->numa || !sds->numa_numa_running)
 		return 0;
 
@@ -5918,6 +6898,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.iteration          = 0,
 	};
 
+	if (sched_feat(NUMA_BALANCE_ALL))
+		return 1;
+
 	cpumask_copy(cpus, cpu_active_mask);
 	max_lb_iterations = cpumask_weight(env.dst_grpmask);
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index c868a66..2529f05 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,9 +63,9 @@ SCHED_FEAT(NONTASK_POWER, true)
  */
 SCHED_FEAT(TTWU_QUEUE, true)
 
-SCHED_FEAT(FORCE_SD_OVERLAP, false)
-SCHED_FEAT(RT_RUNTIME_SHARE, true)
-SCHED_FEAT(LB_MIN, false)
+SCHED_FEAT(FORCE_SD_OVERLAP,		false)
+SCHED_FEAT(RT_RUNTIME_SHARE,		true)
+SCHED_FEAT(LB_MIN,			false)
 SCHED_FEAT(IDEAL_CPU,			true)
 SCHED_FEAT(IDEAL_CPU_THREAD_BIAS,	false)
 SCHED_FEAT(PUSH_PRIVATE_BUDDIES,	true)
@@ -74,8 +74,14 @@ SCHED_FEAT(WAKE_ON_IDEAL_CPU,		false)
 
 #ifdef CONFIG_NUMA_BALANCING
 /* Do the working set probing faults: */
-SCHED_FEAT(NUMA,             true)
-SCHED_FEAT(NUMA_FAULTS_UP,   false)
-SCHED_FEAT(NUMA_FAULTS_DOWN, false)
-SCHED_FEAT(NUMA_SETTLE,      true)
+SCHED_FEAT(NUMA,			true)
+SCHED_FEAT(NUMA_FAULTS_UP,		false)
+SCHED_FEAT(NUMA_FAULTS_DOWN,		false)
+SCHED_FEAT(NUMA_SETTLE,			false)
+SCHED_FEAT(NUMA_BALANCE_ALL,		false)
+SCHED_FEAT(NUMA_BALANCE_INTERNODE,		false)
+SCHED_FEAT(NUMA_LB,			false)
+SCHED_FEAT(NUMA_GROUP_LB_COMPRESS,	true)
+SCHED_FEAT(NUMA_GROUP_LB_SPREAD,	true)
+SCHED_FEAT(MIGRATE_FAULT_STATS,		false)
 #endif
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 06/10] sched: Remove statistical NUMA scheduling
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (4 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 05/10] sched: Introduce directed NUMA convergence Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 07/10] sched: Track quality and strength of convergence Ingo Molnar
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Remove leftovers of the (now inactive) statistical NUMA scheduling code.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h   |   2 -
 kernel/sched/core.c     |   1 -
 kernel/sched/fair.c     | 436 +-----------------------------------------------
 kernel/sched/features.h |   3 -
 kernel/sched/sched.h    |   3 -
 5 files changed, 6 insertions(+), 439 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3bc69b7..8eeb866 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,10 +1507,8 @@ struct task_struct {
 	int numa_max_node;
 	int numa_scan_seq;
 	unsigned long numa_scan_ts_secs;
-	int numa_migrate_seq;
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
-	unsigned long numa_weight;
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
 	struct callback_head numa_scan_work;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48f69a0..c5a707c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1556,7 +1556,6 @@ static void __sched_fork(struct task_struct *p)
 	p->numa_shared = -1;
 	p->node_stamp = 0ULL;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
-	p->numa_migrate_seq = 2;
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 417c7bb..7af89b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -801,26 +801,6 @@ static unsigned long task_h_load(struct task_struct *p);
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
-static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
-{
-	if (task_numa_shared(p) != -1) {
-		p->numa_weight = task_h_load(p);
-		rq->nr_numa_running++;
-		rq->nr_shared_running += task_numa_shared(p);
-		rq->nr_ideal_running += (cpu_to_node(task_cpu(p)) == p->numa_max_node);
-		rq->numa_weight += p->numa_weight;
-	}
-}
-
-static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
-{
-	if (task_numa_shared(p) != -1) {
-		rq->nr_numa_running--;
-		rq->nr_shared_running -= task_numa_shared(p);
-		rq->nr_ideal_running -= (cpu_to_node(task_cpu(p)) == p->numa_max_node);
-		rq->numa_weight -= p->numa_weight;
-	}
-}
 
 /*
  * Scan @scan_size MB every @scan_period after an initial @scan_delay.
@@ -835,11 +815,6 @@ unsigned int sysctl_sched_numa_scan_size = 256;		/* MB */
  */
 unsigned int sysctl_sched_numa_settle_count = 2;
 
-static void task_numa_migrate(struct task_struct *p, int next_cpu)
-{
-	p->numa_migrate_seq = 0;
-}
-
 static int task_ideal_cpu(struct task_struct *p)
 {
 	if (!sched_feat(IDEAL_CPU))
@@ -2041,8 +2016,6 @@ static void task_numa_placement_tick(struct task_struct *p)
 	}
 
 	if (shared != task_numa_shared(p) || (ideal_node != -1 && ideal_node != p->numa_max_node)) {
-
-		p->numa_migrate_seq = 0;
 		/*
 		 * Fix up node migration fault statistics artifact, as we
 		 * migrate to another node we'll soon bring over our private
@@ -2227,13 +2200,6 @@ void task_numa_scan_work(struct callback_head *work)
 	if (p->flags & PF_EXITING)
 		return;
 
-	p->numa_migrate_seq++;
-	if (sched_feat(NUMA_SETTLE) &&
-	    p->numa_migrate_seq < sysctl_sched_numa_settle_count) {
-		trace_printk("NUMA TICK: placement, return to let it settle, task %s:%d\n", p->comm, p->pid);
-		return;
-	}
-
 	/*
 	 * Enforce maximal scan/migration frequency..
 	 */
@@ -2420,11 +2386,8 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
 #else /* !CONFIG_NUMA_BALANCING: */
 #ifdef CONFIG_SMP
 static inline int task_ideal_cpu(struct task_struct *p)				{ return -1; }
-static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)	{ }
 #endif
-static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)	{ }
 static inline void task_tick_numa(struct rq *rq, struct task_struct *curr)	{ }
-static inline void task_numa_migrate(struct task_struct *p, int next_cpu)	{ }
 #endif /* CONFIG_NUMA_BALANCING */
 
 /**************************************************
@@ -2441,7 +2404,6 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	if (entity_is_task(se)) {
 		struct rq *rq = rq_of(cfs_rq);
 
-		account_numa_enqueue(rq, task_of(se));
 		list_add(&se->group_node, &rq->cfs_tasks);
 	}
 #endif /* CONFIG_SMP */
@@ -2454,10 +2416,9 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	update_load_sub(&cfs_rq->load, se->load.weight);
 	if (!parent_entity(se))
 		update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
-	if (entity_is_task(se)) {
+	if (entity_is_task(se))
 		list_del_init(&se->group_node);
-		account_numa_dequeue(rq_of(cfs_rq), task_of(se));
-	}
+
 	cfs_rq->nr_running--;
 }
 
@@ -4892,7 +4853,6 @@ static void
 migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 {
 	migrate_task_rq_entity(p, next_cpu);
-	task_numa_migrate(p, next_cpu);
 }
 #endif /* CONFIG_SMP */
 
@@ -5268,9 +5228,6 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_SOME_PINNED	0x04
-#define LBF_NUMA_RUN	0x08
-#define LBF_NUMA_SHARED	0x10
-#define LBF_KEEP_SHARED	0x20
 
 struct lb_env {
 	struct sched_domain	*sd;
@@ -5313,82 +5270,6 @@ static void move_task(struct task_struct *p, struct lb_env *env)
 	check_preempt_curr(env->dst_rq, p, 0);
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-
-static inline unsigned long task_node_faults(struct task_struct *p, int node)
-{
-	return p->numa_faults[2*node] + p->numa_faults[2*node + 1];
-}
-
-static int task_faults_down(struct task_struct *p, struct lb_env *env)
-{
-	int src_node, dst_node, node, down_node = -1;
-	unsigned long faults, src_faults, max_faults = 0;
-
-	if (!sched_feat_numa(NUMA_FAULTS_DOWN) || !p->numa_faults)
-		return 1;
-
-	src_node = cpu_to_node(env->src_cpu);
-	dst_node = cpu_to_node(env->dst_cpu);
-
-	if (src_node == dst_node)
-		return 1;
-
-	src_faults = task_node_faults(p, src_node);
-
-	for (node = 0; node < nr_node_ids; node++) {
-		if (node == src_node)
-			continue;
-
-		faults = task_node_faults(p, node);
-
-		if (faults > max_faults && faults <= src_faults) {
-			max_faults = faults;
-			down_node = node;
-		}
-	}
-
-	if (down_node == dst_node)
-		return 1; /* move towards the next node down */
-
-	return 0; /* stay here */
-}
-
-static int task_faults_up(struct task_struct *p, struct lb_env *env)
-{
-	unsigned long src_faults, dst_faults;
-	int src_node, dst_node;
-
-	if (!sched_feat_numa(NUMA_FAULTS_UP) || !p->numa_faults)
-		return 0; /* can't say it improved */
-
-	src_node = cpu_to_node(env->src_cpu);
-	dst_node = cpu_to_node(env->dst_cpu);
-
-	if (src_node == dst_node)
-		return 0; /* pointless, don't do that */
-
-	src_faults = task_node_faults(p, src_node);
-	dst_faults = task_node_faults(p, dst_node);
-
-	if (dst_faults > src_faults)
-		return 1; /* move to dst */
-
-	return 0; /* stay where we are */
-}
-
-#else /* !CONFIG_NUMA_BALANCING: */
-static inline int task_faults_up(struct task_struct *p, struct lb_env *env)
-{
-	return 0;
-}
-
-static inline int task_faults_down(struct task_struct *p, struct lb_env *env)
-{
-	return 0;
-}
-#endif
-
 /*
  * Is this task likely cache-hot:
  */
@@ -5469,77 +5350,6 @@ static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
 }
 
 /*
- * Can we migrate a NUMA task? The rules are rather involved:
- */
-static bool can_migrate_numa_task(struct task_struct *p, struct lb_env *env)
-{
-	/*
-	 * iteration:
-	 *   0		   -- only allow improvement, or !numa
-	 *   1		   -- + worsen !ideal
-	 *   2                         priv
-	 *   3                         shared (everything)
-	 *
-	 * NUMA_HOT_DOWN:
-	 *   1 .. nodes    -- allow getting worse by step
-	 *   nodes+1	   -- punt, everything goes!
-	 *
-	 * LBF_NUMA_RUN    -- numa only, only allow improvement
-	 * LBF_NUMA_SHARED -- shared only
-	 * LBF_NUMA_IDEAL  -- ideal only
-	 *
-	 * LBF_KEEP_SHARED -- do not touch shared tasks
-	 */
-
-	/* a numa run can only move numa tasks about to improve things */
-	if (env->flags & LBF_NUMA_RUN) {
-		if (task_numa_shared(p) < 0 && task_ideal_cpu(p) < 0)
-			return false;
-
-		/* If we are only allowed to pull shared tasks: */
-		if ((env->flags & LBF_NUMA_SHARED) && !task_numa_shared(p))
-			return false;
-	} else {
-		if (task_numa_shared(p) < 0)
-			goto try_migrate;
-	}
-
-	/* can not move shared tasks */
-	if ((env->flags & LBF_KEEP_SHARED) && task_numa_shared(p) == 1)
-		return false;
-
-	if (task_faults_up(p, env))
-		return true; /* memory locality beats cache hotness */
-
-	if (env->iteration < 1)
-		return false;
-
-#ifdef CONFIG_NUMA_BALANCING
-	if (p->numa_max_node != cpu_to_node(task_cpu(p))) /* !ideal */
-		goto demote;
-#endif
-
-	if (env->iteration < 2)
-		return false;
-
-	if (task_numa_shared(p) == 0) /* private */
-		goto demote;
-
-	if (env->iteration < 3)
-		return false;
-
-demote:
-	if (env->iteration < 5)
-		return task_faults_down(p, env);
-
-try_migrate:
-	if (env->failed > env->sd->cache_nice_tries)
-		return true;
-
-	return !task_hot(p, env);
-}
-
-/*
  * can_migrate_task() - may task p from runqueue rq be migrated to this_cpu?
  */
 static int can_migrate_task(struct task_struct *p, struct lb_env *env)
@@ -5559,7 +5369,7 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 
 #ifdef CONFIG_NUMA_BALANCING
 	/* If we are only allowed to pull ideal tasks: */
-	if ((task_ideal_cpu(p) >= 0) && (p->shared_buddy_faults > 1000)) {
+	if (0 && (task_ideal_cpu(p) >= 0) && (p->shared_buddy_faults > 1000)) {
 		int ideal_node;
 		int dst_node;
 
@@ -5575,9 +5385,6 @@ static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	}
 #endif
 
-	if (env->sd->flags & SD_NUMA)
-		return can_migrate_numa_task(p, env);
-
 	if (env->failed > env->sd->cache_nice_tries)
 		return true;
 
@@ -5867,24 +5674,6 @@ struct sd_lb_stats {
 	unsigned int  busiest_group_weight;
 
 	int group_imb; /* Is there imbalance in this sd */
-
-#ifdef CONFIG_NUMA_BALANCING
-	unsigned long this_numa_running;
-	unsigned long this_numa_weight;
-	unsigned long this_shared_running;
-	unsigned long this_ideal_running;
-	unsigned long this_group_capacity;
-
-	struct sched_group *numa;
-	unsigned long numa_load;
-	unsigned long numa_nr_running;
-	unsigned long numa_numa_running;
-	unsigned long numa_shared_running;
-	unsigned long numa_ideal_running;
-	unsigned long numa_numa_weight;
-	unsigned long numa_group_capacity;
-	unsigned int  numa_has_capacity;
-#endif
 };
 
 /*
@@ -5900,13 +5689,6 @@ struct sg_lb_stats {
 	unsigned long group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
 	int group_has_capacity; /* Is there extra capacity in the group? */
-
-#ifdef CONFIG_NUMA_BALANCING
-	unsigned long sum_ideal_running;
-	unsigned long sum_numa_running;
-	unsigned long sum_numa_weight;
-#endif
-	unsigned long sum_shared_running;	/* 0 on non-NUMA */
 };
 
 /**
@@ -5935,158 +5717,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
-#ifdef CONFIG_NUMA_BALANCING
-
-static inline bool pick_numa_rand(int n)
-{
-	return !(get_random_int() % n);
-}
-
-static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
-{
-	sgs->sum_ideal_running += rq->nr_ideal_running;
-	sgs->sum_shared_running += rq->nr_shared_running;
-	sgs->sum_numa_running += rq->nr_numa_running;
-	sgs->sum_numa_weight += rq->numa_weight;
-}
-
-static inline
-void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
-			  struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
-			  int local_group)
-{
-	if (!(sd->flags & SD_NUMA))
-		return;
-
-	if (local_group) {
-		sds->this_numa_running   = sgs->sum_numa_running;
-		sds->this_numa_weight    = sgs->sum_numa_weight;
-		sds->this_shared_running = sgs->sum_shared_running;
-		sds->this_ideal_running  = sgs->sum_ideal_running;
-		sds->this_group_capacity = sgs->group_capacity;
-
-	} else if (sgs->sum_numa_running - sgs->sum_ideal_running) {
-		if (!sds->numa || pick_numa_rand(sd->span_weight / sg->group_weight)) {
-			sds->numa = sg;
-			sds->numa_load		 = sgs->avg_load;
-			sds->numa_nr_running     = sgs->sum_nr_running;
-			sds->numa_numa_running   = sgs->sum_numa_running;
-			sds->numa_shared_running = sgs->sum_shared_running;
-			sds->numa_ideal_running  = sgs->sum_ideal_running;
-			sds->numa_numa_weight    = sgs->sum_numa_weight;
-			sds->numa_has_capacity	 = sgs->group_has_capacity;
-			sds->numa_group_capacity = sgs->group_capacity;
-		}
-	}
-}
-
-static struct rq *
-find_busiest_numa_queue(struct lb_env *env, struct sched_group *sg)
-{
-	struct rq *rq, *busiest = NULL;
-	int cpu;
-
-	for_each_cpu_and(cpu, sched_group_cpus(sg), env->cpus) {
-		rq = cpu_rq(cpu);
-
-		if (!rq->nr_numa_running)
-			continue;
-
-		if (!(rq->nr_numa_running - rq->nr_ideal_running))
-			continue;
-
-		if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
-			continue;
-
-		if (!busiest || pick_numa_rand(sg->group_weight))
-			busiest = rq;
-	}
-
-	return busiest;
-}
-
-static bool can_do_numa_run(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	/*
-	 * if we're overloaded; don't pull when:
-	 *   - the other guy isn't
-	 *   - imbalance would become too great
-	 */
-	if (!sds->this_has_capacity) {
-		if (sds->numa_has_capacity)
-			return false;
-	}
-
-	/*
-	 * pull if we got easy trade
-	 */
-	if (sds->this_nr_running - sds->this_numa_running)
-		return true;
-
-	/*
-	 * If we got capacity allow stacking up on shared tasks.
-	 */
-	if ((sds->this_shared_running < sds->this_group_capacity) && sds->numa_shared_running) {
-		/* There's no point in trying to move if all are here already: */
-		if (sds->numa_shared_running == sds->this_shared_running)
-			return false;
-
-		env->flags |= LBF_NUMA_SHARED;
-		return true;
-	}
-
-	/*
-	 * pull if we could possibly trade
-	 */
-	if (sds->this_numa_running - sds->this_ideal_running)
-		return true;
-
-	return false;
-}
-
-/*
- * introduce some controlled imbalance to perturb the state so we allow the
- * state to improve should be tightly controlled/co-ordinated with
- * can_migrate_task()
- */
-static int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	if (!sched_feat(NUMA_LB))
-		return 0;
-
-	if (!sds->numa || !sds->numa_numa_running)
-		return 0;
-
-	if (!can_do_numa_run(env, sds))
-		return 0;
-
-	env->flags |= LBF_NUMA_RUN;
-	env->flags &= ~LBF_KEEP_SHARED;
-	env->imbalance = sds->numa_numa_weight / sds->numa_numa_running;
-	sds->busiest = sds->numa;
-	env->find_busiest_queue = find_busiest_numa_queue;
-
-	return 1;
-}
-
-#else /* !CONFIG_NUMA_BALANCING: */
-static inline
-void update_sd_numa_stats(struct sched_domain *sd, struct sched_group *sg,
-			  struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
-			  int local_group)
-{
-}
-
-static inline void update_sg_numa_stats(struct sg_lb_stats *sgs, struct rq *rq)
-{
-}
-
-static inline int check_numa_busiest_group(struct lb_env *env, struct sd_lb_stats *sds)
-{
-	return 0;
-}
-#endif
-
 unsigned long default_scale_freq_power(struct sched_domain *sd, int cpu)
 {
 	return SCHED_POWER_SCALE;
@@ -6301,8 +5931,6 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_nr_running += nr_running;
 		sgs->sum_weighted_load += weighted_cpuload(i);
 
-		update_sg_numa_stats(sgs, rq);
-
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
 	}
@@ -6394,13 +6022,6 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
-static void update_src_keep_shared(struct lb_env *env, bool keep_shared)
-{
-	env->flags &= ~LBF_KEEP_SHARED;
-	if (keep_shared)
-		env->flags |= LBF_KEEP_SHARED;
-}
-
 /**
  * update_sd_lb_stats - Update sched_domain's statistics for load balancing.
  * @env: The load balancing environment.
@@ -6433,23 +6054,6 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 		sds->total_load += sgs.group_load;
 		sds->total_pwr += sg->sgp->power;
 
-#ifdef CONFIG_NUMA_BALANCING
-		/*
-		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity to one so that we'll try
-		 * and move all the excess tasks away. We lower the capacity
-		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity. The
-		 * extra check prevents the case where you always pull from the
-		 * heaviest group when it is already under-utilized (possible
-		 * with a large weight task outweighs the tasks on the system).
-		 */
-		if (0 && prefer_sibling && !local_group && sds->this_has_capacity) {
-			sgs.group_capacity = clamp_val(sgs.sum_shared_running,
-					1UL, sgs.group_capacity);
-		}
-#endif
-
 		if (local_group) {
 			sds->this_load = sgs.avg_load;
 			sds->this = sg;
@@ -6467,13 +6071,8 @@ static inline void update_sd_lb_stats(struct lb_env *env,
 			sds->busiest_has_capacity = sgs.group_has_capacity;
 			sds->busiest_group_weight = sgs.group_weight;
 			sds->group_imb = sgs.group_imb;
-
-			update_src_keep_shared(env,
-				sgs.sum_shared_running <= sgs.group_capacity);
 		}
 
-		update_sd_numa_stats(env->sd, sg, sds, &sgs, local_group);
-
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 }
@@ -6765,9 +6364,6 @@ out_imbalanced:
 		goto ret;
 
 out_balanced:
-	if (check_numa_busiest_group(env, &sds))
-		return sds.busiest;
-
 ret:
 	env->imbalance = 0;
 
@@ -6806,9 +6402,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
 			continue;
 
-		if ((env->flags & LBF_KEEP_SHARED) && !(rq->nr_running - rq->nr_shared_running))
-			continue;
-
 		/*
 		 * For the load comparisons with the other cpu's, consider
 		 * the weighted_cpuload() scaled with the cpu power, so that
@@ -6847,30 +6440,13 @@ static void update_sd_failed(struct lb_env *env, int ld_moved)
 		 * frequent, pollute the failure counter causing
 		 * excessive cache_hot migrations and active balances.
 		 */
-		if (env->idle != CPU_NEWLY_IDLE && !(env->flags & LBF_NUMA_RUN))
+		if (env->idle != CPU_NEWLY_IDLE)
 			env->sd->nr_balance_failed++;
 	} else
 		env->sd->nr_balance_failed = 0;
 }
 
 /*
- * See can_migrate_numa_task()
- */
-static int lb_max_iteration(struct lb_env *env)
-{
-	if (!(env->sd->flags & SD_NUMA))
-		return 0;
-
-	if (env->flags & LBF_NUMA_RUN)
-		return 0; /* NUMA_RUN may only improve */
-
-	if (sched_feat_numa(NUMA_FAULTS_DOWN))
-		return 5; /* nodes^2 would suck */
-
-	return 3;
-}
-
-/*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
@@ -7006,7 +6582,7 @@ more_balance:
 		if (unlikely(env.flags & LBF_ALL_PINNED))
 			goto out_pinned;
 
-		if (!ld_moved && env.iteration < lb_max_iteration(&env)) {
+		if (!ld_moved && env.iteration < 3) {
 			env.iteration++;
 			env.loop = 0;
 			goto more_balance;
@@ -7192,7 +6768,7 @@ static int active_load_balance_cpu_stop(void *data)
 			.failed		= busiest_rq->ab_failed,
 			.idle		= busiest_rq->ab_idle,
 		};
-		env.iteration = lb_max_iteration(&env);
+		env.iteration = 3;
 
 		schedstat_inc(sd, alb_count);
 
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 2529f05..fd9db0b 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -75,9 +75,6 @@ SCHED_FEAT(WAKE_ON_IDEAL_CPU,		false)
 #ifdef CONFIG_NUMA_BALANCING
 /* Do the working set probing faults: */
 SCHED_FEAT(NUMA,			true)
-SCHED_FEAT(NUMA_FAULTS_UP,		false)
-SCHED_FEAT(NUMA_FAULTS_DOWN,		false)
-SCHED_FEAT(NUMA_SETTLE,			false)
 SCHED_FEAT(NUMA_BALANCE_ALL,		false)
 SCHED_FEAT(NUMA_BALANCE_INTERNODE,		false)
 SCHED_FEAT(NUMA_LB,			false)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e53cbf..ca92adf 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -438,9 +438,6 @@ struct rq {
 	struct list_head cfs_tasks;
 
 #ifdef CONFIG_NUMA_BALANCING
-	unsigned long numa_weight;
-	unsigned long nr_numa_running;
-	unsigned long nr_ideal_running;
 	struct task_struct *curr_buddy;
 #endif
 	unsigned long nr_shared_running;	/* 0 on non-NUMA */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 07/10] sched: Track quality and strength of convergence
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (5 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 06/10] sched: Remove statistical NUMA scheduling Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 08/10] sched: Converge NUMA migrations Ingo Molnar
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Track strength of convergence, which is a value between 1 and 1024.
This will be used by the placement logic later on.

A strength value of 1024 means that the workload has fully
converged, all faults after the last scan period came from a
single node.

A value of 1024/nr_nodes means a totally spread out working set.

'max_faults' is the number of faults observed on the highest-faulting node.
'sum_faults' are all faults from the last scan, averaged over ~16 periods.

The goal of the scheduler is to maximize convergence system-wide.
Once a task has converged, it carries with it a non-trivial amount
of working set. If such a task is migrated to another node later
on then its working set will migrate there as well, which is a
non-trivial cost.

So the ultimate goal of NUMA scheduling is to let as many tasks
converge as possible, and to run them as close to their memory
as possible.

( Note: we could also sample migration activities to directly measure
  how much convergence influx there is. )

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   |  2 ++
 kernel/sched/fair.c   | 46 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 50 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8eeb866..5b2cf2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1509,6 +1509,8 @@ struct task_struct {
 	unsigned long numa_scan_ts_secs;
 	unsigned int numa_scan_period;
 	u64 node_stamp;			/* migration stamp  */
+	unsigned long convergence_strength;
+	int convergence_node;
 	unsigned long *numa_faults;
 	unsigned long *numa_faults_curr;
 	struct callback_head numa_scan_work;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c5a707c..47b14d1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1555,6 +1555,8 @@ static void __sched_fork(struct task_struct *p)
 
 	p->numa_shared = -1;
 	p->node_stamp = 0ULL;
+	p->convergence_strength		= 0;
+	p->convergence_node		= -1;
 	p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
 	p->numa_faults = NULL;
 	p->numa_scan_period = sysctl_sched_numa_scan_delay;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7af89b7..1f6104a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1934,6 +1934,50 @@ clear_buddy:
 }
 
 /*
+ * Update the p->convergence_strength info, which is a value between 1 and 1024.
+ *
+ * A strength value of 1024 means that the workload has fully
+ * converged, all faults after the last scan period came from a
+ * single node.
+ *
+ * A value of 1024/nr_nodes means a totally spread out working set.
+ *
+ * 'max_faults' is the number of faults observed on the highest-faulting node.
+ * 'sum_faults' are all faults from the last scan, averaged over ~8 periods.
+ *
+ * The goal of the scheduler is to maximize convergence system-wide.
+ * Once a task has converged, it carries with it a non-trivial amount
+ * of working set. If such a task is migrated to another node later
+ * on then its working set will migrate there as well, which is a
+ * non-trivial cost.
+ *
+ * So the ultimate goal of NUMA scheduling is to let as many tasks
+ * converge as possible, and to run them as close to their memory
+ * as possible.
+ *
+ * ( Note: we could also sample migration activities to directly measure
+ *   how much convergence influx there is. )
+ */
+static void
+shared_fault_calc_convergence(struct task_struct *p, int max_node,
+			      unsigned long max_faults, unsigned long sum_faults)
+{
+	/*
+	 * If sum_faults is 0 then leave the convergence alone:
+	 */
+	if (sum_faults) {
+		p->convergence_strength = 1024L * max_faults / sum_faults;
+
+		if (p->convergence_strength >= 921) {
+			WARN_ON_ONCE(max_node == -1);
+			p->convergence_node = max_node;
+		} else {
+			p->convergence_node = -1;
+		}
+	}
+}
+
+/*
  * Called every couple of hundred milliseconds in the task's
  * execution life-time, this function decides whether to
  * change placement parameters:
@@ -1974,6 +2018,8 @@ static void task_numa_placement_tick(struct task_struct *p)
 		}
 	}
 
+	shared_fault_calc_convergence(p, ideal_node, max_faults, total[0] + total[1]);
+
 	shared_fault_full_scan_done(p);
 
 	/*
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 08/10] sched: Converge NUMA migrations
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (6 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 07/10] sched: Track quality and strength of convergence Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 09/10] sched: Add convergence strength based adaptive NUMA page fault rate Ingo Molnar
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Consolidate the various convergence models and add a new one: when
a strongly converged NUMA task migrates, prefer to migrate it in
the direction of its preferred node.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 59 +++++++++++++++++++++++++++++++++++--------------
 kernel/sched/features.h |  3 ++-
 2 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1f6104a..10cbfa3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4750,6 +4750,35 @@ done:
 	return target;
 }
 
+static bool numa_allow_migration(struct task_struct *p, int prev_cpu, int new_cpu)
+{
+#ifdef CONFIG_NUMA_BALANCING
+	if (sched_feat(NUMA_CONVERGE_MIGRATIONS)) {
+		/* Help in the direction of expected convergence: */
+		if (p->convergence_node >= 0 && (cpu_to_node(new_cpu) != p->convergence_node))
+			return false;
+
+		return true;
+	}
+
+	if (sched_feat(NUMA_BALANCE_ALL)) {
+ 		if (task_numa_shared(p) >= 0)
+			return false;
+
+		return true;
+	}
+
+	if (sched_feat(NUMA_BALANCE_INTERNODE)) {
+		if (task_numa_shared(p) >= 0) {
+ 			if (cpu_to_node(prev_cpu) != cpu_to_node(new_cpu))
+				return false;
+		}
+	}
+#endif
+	return true;
+}
+
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -4766,7 +4795,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
 	int cpu = smp_processor_id();
-	int prev_cpu = task_cpu(p);
+	int prev0_cpu = task_cpu(p);
+	int prev_cpu = prev0_cpu;
 	int new_cpu = cpu;
 	int want_affine = 0;
 	int sync = wake_flags & WF_SYNC;
@@ -4775,10 +4805,6 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 		return prev_cpu;
 
 #ifdef CONFIG_NUMA_BALANCING
-	/* We do NUMA balancing elsewhere: */
-	if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) >= 0)
-		return prev_cpu;
-
 	if (sched_feat(WAKE_ON_IDEAL_CPU) && p->ideal_cpu >= 0)
 		return p->ideal_cpu;
 #endif
@@ -4857,8 +4883,8 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
 unlock:
 	rcu_read_unlock();
 
-	if (sched_feat(NUMA_BALANCE_INTERNODE) && task_numa_shared(p) >= 0 && (cpu_to_node(prev_cpu) != cpu_to_node(new_cpu)))
-		return prev_cpu;
+	if (!numa_allow_migration(p, prev0_cpu, new_cpu))
+		return prev0_cpu;
 
 	return new_cpu;
 }
@@ -5401,8 +5427,11 @@ static bool can_migrate_running_task(struct task_struct *p, struct lb_env *env)
 static int can_migrate_task(struct task_struct *p, struct lb_env *env)
 {
 	/* We do NUMA balancing elsewhere: */
-	if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) > 0 && env->failed <= env->sd->cache_nice_tries)
-		return false;
+
+	if (env->failed <= env->sd->cache_nice_tries) {
+		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu))
+			return false;
+	}
 
 	if (!can_migrate_pinned_task(p, env))
 		return false;
@@ -5461,10 +5490,7 @@ static int move_one_task(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			continue;
 
-		if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) >= 0)
-			continue;
-
-		if (sched_feat(NUMA_BALANCE_INTERNODE) && task_numa_shared(p) >= 0 && (cpu_to_node(env->src_rq->cpu) != cpu_to_node(env->dst_cpu)))
+		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu))
 			continue;
 
 		move_task(p, env);
@@ -5527,10 +5553,7 @@ static int move_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
-		if (sched_feat(NUMA_BALANCE_ALL) && task_numa_shared(p) >= 0)
-			continue;
-
-		if (sched_feat(NUMA_BALANCE_INTERNODE) && task_numa_shared(p) >= 0 && (cpu_to_node(env->src_rq->cpu) != cpu_to_node(env->dst_cpu)))
+		if (!numa_allow_migration(p, env->src_rq->cpu, env->dst_cpu))
 			goto next;
 
 		move_task(p, env);
@@ -6520,8 +6543,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.iteration          = 0,
 	};
 
+#ifdef CONFIG_NUMA_BALANCING
 	if (sched_feat(NUMA_BALANCE_ALL))
 		return 1;
+#endif
 
 	cpumask_copy(cpus, cpu_active_mask);
 	max_lb_iterations = cpumask_weight(env.dst_grpmask);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index fd9db0b..9075faf 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -76,9 +76,10 @@ SCHED_FEAT(WAKE_ON_IDEAL_CPU,		false)
 /* Do the working set probing faults: */
 SCHED_FEAT(NUMA,			true)
 SCHED_FEAT(NUMA_BALANCE_ALL,		false)
-SCHED_FEAT(NUMA_BALANCE_INTERNODE,		false)
+SCHED_FEAT(NUMA_BALANCE_INTERNODE,	false)
 SCHED_FEAT(NUMA_LB,			false)
 SCHED_FEAT(NUMA_GROUP_LB_COMPRESS,	true)
 SCHED_FEAT(NUMA_GROUP_LB_SPREAD,	true)
 SCHED_FEAT(MIGRATE_FAULT_STATS,		false)
+SCHED_FEAT(NUMA_CONVERGE_MIGRATIONS,	true)
 #endif
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 09/10] sched: Add convergence strength based adaptive NUMA page fault rate
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (7 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 08/10] sched: Converge NUMA migrations Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 19:58 ` [PATCH 10/10] sched: Refine the 'shared tasks' memory interleaving logic Ingo Molnar
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Mel Gorman reported that the NUMA code is system-time intense even
after a workload has converged.

To remedy this, turn sched_numa_scan_size into a range:

   sched_numa_scan_size_min        [default:  32 MB]
   sched_numa_scan_size_max        [default: 512 MB]

As workloads converge, so does their scanning activity get reduced.
If they unconverge again - for example because system load changes,
then their scanning will pick up again.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  3 ++-
 kernel/sched/fair.c   | 57 +++++++++++++++++++++++++++++++++++++++++++--------
 kernel/sysctl.c       | 11 ++++++++--
 3 files changed, 59 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5b2cf2e..ce834e7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2057,7 +2057,8 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
 extern unsigned int sysctl_sched_numa_scan_delay;
 extern unsigned int sysctl_sched_numa_scan_period_min;
 extern unsigned int sysctl_sched_numa_scan_period_max;
-extern unsigned int sysctl_sched_numa_scan_size;
+extern unsigned int sysctl_sched_numa_scan_size_min;
+extern unsigned int sysctl_sched_numa_scan_size_max;
 extern unsigned int sysctl_sched_numa_settle_count;
 
 #ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 10cbfa3..9262692 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,15 +805,17 @@ static unsigned long task_h_load(struct task_struct *p);
 /*
  * Scan @scan_size MB every @scan_period after an initial @scan_delay.
  */
-unsigned int sysctl_sched_numa_scan_delay = 1000;	/* ms */
-unsigned int sysctl_sched_numa_scan_period_min = 100;	/* ms */
-unsigned int sysctl_sched_numa_scan_period_max = 100*16;/* ms */
-unsigned int sysctl_sched_numa_scan_size = 256;		/* MB */
+unsigned int sysctl_sched_numa_scan_delay	__read_mostly = 1000;	/* ms */
+unsigned int sysctl_sched_numa_scan_period_min	__read_mostly = 100;	/* ms */
+unsigned int sysctl_sched_numa_scan_period_max	__read_mostly = 100*16;	/* ms */
+
+unsigned int sysctl_sched_numa_scan_size_min	__read_mostly =  32;	/* MB */
+unsigned int sysctl_sched_numa_scan_size_max	__read_mostly = 512;	/* MB */
 
 /*
  * Wait for the 2-sample stuff to settle before migrating again
  */
-unsigned int sysctl_sched_numa_settle_count = 2;
+unsigned int sysctl_sched_numa_settle_count	__read_mostly = 2;
 
 static int task_ideal_cpu(struct task_struct *p)
 {
@@ -2077,9 +2079,15 @@ static void task_numa_placement_tick(struct task_struct *p)
 			p->numa_faults[idx_oldnode] = 0;
 		}
 		sched_setnuma(p, ideal_node, shared);
+		/*
+		 * We changed a node, start scanning more frequently again
+		 * to map out the working set:
+		 */
+		p->numa_scan_period = sysctl_sched_numa_scan_period_min;
 	} else {
 		/* node unchanged, back off: */
-		p->numa_scan_period = min(p->numa_scan_period * 2, sysctl_sched_numa_scan_period_max);
+		p->numa_scan_period = min(p->numa_scan_period*2,
+						sysctl_sched_numa_scan_period_max);
 	}
 
 	this_cpu = task_cpu(p);
@@ -2238,6 +2246,7 @@ void task_numa_scan_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	struct vm_area_struct *vma;
+	long pages_min, pages_max;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_scan_work));
 
@@ -2260,10 +2269,40 @@ void task_numa_scan_work(struct callback_head *work)
 	current->numa_scan_period += jiffies_to_msecs(2);
 
 	start0 = start = end = mm->numa_scan_offset;
-	pages_total = sysctl_sched_numa_scan_size;
-	pages_total <<= 20 - PAGE_SHIFT; /* MB in pages */
-	if (!pages_total)
+
+	pages_max = sysctl_sched_numa_scan_size_max;
+	pages_max <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages_max)
+		return;
+
+	pages_min = sysctl_sched_numa_scan_size_min;
+	pages_min <<= 20 - PAGE_SHIFT; /* MB in pages */
+	if (!pages_min)
+		return;
+
+	if (WARN_ON_ONCE(p->convergence_strength < 0 || p->convergence_strength > 1024))
 		return;
+	if (WARN_ON_ONCE(pages_min > pages_max))
+		return;
+
+	/*
+	 * Convergence strength is a number in the range of
+	 * 0 ... 1024.
+	 *
+	 * As tasks converge, scale down our scanning to the minimum
+	 * of the allowed range. Shortly after they get unsettled
+	 * (because the workload changes or the system is loaded
+	 * differently), scanning revs up again.
+	 *
+	 * The important thing is that when the system is in an
+	 * equilibrium, we do the minimum amount of scanning.
+	 */
+
+	pages_total = pages_min;
+	pages_total += (pages_max - pages_min)*(1024-p->convergence_strength)/1024;
+
+	pages_total = max(pages_min, pages_total);
+	pages_total = min(pages_max, pages_total);
 
 	sum_pages_scanned = 0;
 	pages_left = pages_total;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 6d2fe5b..b6ddfae 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -374,8 +374,15 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
-		.procname	= "sched_numa_scan_size_mb",
-		.data		= &sysctl_sched_numa_scan_size,
+		.procname	= "sched_numa_scan_size_min_mb",
+		.data		= &sysctl_sched_numa_scan_size_min,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "sched_numa_scan_size_max_mb",
+		.data		= &sysctl_sched_numa_scan_size_max,
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH 10/10] sched: Refine the 'shared tasks' memory interleaving logic
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (8 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 09/10] sched: Add convergence strength based adaptive NUMA page fault rate Ingo Molnar
@ 2012-11-30 19:58 ` Ingo Molnar
  2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-11-30 19:58 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

Change the adaptive memory policy code to take a majority of buddies
on a node into account. Previously, since this commit:

  "sched: Track shared task's node groups and interleave their memory allocations"

We'd include any node that has run a buddy in the past, which was too
aggressive and spread the allocations of 'mostly converged' workloads
too much, and prevented their further convergence.

Add a few other variants for testing:

  NUMA_POLICY_ADAPTIVE:		use memory on every node that runs a buddy of this task

  NUMA_POLICY_SYSWIDE:		use a simple, static, system-wide mask

  NUMA_POLICY_MAXNODE:		use memory on this task's 'maximum node'

  NUMA_POLICY_MAXBUDDIES:	use memory on the node with the most buddies

  NUMA_POLICY_MANYBUDDIES:	this is the default, a quorum of buddies
				determines the allocation mask

The 'many buddies' quorum logic appears to work best in practice,
but the 'maxnode' and 'syswide' ones are good, robust policies too.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c     |  2 +-
 kernel/sched/fair.c     | 43 +++++++++++++++++++++++++++++++++++++------
 kernel/sched/features.h |  6 ++++++
 kernel/sched/sched.h    |  4 ++--
 4 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 47b14d1..9fef0d3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -134,7 +134,7 @@ void update_rq_clock(struct rq *rq)
 #define SCHED_FEAT(name, enabled)	\
 	(1UL << __SCHED_FEAT_##name) * enabled |
 
-const_debug unsigned int sysctl_sched_features =
+const_debug u64 sysctl_sched_features =
 #include "features.h"
 	0;
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9262692..18d732f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1611,6 +1611,9 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p, int *flip_tasks)
 	min_node_load = LONG_MAX;
 	min_node = -1;
 
+	if (sched_feat(NUMA_POLICY_MANYBUDDIES))
+		nodes_clear(p->numa_policy.v.nodes);
+
 	/*
 	 * Map out our maximum buddies layout:
 	 */
@@ -1677,16 +1680,28 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p, int *flip_tasks)
 			min_node = node;
 		}
 
-		if (buddies)
-			node_set(node, p->numa_policy.v.nodes);
-		else
-			node_clear(node, p->numa_policy.v.nodes);
+		if (sched_feat(NUMA_POLICY_ADAPTIVE)) {
+			if (buddies)
+				node_set(node, p->numa_policy.v.nodes);
+			else
+				node_clear(node, p->numa_policy.v.nodes);
+		}
+
+		if (!buddies) {
+			if (sched_feat(NUMA_POLICY_MANYBUDDIES))
+				node_clear(node, p->numa_policy.v.nodes);
+			continue;
+		}
+
+		/* A majority of buddies attracts memory: */
+		if (sched_feat(NUMA_POLICY_MANYBUDDIES)) {
+			if (buddies >= 3)
+				node_set(node, p->numa_policy.v.nodes);
+		}
 
 		/* Don't go to a node that is near its capacity limit: */
 		if (node_load + SCHED_LOAD_SCALE > node_capacity)
 			continue;
-		if (!buddies)
-			continue;
 
 		if (buddies > max_buddies && target_cpu != -1) {
 			max_buddies = buddies;
@@ -1696,6 +1711,13 @@ static int sched_update_ideal_cpu_shared(struct task_struct *p, int *flip_tasks)
 		}
 	}
 
+	/* Cluster memory around the buddies maximum: */
+	if (sched_feat(NUMA_POLICY_MAXBUDDIES)) {
+		if (ideal_node != -1) {
+			nodes_clear(p->numa_policy.v.nodes);
+			node_set(ideal_node, p->numa_policy.v.nodes);
+		}
+	}
 	if (WARN_ON_ONCE(ideal_node == -1 && ideal_cpu != -1))
 		return this_cpu;
 	if (WARN_ON_ONCE(ideal_node != -1 && ideal_cpu == -1))
@@ -2079,6 +2101,15 @@ static void task_numa_placement_tick(struct task_struct *p)
 			p->numa_faults[idx_oldnode] = 0;
 		}
 		sched_setnuma(p, ideal_node, shared);
+
+		/* Allocate only the maximum node: */
+		if (sched_feat(NUMA_POLICY_MAXNODE)) {
+			nodes_clear(p->numa_policy.v.nodes);
+			node_set(ideal_node, p->numa_policy.v.nodes);
+		}
+		/* Allocate system-wide: */
+		if (sched_feat(NUMA_POLICY_SYSWIDE))
+			p->numa_policy.v.nodes = node_online_map;
 		/*
 		 * We changed a node, start scanning more frequently again
 		 * to map out the working set:
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 9075faf..1775b80 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -81,5 +81,11 @@ SCHED_FEAT(NUMA_LB,			false)
 SCHED_FEAT(NUMA_GROUP_LB_COMPRESS,	true)
 SCHED_FEAT(NUMA_GROUP_LB_SPREAD,	true)
 SCHED_FEAT(MIGRATE_FAULT_STATS,		false)
+SCHED_FEAT(NUMA_POLICY_ADAPTIVE,	false)
+SCHED_FEAT(NUMA_POLICY_SYSWIDE,		false)
+SCHED_FEAT(NUMA_POLICY_MAXNODE,		false)
+SCHED_FEAT(NUMA_POLICY_MAXBUDDIES,	false)
+SCHED_FEAT(NUMA_POLICY_MANYBUDDIES,	true)
+
 SCHED_FEAT(NUMA_CONVERGE_MIGRATIONS,	true)
 #endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ca92adf..ace1159 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,7 +648,7 @@ static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
 # define const_debug const
 #endif
 
-extern const_debug unsigned int sysctl_sched_features;
+extern const_debug u64 sysctl_sched_features;
 
 #define SCHED_FEAT(name, enabled)	\
 	__SCHED_FEAT_##name ,
@@ -684,7 +684,7 @@ static __always_inline bool static_branch_##name(struct static_key *key) \
 extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
 #define sched_feat(x) (static_branch_##x(&sched_feat_keys[__SCHED_FEAT_##x]))
 #else /* !(SCHED_DEBUG && HAVE_JUMP_LABEL) */
-#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
+#define sched_feat(x) (sysctl_sched_features & (1ULL << __SCHED_FEAT_##x))
 #endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */
 
 #ifdef CONFIG_NUMA_BALANCING
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/10] Latest numa/core release, v18
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (9 preceding siblings ...)
  2012-11-30 19:58 ` [PATCH 10/10] sched: Refine the 'shared tasks' memory interleaving logic Ingo Molnar
@ 2012-11-30 20:37 ` Linus Torvalds
  2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
  2012-12-03 13:41   ` [PATCH 00/10] Latest numa/core release, v18 Mel Gorman
  2012-12-03 10:43 ` Mel Gorman
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2012-11-30 20:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> When pushed hard enough via threaded workloads (for example via the
> numa02 test) then the upstream page migration code in mm/migration.c
> becomes unscalable, resulting in lot of scheduling on the anon vma
> mutex and a subsequent drop in performance.

Ugh.

I wonder if migration really needs that thing to be a mutex? I may be
wrong, but the anon_vma lock only protects the actual rmap chains, and
migration only ever changes the pte *contents*, not the actual chains
of pte's themselves, right?

So if this is a migration-specific scalability issue, then it might be
possible to solve by making the mutex be a rwsem instead, and have
migration only take it for reading.

Of course, I'm quite possibly wrong, and the code depends on full
mutual exclusion.

Just a thought, in case it makes somebody go "Hmm.."

               Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
  2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
@ 2012-12-01  9:49   ` Ingo Molnar
  2012-12-01 12:26     ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
                       ` (2 more replies)
  2012-12-03 13:41   ` [PATCH 00/10] Latest numa/core release, v18 Mel Gorman
  1 sibling, 3 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-12-01  9:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > When pushed hard enough via threaded workloads (for example 
> > via the numa02 test) then the upstream page migration code 
> > in mm/migration.c becomes unscalable, resulting in lot of 
> > scheduling on the anon vma mutex and a subsequent drop in 
> > performance.
> 
> Ugh.
>
> I wonder if migration really needs that thing to be a mutex? I 
> may be wrong, but the anon_vma lock only protects the actual 
> rmap chains, and migration only ever changes the pte 
> *contents*, not the actual chains of pte's themselves, right?
> 
> So if this is a migration-specific scalability issue, then it 
> might be possible to solve by making the mutex be a rwsem 
> instead, and have migration only take it for reading.
> 
> Of course, I'm quite possibly wrong, and the code depends on 
> full mutual exclusion.
> 
> Just a thought, in case it makes somebody go "Hmm.."

I *think* you are right that for this type of migration that we 
are using here we indeed don't need to take an exclusive vma 
lock - in fact I think we don't need to take it at all:

The main goal in the migration code is to unmap the pte from all 
thread's MMU visibility, before we copy its contents into 
another page [located on another node] and map that page into 
the page tables instead of the old page.

No other thread must have a write reference to the old page when 
the copying [migrate_page_copy()] is performed, or we corrupt 
user-space memory subtly via copying a slightly older version of 
user-space memory.

rmap_walk() OTOH appears to have been written as a general 
purpose function, to be usable without holding the mmap_sem() as 
well, so it is written to protect against the disappearance of 
anon vmas.

But ... in all upstream and NUMA-migration codepaths I could 
find - and AFAICS in all other page-migration codepaths as well, 
including sys_move_pages() - anon vmas cannot disappear from 
under us, because we are already holding the mmap_sem.

[ Initially I assumed that swapout or filesystem code could 
  somehow call this without holding the mmap sem - but could not 
  find any such code path. ]

So I think we could get away rather simply, with something like 
the (entirely and utterly untested!) patch below.

But ... judging from the code my feeling is this can only be the 
first (and easiest) step:

1)

This patch might solve the remapping (remove_migration_ptes()), 
but does not solve the anon-vma locking done in the first, 
unmapping step of pte-migration - which is done via 
try_to_unmap(): which is a generic VM function used by swapout 
too, so callers do not necessarily hold the mmap_sem.

A new TTU flag might solve it although I detest flag-driven 
locking semantics with a passion:

Splitting out unlocked versions of try_to_unmap_anon(), 
try_to_unmap_ksm(), try_to_unmap_file() and constructing an 
unlocked try_to_unmap() out of them, to be used by the migration 
code, would be the cleaner option.

2)

Taking a process-global mutex 1024 times per 2MB was indeed very 
expensive - and lets assume that we manage to sort that out - 
but then we are AFAICS exposed to the next layer: the 
finegrained migrate_pages() model where the migration code 
flushes the TLB 512 times per 2MB to unmap and remap it again 
and again at 4K granularity ...

Assuming the simpler patch goes fine I'll try to come up with 
something intelligent for the TLB flushing sub-problem too: we 
could in theory batch the migration TLB flushes as well, by 
first doing an array of 2MB granular unmaps, then copying up to 
512x 4K pages, then doing the final 2MB granular [but still 
4K-represented in the page tables] remap.

2MB granular TLB flushing is OK for these workloads, I can see 
that in +THP tests.

I will keep you updated about how far I manage to get down this 
road.

Thanks,

	Ingo

---------------------------->
Subject: mm/migration: Don't lock anon vmas in rmap_walk_anon()
From: Ingo Molnar <mingo@kernel.org>
Date: Thu Nov 22 14:16:26 CET 2012

rmap_walk_anon() appears to be too careful about locking the anon
vma for its own good - since all callers are holding the mmap_sem
no vma can go away from under us:

 - sys_move_pages() is doing down_read(&mm->mmap_sem) in the
   sys_move_pages() -> do_pages_move() -> do_move_page_to_node_array()
   code path, which then calls migrate_pages(pagelist), which then
   does unmap_and_move() for every page in the list, which does
   remove_migration_ptes() which calls rmap.c::try_to_unmap().

 - the NUMA migration code's migrate_misplaced_page(), which calls
   migrate_pages() ... try_to_unmap(), is holding the mm->mmap_sem
   read-locked by virtue of the low level page fault handler taking
   it before calling handle_mm_fault().

Removing this lock removes a global mutex from the hot path of
migration-happy threaded workloads which can cause pathological
performance like this:

    96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                  |
                  --- perf_trace_sched_switch
                      __schedule
                      schedule
                      schedule_preempt_disabled
                      __mutex_lock_common.isra.6
                      __mutex_lock_slowpath
                      mutex_lock
                     |
                     |--50.61%-- rmap_walk
                     |          move_to_new_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          __do_numa_page.isra.69
                     |          handle_pte_fault
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          __memset_sse2
                     |          |
                     |           --100.00%-- worker_thread
                     |                     |
                     |                      --100.00%-- start_thread
                     |
                      --49.39%-- page_lock_anon_vma
                                try_to_unmap_anon
                                try_to_unmap
                                migrate_pages
                                migrate_misplaced_page
                                __do_numa_page.isra.69
                                handle_pte_fault
                                handle_mm_fault
                                __do_page_fault
                                do_page_fault
                                page_fault
                                __memset_sse2
                                |
                                 --100.00%-- worker_thread
                                           start_thread

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 mm/rmap.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -1686,6 +1686,9 @@ void __put_anon_vma(struct anon_vma *ano
 /*
  * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
  * Called by migrate.c to remove migration ptes, but might be used more later.
+ *
+ * Note: callers are expected to protect against anon vmas disappearing
+ *       under us - by holding the mmap_sem read or write locked.
  */
 static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		struct vm_area_struct *, unsigned long, void *), void *arg)
@@ -1695,16 +1698,10 @@ static int rmap_walk_anon(struct page *p
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	/*
-	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
-	 * because that depends on page_mapped(); but not all its usages
-	 * are holding mmap_sem. Users without mmap_sem are required to
-	 * take a reference count to prevent the anon_vma disappearing
-	 */
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	anon_vma_lock(anon_vma);
+
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1712,7 +1709,7 @@ static int rmap_walk_anon(struct page *p
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	anon_vma_unlock(anon_vma);
+
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
  2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
@ 2012-12-01 12:26     ` Ingo Molnar
  2012-12-01 18:38       ` Linus Torvalds
  2012-12-01 16:19     ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Rik van Riel
  2012-12-01 17:55     ` Linus Torvalds
  2 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2012-12-01 12:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


* Ingo Molnar <mingo@kernel.org> wrote:

> 1)
> 
> This patch might solve the remapping 
> (remove_migration_ptes()), but does not solve the anon-vma 
> locking done in the first, unmapping step of pte-migration - 
> which is done via try_to_unmap(): which is a generic VM 
> function used by swapout too, so callers do not necessarily 
> hold the mmap_sem.
> 
> A new TTU flag might solve it although I detest flag-driven 
> locking semantics with a passion:
> 
> Splitting out unlocked versions of try_to_unmap_anon(), 
> try_to_unmap_ksm(), try_to_unmap_file() and constructing an 
> unlocked try_to_unmap() out of them, to be used by the 
> migration code, would be the cleaner option.

So as a quick concept hack I wrote the patch attached below. 
(It's not signed off, see the patch description text for the 
reason.)

With this applied I get the same good 4x JVM performance:

     spec1.txt:           throughput =     157471.10 SPECjbb2005 bops 
     spec2.txt:           throughput =     157817.09 SPECjbb2005 bops 
     spec3.txt:           throughput =     157581.79 SPECjbb2005 bops 
     spec4.txt:           throughput =     157890.26 SPECjbb2005 bops 
                                           --------------------------
           SUM:           throughput =     630760.24 SPECjbb2005 bops

... because the JVM workload did not trigger the migration 
scalability threshold to begin with.

Mainline 4xJVM SPECjbb performance:

     spec1.txt:           throughput =     128575.47 SPECjbb2005 bops
     spec2.txt:           throughput =     125767.24 SPECjbb2005 bops
     spec3.txt:           throughput =     130042.30 SPECjbb2005 bops
     spec4.txt:           throughput =     128155.32 SPECjbb2005 bops
                                       --------------------------
           SUM:           throughput =     512540.33 SPECjbb2005 bops

     # (32 CPUs, 4 instances, 8 warehouses each, 240 seconds runtime, !THP)

But !THP/4K numa02 performance went trough the roof!

Mainline !THP numa02 performance:

         40.918 secs runtime/thread
         26.051 secs fastest (min) thread time
         59.229 secs elapsed (max) thread time [ spread: -28.0% ]
         26.844 GB data processed, per thread
        858.993 GB data processed, total
          2.206 nsecs/byte/thread
          0.453 GB/sec/thread
         14.503 GB/sec total

numa/core v18 + migration-locking-enhancements, !THP:

         18.543 secs runtime/thread
         17.721 secs fastest (min) thread time
         19.262 secs elapsed (max) thread time [ spread: -4.0% ]
         26.844 GB data processed, per thread
        858.993 GB data processed, total
          0.718 nsecs/byte/thread
          1.394 GB/sec/thread
         44.595 GB/sec total

as you can see the performance of each of the 32 threads is 
within a tight bound:

         17.721 secs fastest (min) thread time
         19.262 secs elapsed (max) thread time [ spread: -4.0% ]

... with very little spread between them.

So this is roughly as good as it can get without hard binding - 
and according to my limited testing the numa02 workload is 
20-30% faster than the AutoNUMA or balancenuma kernels on the 
same hardware/kernel combo. The above numa02 result now also 
gets reasonably close to the numa/core +THP numa02 numbers (to 
within 10%).

As expected there's a lot of TLB flushing going on, but, and 
this was unexpected to me, even maximally pushing the migration 
code does not trigger anything pathological on this 4-node 
system - so while the TLB optimization will be a welcome 
enhancement, it's not a must-have at this stage.

I'll do a cleaner version of this patch and I'll test on a 
larger system with a large NUMA factor too to make sure we don't 
need the TLB optimization on !THP.

So I think (assuming that I have not overlooked something 
critical in these patches!), with these two fixes all the 
difficult known regressions in numa/core are fixed.

I'll do more testing with broader workloads and on more systems 
to ascertain this.

Thanks,

	Ingo

---------------->
Subject: mm/migration: Remove anon vma locking from try_to_unmap() use
From: Ingo Molnar <mingo@kernel.org>
Date: Sat Dec 1 11:22:09 CET 2012

As outlined in:

    mm/migration: Don't lock anon vmas in rmap_walk_anon()

the process-global anon vma mutex locking of the page migration
code can be very expensive.

This removes the second (and last) use of that mutex from the
migration code: try_to_unmap().

Since try_to_unmap() is used by swapout and filesystem code
as well, which does not hold the mmap_sem, we only want to
do this optimization from the migration path.

This patch is ugly and should be replaced via a
try_to_unmap_locked() variant instead which offers us the
unlocked codepath, but it's good enough for testing purposes.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Not-Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/rmap.h |    2 +-
 mm/huge_memory.c     |    2 +-
 mm/memory-failure.c  |    2 +-
 mm/rmap.c            |   13 ++++++++++---
 4 files changed, 13 insertions(+), 6 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h
+++ linux/include/linux/rmap.h
@@ -220,7 +220,7 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page);
+struct anon_vma *page_lock_anon_vma(struct page *page, enum ttu_flags flags);
 void page_unlock_anon_vma(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -1645,7 +1645,7 @@ int split_huge_page(struct page *page)
 	int ret = 1;
 
 	BUG_ON(!PageAnon(page));
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma(page, 0);
 	if (!anon_vma)
 		goto out;
 	ret = 0;
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -402,7 +402,7 @@ static void collect_procs_anon(struct pa
 	struct anon_vma *av;
 	pgoff_t pgoff;
 
-	av = page_lock_anon_vma(page);
+	av = page_lock_anon_vma(page, 0);
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -442,7 +442,7 @@ out:
  * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
  * reference like with page_get_anon_vma() and then block on the mutex.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma(struct page *page, enum ttu_flags flags)
 {
 	struct anon_vma *anon_vma = NULL;
 	struct anon_vma *root_anon_vma;
@@ -456,6 +456,13 @@ struct anon_vma *page_lock_anon_vma(stru
 		goto out;
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
+	/*
+	 * The migration code paths are already holding the mmap_sem,
+	 * so the anon vma cannot go away from under us - return it:
+	 */
+	if (flags & TTU_MIGRATION)
+		goto out;
+
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
 	if (mutex_trylock(&root_anon_vma->mutex)) {
 		/*
@@ -732,7 +739,7 @@ static int page_referenced_anon(struct p
 	struct anon_vma_chain *avc;
 	int referenced = 0;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma(page, 0);
 	if (!anon_vma)
 		return referenced;
 
@@ -1474,7 +1481,7 @@ static int try_to_unmap_anon(struct page
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma(page, flags);
 	if (!anon_vma)
 		return ret;
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
  2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
  2012-12-01 12:26     ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
@ 2012-12-01 16:19     ` Rik van Riel
  2012-12-01 17:55     ` Linus Torvalds
  2 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2012-12-01 16:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On 12/01/2012 04:49 AM, Ingo Molnar wrote:
>
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
>> On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> When pushed hard enough via threaded workloads (for example
>>> via the numa02 test) then the upstream page migration code
>>> in mm/migration.c becomes unscalable, resulting in lot of
>>> scheduling on the anon vma mutex and a subsequent drop in
>>> performance.
>>
>> Ugh.
>>
>> I wonder if migration really needs that thing to be a mutex? I
>> may be wrong, but the anon_vma lock only protects the actual
>> rmap chains, and migration only ever changes the pte
>> *contents*, not the actual chains of pte's themselves, right?
>>
>> So if this is a migration-specific scalability issue, then it
>> might be possible to solve by making the mutex be a rwsem
>> instead, and have migration only take it for reading.
>>
>> Of course, I'm quite possibly wrong, and the code depends on
>> full mutual exclusion.
>>
>> Just a thought, in case it makes somebody go "Hmm.."
>
> I *think* you are right that for this type of migration that we
> are using here we indeed don't need to take an exclusive vma
> lock - in fact I think we don't need to take it at all:
>
> The main goal in the migration code is to unmap the pte from all
> thread's MMU visibility, before we copy its contents into
> another page [located on another node] and map that page into
> the page tables instead of the old page.
>
> No other thread must have a write reference to the old page when
> the copying [migrate_page_copy()] is performed, or we corrupt
> user-space memory subtly via copying a slightly older version of
> user-space memory.
>
> rmap_walk() OTOH appears to have been written as a general
> purpose function, to be usable without holding the mmap_sem() as
> well, so it is written to protect against the disappearance of
> anon vmas.
>
> But ... in all upstream and NUMA-migration codepaths I could
> find - and AFAICS in all other page-migration codepaths as well,
> including sys_move_pages() - anon vmas cannot disappear from
> under us, because we are already holding the mmap_sem.
>
> [ Initially I assumed that swapout or filesystem code could
>    somehow call this without holding the mmap sem - but could not
>    find any such code path. ]
>
> So I think we could get away rather simply, with something like
> the (entirely and utterly untested!) patch below.
>
> But ... judging from the code my feeling is this can only be the
> first (and easiest) step:
>
> 1)
>
> This patch might solve the remapping (remove_migration_ptes()),
> but does not solve the anon-vma locking done in the first,
> unmapping step of pte-migration - which is done via
> try_to_unmap(): which is a generic VM function used by swapout
> too, so callers do not necessarily hold the mmap_sem.
>
> A new TTU flag might solve it although I detest flag-driven
> locking semantics with a passion:
>
> Splitting out unlocked versions of try_to_unmap_anon(),
> try_to_unmap_ksm(), try_to_unmap_file() and constructing an
> unlocked try_to_unmap() out of them, to be used by the migration
> code, would be the cleaner option.
>
> 2)
>
> Taking a process-global mutex 1024 times per 2MB was indeed very
> expensive - and lets assume that we manage to sort that out -
> but then we are AFAICS exposed to the next layer: the
> finegrained migrate_pages() model where the migration code
> flushes the TLB 512 times per 2MB to unmap and remap it again
> and again at 4K granularity ...
>
> Assuming the simpler patch goes fine I'll try to come up with
> something intelligent for the TLB flushing sub-problem too: we
> could in theory batch the migration TLB flushes as well, by
> first doing an array of 2MB granular unmaps, then copying up to
> 512x 4K pages, then doing the final 2MB granular [but still
> 4K-represented in the page tables] remap.
>
> 2MB granular TLB flushing is OK for these workloads, I can see
> that in +THP tests.
>
> I will keep you updated about how far I manage to get down this
> road.
>
> Thanks,
>
> 	Ingo
>
> ---------------------------->
> Subject: mm/migration: Don't lock anon vmas in rmap_walk_anon()
> From: Ingo Molnar <mingo@kernel.org>
> Date: Thu Nov 22 14:16:26 CET 2012
>
> rmap_walk_anon() appears to be too careful about locking the anon
> vma for its own good - since all callers are holding the mmap_sem
> no vma can go away from under us:
>
>   - sys_move_pages() is doing down_read(&mm->mmap_sem) in the
>     sys_move_pages() -> do_pages_move() -> do_move_page_to_node_array()
>     code path, which then calls migrate_pages(pagelist), which then
>     does unmap_and_move() for every page in the list, which does
>     remove_migration_ptes() which calls rmap.c::try_to_unmap().
>
>   - the NUMA migration code's migrate_misplaced_page(), which calls
>     migrate_pages() ... try_to_unmap(), is holding the mm->mmap_sem
>     read-locked by virtue of the low level page fault handler taking
>     it before calling handle_mm_fault().
>
> Removing this lock removes a global mutex from the hot path of
> migration-happy threaded workloads which can cause pathological
> performance like this:
>
>      96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
>                    |
>                    --- perf_trace_sched_switch
>                        __schedule
>                        schedule
>                        schedule_preempt_disabled
>                        __mutex_lock_common.isra.6
>                        __mutex_lock_slowpath
>                        mutex_lock
>                       |
>                       |--50.61%-- rmap_walk
>                       |          move_to_new_page
>                       |          migrate_pages
>                       |          migrate_misplaced_page
>                       |          __do_numa_page.isra.69
>                       |          handle_pte_fault
>                       |          handle_mm_fault
>                       |          __do_page_fault
>                       |          do_page_fault
>                       |          page_fault
>                       |          __memset_sse2
>                       |          |
>                       |           --100.00%-- worker_thread
>                       |                     |
>                       |                      --100.00%-- start_thread
>                       |
>                        --49.39%-- page_lock_anon_vma
>                                  try_to_unmap_anon
>                                  try_to_unmap
>                                  migrate_pages
>                                  migrate_misplaced_page
>                                  __do_numa_page.isra.69
>                                  handle_pte_fault
>                                  handle_mm_fault
>                                  __do_page_fault
>                                  do_page_fault
>                                  page_fault
>                                  __memset_sse2
>                                  |
>                                   --100.00%-- worker_thread
>                                             start_thread
>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Hugh Dickins <hughd@google.com>
> Not-Yet-Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>   mm/rmap.c |   13 +++++--------
>   1 file changed, 5 insertions(+), 8 deletions(-)
>
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c
> +++ linux/mm/rmap.c
> @@ -1686,6 +1686,9 @@ void __put_anon_vma(struct anon_vma *ano
>   /*
>    * rmap_walk() and its helpers rmap_walk_anon() and rmap_walk_file():
>    * Called by migrate.c to remove migration ptes, but might be used more later.
> + *
> + * Note: callers are expected to protect against anon vmas disappearing
> + *       under us - by holding the mmap_sem read or write locked.
>    */
>   static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
>   		struct vm_area_struct *, unsigned long, void *), void *arg)

I am not convinced this is enough.

The same anonymous page could be mapped into multiple processes
that inherited memory from the same (grand)parent process.

Holding the mmap_sem for one process does not protect against
manipulations of the anon_vma chain by sibling, child, or parent
processes.

We may need to turn the anon_vma lock into a rwsem, like Linus
suggested.


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
  2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
  2012-12-01 12:26     ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
  2012-12-01 16:19     ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Rik van Riel
@ 2012-12-01 17:55     ` Linus Torvalds
  2012-12-01 18:30       ` Ingo Molnar
  2 siblings, 1 reply; 39+ messages in thread
From: Linus Torvalds @ 2012-12-01 17:55 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Sat, Dec 1, 2012 at 1:49 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> I *think* you are right that for this type of migration that we
> are using here we indeed don't need to take an exclusive vma
> lock - in fact I think we don't need to take it at all:

I'm pretty sure we do need at least a read-side reference.

Even if no other MM can contain that particular pte, the vma lock
protects the chain that is created by fork and exit and vma splitting
etc.

So it's enough that another thread does a fork() at the same time. Or
a partial unmap of the vma (that splits it in two), for the rmap chain
to be modified.

Besides, there's absolutely nothing that protects that vma to be part
of the same vma chain in entirely unrelated processes. The vma chain
can get quite long over multiple forks (it's even a performance
problem under some extreme loads).

And we do walk the rmap chain - so we need the lock.

But we walk it read-only afaik, which is why I think the semaphore
could be an rwsem.

Now, there *are* likely cases where we could avoid anon_vma locking
entirely, but they are very specialized. They'd be along the lines of

 - we hold the page table lock
 - we see that vma->anon_vma == vma->anon_vma->root
 - we see that vma->anon_vma->refcount == 1

or similar, because then we can guarantee that the anon-vma chain has
a length of one without even locking, and holding the page table lock
means that any concurrent fork or mmap/munmap from another thread will
block on this particular pte.

So I suspect that in the above kind of special case (which might be a
somewhat common case for normal page faults, for example) we could
make a "we have exclusive pte access to this page" argument. But quite
frankly, I completely made the above rules up in my head, they may be
bogus too.

For the general migration case, it's definitely not possible to just
drop the anon_vma lock.

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon()
  2012-12-01 17:55     ` Linus Torvalds
@ 2012-12-01 18:30       ` Ingo Molnar
  0 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-12-01 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Dec 1, 2012 at 1:49 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > I *think* you are right that for this type of migration that 
> > we are using here we indeed don't need to take an exclusive 
> > vma lock - in fact I think we don't need to take it at all:
> 
> I'm pretty sure we do need at least a read-side reference.

Ok.

> Even if no other MM can contain that particular pte, the vma 
> lock protects the chain that is created by fork and exit and 
> vma splitting etc.
> 
> So it's enough that another thread does a fork() at the same 
> time. Or a partial unmap of the vma (that splits it in two), 
> for the rmap chain to be modified.
> 
> Besides, there's absolutely nothing that protects that vma to 
> be part of the same vma chain in entirely unrelated processes. 
> The vma chain can get quite long over multiple forks (it's 
> even a performance problem under some extreme loads).
> 
> And we do walk the rmap chain - so we need the lock.
> 
> But we walk it read-only afaik, which is why I think the 
> semaphore could be an rwsem.
> 
> Now, there *are* likely cases where we could avoid anon_vma 
> locking entirely, but they are very specialized. They'd be 
> along the lines of
> 
>  - we hold the page table lock
>  - we see that vma->anon_vma == vma->anon_vma->root
>  - we see that vma->anon_vma->refcount == 1
> 
> or similar, because then we can guarantee that the anon-vma 
> chain has a length of one without even locking, and holding 
> the page table lock means that any concurrent fork or 
> mmap/munmap from another thread will block on this particular 
> pte.

Hm. These conditions may be true for some pretty common cases, 
but it's difficult to discover that information from the 
migration code due to the way we discover all anon vmas and walk 
the anon vma list: we first lock the anon vma then do we get to 
iterate over the individual vmas and do the pte changes.

So it's the wrong way around.

I think your rwsem suggestion is a lot more generic and more 
robust as well.

> So I suspect that in the above kind of special case (which 
> might be a somewhat common case for normal page faults, for 
> example) we could make a "we have exclusive pte access to this 
> page" argument. But quite frankly, I completely made the above 
> rules up in my head, they may be bogus too.
> 
> For the general migration case, it's definitely not possible 
> to just drop the anon_vma lock.

Ok, I see - I'll redo this part then and try out how an rwsem 
fares. I suspect it would give a small speedup to a fair number 
of workloads, so it's worthwile to spend some time on it.

Thanks for the suggestions!

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
  2012-12-01 12:26     ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
@ 2012-12-01 18:38       ` Linus Torvalds
  2012-12-01 18:41         ` Ingo Molnar
  2012-12-01 18:55         ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Rik van Riel
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2012-12-01 18:38 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
>
> So as a quick concept hack I wrote the patch attached below.
> (It's not signed off, see the patch description text for the
> reason.)

Well, it confirms that anon_vma locking is a big problem, but as
outlined in my other email it's completely incorrect from an actual
behavior standpoint.

Btw, I think the anon_vma lock could be made a spinlock instead of a
mutex or rwsem, but that would probably take more work. We *shouldn't*
be doing anything that needs IO inside the anon_vma lock, though, so
it *should* be doable. But there are probably quite a bit of
allocations inside the lock, and I know it covers huge areas, so a
spinlock might not only be hard to convert to, it quite likely has
latency issues too.

Oh, btw, MUTEX_SPIN_ON_OWNER may well improve performance too, but it
gets disabled by DEBUG_MUTEXES. So some of the performance impact of
the vma locking may be *very* kernel-config dependent.

                  Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
  2012-12-01 18:38       ` Linus Torvalds
@ 2012-12-01 18:41         ` Ingo Molnar
  2012-12-01 18:50           ` Linus Torvalds
  2012-12-01 18:55         ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Rik van Riel
  1 sibling, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2012-12-01 18:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> >
> > So as a quick concept hack I wrote the patch attached below.
> > (It's not signed off, see the patch description text for the
> > reason.)
> 
> Well, it confirms that anon_vma locking is a big problem, but 
> as outlined in my other email it's completely incorrect from 
> an actual behavior standpoint.

Yeah.

> Btw, I think the anon_vma lock could be made a spinlock 
> instead of a mutex or rwsem, but that would probably take more 
> work. We *shouldn't* be doing anything that needs IO inside 
> the anon_vma lock, though, so it *should* be doable. But there 
> are probably quite a bit of allocations inside the lock, and I 
> know it covers huge areas, so a spinlock might not only be 
> hard to convert to, it quite likely has latency issues too.

I'll try the rwsem and see how it goes?

> Oh, btw, MUTEX_SPIN_ON_OWNER may well improve performance too, 
> but it gets disabled by DEBUG_MUTEXES. So some of the 
> performance impact of the vma locking may be *very* 
> kernel-config dependent.

Hm, indeed. For performance runs I typically disable lock 
debugging - which might have made me not directly notice some of 
the performance problems.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
  2012-12-01 18:41         ` Ingo Molnar
@ 2012-12-01 18:50           ` Linus Torvalds
  2012-12-01 20:10             ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
  2012-12-01 20:15             ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
  0 siblings, 2 replies; 39+ messages in thread
From: Linus Torvalds @ 2012-12-01 18:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Sat, Dec 1, 2012 at 10:41 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> I'll try the rwsem and see how it goes?

Yeah. That should be an easy conversion (just convert everything to
use the write-lock first, and then you can make one or two migration
places use the read version).

Side note: The mutex code tends to potentially generate slightly
faster noncontended locks than rwsems, and it does have the
MUTEX_SPIN_ON_OWNER feature that makes the contention case often
*much* better, so there are real downsides to rw-semaphores.

But for this load, it does seem like the scalability advantages of an
rwsem *might* be worth it.

Side note: in contrast, the rwlock spinning reader-writer locks are
basically never a win - the downsides just about always negate any
theoretical scalability advantage. rwsem's can work well, we already
use it for mmap_sem, for example, to allow concurrent page faults, and
it was a *big* scalabiloity win there. Although then we did the "drop
mmap_sem over IO and retry", and that might have negated many of the
advantages of the mmap_sem.

> Hm, indeed. For performance runs I typically disable lock
> debugging - which might have made me not directly notice some of
> the performance problems.

Yeah, lock debugging really tends to make anything that is close to
contended be absolutely *horribly* contended. Doubly so for the
mutexes because it disables the spinning code, but it's true in
general too.

                Linus

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use
  2012-12-01 18:38       ` Linus Torvalds
  2012-12-01 18:41         ` Ingo Molnar
@ 2012-12-01 18:55         ` Rik van Riel
  1 sibling, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2012-12-01 18:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Peter Zijlstra,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On 12/01/2012 01:38 PM, Linus Torvalds wrote:
> On Sat, Dec 1, 2012 at 4:26 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>>
>> So as a quick concept hack I wrote the patch attached below.
>> (It's not signed off, see the patch description text for the
>> reason.)
>
> Well, it confirms that anon_vma locking is a big problem, but as
> outlined in my other email it's completely incorrect from an actual
> behavior standpoint.
>
> Btw, I think the anon_vma lock could be made a spinlock

The anon_vma lock used to be a spinlock, and was turned into a
mutex by Peter, as part of an effort to make more of the VM
preemptible.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  2012-12-01 18:50           ` Linus Torvalds
@ 2012-12-01 20:10             ` Ingo Molnar
  2012-12-01 20:19               ` Rik van Riel
  2012-12-03 13:59               ` Mel Gorman
  2012-12-01 20:15             ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
  1 sibling, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-12-01 20:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


Convert the struct anon_vma::mutex to an rwsem, which will help
in solving a page-migration scalability problem. (Addressed in
a separate patch.)

The conversion is simple and straightforward: in every case
where we mutex_lock()ed we'll now down_write().

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/rmap.h |   16 ++++++++--------
 mm/huge_memory.c     |    4 ++--
 mm/mmap.c            |    8 ++++----
 mm/rmap.c            |   22 +++++++++++-----------
 4 files changed, 25 insertions(+), 25 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h
+++ linux/include/linux/rmap.h
@@ -7,7 +7,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
-#include <linux/mutex.h>
+#include <linux/rwsem.h>
 #include <linux/memcontrol.h>
 
 /*
@@ -25,8 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-	struct anon_vma *root;	/* Root of this anon_vma tree */
-	struct mutex mutex;	/* Serialize access to vma list */
+	struct anon_vma *root;		/* Root of this anon_vma tree */
+	struct rw_semaphore rwsem;	/* W: modification, R: walking the list */
 	/*
 	 * The refcount is taken on an anon_vma when there is no
 	 * guarantee that the vma of page tables will exist for
@@ -64,7 +64,7 @@ struct anon_vma_chain {
 	struct vm_area_struct *vma;
 	struct anon_vma *anon_vma;
 	struct list_head same_vma;   /* locked by mmap_sem & page_table_lock */
-	struct rb_node rb;			/* locked by anon_vma->mutex */
+	struct rb_node rb;			/* locked by anon_vma->rwsem */
 	unsigned long rb_subtree_last;
 #ifdef CONFIG_DEBUG_VM_RB
 	unsigned long cached_vma_start, cached_vma_last;
@@ -108,24 +108,24 @@ static inline void vma_lock_anon_vma(str
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		mutex_lock(&anon_vma->root->mutex);
+		down_write(&anon_vma->root->rwsem);
 }
 
 static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 {
 	struct anon_vma *anon_vma = vma->anon_vma;
 	if (anon_vma)
-		mutex_unlock(&anon_vma->root->mutex);
+		up_write(&anon_vma->root->rwsem);
 }
 
 static inline void anon_vma_lock(struct anon_vma *anon_vma)
 {
-	mutex_lock(&anon_vma->root->mutex);
+	down_write(&anon_vma->root->rwsem);
 }
 
 static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 {
-	mutex_unlock(&anon_vma->root->mutex);
+	up_write(&anon_vma->root->rwsem);
 }
 
 /*
Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -1388,7 +1388,7 @@ static int __split_huge_page_splitting(s
 		 * We can't temporarily set the pmd to null in order
 		 * to split it, the pmd must remain marked huge at all
 		 * times or the VM won't take the pmd_trans_huge paths
-		 * and it won't wait on the anon_vma->root->mutex to
+		 * and it won't wait on the anon_vma->root->rwsem to
 		 * serialize against split_huge_page*.
 		 */
 		pmdp_splitting_flush(vma, address, pmd);
@@ -1591,7 +1591,7 @@ static int __split_huge_page_map(struct
 	return ret;
 }
 
-/* must be called with anon_vma->root->mutex hold */
+/* must be called with anon_vma->root->rwsem held */
 static void __split_huge_page(struct page *page,
 			      struct anon_vma *anon_vma)
 {
Index: linux/mm/mmap.c
===================================================================
--- linux.orig/mm/mmap.c
+++ linux/mm/mmap.c
@@ -2561,15 +2561,15 @@ static void vm_lock_anon_vma(struct mm_s
 		 * The LSB of head.next can't change from under us
 		 * because we hold the mm_all_locks_mutex.
 		 */
-		mutex_lock_nest_lock(&anon_vma->root->mutex, &mm->mmap_sem);
+		down_write(&anon_vma->root->rwsem);
 		/*
 		 * We can safely modify head.next after taking the
-		 * anon_vma->root->mutex. If some other vma in this mm shares
+		 * anon_vma->root->rwsem. If some other vma in this mm shares
 		 * the same anon_vma we won't take it again.
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us thanks to the
-		 * anon_vma->root->mutex.
+		 * anon_vma->root->rwsem.
 		 */
 		if (__test_and_set_bit(0, (unsigned long *)
 				       &anon_vma->root->rb_root.rb_node))
@@ -2671,7 +2671,7 @@ static void vm_unlock_anon_vma(struct an
 		 *
 		 * No need of atomic instructions here, head.next
 		 * can't change from under us until we release the
-		 * anon_vma->root->mutex.
+		 * anon_vma->root->rwsem.
 		 */
 		if (!__test_and_clear_bit(0, (unsigned long *)
 					  &anon_vma->root->rb_root.rb_node))
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -24,7 +24,7 @@
  *   mm->mmap_sem
  *     page->flags PG_locked (lock_page)
  *       mapping->i_mmap_mutex
- *         anon_vma->mutex
+ *         anon_vma->rwsem
  *           mm->page_table_lock or pte_lock
  *             zone->lru_lock (in mark_page_accessed, isolate_lru_page)
  *             swap_lock (in swap_duplicate, swap_info_get)
@@ -37,7 +37,7 @@
  *                           in arch-dependent flush_dcache_mmap_lock,
  *                           within bdi.wb->list_lock in __sync_single_inode)
  *
- * anon_vma->mutex,mapping->i_mutex      (memory_failure, collect_procs_anon)
+ * anon_vma->rwsem,mapping->i_mutex      (memory_failure, collect_procs_anon)
  *   ->tasklist_lock
  *     pte map lock
  */
@@ -103,7 +103,7 @@ static inline void anon_vma_free(struct
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
 	 */
-	if (mutex_is_locked(&anon_vma->root->mutex)) {
+	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
 		anon_vma_lock(anon_vma);
 		anon_vma_unlock(anon_vma);
 	}
@@ -219,9 +219,9 @@ static inline struct anon_vma *lock_anon
 	struct anon_vma *new_root = anon_vma->root;
 	if (new_root != root) {
 		if (WARN_ON_ONCE(root))
-			mutex_unlock(&root->mutex);
+			up_write(&root->rwsem);
 		root = new_root;
-		mutex_lock(&root->mutex);
+		down_write(&root->rwsem);
 	}
 	return root;
 }
@@ -229,7 +229,7 @@ static inline struct anon_vma *lock_anon
 static inline void unlock_anon_vma_root(struct anon_vma *root)
 {
 	if (root)
-		mutex_unlock(&root->mutex);
+		up_write(&root->rwsem);
 }
 
 /*
@@ -349,7 +349,7 @@ void unlink_anon_vmas(struct vm_area_str
 	/*
 	 * Iterate the list once more, it now only contains empty and unlinked
 	 * anon_vmas, destroy them. Could not do before due to __put_anon_vma()
-	 * needing to acquire the anon_vma->root->mutex.
+	 * needing to write-acquire the anon_vma->root->rwsem.
 	 */
 	list_for_each_entry_safe(avc, next, &vma->anon_vma_chain, same_vma) {
 		struct anon_vma *anon_vma = avc->anon_vma;
@@ -365,7 +365,7 @@ static void anon_vma_ctor(void *data)
 {
 	struct anon_vma *anon_vma = data;
 
-	mutex_init(&anon_vma->mutex);
+	init_rwsem(&anon_vma->rwsem);
 	atomic_set(&anon_vma->refcount, 0);
 	anon_vma->rb_root = RB_ROOT;
 }
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma(stru
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (mutex_trylock(&root_anon_vma->mutex)) {
+	if (down_write_trylock(&root_anon_vma->rwsem)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			mutex_unlock(&root_anon_vma->mutex);
+			up_write(&root_anon_vma->rwsem);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -1299,7 +1299,7 @@ out_mlock:
 	/*
 	 * We need mmap_sem locking, Otherwise VM_LOCKED check makes
 	 * unstable result and race. Plus, We can't wait here because
-	 * we now hold anon_vma->mutex or mapping->i_mmap_mutex.
+	 * we now hold anon_vma->rwsem or mapping->i_mmap_mutex.
 	 * if trylock failed, the page remain in evictable lru and later
 	 * vmscan could retry to move the page to unevictable lru if the
 	 * page is actually mlocked.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-01 18:50           ` Linus Torvalds
  2012-12-01 20:10             ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
@ 2012-12-01 20:15             ` Ingo Molnar
  2012-12-01 20:33               ` Rik van Riel
  2012-12-03 14:17               ` [PATCH 2/2] " Mel Gorman
  1 sibling, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-12-01 20:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Mel Gorman,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


Note, with this optimization I went a farther than the 
boundaries of the migration code - it seemed worthwile to do and 
I've reviewed all the other users of page_lock_anon_vma() as 
well and none seemed to be modifying the list inside that lock.

Please review this patch carefully - in particular the SMP races 
outlined in anon_vma_free() are exciting: I have updated the 
reasoning and it still appears to hold, but please double check 
the changes nevertheless ...

Thanks,

	Ingo

------------------->
From: Ingo Molnar <mingo@kernel.org>
Date: Sat Dec 1 20:43:04 CET 2012

rmap_walk_anon() and try_to_unmap_anon() appears to be too careful
about locking the anon vma: while it needs protection against anon
vma list modifications, it does not need exclusive access to the
list itself.

Transforming this exclusive lock to a read-locked rwsem removes a
global lock from the hot path of page-migration intense threaded
workloads which can cause pathological performance like this:

    96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                  |
                  --- perf_trace_sched_switch
                      __schedule
                      schedule
                      schedule_preempt_disabled
                      __mutex_lock_common.isra.6
                      __mutex_lock_slowpath
                      mutex_lock
                     |
                     |--50.61%-- rmap_walk
                     |          move_to_new_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          __do_numa_page.isra.69
                     |          handle_pte_fault
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          __memset_sse2
                     |          |
                     |           --100.00%-- worker_thread
                     |                     |
                     |                      --100.00%-- start_thread
                     |
                      --49.39%-- page_lock_anon_vma
                                try_to_unmap_anon
                                try_to_unmap
                                migrate_pages
                                migrate_misplaced_page
                                __do_numa_page.isra.69
                                handle_pte_fault
                                handle_mm_fault
                                __do_page_fault
                                do_page_fault
                                page_fault
                                __memset_sse2
                                |
                                 --100.00%-- worker_thread
                                           start_thread

With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.

Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/rmap.h |   15 +++++++++++++--
 mm/huge_memory.c     |    4 ++--
 mm/memory-failure.c  |    4 ++--
 mm/migrate.c         |    2 +-
 mm/rmap.c            |   40 ++++++++++++++++++++--------------------
 5 files changed, 38 insertions(+), 27 deletions(-)

Index: linux/include/linux/rmap.h
===================================================================
--- linux.orig/include/linux/rmap.h
+++ linux/include/linux/rmap.h
@@ -128,6 +128,17 @@ static inline void anon_vma_unlock(struc
 	up_write(&anon_vma->root->rwsem);
 }
 
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+	down_read(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+	up_read(&anon_vma->root->rwsem);
+}
+
+
 /*
  * anon_vma helper functions.
  */
@@ -220,8 +231,8 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page);
-void page_unlock_anon_vma(struct anon_vma *anon_vma);
+struct anon_vma *page_lock_anon_vma_read(struct page *page);
+void page_unlock_anon_vma_read(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 /*
Index: linux/mm/huge_memory.c
===================================================================
--- linux.orig/mm/huge_memory.c
+++ linux/mm/huge_memory.c
@@ -1645,7 +1645,7 @@ int split_huge_page(struct page *page)
 	int ret = 1;
 
 	BUG_ON(!PageAnon(page));
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		goto out;
 	ret = 0;
@@ -1658,7 +1658,7 @@ int split_huge_page(struct page *page)
 
 	BUG_ON(PageCompound(page));
 out_unlock:
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 out:
 	return ret;
 }
Index: linux/mm/memory-failure.c
===================================================================
--- linux.orig/mm/memory-failure.c
+++ linux/mm/memory-failure.c
@@ -402,7 +402,7 @@ static void collect_procs_anon(struct pa
 	struct anon_vma *av;
 	pgoff_t pgoff;
 
-	av = page_lock_anon_vma(page);
+	av = page_lock_anon_vma_read(page);
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
@@ -423,7 +423,7 @@ static void collect_procs_anon(struct pa
 		}
 	}
 	read_unlock(&tasklist_lock);
-	page_unlock_anon_vma(av);
+	page_unlock_anon_vma_read(av);
 }
 
 /*
Index: linux/mm/migrate.c
===================================================================
--- linux.orig/mm/migrate.c
+++ linux/mm/migrate.c
@@ -751,7 +751,7 @@ static int __unmap_and_move(struct page
 	 */
 	if (PageAnon(page)) {
 		/*
-		 * Only page_lock_anon_vma() understands the subtleties of
+		 * Only page_lock_anon_vma_read() understands the subtleties of
 		 * getting a hold on an anon_vma from outside one of its mms.
 		 */
 		anon_vma = page_get_anon_vma(page);
Index: linux/mm/rmap.c
===================================================================
--- linux.orig/mm/rmap.c
+++ linux/mm/rmap.c
@@ -87,18 +87,18 @@ static inline void anon_vma_free(struct
 	VM_BUG_ON(atomic_read(&anon_vma->refcount));
 
 	/*
-	 * Synchronize against page_lock_anon_vma() such that
+	 * Synchronize against page_lock_anon_vma_read() such that
 	 * we can safely hold the lock without the anon_vma getting
 	 * freed.
 	 *
 	 * Relies on the full mb implied by the atomic_dec_and_test() from
 	 * put_anon_vma() against the acquire barrier implied by
-	 * mutex_trylock() from page_lock_anon_vma(). This orders:
+	 * down_read_trylock() from page_lock_anon_vma_read(). This orders:
 	 *
-	 * page_lock_anon_vma()		VS	put_anon_vma()
-	 *   mutex_trylock()			  atomic_dec_and_test()
+	 * page_lock_anon_vma_read()	VS	put_anon_vma()
+	 *   down_read_trylock()		  atomic_dec_and_test()
 	 *   LOCK				  MB
-	 *   atomic_read()			  mutex_is_locked()
+	 *   atomic_read()			  rwsem_is_locked()
 	 *
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
@@ -146,7 +146,7 @@ static void anon_vma_chain_link(struct v
  * allocate a new one.
  *
  * Anon-vma allocations are very subtle, because we may have
- * optimistically looked up an anon_vma in page_lock_anon_vma()
+ * optimistically looked up an anon_vma in page_lock_anon_vma_read()
  * and that may actually touch the spinlock even in the newly
  * allocated vma (it depends on RCU to make sure that the
  * anon_vma isn't actually destroyed).
@@ -442,7 +442,7 @@ out:
  * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
  * reference like with page_get_anon_vma() and then block on the mutex.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma_read(struct page *page)
 {
 	struct anon_vma *anon_vma = NULL;
 	struct anon_vma *root_anon_vma;
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma(stru
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (down_write_trylock(&root_anon_vma->rwsem)) {
+	if (down_read_trylock(&root_anon_vma->rwsem)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			up_write(&root_anon_vma->rwsem);
+			up_read(&root_anon_vma->rwsem);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -484,7 +484,7 @@ struct anon_vma *page_lock_anon_vma(stru
 
 	/* we pinned the anon_vma, its safe to sleep */
 	rcu_read_unlock();
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_read(anon_vma);
 
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
@@ -492,7 +492,7 @@ struct anon_vma *page_lock_anon_vma(stru
 		 * and bail -- can't simply use put_anon_vma() because
 		 * we'll deadlock on the anon_vma_lock() recursion.
 		 */
-		anon_vma_unlock(anon_vma);
+		anon_vma_unlock_read(anon_vma);
 		__put_anon_vma(anon_vma);
 		anon_vma = NULL;
 	}
@@ -504,9 +504,9 @@ out:
 	return anon_vma;
 }
 
-void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 {
-	anon_vma_unlock(anon_vma);
+	anon_vma_unlock_read(anon_vma);
 }
 
 /*
@@ -732,7 +732,7 @@ static int page_referenced_anon(struct p
 	struct anon_vma_chain *avc;
 	int referenced = 0;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		return referenced;
 
@@ -754,7 +754,7 @@ static int page_referenced_anon(struct p
 			break;
 	}
 
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 	return referenced;
 }
 
@@ -1474,7 +1474,7 @@ static int try_to_unmap_anon(struct page
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		return ret;
 
@@ -1501,7 +1501,7 @@ static int try_to_unmap_anon(struct page
 			break;
 	}
 
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 	return ret;
 }
 
@@ -1696,7 +1696,7 @@ static int rmap_walk_anon(struct page *p
 	int ret = SWAP_AGAIN;
 
 	/*
-	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
+	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma_read()
 	 * because that depends on page_mapped(); but not all its usages
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
@@ -1704,7 +1704,7 @@ static int rmap_walk_anon(struct page *p
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_read(anon_vma);
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1712,7 +1712,7 @@ static int rmap_walk_anon(struct page *p
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	anon_vma_unlock(anon_vma);
+	anon_vma_unlock_read(anon_vma);
 	return ret;
 }
 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  2012-12-01 20:10             ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
@ 2012-12-01 20:19               ` Rik van Riel
  2012-12-02 15:10                 ` Ingo Molnar
  2012-12-03 13:59               ` Mel Gorman
  1 sibling, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2012-12-01 20:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On 12/01/2012 03:10 PM, Ingo Molnar wrote:
>
> Convert the struct anon_vma::mutex to an rwsem, which will help
> in solving a page-migration scalability problem. (Addressed in
> a separate patch.)
>
> The conversion is simple and straightforward: in every case
> where we mutex_lock()ed we'll now down_write().
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

Reviewed-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-01 20:15             ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
@ 2012-12-01 20:33               ` Rik van Riel
  2012-12-02 15:12                 ` [PATCH 2/2, v2] " Ingo Molnar
  2012-12-03 14:17               ` [PATCH 2/2] " Mel Gorman
  1 sibling, 1 reply; 39+ messages in thread
From: Rik van Riel @ 2012-12-01 20:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On 12/01/2012 03:15 PM, Ingo Molnar wrote:

> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h
> +++ linux/include/linux/rmap.h
> @@ -128,6 +128,17 @@ static inline void anon_vma_unlock(struc
>   	up_write(&anon_vma->root->rwsem);
>   }
>
> +static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
> +{
> +	down_read(&anon_vma->root->rwsem);
> +}

I see you did not rename anon_vma_lock and anon_vma_unlock
to anon_vma_lock_write and anon_vma_unlock_write.

That could get confusing to people touching that code in
the future.

The patch looks correct, though.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  2012-12-01 20:19               ` Rik van Riel
@ 2012-12-02 15:10                 ` Ingo Molnar
  0 siblings, 0 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-12-02 15:10 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


* Rik van Riel <riel@redhat.com> wrote:

> On 12/01/2012 03:10 PM, Ingo Molnar wrote:
> >
> >Convert the struct anon_vma::mutex to an rwsem, which will help
> >in solving a page-migration scalability problem. (Addressed in
> >a separate patch.)
> >
> >The conversion is simple and straightforward: in every case
> >where we mutex_lock()ed we'll now down_write().
> >
> >Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> >Cc: Andrew Morton <akpm@linux-foundation.org>
> >Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> >Cc: Andrea Arcangeli <aarcange@redhat.com>
> >Cc: Rik van Riel <riel@redhat.com>
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> Reviewed-by: Rik van Riel <riel@redhat.com>

Thanks Rik!

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH 2/2, v2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-01 20:33               ` Rik van Riel
@ 2012-12-02 15:12                 ` Ingo Molnar
  2012-12-02 17:53                   ` Rik van Riel
                                     ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Ingo Molnar @ 2012-12-02 15:12 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins


* Rik van Riel <riel@redhat.com> wrote:

> >+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
> >+{
> >+	down_read(&anon_vma->root->rwsem);
> >+}
> 
> I see you did not rename anon_vma_lock and anon_vma_unlock to 
> anon_vma_lock_write and anon_vma_unlock_write.
> 
> That could get confusing to people touching that code in the 
> future.

Agreed, doing that rename makes perfect sense - I've done that 
in the v2 version attached below.

Thanks,

	Ingo

----------------------->
>From 21469dcb225b9cf3160f839b7a823448f5ce5afa Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Sat, 1 Dec 2012 21:15:38 +0100
Subject: [PATCH] mm/rmap, migration: Make rmap_walk_anon() and
 try_to_unmap_anon() more scalable

rmap_walk_anon() and try_to_unmap_anon() appears to be too
careful about locking the anon vma: while it needs protection
against anon vma list modifications, it does not need exclusive
access to the list itself.

Transforming this exclusive lock to a read-locked rwsem removes
a global lock from the hot path of page-migration intense
threaded workloads which can cause pathological performance like
this:

    96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
                  |
                  --- perf_trace_sched_switch
                      __schedule
                      schedule
                      schedule_preempt_disabled
                      __mutex_lock_common.isra.6
                      __mutex_lock_slowpath
                      mutex_lock
                     |
                     |--50.61%-- rmap_walk
                     |          move_to_new_page
                     |          migrate_pages
                     |          migrate_misplaced_page
                     |          __do_numa_page.isra.69
                     |          handle_pte_fault
                     |          handle_mm_fault
                     |          __do_page_fault
                     |          do_page_fault
                     |          page_fault
                     |          __memset_sse2
                     |          |
                     |           --100.00%-- worker_thread
                     |                     |
                     |                      --100.00%-- start_thread
                     |
                      --49.39%-- page_lock_anon_vma
                                try_to_unmap_anon
                                try_to_unmap
                                migrate_pages
                                migrate_misplaced_page
                                __do_numa_page.isra.69
                                handle_pte_fault
                                handle_mm_fault
                                __do_page_fault
                                do_page_fault
                                page_fault
                                __memset_sse2
                                |
                                 --100.00%-- worker_thread
                                           start_thread

With this change applied the profile is now nicely flat
and there's no anon-vma related scheduling/blocking.

Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
to make it clearer that it's an exclusive write-lock in
that case - suggested by Rik van Riel.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/huge_mm.h |  2 +-
 include/linux/rmap.h    | 17 ++++++++++++++---
 mm/huge_memory.c        |  6 +++---
 mm/ksm.c                |  6 +++---
 mm/memory-failure.c     |  4 ++--
 mm/migrate.c            |  2 +-
 mm/mmap.c               |  2 +-
 mm/mremap.c             |  2 +-
 mm/rmap.c               | 48 ++++++++++++++++++++++++------------------------
 9 files changed, 50 insertions(+), 39 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7f5a552..81a9dee 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -101,7 +101,7 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
 #define wait_split_huge_page(__anon_vma, __pmd)				\
 	do {								\
 		pmd_t *____pmd = (__pmd);				\
-		anon_vma_lock(__anon_vma);				\
+		anon_vma_lock_write(__anon_vma);			\
 		anon_vma_unlock(__anon_vma);				\
 		BUG_ON(pmd_trans_splitting(*____pmd) ||			\
 		       pmd_trans_huge(*____pmd));			\
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index f3f41d2..c20635c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -118,7 +118,7 @@ static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
 		up_write(&anon_vma->root->rwsem);
 }
 
-static inline void anon_vma_lock(struct anon_vma *anon_vma)
+static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
 {
 	down_write(&anon_vma->root->rwsem);
 }
@@ -128,6 +128,17 @@ static inline void anon_vma_unlock(struct anon_vma *anon_vma)
 	up_write(&anon_vma->root->rwsem);
 }
 
+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
+{
+	down_read(&anon_vma->root->rwsem);
+}
+
+static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
+{
+	up_read(&anon_vma->root->rwsem);
+}
+
+
 /*
  * anon_vma helper functions.
  */
@@ -220,8 +231,8 @@ int try_to_munlock(struct page *);
 /*
  * Called by memory-failure.c to kill processes.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page);
-void page_unlock_anon_vma(struct anon_vma *anon_vma);
+struct anon_vma *page_lock_anon_vma_read(struct page *page);
+void page_unlock_anon_vma_read(struct anon_vma *anon_vma);
 int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 25929c1..265667e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1644,7 +1644,7 @@ int split_huge_page(struct page *page)
 	int ret = 1;
 
 	BUG_ON(!PageAnon(page));
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		goto out;
 	ret = 0;
@@ -1657,7 +1657,7 @@ int split_huge_page(struct page *page)
 
 	BUG_ON(PageCompound(page));
 out_unlock:
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 out:
 	return ret;
 }
@@ -2169,7 +2169,7 @@ static void collapse_huge_page(struct mm_struct *mm,
 	if (!pmd_present(*pmd) || pmd_trans_huge(*pmd))
 		goto out;
 
-	anon_vma_lock(vma->anon_vma);
+	anon_vma_lock_write(vma->anon_vma);
 
 	pte = pte_offset_map(pmd, address);
 	ptl = pte_lockptr(mm, pmd);
diff --git a/mm/ksm.c b/mm/ksm.c
index ae539f0..7fa37de 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1634,7 +1634,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
 			vma = vmac->vma;
@@ -1688,7 +1688,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
 			vma = vmac->vma;
@@ -1741,7 +1741,7 @@ again:
 		struct anon_vma_chain *vmac;
 		struct vm_area_struct *vma;
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_foreach(vmac, &anon_vma->rb_root,
 					       0, ULONG_MAX) {
 			vma = vmac->vma;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..6b4460c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -402,7 +402,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 	struct anon_vma *av;
 	pgoff_t pgoff;
 
-	av = page_lock_anon_vma(page);
+	av = page_lock_anon_vma_read(page);
 	if (av == NULL)	/* Not actually mapped anymore */
 		return;
 
@@ -423,7 +423,7 @@ static void collect_procs_anon(struct page *page, struct list_head *to_kill,
 		}
 	}
 	read_unlock(&tasklist_lock);
-	page_unlock_anon_vma(av);
+	page_unlock_anon_vma_read(av);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
index 3db0543..138cb34 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -751,7 +751,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 */
 	if (PageAnon(page)) {
 		/*
-		 * Only page_lock_anon_vma() understands the subtleties of
+		 * Only page_lock_anon_vma_read() understands the subtleties of
 		 * getting a hold on an anon_vma from outside one of its mms.
 		 */
 		anon_vma = page_get_anon_vma(page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 27951e4..964a85c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -600,7 +600,7 @@ again:			remove_next = 1 + (end > next->vm_end);
 	if (anon_vma) {
 		VM_BUG_ON(adjust_next && next->anon_vma &&
 			  anon_vma != next->anon_vma);
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_interval_tree_pre_update_vma(vma);
 		if (adjust_next)
 			anon_vma_interval_tree_pre_update_vma(next);
diff --git a/mm/mremap.c b/mm/mremap.c
index 1b61c2d..3dabd17 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -104,7 +104,7 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd,
 		}
 		if (vma->anon_vma) {
 			anon_vma = vma->anon_vma;
-			anon_vma_lock(anon_vma);
+			anon_vma_lock_write(anon_vma);
 		}
 	}
 
diff --git a/mm/rmap.c b/mm/rmap.c
index 6e3ee3b..b0f612d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -87,24 +87,24 @@ static inline void anon_vma_free(struct anon_vma *anon_vma)
 	VM_BUG_ON(atomic_read(&anon_vma->refcount));
 
 	/*
-	 * Synchronize against page_lock_anon_vma() such that
+	 * Synchronize against page_lock_anon_vma_read() such that
 	 * we can safely hold the lock without the anon_vma getting
 	 * freed.
 	 *
 	 * Relies on the full mb implied by the atomic_dec_and_test() from
 	 * put_anon_vma() against the acquire barrier implied by
-	 * mutex_trylock() from page_lock_anon_vma(). This orders:
+	 * down_read_trylock() from page_lock_anon_vma_read(). This orders:
 	 *
-	 * page_lock_anon_vma()		VS	put_anon_vma()
-	 *   mutex_trylock()			  atomic_dec_and_test()
+	 * page_lock_anon_vma_read()	VS	put_anon_vma()
+	 *   down_read_trylock()		  atomic_dec_and_test()
 	 *   LOCK				  MB
-	 *   atomic_read()			  mutex_is_locked()
+	 *   atomic_read()			  rwsem_is_locked()
 	 *
 	 * LOCK should suffice since the actual taking of the lock must
 	 * happen _before_ what follows.
 	 */
 	if (rwsem_is_locked(&anon_vma->root->rwsem)) {
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		anon_vma_unlock(anon_vma);
 	}
 
@@ -146,7 +146,7 @@ static void anon_vma_chain_link(struct vm_area_struct *vma,
  * allocate a new one.
  *
  * Anon-vma allocations are very subtle, because we may have
- * optimistically looked up an anon_vma in page_lock_anon_vma()
+ * optimistically looked up an anon_vma in page_lock_anon_vma_read()
  * and that may actually touch the spinlock even in the newly
  * allocated vma (it depends on RCU to make sure that the
  * anon_vma isn't actually destroyed).
@@ -181,7 +181,7 @@ int anon_vma_prepare(struct vm_area_struct *vma)
 			allocated = anon_vma;
 		}
 
-		anon_vma_lock(anon_vma);
+		anon_vma_lock_write(anon_vma);
 		/* page_table_lock to protect against threads */
 		spin_lock(&mm->page_table_lock);
 		if (likely(!vma->anon_vma)) {
@@ -306,7 +306,7 @@ int anon_vma_fork(struct vm_area_struct *vma, struct vm_area_struct *pvma)
 	get_anon_vma(anon_vma->root);
 	/* Mark this anon_vma as the one where our new (COWed) pages go. */
 	vma->anon_vma = anon_vma;
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_write(anon_vma);
 	anon_vma_chain_link(vma, avc, anon_vma);
 	anon_vma_unlock(anon_vma);
 
@@ -442,7 +442,7 @@ out:
  * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
  * reference like with page_get_anon_vma() and then block on the mutex.
  */
-struct anon_vma *page_lock_anon_vma(struct page *page)
+struct anon_vma *page_lock_anon_vma_read(struct page *page)
 {
 	struct anon_vma *anon_vma = NULL;
 	struct anon_vma *root_anon_vma;
@@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
 	root_anon_vma = ACCESS_ONCE(anon_vma->root);
-	if (down_write_trylock(&root_anon_vma->rwsem)) {
+	if (down_read_trylock(&root_anon_vma->rwsem)) {
 		/*
 		 * If the page is still mapped, then this anon_vma is still
 		 * its anon_vma, and holding the mutex ensures that it will
 		 * not go away, see anon_vma_free().
 		 */
 		if (!page_mapped(page)) {
-			up_write(&root_anon_vma->rwsem);
+			up_read(&root_anon_vma->rwsem);
 			anon_vma = NULL;
 		}
 		goto out;
@@ -484,15 +484,15 @@ struct anon_vma *page_lock_anon_vma(struct page *page)
 
 	/* we pinned the anon_vma, its safe to sleep */
 	rcu_read_unlock();
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_read(anon_vma);
 
 	if (atomic_dec_and_test(&anon_vma->refcount)) {
 		/*
 		 * Oops, we held the last refcount, release the lock
 		 * and bail -- can't simply use put_anon_vma() because
-		 * we'll deadlock on the anon_vma_lock() recursion.
+		 * we'll deadlock on the anon_vma_lock_write() recursion.
 		 */
-		anon_vma_unlock(anon_vma);
+		anon_vma_unlock_read(anon_vma);
 		__put_anon_vma(anon_vma);
 		anon_vma = NULL;
 	}
@@ -504,9 +504,9 @@ out:
 	return anon_vma;
 }
 
-void page_unlock_anon_vma(struct anon_vma *anon_vma)
+void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
 {
-	anon_vma_unlock(anon_vma);
+	anon_vma_unlock_read(anon_vma);
 }
 
 /*
@@ -732,7 +732,7 @@ static int page_referenced_anon(struct page *page,
 	struct anon_vma_chain *avc;
 	int referenced = 0;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		return referenced;
 
@@ -754,7 +754,7 @@ static int page_referenced_anon(struct page *page,
 			break;
 	}
 
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 	return referenced;
 }
 
@@ -1474,7 +1474,7 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 	struct anon_vma_chain *avc;
 	int ret = SWAP_AGAIN;
 
-	anon_vma = page_lock_anon_vma(page);
+	anon_vma = page_lock_anon_vma_read(page);
 	if (!anon_vma)
 		return ret;
 
@@ -1501,7 +1501,7 @@ static int try_to_unmap_anon(struct page *page, enum ttu_flags flags)
 			break;
 	}
 
-	page_unlock_anon_vma(anon_vma);
+	page_unlock_anon_vma_read(anon_vma);
 	return ret;
 }
 
@@ -1696,7 +1696,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	int ret = SWAP_AGAIN;
 
 	/*
-	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
+	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma_read()
 	 * because that depends on page_mapped(); but not all its usages
 	 * are holding mmap_sem. Users without mmap_sem are required to
 	 * take a reference count to prevent the anon_vma disappearing
@@ -1704,7 +1704,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 	anon_vma = page_anon_vma(page);
 	if (!anon_vma)
 		return ret;
-	anon_vma_lock(anon_vma);
+	anon_vma_lock_read(anon_vma);
 	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
 		struct vm_area_struct *vma = avc->vma;
 		unsigned long address = vma_address(page, vma);
@@ -1712,7 +1712,7 @@ static int rmap_walk_anon(struct page *page, int (*rmap_one)(struct page *,
 		if (ret != SWAP_AGAIN)
 			break;
 	}
-	anon_vma_unlock(anon_vma);
+	anon_vma_unlock_read(anon_vma);
 	return ret;
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2, v2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-02 15:12                 ` [PATCH 2/2, v2] " Ingo Molnar
@ 2012-12-02 17:53                   ` Rik van Riel
  2012-12-04 14:42                   ` Michel Lespinasse
  2012-12-05  2:59                   ` Michel Lespinasse
  2 siblings, 0 replies; 39+ messages in thread
From: Rik van Riel @ 2012-12-02 17:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Mel Gorman, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On 12/02/2012 10:12 AM, Ingo Molnar wrote:

> Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(),
> to make it clearer that it's an exclusive write-lock in
> that case - suggested by Rik van Riel.

... close, but you forgot to actually rename the unlock function :)

> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7f5a552..81a9dee 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -101,7 +101,7 @@ extern void __split_huge_page_pmd(struct mm_struct *mm, pmd_t *pmd);
>   #define wait_split_huge_page(__anon_vma, __pmd)				\
>   	do {								\
>   		pmd_t *____pmd = (__pmd);				\
> -		anon_vma_lock(__anon_vma);				\
> +		anon_vma_lock_write(__anon_vma);			\
>   		anon_vma_unlock(__anon_vma);				\
>   		BUG_ON(pmd_trans_splitting(*____pmd) ||			\

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/10] Latest numa/core release, v18
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (10 preceding siblings ...)
  2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
@ 2012-12-03 10:43 ` Mel Gorman
  2012-12-03 11:32 ` Mel Gorman
  2012-12-04 22:49 ` Mel Gorman
  13 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-12-03 10:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

I was away for the weekend so did not see this until Sunday night. I
queued up tip/master as it looked at that time and ran it overnight. In
general, I have not looked closely at any of the patches.

As a heads-up, I'm also flying very early tomorrow morning and will be
travelling for the week. I'll have intermittent access to email and *should*
be able to access my test machine remotely but my responsiveness will vary.

On Fri, Nov 30, 2012 at 08:58:31PM +0100, Ingo Molnar wrote:
> I'm pleased to announce the latest, -v18 numa/core release.
> 
> This release fixes regressions and improves NUMA performance.
> It has the following main changes:
> 
>   - Introduce directed NUMA convergence, which is based on
>     the 'task buddy' relation introduced in -v17, and make
>     use of the new "task flipping" facility.
> 
>   - Add "related task group" balancing notion to the scheduler, to
>     be able to 'compress' and 'spread' NUMA workloads
>     based on which tasks relate to each other via their
>     working set (i.e. which tasks access the same memory areas).
> 
>   - Track the quality and strength of NUMA convergence and
>     create a feedback loop with the scheduler:
> 
>      - use it to direct migrations
> 
>      - use it to slow down and speed up the rate of the
>        NUMA hinting page faults
> 
>   - Turn 4K pte NUMA faults into effective hugepage ones
> 

This one spiked my interest and I took a closer look.

It does multiple things including a cleanup but at a glance it looks like it
has similar problems to the earlier version of this patch when I reviewed
it here https://lkml.org/lkml/2012/11/21/238.  It looks like you'll still
incur a PMDs-worth of work even if the workload has not converged within
that PMD boundary. The trylock, lock, unlock, put, refault is new as well
and it's not clear what it's for.  It's neither clear why you need the page
lock in this path or why you decide to always refault if it's contended
instead of rechecking the PTE.

I know the page lock is taken in the transhuge patch but it's a massive hack
and not required here as such. migration does take the page lock but
it's done later.

This was based on just a quick glance and I likely missed a bunch of
obvious things that may have alleviated my concerns after the last
review.

>   - Refine the 'shared tasks' memory interleaving logic
> 
>   - Improve CONFIG_NUMA_BALANCING=y OOM behavior
> 
> One key practical area of improvement are enhancements to
> the NUMA convergence of "multiple JVM" kind of workloads.
> 
> As a recap, this was -v17 performance with 4x SPECjbb instances
> on a 4-node system (32 CPUs, 4 instances, 8 warehouses each, 240
> seconds runtime, +THP):
> 
>      spec1.txt:           throughput =     177460.44 SPECjbb2005 bops
>      spec2.txt:           throughput =     176175.08 SPECjbb2005 bops
>      spec3.txt:           throughput =     175053.91 SPECjbb2005 bops
>      spec4.txt:           throughput =     171383.52 SPECjbb2005 bops
>                                       --------------------------
>            SUM:           throughput =     700072.95 SPECjbb2005 bops
> 
> The new -v18 figures are:
> 
>      spec1.txt:           throughput =     191415.52 SPECjbb2005 bops 
>      spec2.txt:           throughput =     193481.96 SPECjbb2005 bops 
>      spec3.txt:           throughput =     192865.30 SPECjbb2005 bops 
>      spec4.txt:           throughput =     191627.40 SPECjbb2005 bops 
>                                            --------------------------
>            SUM:           throughput =     769390.18 SPECjbb2005 bops
> 
> Which is 10% faster than -v17, 22% faster than mainline and it is
> within 1% of the hard-binding results (where each JVM is explicitly
> memory and CPU-bound to a single node each).
> 
> Occording to my measurements the -v18 NUMA kernel is also faster than
> AutoNUMA (+THP-fix):
> 
>      spec1.txt:           throughput =     184327.49 SPECjbb2005 bops
>      spec2.txt:           throughput =     187508.83 SPECjbb2005 bops
>      spec3.txt:           throughput =     186206.44 SPECjbb2005 bops
>      spec4.txt:           throughput =     188739.22 SPECjbb2005 bops
>                                            --------------------------
>            SUM:           throughput =     746781.98 SPECjbb2005 bops
> 
> Mainline has the following 4x JVM performance:
> 
>      spec1.txt:           throughput =     157839.25 SPECjbb2005 bops
>      spec2.txt:           throughput =     156969.15 SPECjbb2005 bops
>      spec3.txt:           throughput =     157571.59 SPECjbb2005 bops
>      spec4.txt:           throughput =     157873.86 SPECjbb2005 bops
>                                       --------------------------
>            SUM:           throughput =     630253.85 SPECjbb2005 bops
> 
> Another key area of improvement is !THP (4K pages) performance.
> 
> Mainline 4x SPECjbb !THP JVM results:
> 
>      spec1.txt:           throughput =     128575.47 SPECjbb2005 bops 
>      spec2.txt:           throughput =     125767.24 SPECjbb2005 bops 
>      spec3.txt:           throughput =     130042.30 SPECjbb2005 bops 
>      spec4.txt:           throughput =     128155.32 SPECjbb2005 bops 
>                                        --------------------------
>            SUM:           throughput =     512540.33 SPECjbb2005 bops
> 
> 
> numa/core -v18 4x SPECjbb JVM !THP results:
> 
>      spec1.txt:           throughput =     158023.05 SPECjbb2005 bops 
>      spec2.txt:           throughput =     156895.51 SPECjbb2005 bops 
>      spec3.txt:           throughput =     156158.11 SPECjbb2005 bops 
>      spec4.txt:           throughput =     157414.52 SPECjbb2005 bops 
>                                       --------------------------
>            SUM:           throughput =     628491.19 SPECjbb2005 bops
> 
> That too is roughly 22% faster than mainline - the !THP regression
> that was reported by Mel Gorman appears to be fixed.
> 

Ok, luckily I had queued a full set of tests over the weekend and adding
tip/master as of last night was not an issue. It looks like it completed
an hour ago so I'll go through it shortly and report what I see.

> AutoNUMA-benchmark comparison to the mainline kernel:
> 
>  ##############
>  # res-v3.6-vanilla.log vs res-numacore-v18b.log:
>  #------------------------------------------------------------------------------------>
>    autonuma benchmark                run time (lower is better)         speedup %
>  ------------------------------------------------------------------------------------->
>    numa01                           :   337.29  vs.  177.64   |           +89.8 %
>    numa01_THREAD_ALLOC              :   428.79  vs.  127.07   |          +237.4 %
>    numa02                           :    56.32  vs.   18.08   |          +211.5 %
>    ------------------------------------------------------------
> 
> (this is similar to -v17, within noise.)
> 
> Comparison to AutoNUMA-v28 (+THP-fix):
> 
>  ##############
>  # res-autonuma-v28-THP.log vs res-numacore-v18b.log:
>  #------------------------------------------------------------------------------------>
>    autonuma benchmark                run time (lower is better)         speedup %
>  ------------------------------------------------------------------------------------->
>    numa01                           :   235.77  vs.  177.64   |           +32.7 %
>    numa01_THREAD_ALLOC              :   134.53  vs.  127.07   |            +5.8 %
>    numa02                           :    19.49  vs.   18.08   |            +7.7 %
>    ------------------------------------------------------------
> 
> A few caveats: I'm still seeing problems on !THP.
> 
> Here's the analysis of one of the last regression sources I'm still
> seeing with it on larger systems. I have identified the source
> of the regression, and I see how the AutoNUMA and 'balancenuma' trees
> solved this problem - but I disagree with the solution.
> 
> When pushed hard enough via threaded workloads (for example via the
> numa02 test) then the upstream page migration code in mm/migration.c
> becomes unscalable, resulting in lot of scheduling on the anon vma
> mutex and a subsequent drop in performance.
> 
> When the points of scheduling are call-graph profiled, the
> unscalability appears to be due to interaction between the
> following page migration code paths:
> 
>     96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
>                   |
>                   --- perf_trace_sched_switch
>                       __schedule
>                       schedule
>                       schedule_preempt_disabled
>                       __mutex_lock_common.isra.6
>                       __mutex_lock_slowpath
>                       mutex_lock
>                      |
>                      |--50.61%-- rmap_walk
>                      |          move_to_new_page
>                      |          migrate_pages
>                      |          migrate_misplaced_page
>                      |          __do_numa_page.isra.69
>                      |          handle_pte_fault
>                      |          handle_mm_fault
>                      |          __do_page_fault
>                      |          do_page_fault
>                      |          page_fault
>                      |          __memset_sse2
>                      |          |
>                      |           --100.00%-- worker_thread
>                      |                     |
>                      |                      --100.00%-- start_thread
>                      |
>                       --49.39%-- page_lock_anon_vma
>                                 try_to_unmap_anon
>                                 try_to_unmap
>                                 migrate_pages
>                                 migrate_misplaced_page
>                                 __do_numa_page.isra.69
>                                 handle_pte_fault
>                                 handle_mm_fault
>                                 __do_page_fault
>                                 do_page_fault
>                                 page_fault
>                                 __memset_sse2
>                                 |
>                                  --100.00%-- worker_thread
>                                            start_thread
> 
> From what I can see theAutoNUMA and 'balancenuma' kernels works
> around this !THP scalability issue by rate-limiting migrations.
> For example balancenuma rate-limits migrations to about 1.2 GB/sec
> bandwidth.
> 

This is not what rate limiting was concerned with. Rate limiting addressed
two concerns.

1. NUMA balancing should not consume a high percentage of memory
   bandwidth
2. If the policy encounters an adverse workload, the machine should not
   drastically slow down due to spending all its time migrating.

The two concerns are related. The first one is basically saying that
perfect balancing is pointless if the actual workload is not able to
access memory because the bus is congested. The second is more important
as I was basically assuming that no matter how smart a policy is that it
would eventually encounter a workload it simply could not handle properly
and broke down. When that happens, we do not want the users machine to
fall apart. Rate limiting is a backstop as to how *bad* we can get.

Consider a deliberately adverse workload that creates a process per node
and allocates an amount of memory per process. Every scan interval, it
binds its CPUs to the next node. The intention of this workload would
be to maximise the amount of migration NUMA balancing does. Without rate
limiting, a few instances of this workload could keep the memory bus filled
with migration traffic and potentially be a local DOS.

That said, I agree that getting bottlenecked here is unfortunate and
should be addressed but it does not obviate the need for rate limiting.

> Rate-limiting to solve scalability limits is not the right
> solution IMO, because it hurts cases where migration is justified.
> The migration of the working set itself is not a problem, it would
> in fact be beneficial - but our implementation of it does not scale
> beyond a certain rate.
> 

Which would be logical if scalability was the problem it was addressing
but it's not. It's to stop the machine going to pot if there is a hostile
user of a shared machine or the policy breaks down.

> ( THP, which has a 512 times lower natural rate of migration page
>   faults, does not run into this scalability limit. )
> 
> So this issue is still open and testers are encouraged to use THP
> if they can.
> 

As before not everything can use THP. For example, openMPI by default on
local machine communicates with shared mappings in /tmp/. Granted, this
is only important during communication so one would hope it's only a
factor during the initial setup and during the final reduction. Also
remember that THP is not always available due to fragmentation or
because the watermarks are not met if the NUMA node has most of its
memory allocated.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/10] Latest numa/core release, v18
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (11 preceding siblings ...)
  2012-12-03 10:43 ` Mel Gorman
@ 2012-12-03 11:32 ` Mel Gorman
  2012-12-04 22:49 ` Mel Gorman
  13 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-12-03 11:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Fri, Nov 30, 2012 at 08:58:31PM +0100, Ingo Molnar wrote:
> I'm pleased to announce the latest, -v18 numa/core release.
> 

This is very similar report to the previous report for V8 of my own tree
and v17 of numacore. There are two differences. One, I've added tip/master
as of the night of December 2nd 2012. The second is that unlike earlier
reports there was no monitoring active this time to avoid any possible
interference due to reading numa_maps. I did not change any of the defaults
for numacore.

AUTONUMA BENCH
                                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                                     stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
User    NUMA01               65230.85 (  0.00%)    24835.22 ( 61.93%)    69344.37 ( -6.31%)    30410.22 ( 53.38%)    54614.75 ( 16.27%)
User    NUMA01_THEADLOCAL    60794.67 (  0.00%)    17856.17 ( 70.63%)    53416.06 ( 12.14%)    17185.34 ( 71.73%)    17389.52 ( 71.40%)
User    NUMA02                7031.50 (  0.00%)     2084.38 ( 70.36%)     6726.17 (  4.34%)     2238.73 ( 68.16%)     2088.37 ( 70.30%)
User    NUMA02_SMT            2916.19 (  0.00%)     1009.28 ( 65.39%)     3207.30 ( -9.98%)     1037.07 ( 64.44%)      987.84 ( 66.13%)
System  NUMA01                  39.66 (  0.00%)      926.55 (-2236.23%)      333.49 (-740.87%)      236.83 (-497.15%)      272.38 (-586.79%)
System  NUMA01_THEADLOCAL       42.33 (  0.00%)      513.99 (-1114.25%)       40.59 (  4.11%)       70.90 (-67.49%)       97.20 (-129.62%)
System  NUMA02                   1.25 (  0.00%)       18.57 (-1385.60%)        1.04 ( 16.80%)        6.39 (-411.20%)        9.27 (-641.60%)
System  NUMA02_SMT              16.66 (  0.00%)       12.32 ( 26.05%)        0.95 ( 94.30%)        3.17 ( 80.97%)        3.44 ( 79.35%)
Elapsed NUMA01                1511.76 (  0.00%)      575.93 ( 61.90%)     1644.63 ( -8.79%)      701.62 ( 53.59%)     1229.74 ( 18.66%)
Elapsed NUMA01_THEADLOCAL     1387.17 (  0.00%)      398.55 ( 71.27%)     1260.92 (  9.10%)      378.47 ( 72.72%)      390.62 ( 71.84%)
Elapsed NUMA02                 176.81 (  0.00%)       51.14 ( 71.08%)      180.80 ( -2.26%)       53.45 ( 69.77%)       50.25 ( 71.58%)
Elapsed NUMA02_SMT             163.96 (  0.00%)       48.92 ( 70.16%)      166.96 ( -1.83%)       48.17 ( 70.62%)       45.99 ( 71.95%)
CPU     NUMA01                4317.00 (  0.00%)     4473.00 ( -3.61%)     4236.00 (  1.88%)     4368.00 ( -1.18%)     4463.00 ( -3.38%)
CPU     NUMA01_THEADLOCAL     4385.00 (  0.00%)     4609.00 ( -5.11%)     4239.00 (  3.33%)     4559.00 ( -3.97%)     4476.00 ( -2.08%)
CPU     NUMA02                3977.00 (  0.00%)     4111.00 ( -3.37%)     3720.00 (  6.46%)     4200.00 ( -5.61%)     4173.00 ( -4.93%)
CPU     NUMA02_SMT            1788.00 (  0.00%)     2087.00 (-16.72%)     1921.00 ( -7.44%)     2159.00 (-20.75%)     2155.00 (-20.53%)

I'm seeing very different results than Ingo for some
reason. numacore-20121130 was pretty good in terms of elapsed time for
numa01 but numacore-20121202 is a little worse than mainline and much worse
than v17. All the other tests suffered too which is a major surprise. If
you were looking at just elapsed time you might conclude that numacore
was not enabled but I just checked

compass:/usr/src/linux-3.7-rc7-numacore-20121202 # grep BALANC .config
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING=y
CONFIG_NUMA_BALANCING_HUGEPAGE=y
# CONFIG_NET_TEAM_MODE_LOADBALANCE is not set

And besides you can see differences in the system CPU usage. I see there are
a bunch of new knobs available for the scheduler. Did you tweak any of them?
Was perf profiling during your tests?

FWIW, the System CPU usage of numacore has improved a *lot* according to
this test.  I note one of the patch subjects moves a bunch of work into a
task worklet. How is this accounted? Is it reflected in System CPU usage or
it included in the User CPU usage because of when the worklets are executed?
Is the cost hidden entirely and not accounted anywhere?

In general, balancenuma does very well in comparison to numacore-20121202
and both kernels were using identical configurations and scripts.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
User       135980.38    45792.55   132701.13    50878.50    75087.34
System        100.53     1472.19      376.74      317.89      382.96
Elapsed      3248.36     1084.63     3262.62     1191.85     1726.87

numacore-20121202 adds some system CPU overhead but as you can see it's
much better than it was before.


MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
Page Ins                         42320       41628       40624       41592       42100
Page Outs                        16516        8032       17064        8596       10432
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                  17801       13484       19107       20032       17152
THP collapse alloc                  14           0           6          54           7
THP splits                           5           0           5           7           2
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0           0     9398512
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0           0        9755
NUMA PTE updates                     0           0           0           0   140192190
NUMA hint faults                     0           0           0           0      748819
NUMA hint local faults               0           0           0           0      570031
NUMA pages migrated                  0           0           0           0     9398512
AutoNUMA cost                        0           0           0           0        4904

THP was certainly in use, you can see the fault allocs.

Next is the specjbb. There are 4 separate configurations

multiple JVMs, THP
multiple JVMs, no THP
single JVM, THP
single JVM, no THP

SPECJBB: Multiple JVMs (one per node, 4 nodes), THP is enabled
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                     stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
Mean   1      31311.75 (  0.00%)     27938.00 (-10.77%)     29681.25 ( -5.21%)     31474.25 (  0.52%)     29616.25 ( -5.41%)
Mean   2      62972.75 (  0.00%)     51899.00 (-17.58%)     60403.00 ( -4.08%)     66654.00 (  5.85%)     61311.25 ( -2.64%)
Mean   3      91292.00 (  0.00%)     80908.00 (-11.37%)     86570.25 ( -5.17%)     97177.50 (  6.45%)     89499.00 ( -1.96%)
Mean   4     115768.75 (  0.00%)     99497.25 (-14.06%)    105982.25 ( -8.45%)    125596.00 (  8.49%)    115788.75 (  0.02%)
Mean   5     137248.50 (  0.00%)     92837.75 (-32.36%)    115640.50 (-15.74%)    152795.25 ( 11.33%)    139499.25 (  1.64%)
Mean   6     155528.50 (  0.00%)    105554.50 (-32.13%)    124614.75 (-19.88%)    177455.25 ( 14.10%)    159322.25 (  2.44%)
Mean   7     156747.50 (  0.00%)    122582.25 (-21.80%)    133205.00 (-15.02%)    184578.75 ( 17.76%)    161256.75 (  2.88%)
Mean   8     152069.50 (  0.00%)    122439.00 (-19.48%)    132939.25 (-12.58%)    186619.25 ( 22.72%)    163891.50 (  7.77%)
Mean   9     146609.75 (  0.00%)    112410.00 (-23.33%)    123667.25 (-15.65%)    186165.00 ( 26.98%)    160291.00 (  9.33%)
Mean   10    142819.00 (  0.00%)    111456.00 (-21.96%)    117609.00 (-17.65%)    182569.75 ( 27.83%)    155353.50 (  8.78%)
Mean   11    128292.25 (  0.00%)     98027.00 (-23.59%)    112410.25 (-12.38%)    176104.75 ( 37.27%)    146795.00 ( 14.42%)
Mean   12    128769.75 (  0.00%)    129469.50 (  0.54%)    106629.50 (-17.19%)    169003.00 ( 31.24%)    140294.75 (  8.95%)
Mean   13    126488.50 (  0.00%)    110133.75 (-12.93%)    106878.25 (-15.50%)    162725.75 ( 28.65%)    138068.75 (  9.16%)
Mean   14    123400.00 (  0.00%)    117929.75 ( -4.43%)    105558.25 (-14.46%)    163781.25 ( 32.72%)    138708.25 ( 12.41%)
Mean   15    122139.50 (  0.00%)    122404.25 (  0.22%)    102829.25 (-15.81%)    160800.25 ( 31.65%)    136569.25 ( 11.81%)
Mean   16    116413.50 (  0.00%)    124573.50 (  7.01%)    100475.75 (-13.69%)    160882.75 ( 38.20%)    134798.75 ( 15.79%)
Mean   17    117263.25 (  0.00%)    121937.25 (  3.99%)     97237.75 (-17.08%)    159069.75 ( 35.65%)    124516.25 (  6.19%)
Mean   18    117277.00 (  0.00%)    116633.75 ( -0.55%)     96547.00 (-17.68%)    158694.75 ( 35.32%)    127399.50 (  8.63%)
Mean   19    113231.00 (  0.00%)    111035.75 ( -1.94%)     97683.00 (-13.73%)    155563.25 ( 37.39%)    124921.00 ( 10.32%)
Mean   20    113628.75 (  0.00%)    113451.25 ( -0.16%)     96311.75 (-15.24%)    154779.75 ( 36.22%)    123689.25 (  8.85%)
Mean   21    110982.50 (  0.00%)    107660.50 ( -2.99%)     93732.50 (-15.54%)    151147.25 ( 36.19%)    124793.25 ( 12.44%)
Mean   22    107660.25 (  0.00%)    104771.50 ( -2.68%)     91888.75 (-14.65%)    151180.50 ( 40.42%)    122143.50 ( 13.45%)
Mean   23    105320.50 (  0.00%)     88275.25 (-16.18%)     91594.75 (-13.03%)    147032.00 ( 39.60%)    118244.25 ( 12.27%)
Mean   24    110900.50 (  0.00%)     85169.00 (-23.20%)     87782.75 (-20.85%)    147407.00 ( 32.92%)    119124.75 (  7.42%)
Stddev 1        720.83 (  0.00%)       982.31 (-36.28%)      1738.11 (-141.13%)       942.80 (-30.79%)       281.42 ( 60.96%)
Stddev 2        466.00 (  0.00%)      1770.75 (-279.99%)       437.94 (  6.02%)      1327.32 (-184.83%)       935.13 (-100.67%)
Stddev 3        509.61 (  0.00%)      4849.62 (-851.63%)      1892.19 (-271.30%)      1803.72 (-253.94%)      1464.93 (-187.46%)
Stddev 4       1750.10 (  0.00%)     10708.16 (-511.86%)      5762.55 (-229.27%)      2010.11 (-14.86%)       989.59 ( 43.45%)
Stddev 5        700.05 (  0.00%)     16497.79 (-2256.66%)      4658.04 (-565.39%)      2354.70 (-236.36%)      2983.62 (-326.20%)
Stddev 6       2259.33 (  0.00%)     24221.98 (-972.09%)      6618.94 (-192.96%)      1516.32 ( 32.89%)      1268.19 ( 43.87%)
Stddev 7       3390.99 (  0.00%)      4721.80 (-39.25%)      7337.14 (-116.37%)      2398.34 ( 29.27%)      2522.91 ( 25.60%)
Stddev 8       7533.18 (  0.00%)      8609.90 (-14.29%)      9431.33 (-25.20%)      2895.55 ( 61.56%)      4162.07 ( 44.75%)
Stddev 9       9223.98 (  0.00%)     10731.70 (-16.35%)     10681.30 (-15.80%)      4726.23 ( 48.76%)      3866.54 ( 58.08%)
Stddev 10      4578.09 (  0.00%)     11136.27 (-143.25%)     12513.13 (-173.33%)      6705.48 (-46.47%)      4869.86 ( -6.37%)
Stddev 11      8201.30 (  0.00%)      3580.27 ( 56.35%)     18390.50 (-124.24%)     10915.90 (-33.10%)      6316.98 ( 22.98%)
Stddev 12      5713.70 (  0.00%)     13923.12 (-143.68%)     15228.05 (-166.52%)     16555.64 (-189.75%)      3540.96 ( 38.03%)
Stddev 13      5878.95 (  0.00%)     10471.09 (-78.11%)     14014.88 (-138.39%)     18628.01 (-216.86%)      3024.19 ( 48.56%)
Stddev 14      4783.95 (  0.00%)      4051.35 ( 15.31%)     13764.72 (-187.73%)     18324.63 (-283.04%)      1203.75 ( 74.84%)
Stddev 15      6281.48 (  0.00%)      3357.07 ( 46.56%)     11925.69 (-89.85%)     17654.58 (-181.06%)      1014.51 ( 83.85%)
Stddev 16      6948.12 (  0.00%)      3763.32 ( 45.84%)     13658.66 (-96.58%)     18280.52 (-163.10%)      2171.19 ( 68.75%)
Stddev 17      5603.77 (  0.00%)      1452.04 ( 74.09%)     12618.33 (-125.18%)     18230.53 (-225.33%)      1228.38 ( 78.08%)
Stddev 18      6200.90 (  0.00%)      1870.12 ( 69.84%)     11261.01 (-81.60%)     18486.73 (-198.13%)      1143.03 ( 81.57%)
Stddev 19      6726.31 (  0.00%)      1045.21 ( 84.46%)     10748.09 (-59.79%)     18465.25 (-174.52%)      1281.55 ( 80.95%)
Stddev 20      5713.58 (  0.00%)      2066.90 ( 63.82%)     12195.08 (-113.44%)     19947.77 (-249.13%)      1076.69 ( 81.16%)
Stddev 21      4566.92 (  0.00%)      2460.40 ( 46.13%)     14089.14 (-208.50%)     21189.08 (-363.97%)      2861.40 ( 37.35%)
Stddev 22      6168.05 (  0.00%)      2770.81 ( 55.08%)     10037.19 (-62.73%)     20033.82 (-224.80%)      3628.02 ( 41.18%)
Stddev 23      6295.45 (  0.00%)      1337.32 ( 78.76%)     13290.13 (-111.11%)     22610.91 (-259.16%)      2651.23 ( 57.89%)
Stddev 24      3108.17 (  0.00%)      1381.20 ( 55.56%)     12637.15 (-306.58%)     21243.56 (-583.47%)      1156.63 ( 62.79%)
TPut   1     125247.00 (  0.00%)    111752.00 (-10.77%)    118725.00 ( -5.21%)    125897.00 (  0.52%)    118465.00 ( -5.41%)
TPut   2     251891.00 (  0.00%)    207596.00 (-17.58%)    241612.00 ( -4.08%)    266616.00 (  5.85%)    245245.00 ( -2.64%)
TPut   3     365168.00 (  0.00%)    323632.00 (-11.37%)    346281.00 ( -5.17%)    388710.00 (  6.45%)    357996.00 ( -1.96%)
TPut   4     463075.00 (  0.00%)    397989.00 (-14.06%)    423929.00 ( -8.45%)    502384.00 (  8.49%)    463155.00 (  0.02%)
TPut   5     548994.00 (  0.00%)    371351.00 (-32.36%)    462562.00 (-15.74%)    611181.00 ( 11.33%)    557997.00 (  1.64%)
TPut   6     622114.00 (  0.00%)    422218.00 (-32.13%)    498459.00 (-19.88%)    709821.00 ( 14.10%)    637289.00 (  2.44%)
TPut   7     626990.00 (  0.00%)    490329.00 (-21.80%)    532820.00 (-15.02%)    738315.00 ( 17.76%)    645027.00 (  2.88%)
TPut   8     608278.00 (  0.00%)    489756.00 (-19.48%)    531757.00 (-12.58%)    746477.00 ( 22.72%)    655566.00 (  7.77%)
TPut   9     586439.00 (  0.00%)    449640.00 (-23.33%)    494669.00 (-15.65%)    744660.00 ( 26.98%)    641164.00 (  9.33%)
TPut   10    571276.00 (  0.00%)    445824.00 (-21.96%)    470436.00 (-17.65%)    730279.00 ( 27.83%)    621414.00 (  8.78%)
TPut   11    513169.00 (  0.00%)    392108.00 (-23.59%)    449641.00 (-12.38%)    704419.00 ( 37.27%)    587180.00 ( 14.42%)
TPut   12    515079.00 (  0.00%)    517878.00 (  0.54%)    426518.00 (-17.19%)    676012.00 ( 31.24%)    561179.00 (  8.95%)
TPut   13    505954.00 (  0.00%)    440535.00 (-12.93%)    427513.00 (-15.50%)    650903.00 ( 28.65%)    552275.00 (  9.16%)
TPut   14    493600.00 (  0.00%)    471719.00 ( -4.43%)    422233.00 (-14.46%)    655125.00 ( 32.72%)    554833.00 ( 12.41%)
TPut   15    488558.00 (  0.00%)    489617.00 (  0.22%)    411317.00 (-15.81%)    643201.00 ( 31.65%)    546277.00 ( 11.81%)
TPut   16    465654.00 (  0.00%)    498294.00 (  7.01%)    401903.00 (-13.69%)    643531.00 ( 38.20%)    539195.00 ( 15.79%)
TPut   17    469053.00 (  0.00%)    487749.00 (  3.99%)    388951.00 (-17.08%)    636279.00 ( 35.65%)    498065.00 (  6.19%)
TPut   18    469108.00 (  0.00%)    466535.00 ( -0.55%)    386188.00 (-17.68%)    634779.00 ( 35.32%)    509598.00 (  8.63%)
TPut   19    452924.00 (  0.00%)    444143.00 ( -1.94%)    390732.00 (-13.73%)    622253.00 ( 37.39%)    499684.00 ( 10.32%)
TPut   20    454515.00 (  0.00%)    453805.00 ( -0.16%)    385247.00 (-15.24%)    619119.00 ( 36.22%)    494757.00 (  8.85%)
TPut   21    443930.00 (  0.00%)    430642.00 ( -2.99%)    374930.00 (-15.54%)    604589.00 ( 36.19%)    499173.00 ( 12.44%)
TPut   22    430641.00 (  0.00%)    419086.00 ( -2.68%)    367555.00 (-14.65%)    604722.00 ( 40.42%)    488574.00 ( 13.45%)
TPut   23    421282.00 (  0.00%)    353101.00 (-16.18%)    366379.00 (-13.03%)    588128.00 ( 39.60%)    472977.00 ( 12.27%)
TPut   24    443602.00 (  0.00%)    340676.00 (-23.20%)    351131.00 (-20.85%)    589628.00 ( 32.92%)    476499.00 (  7.42%)

I'm not seeing the same benefit at all. Ingo reported for one warehouse
set which was 8 warehouses per JVM and 32 warehouses overall. This is the
maximum loaded configuration for his machine. The equivalent on my machine
would be 12 warehouses per JVM (TPut 12 above) which would be 48 warehouses
overall but I'm not seeing a large gains, I'm seeing a 17.19% regression
for latest numacore.

autonuma does fairly well.

balancenuma does all right.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4            thpmigrate-v8r6
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        515079.00 (  0.00%)        517878.00 (  0.54%)        426518.00 (-17.19%)        676012.00 ( 31.24%)        561179.00 (  8.95%)
 Actual Warehouse             7.00 (  0.00%)            12.00 ( 71.43%)             7.00 (  0.00%)             8.00 ( 14.29%)             8.00 ( 14.29%)
 Actual Peak Bops        626990.00 (  0.00%)        517878.00 (-17.40%)        532820.00 (-15.02%)        746477.00 ( 19.06%)        655566.00 (  4.56%)
 SpecJBB Bops            465685.00 (  0.00%)        447214.00 ( -3.97%)        392353.00 (-15.75%)        628328.00 ( 34.93%)        514853.00 ( 10.56%)
 SpecJBB Bops/JVM        116421.00 (  0.00%)        111804.00 ( -3.97%)         98088.00 (-15.75%)        157082.00 ( 34.93%)        128713.00 ( 10.56%)

Latest numacore is showing 15.02% regression at the peak and a 15.75%
regression on the overall specjbb score.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
User       177835.94   171938.81   177810.87   177457.20   177439.06
System        166.79     5814.00      168.00      207.74      515.54
Elapsed      4037.12     4038.74     4030.32     4037.22     4027.60

Once again, the system CPU usage is far better. Again, is that worklet
accounting at work or is the improvement just that good?

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
Page Ins                         37116       36404       36064       36740       44440
Page Outs                        30340       33624       30256       29428       29540
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                  63322       49889       65564       52514       67970
THP collapse alloc                 130          53         130         463         122
THP splits                         355         192         344         376         372
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0           0    52980376
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0           0       54993
NUMA PTE updates                     0           0           0           0   418971067
NUMA hint faults                     0           0           0           0     3179614
NUMA hint local faults               0           0           0           0     1030391
NUMA pages migrated                  0           0           0           0    52980376
AutoNUMA cost                        0           0           0           0       19837

THP was in use, we all split at roughly the same rates.

SPECJBB: Multiple JVMs (one per node, 4 nodes), THP is disabled
                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                     stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
Mean   1      26036.50 (  0.00%)     19595.00 (-24.74%)     24601.50 ( -5.51%)     24738.25 ( -4.99%)     25782.75 ( -0.97%)
Mean   2      53629.75 (  0.00%)     38481.50 (-28.25%)     52351.25 ( -2.38%)     55646.75 (  3.76%)     53155.50 ( -0.88%)
Mean   3      77385.00 (  0.00%)     53685.50 (-30.63%)     75993.00 ( -1.80%)     82714.75 (  6.89%)     76386.00 ( -1.29%)
Mean   4     100097.75 (  0.00%)     68253.50 (-31.81%)     92149.50 ( -7.94%)    107883.25 (  7.78%)     98124.00 ( -1.97%)
Mean   5     119012.75 (  0.00%)     74164.50 (-37.68%)    112056.00 ( -5.85%)    130260.25 (  9.45%)    119199.75 (  0.16%)
Mean   6     137419.25 (  0.00%)     86158.50 (-37.30%)    133604.50 ( -2.78%)    154244.50 ( 12.24%)    136583.25 ( -0.61%)
Mean   7     138018.25 (  0.00%)     96059.25 (-30.40%)    136477.50 ( -1.12%)    159501.00 ( 15.57%)    139283.50 (  0.92%)
Mean   8     136774.00 (  0.00%)     97003.50 (-29.08%)    137033.75 (  0.19%)    162868.00 ( 19.08%)    139270.75 (  1.83%)
Mean   9     127966.50 (  0.00%)     95261.00 (-25.56%)    135496.00 (  5.88%)    163008.00 ( 27.38%)    136763.25 (  6.87%)
Mean   10    124628.75 (  0.00%)     96202.25 (-22.81%)    128704.25 (  3.27%)    159696.50 ( 28.14%)    132052.75 (  5.96%)
Mean   11    117269.00 (  0.00%)     95924.25 (-18.20%)    119718.50 (  2.09%)    154701.50 ( 31.92%)    127392.75 (  8.63%)
Mean   12    111962.25 (  0.00%)     94247.25 (-15.82%)    115400.75 (  3.07%)    150936.50 ( 34.81%)    120972.25 (  8.05%)
Mean   13    111595.50 (  0.00%)    106538.50 ( -4.53%)    110988.50 ( -0.54%)    147193.25 ( 31.90%)    120090.25 (  7.61%)
Mean   14    110881.00 (  0.00%)    103549.00 ( -6.61%)    111549.00 (  0.60%)    144584.00 ( 30.40%)    112823.50 (  1.75%)
Mean   15    109337.50 (  0.00%)    101729.00 ( -6.96%)    108927.25 ( -0.38%)    143333.00 ( 31.09%)    116505.25 (  6.56%)
Mean   16    107031.75 (  0.00%)    101983.75 ( -4.72%)    106160.75 ( -0.81%)    141907.75 ( 32.58%)    112731.25 (  5.33%)
Mean   17    105491.25 (  0.00%)    100205.75 ( -5.01%)    104268.75 ( -1.16%)    140691.00 ( 33.37%)    113183.00 (  7.29%)
Mean   18    101102.75 (  0.00%)     96635.50 ( -4.42%)    104045.75 (  2.91%)    137784.25 ( 36.28%)    113993.50 ( 12.75%)
Mean   19    103907.25 (  0.00%)     94578.25 ( -8.98%)    102897.50 ( -0.97%)    135719.25 ( 30.62%)    113725.00 (  9.45%)
Mean   20    100496.00 (  0.00%)     92683.75 ( -7.77%)     98143.50 ( -2.34%)    135264.25 ( 34.60%)    110990.25 ( 10.44%)
Mean   21     99570.00 (  0.00%)     92955.75 ( -6.64%)     97375.00 ( -2.20%)    133891.00 ( 34.47%)    108527.75 (  9.00%)
Mean   22     98611.75 (  0.00%)     89781.75 ( -8.95%)     98287.00 ( -0.33%)    132399.75 ( 34.26%)    107813.75 (  9.33%)
Mean   23     98173.00 (  0.00%)     88846.00 ( -9.50%)     98131.00 ( -0.04%)    130726.00 ( 33.16%)    106859.25 (  8.85%)
Mean   24     92074.75 (  0.00%)     88581.00 ( -3.79%)     96459.75 (  4.76%)    127552.25 ( 38.53%)    109180.75 ( 18.58%)
Stddev 1        735.13 (  0.00%)       538.24 ( 26.78%)       973.28 (-32.40%)       121.08 ( 83.53%)       850.05 (-15.63%)
Stddev 2        406.26 (  0.00%)      3458.87 (-751.39%)      1082.66 (-166.49%)       477.32 (-17.49%)      1070.30 (-163.45%)
Stddev 3        644.20 (  0.00%)      1360.89 (-111.25%)      1334.10 (-107.09%)       922.47 (-43.20%)      1668.38 (-158.98%)
Stddev 4        743.93 (  0.00%)      2149.34 (-188.92%)      2267.12 (-204.75%)      1385.42 (-86.23%)      1244.82 (-67.33%)
Stddev 5        898.53 (  0.00%)      2521.01 (-180.57%)      1948.30 (-116.83%)       763.24 ( 15.06%)      1021.24 (-13.66%)
Stddev 6       1126.61 (  0.00%)      3818.22 (-238.91%)       917.32 ( 18.58%)      1527.03 (-35.54%)      1405.79 (-24.78%)
Stddev 7       2907.61 (  0.00%)      4419.29 (-51.99%)      2486.28 ( 14.49%)      1536.66 ( 47.15%)      4825.30 (-65.95%)
Stddev 8       3200.64 (  0.00%)       382.01 ( 88.06%)      5978.31 (-86.78%)      1228.09 ( 61.63%)      5415.14 (-69.19%)
Stddev 9       2907.92 (  0.00%)      1813.39 ( 37.64%)      4583.53 (-57.62%)      1502.61 ( 48.33%)      4569.07 (-57.12%)
Stddev 10      5093.23 (  0.00%)      1313.58 ( 74.21%)      8194.93 (-60.90%)      2763.19 ( 45.75%)      2876.58 ( 43.52%)
Stddev 11      4982.41 (  0.00%)      1163.02 ( 76.66%)      1899.45 ( 61.88%)      4776.28 (  4.14%)      2729.19 ( 45.22%)
Stddev 12      3051.38 (  0.00%)      2117.59 ( 30.60%)      2404.89 ( 21.19%)      9252.59 (-203.23%)      1797.38 ( 41.10%)
Stddev 13      2918.03 (  0.00%)      2252.11 ( 22.82%)      3889.75 (-33.30%)      9384.83 (-221.62%)      2832.98 (  2.91%)
Stddev 14      3178.97 (  0.00%)      2337.49 ( 26.47%)      3612.00 (-13.62%)      9353.03 (-194.22%)      3499.23 (-10.07%)
Stddev 15      2438.31 (  0.00%)      1707.72 ( 29.96%)      2925.87 (-20.00%)     10494.03 (-330.38%)      2752.56 (-12.89%)
Stddev 16      2682.25 (  0.00%)       840.47 ( 68.67%)      3118.36 (-16.26%)     10343.25 (-285.62%)      2685.02 ( -0.10%)
Stddev 17      2807.66 (  0.00%)      1546.16 ( 44.93%)      3750.42 (-33.58%)     11446.15 (-307.68%)      2421.25 ( 13.76%)
Stddev 18      3049.27 (  0.00%)       934.11 ( 69.37%)      3382.16 (-10.92%)     11779.80 (-286.31%)      2488.60 ( 18.39%)
Stddev 19      2782.65 (  0.00%)       735.28 ( 73.58%)      2853.22 ( -2.54%)     11416.35 (-310.27%)       298.00 ( 89.29%)
Stddev 20      2379.12 (  0.00%)       956.25 ( 59.81%)      2876.85 (-20.92%)     10511.63 (-341.83%)      2077.10 ( 12.69%)
Stddev 21      2975.22 (  0.00%)       438.31 ( 85.27%)      2627.61 ( 11.68%)     11292.91 (-279.57%)      1369.17 ( 53.98%)
Stddev 22      2260.61 (  0.00%)       718.23 ( 68.23%)      2706.69 (-19.73%)     11993.84 (-430.56%)      1584.55 ( 29.91%)
Stddev 23      2900.85 (  0.00%)       275.47 ( 90.50%)      2348.16 ( 19.05%)     12234.80 (-321.77%)       878.53 ( 69.71%)
Stddev 24      2578.98 (  0.00%)       481.68 ( 81.32%)      3346.30 (-29.75%)     12769.61 (-395.14%)      1037.50 ( 59.77%)
TPut   1     104146.00 (  0.00%)     78380.00 (-24.74%)     98406.00 ( -5.51%)     98953.00 ( -4.99%)    103131.00 ( -0.97%)
TPut   2     214519.00 (  0.00%)    153926.00 (-28.25%)    209405.00 ( -2.38%)    222587.00 (  3.76%)    212622.00 ( -0.88%)
TPut   3     309540.00 (  0.00%)    214742.00 (-30.63%)    303972.00 ( -1.80%)    330859.00 (  6.89%)    305544.00 ( -1.29%)
TPut   4     400391.00 (  0.00%)    273014.00 (-31.81%)    368598.00 ( -7.94%)    431533.00 (  7.78%)    392496.00 ( -1.97%)
TPut   5     476051.00 (  0.00%)    296658.00 (-37.68%)    448224.00 ( -5.85%)    521041.00 (  9.45%)    476799.00 (  0.16%)
TPut   6     549677.00 (  0.00%)    344634.00 (-37.30%)    534418.00 ( -2.78%)    616978.00 ( 12.24%)    546333.00 ( -0.61%)
TPut   7     552073.00 (  0.00%)    384237.00 (-30.40%)    545910.00 ( -1.12%)    638004.00 ( 15.57%)    557134.00 (  0.92%)
TPut   8     547096.00 (  0.00%)    388014.00 (-29.08%)    548135.00 (  0.19%)    651472.00 ( 19.08%)    557083.00 (  1.83%)
TPut   9     511866.00 (  0.00%)    381044.00 (-25.56%)    541984.00 (  5.88%)    652032.00 ( 27.38%)    547053.00 (  6.87%)
TPut   10    498515.00 (  0.00%)    384809.00 (-22.81%)    514817.00 (  3.27%)    638786.00 ( 28.14%)    528211.00 (  5.96%)
TPut   11    469076.00 (  0.00%)    383697.00 (-18.20%)    478874.00 (  2.09%)    618806.00 ( 31.92%)    509571.00 (  8.63%)
TPut   12    447849.00 (  0.00%)    376989.00 (-15.82%)    461603.00 (  3.07%)    603746.00 ( 34.81%)    483889.00 (  8.05%)
TPut   13    446382.00 (  0.00%)    426154.00 ( -4.53%)    443954.00 ( -0.54%)    588773.00 ( 31.90%)    480361.00 (  7.61%)
TPut   14    443524.00 (  0.00%)    414196.00 ( -6.61%)    446196.00 (  0.60%)    578336.00 ( 30.40%)    451294.00 (  1.75%)
TPut   15    437350.00 (  0.00%)    406916.00 ( -6.96%)    435709.00 ( -0.38%)    573332.00 ( 31.09%)    466021.00 (  6.56%)
TPut   16    428127.00 (  0.00%)    407935.00 ( -4.72%)    424643.00 ( -0.81%)    567631.00 ( 32.58%)    450925.00 (  5.33%)
TPut   17    421965.00 (  0.00%)    400823.00 ( -5.01%)    417075.00 ( -1.16%)    562764.00 ( 33.37%)    452732.00 (  7.29%)
TPut   18    404411.00 (  0.00%)    386542.00 ( -4.42%)    416183.00 (  2.91%)    551137.00 ( 36.28%)    455974.00 ( 12.75%)
TPut   19    415629.00 (  0.00%)    378313.00 ( -8.98%)    411590.00 ( -0.97%)    542877.00 ( 30.62%)    454900.00 (  9.45%)
TPut   20    401984.00 (  0.00%)    370735.00 ( -7.77%)    392574.00 ( -2.34%)    541057.00 ( 34.60%)    443961.00 ( 10.44%)
TPut   21    398280.00 (  0.00%)    371823.00 ( -6.64%)    389500.00 ( -2.20%)    535564.00 ( 34.47%)    434111.00 (  9.00%)
TPut   22    394447.00 (  0.00%)    359127.00 ( -8.95%)    393148.00 ( -0.33%)    529599.00 ( 34.26%)    431255.00 (  9.33%)
TPut   23    392692.00 (  0.00%)    355384.00 ( -9.50%)    392524.00 ( -0.04%)    522904.00 ( 33.16%)    427437.00 (  8.85%)
TPut   24    368299.00 (  0.00%)    354324.00 ( -3.79%)    385839.00 (  4.76%)    510209.00 ( 38.53%)    436723.00 ( 18.58%)

The !THP case has indeed improved for numacore -- it's now in line with
the mainline kernel but still worse than autonuma or balancenuma :(

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4            thpmigrate-v8r6
 Expctd Warehouse            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)            12.00 (  0.00%)
 Expctd Peak Bops        447849.00 (  0.00%)        376989.00 (-15.82%)        461603.00 (  3.07%)        603746.00 ( 34.81%)        483889.00 (  8.05%)
 Actual Warehouse             7.00 (  0.00%)            13.00 ( 85.71%)             8.00 ( 14.29%)             9.00 ( 28.57%)             7.00 (  0.00%)
 Actual Peak Bops        552073.00 (  0.00%)        426154.00 (-22.81%)        548135.00 ( -0.71%)        652032.00 ( 18.11%)        557134.00 (  0.92%)
 SpecJBB Bops            415458.00 (  0.00%)        385328.00 ( -7.25%)        416195.00 (  0.18%)        554456.00 ( 33.46%)        451505.00 (  8.68%)
 SpecJBB Bops/JVM        103865.00 (  0.00%)         96332.00 ( -7.25%)        104049.00 (  0.18%)        138614.00 ( 33.46%)        112876.00 (  8.68%)

Same as above really - peaks and specjbb scores are in line with mainline
but no improvement.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
User       177832.71   148340.09   177834.85   177337.90   176423.81
System         89.07    28052.02       86.81      287.31     1468.74
Elapsed      4035.81     4041.26     4041.24     4028.05     4041.25

But the same massive improvement on system CPU usage.


MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
Page Ins                         37380       66040       35792       36416       33076
Page Outs                        29224       46900       30544       29584       24048
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                      2           3           2           2           3
THP collapse alloc                   0           0           0           0           0
THP splits                           0           0           0           0           0
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0           0    37081585
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0           0       38490
NUMA PTE updates                     0           0           0           0   286584227
NUMA hint faults                     0           0           0           0   268498687
NUMA hint local faults               0           0           0           0    69756047
NUMA pages migrated                  0           0           0           0    37081585
AutoNUMA cost                        0           0           0           0     1345204

THP was not in use for sure.

SPECJBB: Single JVM, THP is enabled
                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                   stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
TPut 1      25550.00 (  0.00%)     25491.00 ( -0.23%)     26438.00 (  3.48%)     24233.00 ( -5.15%)     25223.00 ( -1.28%)
TPut 2      55943.00 (  0.00%)     51630.00 ( -7.71%)     57004.00 (  1.90%)     55312.00 ( -1.13%)     55168.00 ( -1.39%)
TPut 3      87707.00 (  0.00%)     74497.00 (-15.06%)     88852.00 (  1.31%)     88569.00 (  0.98%)     85852.00 ( -2.11%)
TPut 4     117911.00 (  0.00%)     98435.00 (-16.52%)    104955.00 (-10.99%)    118561.00 (  0.55%)    115467.00 ( -2.07%)
TPut 5     143285.00 (  0.00%)    133964.00 ( -6.51%)    126238.00 (-11.90%)    145703.00 (  1.69%)    144859.00 (  1.10%)
TPut 6     171208.00 (  0.00%)    152795.00 (-10.75%)    160028.00 ( -6.53%)    171006.00 ( -0.12%)    170328.00 ( -0.51%)
TPut 7     195635.00 (  0.00%)    162517.00 (-16.93%)    172973.00 (-11.58%)    198699.00 (  1.57%)    194656.00 ( -0.50%)
TPut 8     222655.00 (  0.00%)    168679.00 (-24.24%)    179260.00 (-19.49%)    224903.00 (  1.01%)    227247.00 (  2.06%)
TPut 9     244787.00 (  0.00%)    193394.00 (-20.99%)    238823.00 ( -2.44%)    248313.00 (  1.44%)    248613.00 (  1.56%)
TPut 10    271565.00 (  0.00%)    237987.00 (-12.36%)    247724.00 ( -8.78%)    272148.00 (  0.21%)    274387.00 (  1.04%)
TPut 11    298270.00 (  0.00%)    207908.00 (-30.30%)    277513.00 ( -6.96%)    303749.00 (  1.84%)    301764.00 (  1.17%)
TPut 12    320867.00 (  0.00%)    257937.00 (-19.61%)    281723.00 (-12.20%)    327808.00 (  2.16%)    329886.00 (  2.81%)
TPut 13    343514.00 (  0.00%)    248474.00 (-27.67%)    301710.00 (-12.17%)    349080.00 (  1.62%)    350488.00 (  2.03%)
TPut 14    365321.00 (  0.00%)    298876.00 (-18.19%)    314066.00 (-14.03%)    370026.00 (  1.29%)    362375.00 ( -0.81%)
TPut 15    377071.00 (  0.00%)    296562.00 (-21.35%)    334810.00 (-11.21%)    329847.00 (-12.52%)    396381.00 (  5.12%)
TPut 16    404979.00 (  0.00%)    287964.00 (-28.89%)    347142.00 (-14.28%)    411066.00 (  1.50%)    400409.00 ( -1.13%)
TPut 17    420593.00 (  0.00%)    342590.00 (-18.55%)    352738.00 (-16.13%)    428242.00 (  1.82%)    431148.00 (  2.51%)
TPut 18    440178.00 (  0.00%)    377508.00 (-14.24%)    344421.00 (-21.75%)    440392.00 (  0.05%)    453333.00 (  2.99%)
TPut 19    448876.00 (  0.00%)    397727.00 (-11.39%)    367002.00 (-18.24%)    462036.00 (  2.93%)    464066.00 (  3.38%)
TPut 20    460513.00 (  0.00%)    411831.00 (-10.57%)    370870.00 (-19.47%)    476437.00 (  3.46%)    468465.00 (  1.73%)
TPut 21    474161.00 (  0.00%)    442153.00 ( -6.75%)    374835.00 (-20.95%)    487513.00 (  2.82%)    469695.00 ( -0.94%)
TPut 22    474493.00 (  0.00%)    429921.00 ( -9.39%)    371022.00 (-21.81%)    487920.00 (  2.83%)    493468.00 (  4.00%)
TPut 23    489559.00 (  0.00%)    460354.00 ( -5.97%)    377444.00 (-22.90%)    508298.00 (  3.83%)    496034.00 (  1.32%)
TPut 24    495378.00 (  0.00%)    486826.00 ( -1.73%)    376551.00 (-23.99%)    514403.00 (  3.84%)    499233.00 (  0.78%)
TPut 25    491795.00 (  0.00%)    520474.00 (  5.83%)    370872.00 (-24.59%)    507373.00 (  3.17%)    487478.00 ( -0.88%)
TPut 26    490038.00 (  0.00%)    465587.00 ( -4.99%)    370093.00 (-24.48%)    376322.00 (-23.21%)    475886.00 ( -2.89%)
TPut 27    491233.00 (  0.00%)    469764.00 ( -4.37%)    371915.00 (-24.29%)    366225.00 (-25.45%)    472389.00 ( -3.84%)
TPut 28    489058.00 (  0.00%)    489561.00 (  0.10%)    364465.00 (-25.48%)    414027.00 (-15.34%)    440279.00 ( -9.97%)
TPut 29    471539.00 (  0.00%)    492496.00 (  4.44%)    353470.00 (-25.04%)    400529.00 (-15.06%)    444113.00 ( -5.82%)
TPut 30    480343.00 (  0.00%)    488349.00 (  1.67%)    355023.00 (-26.09%)    405612.00 (-15.56%)    439365.00 ( -8.53%)
TPut 31    478109.00 (  0.00%)    460043.00 ( -3.78%)    352440.00 (-26.28%)    401471.00 (-16.03%)    438124.00 ( -8.36%)
TPut 32    475736.00 (  0.00%)    472007.00 ( -0.78%)    341509.00 (-28.21%)    401075.00 (-15.69%)    424440.00 (-10.78%)
TPut 33    470758.00 (  0.00%)    474348.00 (  0.76%)    337127.00 (-28.39%)    399592.00 (-15.12%)    426780.00 ( -9.34%)
TPut 34    467304.00 (  0.00%)    475878.00 (  1.83%)    332477.00 (-28.85%)    394589.00 (-15.56%)    415856.00 (-11.01%)
TPut 35    466391.00 (  0.00%)    487411.00 (  4.51%)    335639.00 (-28.03%)    382799.00 (-17.92%)    419039.00 (-10.15%)
TPut 36    452722.00 (  0.00%)    478050.00 (  5.59%)    316889.00 (-30.00%)    381120.00 (-15.82%)    412743.00 ( -8.83%)
TPut 37    447878.00 (  0.00%)    478467.00 (  6.83%)    326939.00 (-27.00%)    382803.00 (-14.53%)    408997.00 ( -8.68%)
TPut 38    447907.00 (  0.00%)    455542.00 (  1.70%)    315719.00 (-29.51%)    341693.00 (-23.71%)    397839.00 (-11.18%)
TPut 39    428322.00 (  0.00%)    367921.00 (-14.10%)    310519.00 (-27.50%)    404210.00 ( -5.63%)    397204.00 ( -7.27%)
TPut 40    429157.00 (  0.00%)    394277.00 ( -8.13%)    302742.00 (-29.46%)    378554.00 (-11.79%)    369994.00 (-13.79%)
TPut 41    424339.00 (  0.00%)    415413.00 ( -2.10%)    304680.00 (-28.20%)    399220.00 ( -5.92%)    390125.00 ( -8.06%)
TPut 42    397440.00 (  0.00%)    421027.00 (  5.93%)    298298.00 (-24.95%)    372161.00 ( -6.36%)    383341.00 ( -3.55%)
TPut 43    405391.00 (  0.00%)    433900.00 (  7.03%)    286294.00 (-29.38%)    383936.00 ( -5.29%)    375542.00 ( -7.36%)
TPut 44    400692.00 (  0.00%)    427504.00 (  6.69%)    282819.00 (-29.42%)    374757.00 ( -6.47%)    370109.00 ( -7.63%)
TPut 45    399623.00 (  0.00%)    372622.00 ( -6.76%)    273593.00 (-31.54%)    379797.00 ( -4.96%)    367059.00 ( -8.15%)
TPut 46    391920.00 (  0.00%)    351205.00 (-10.39%)    277380.00 (-29.23%)    368042.00 ( -6.09%)    359599.00 ( -8.25%)
TPut 47    378199.00 (  0.00%)    358150.00 ( -5.30%)    273560.00 (-27.67%)    368744.00 ( -2.50%)    345383.00 ( -8.68%)
TPut 48    379346.00 (  0.00%)    387287.00 (  2.09%)    274168.00 (-27.73%)    373581.00 ( -1.52%)    370346.00 ( -2.37%)
TPut 49    373614.00 (  0.00%)    395793.00 (  5.94%)    270794.00 (-27.52%)    372621.00 ( -0.27%)    369080.00 ( -1.21%)
TPut 50    372494.00 (  0.00%)    366488.00 ( -1.61%)    271465.00 (-27.12%)    388778.00 (  4.37%)    378187.00 (  1.53%)
TPut 51    382195.00 (  0.00%)    381771.00 ( -0.11%)    272796.00 (-28.62%)    387687.00 (  1.44%)    380752.00 ( -0.38%)
TPut 52    369118.00 (  0.00%)    429441.00 ( 16.34%)    272019.00 (-26.31%)    390226.00 (  5.72%)    369854.00 (  0.20%)
TPut 53    366453.00 (  0.00%)    445744.00 ( 21.64%)    267952.00 (-26.88%)    399257.00 (  8.95%)    377122.00 (  2.91%)
TPut 54    366571.00 (  0.00%)    375762.00 (  2.51%)    268229.00 (-26.83%)    395098.00 (  7.78%)    364819.00 ( -0.48%)
TPut 55    367580.00 (  0.00%)    336113.00 ( -8.56%)    267474.00 (-27.23%)    400550.00 (  8.97%)    357908.00 ( -2.63%)
TPut 56    367056.00 (  0.00%)    375635.00 (  2.34%)    263577.00 (-28.19%)    385743.00 (  5.09%)    345996.00 ( -5.74%)
TPut 57    359163.00 (  0.00%)    354001.00 ( -1.44%)    261130.00 (-27.29%)    389827.00 (  8.54%)    362023.00 (  0.80%)
TPut 58    360552.00 (  0.00%)    353312.00 ( -2.01%)    261140.00 (-27.57%)    394099.00 (  9.30%)    373974.00 (  3.72%)
TPut 59    354967.00 (  0.00%)    368534.00 (  3.82%)    262418.00 (-26.07%)    390746.00 ( 10.08%)    378369.00 (  6.59%)
TPut 60    362976.00 (  0.00%)    388472.00 (  7.02%)    267468.00 (-26.31%)    383073.00 (  5.54%)    344892.00 ( -4.98%)
TPut 61    368072.00 (  0.00%)    399476.00 (  8.53%)    265659.00 (-27.82%)    380807.00 (  3.46%)    353035.00 ( -4.09%)
TPut 62    356938.00 (  0.00%)    385648.00 (  8.04%)    253107.00 (-29.09%)    387736.00 (  8.63%)    372224.00 (  4.28%)
TPut 63    357491.00 (  0.00%)    404325.00 ( 13.10%)    259404.00 (-27.44%)    396672.00 ( 10.96%)    375653.00 (  5.08%)
TPut 64    357322.00 (  0.00%)    389552.00 (  9.02%)    260333.00 (-27.14%)    386826.00 (  8.26%)    366878.00 (  2.67%)
TPut 65    341262.00 (  0.00%)    394964.00 ( 15.74%)    258149.00 (-24.35%)    380271.00 ( 11.43%)    353347.00 (  3.54%)
TPut 66    357807.00 (  0.00%)    384846.00 (  7.56%)    259279.00 (-27.54%)    362723.00 (  1.37%)    375644.00 (  4.99%)
TPut 67    345092.00 (  0.00%)    376842.00 (  9.20%)    259350.00 (-24.85%)    364193.00 (  5.54%)    372053.00 (  7.81%)
TPut 68    350334.00 (  0.00%)    358330.00 (  2.28%)    259332.00 (-25.98%)    359368.00 (  2.58%)    378979.00 (  8.18%)
TPut 69    348372.00 (  0.00%)    356188.00 (  2.24%)    263076.00 (-24.48%)    364449.00 (  4.61%)    369852.00 (  6.17%)
TPut 70    335077.00 (  0.00%)    359313.00 (  7.23%)    259983.00 (-22.41%)    356418.00 (  6.37%)    353911.00 (  5.62%)
TPut 71    341197.00 (  0.00%)    364168.00 (  6.73%)    254622.00 (-25.37%)    343847.00 (  0.78%)    358404.00 (  5.04%)
TPut 72    345032.00 (  0.00%)    356934.00 (  3.45%)    261060.00 (-24.34%)    345007.00 ( -0.01%)    349968.00 (  1.43%)

Latest numacore is way worse than the previous release according to this.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4            thpmigrate-v8r6
 Expctd Warehouse                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)                   48.00 (  0.00%)
 Expctd Peak Bops               379346.00 (  0.00%)               387287.00 (  2.09%)               274168.00 (-27.73%)               373581.00 ( -1.52%)               370346.00 ( -2.37%)
 Actual Warehouse                   24.00 (  0.00%)                   25.00 (  4.17%)                   23.00 ( -4.17%)                   24.00 (  0.00%)                   24.00 (  0.00%)
 Actual Peak Bops               495378.00 (  0.00%)               520474.00 (  5.07%)               377444.00 (-23.81%)               514403.00 (  3.84%)               499233.00 (  0.78%)
 SpecJBB Bops                   183389.00 (  0.00%)               193652.00 (  5.60%)               134571.00 (-26.62%)               193461.00 (  5.49%)               186801.00 (  1.86%)
 SpecJBB Bops/JVM               183389.00 (  0.00%)               193652.00 (  5.60%)               134571.00 (-26.62%)               193461.00 (  5.49%)               186801.00 (  1.86%)

numacore is now showing a 23.81% regression at the peak and a 26.62% regression in its overall speccpu score.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
User       316340.52   311420.23   317791.75   314589.64   315737.97
System        102.08     3067.27      102.89      352.70      455.15
Elapsed      7433.22     7436.63     7434.49     7434.74     7434.42

System CPU usage has improved.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
Page Ins                         66212       36180       35476       36152       36572
Page Outs                        31248       35544       29372       28388       29096
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                  48874       45657       54164       48296       52738
THP collapse alloc                  51           2          50         157          47
THP splits                          70          37          83          83          77
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0           0    46460155
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0           0       48225
NUMA PTE updates                     0           0           0           0   354564017
NUMA hint faults                     0           0           0           0     1861516
NUMA hint local faults               0           0           0           0      594273
NUMA pages migrated                  0           0           0           0    46460155
AutoNUMA cost                        0           0           0           0       12672

THP was in use

SPECJBB: Single JVM, THP is disabled
                    3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                   stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
TPut 1      19861.00 (  0.00%)     18255.00 ( -8.09%)     21307.00 (  7.28%)     19636.00 ( -1.13%)     20429.00 (  2.86%)
TPut 2      47613.00 (  0.00%)     37136.00 (-22.00%)     47861.00 (  0.52%)     47153.00 ( -0.97%)     48608.00 (  2.09%)
TPut 3      72438.00 (  0.00%)     55692.00 (-23.12%)     72271.00 ( -0.23%)     69394.00 ( -4.20%)     73723.00 (  1.77%)
TPut 4      98455.00 (  0.00%)     81301.00 (-17.42%)     91079.00 ( -7.49%)     98577.00 (  0.12%)     98563.00 (  0.11%)
TPut 5     120831.00 (  0.00%)     89067.00 (-26.29%)    118381.00 ( -2.03%)    120805.00 ( -0.02%)    121179.00 (  0.29%)
TPut 6     140013.00 (  0.00%)    108349.00 (-22.62%)    141994.00 (  1.41%)    125079.00 (-10.67%)    145750.00 (  4.10%)
TPut 7     163553.00 (  0.00%)    116192.00 (-28.96%)    133084.00 (-18.63%)    164368.00 (  0.50%)    162133.00 ( -0.87%)
TPut 8     190148.00 (  0.00%)    125955.00 (-33.76%)    177239.00 ( -6.79%)    188906.00 ( -0.65%)    188252.00 ( -1.00%)
TPut 9     211343.00 (  0.00%)    144068.00 (-31.83%)    180903.00 (-14.40%)    206645.00 ( -2.22%)    214450.00 (  1.47%)
TPut 10    233190.00 (  0.00%)    148098.00 (-36.49%)    215595.00 ( -7.55%)    234533.00 (  0.58%)    224369.00 ( -3.78%)
TPut 11    253333.00 (  0.00%)    146043.00 (-42.35%)    224514.00 (-11.38%)    254167.00 (  0.33%)    260926.00 (  3.00%)
TPut 12    270661.00 (  0.00%)    131739.00 (-51.33%)    245812.00 ( -9.18%)    271490.00 (  0.31%)    258382.00 ( -4.54%)
TPut 13    299807.00 (  0.00%)    169396.00 (-43.50%)    253075.00 (-15.59%)    299758.00 ( -0.02%)    294586.00 ( -1.74%)
TPut 14    319243.00 (  0.00%)    150705.00 (-52.79%)    256078.00 (-19.79%)    318481.00 ( -0.24%)    327111.00 (  2.46%)
TPut 15    339054.00 (  0.00%)    116872.00 (-65.53%)    268646.00 (-20.77%)    331534.00 ( -2.22%)    301219.00 (-11.16%)
TPut 16    354315.00 (  0.00%)    124346.00 (-64.91%)    291148.00 (-17.83%)    352600.00 ( -0.48%)    309066.00 (-12.77%)
TPut 17    371306.00 (  0.00%)    118493.00 (-68.09%)    299399.00 (-19.37%)    368260.00 ( -0.82%)    353726.00 ( -4.73%)
TPut 18    386361.00 (  0.00%)    138571.00 (-64.13%)    303185.00 (-21.53%)    374358.00 ( -3.11%)    361347.00 ( -6.47%)
TPut 19    401827.00 (  0.00%)    118855.00 (-70.42%)    320630.00 (-20.21%)    399476.00 ( -0.59%)    389834.00 ( -2.98%)
TPut 20    411130.00 (  0.00%)    144024.00 (-64.97%)    315391.00 (-23.29%)    407799.00 ( -0.81%)    401569.00 ( -2.33%)
TPut 21    425352.00 (  0.00%)    154264.00 (-63.73%)    326734.00 (-23.19%)    429226.00 (  0.91%)    410586.00 ( -3.47%)
TPut 22    438150.00 (  0.00%)    153892.00 (-64.88%)    329531.00 (-24.79%)    385827.00 (-11.94%)    426994.00 ( -2.55%)
TPut 23    438425.00 (  0.00%)    146506.00 (-66.58%)    336454.00 (-23.26%)    433963.00 ( -1.02%)    425391.00 ( -2.97%)
TPut 24    461598.00 (  0.00%)    138869.00 (-69.92%)    330113.00 (-28.48%)    439691.00 ( -4.75%)    449185.00 ( -2.69%)
TPut 25    459475.00 (  0.00%)    141698.00 (-69.16%)    333545.00 (-27.41%)    431373.00 ( -6.12%)    461066.00 (  0.35%)
TPut 26    452651.00 (  0.00%)    142844.00 (-68.44%)    325634.00 (-28.06%)    447517.00 ( -1.13%)    458937.00 (  1.39%)
TPut 27    450436.00 (  0.00%)    140870.00 (-68.73%)    324881.00 (-27.87%)    430805.00 ( -4.36%)    453923.00 (  0.77%)
TPut 28    459770.00 (  0.00%)    143078.00 (-68.88%)    312547.00 (-32.02%)    432260.00 ( -5.98%)    440465.00 ( -4.20%)
TPut 29    450347.00 (  0.00%)    142076.00 (-68.45%)    318785.00 (-29.21%)    440423.00 ( -2.20%)    444150.00 ( -1.38%)
TPut 30    449252.00 (  0.00%)    146900.00 (-67.30%)    310301.00 (-30.93%)    435082.00 ( -3.15%)    460002.00 (  2.39%)
TPut 31    446802.00 (  0.00%)    148008.00 (-66.87%)    304119.00 (-31.93%)    418684.00 ( -6.29%)    412117.00 ( -7.76%)
TPut 32    439701.00 (  0.00%)    149591.00 (-65.98%)    297625.00 (-32.31%)    421866.00 ( -4.06%)    446009.00 (  1.43%)
TPut 33    434477.00 (  0.00%)    142801.00 (-67.13%)    293405.00 (-32.47%)    420631.00 ( -3.19%)    392769.00 ( -9.60%)
TPut 34    423014.00 (  0.00%)    152308.00 (-63.99%)    288639.00 (-31.77%)    415202.00 ( -1.85%)    443175.00 (  4.77%)
TPut 35    429012.00 (  0.00%)    154116.00 (-64.08%)    283797.00 (-33.85%)    402395.00 ( -6.20%)    409016.00 ( -4.66%)
TPut 36    421097.00 (  0.00%)    157571.00 (-62.58%)    276038.00 (-34.45%)    404770.00 ( -3.88%)    427704.00 (  1.57%)
TPut 37    414815.00 (  0.00%)    150771.00 (-63.65%)    272498.00 (-34.31%)    388842.00 ( -6.26%)    443787.00 (  6.98%)
TPut 38    412361.00 (  0.00%)    157070.00 (-61.91%)    270972.00 (-34.29%)    398947.00 ( -3.25%)    426349.00 (  3.39%)
TPut 39    402234.00 (  0.00%)    161487.00 (-59.85%)    258636.00 (-35.70%)    382645.00 ( -4.87%)    393389.00 ( -2.20%)
TPut 40    380278.00 (  0.00%)    165947.00 (-56.36%)    256492.00 (-32.55%)    394039.00 (  3.62%)    439362.00 ( 15.54%)
TPut 41    393204.00 (  0.00%)    160540.00 (-59.17%)    254896.00 (-35.17%)    385605.00 ( -1.93%)    416757.00 (  5.99%)
TPut 42    380622.00 (  0.00%)    151946.00 (-60.08%)    248167.00 (-34.80%)    374843.00 ( -1.52%)    360885.00 ( -5.19%)
TPut 43    371566.00 (  0.00%)    162369.00 (-56.30%)    238268.00 (-35.87%)    347951.00 ( -6.36%)    410625.00 ( 10.51%)
TPut 44    365538.00 (  0.00%)    161127.00 (-55.92%)    239926.00 (-34.36%)    355070.00 ( -2.86%)    399653.00 (  9.33%)
TPut 45    359305.00 (  0.00%)    159062.00 (-55.73%)    237676.00 (-33.85%)    350973.00 ( -2.32%)    358310.00 ( -0.28%)
TPut 46    343160.00 (  0.00%)    163889.00 (-52.24%)    231272.00 (-32.61%)    347960.00 (  1.40%)    354509.00 (  3.31%)
TPut 47    346983.00 (  0.00%)    168666.00 (-51.39%)    228060.00 (-34.27%)    313612.00 ( -9.62%)    362578.00 (  4.49%)
TPut 48    338143.00 (  0.00%)    153448.00 (-54.62%)    224598.00 (-33.58%)    341809.00 (  1.08%)    356291.00 (  5.37%)
TPut 49    333941.00 (  0.00%)    142784.00 (-57.24%)    224568.00 (-32.75%)    336174.00 (  0.67%)    374367.00 ( 12.11%)
TPut 50    334001.00 (  0.00%)    135713.00 (-59.37%)    221381.00 (-33.72%)    322489.00 ( -3.45%)    364590.00 (  9.16%)
TPut 51    338310.00 (  0.00%)    133402.00 (-60.57%)    219870.00 (-35.01%)    354805.00 (  4.88%)    377493.00 ( 11.58%)
TPut 52    322897.00 (  0.00%)    150293.00 (-53.45%)    217427.00 (-32.66%)    353169.00 (  9.38%)    368626.00 ( 14.16%)
TPut 53    329801.00 (  0.00%)    160792.00 (-51.25%)    224019.00 (-32.07%)    353588.00 (  7.21%)    369336.00 ( 11.99%)
TPut 54    336610.00 (  0.00%)    164696.00 (-51.07%)    214752.00 (-36.20%)    361189.00 (  7.30%)    375852.00 ( 11.66%)
TPut 55    325920.00 (  0.00%)    172380.00 (-47.11%)    219529.00 (-32.64%)    365678.00 ( 12.20%)    382457.00 ( 17.35%)
TPut 56    318997.00 (  0.00%)    176071.00 (-44.80%)    218120.00 (-31.62%)    367048.00 ( 15.06%)    367083.00 ( 15.07%)
TPut 57    321776.00 (  0.00%)    174531.00 (-45.76%)    214685.00 (-33.28%)    341874.00 (  6.25%)    359388.00 ( 11.69%)
TPut 58    308532.00 (  0.00%)    174202.00 (-43.54%)    208226.00 (-32.51%)    348156.00 ( 12.84%)    371239.00 ( 20.32%)
TPut 59    318974.00 (  0.00%)    175343.00 (-45.03%)    214260.00 (-32.83%)    358252.00 ( 12.31%)    370477.00 ( 16.15%)
TPut 60    325465.00 (  0.00%)    173694.00 (-46.63%)    213290.00 (-34.47%)    360808.00 ( 10.86%)    365093.00 ( 12.18%)
TPut 61    319151.00 (  0.00%)    172320.00 (-46.01%)    206197.00 (-35.39%)    350597.00 (  9.85%)    361934.00 ( 13.41%)
TPut 62    320837.00 (  0.00%)    172312.00 (-46.29%)    211186.00 (-34.18%)    359062.00 ( 11.91%)    356332.00 ( 11.06%)
TPut 63    318198.00 (  0.00%)    172297.00 (-45.85%)    215174.00 (-32.38%)    356137.00 ( 11.92%)    352316.00 ( 10.72%)
TPut 64    321438.00 (  0.00%)    171894.00 (-46.52%)    212493.00 (-33.89%)    347376.00 (  8.07%)    350834.00 (  9.15%)
TPut 65    314482.00 (  0.00%)    169147.00 (-46.21%)    204809.00 (-34.87%)    351726.00 ( 11.84%)    348002.00 ( 10.66%)
TPut 66    316802.00 (  0.00%)    170234.00 (-46.26%)    199708.00 (-36.96%)    344548.00 (  8.76%)    343706.00 (  8.49%)
TPut 67    312139.00 (  0.00%)    168180.00 (-46.12%)    208644.00 (-33.16%)    329030.00 (  5.41%)    347701.00 ( 11.39%)
TPut 68    323918.00 (  0.00%)    168392.00 (-48.01%)    206120.00 (-36.37%)    319985.00 ( -1.21%)    342289.00 (  5.67%)
TPut 69    307506.00 (  0.00%)    167082.00 (-45.67%)    204703.00 (-33.43%)    340673.00 ( 10.79%)    344829.00 ( 12.14%)
TPut 70    306799.00 (  0.00%)    165764.00 (-45.97%)    201529.00 (-34.31%)    331678.00 (  8.11%)    324939.00 (  5.91%)
TPut 71    304232.00 (  0.00%)    165289.00 (-45.67%)    203291.00 (-33.18%)    319824.00 (  5.13%)    336540.00 ( 10.62%)
TPut 72    301619.00 (  0.00%)    163909.00 (-45.66%)    203306.00 (-32.60%)    326875.00 (  8.37%)    342014.00 ( 13.39%)

numacore has improved in the !THP case but it's still worse than the
baseline kernel, autonuma or balancenuma.

SPECJBB PEAKS
                                   3.7.0-rc7                  3.7.0-rc6                  3.7.0-rc7                  3.7.0-rc7                  3.7.0-rc7
                                  stats-v8r6          numacore-20121130          numacore-20121202         autonuma-v28fastr4            thpmigrate-v8r6
 Expctd Warehouse            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)            48.00 (  0.00%)
 Expctd Peak Bops        338143.00 (  0.00%)        153448.00 (-54.62%)        224598.00 (-33.58%)        341809.00 (  1.08%)        356291.00 (  5.37%)
 Actual Warehouse            24.00 (  0.00%)            56.00 (133.33%)            23.00 ( -4.17%)            26.00 (  8.33%)            25.00 (  4.17%)
 Actual Peak Bops        461598.00 (  0.00%)        176071.00 (-61.86%)        336454.00 (-27.11%)        447517.00 ( -3.05%)        461066.00 ( -0.12%)
 SpecJBB Bops            163683.00 (  0.00%)         83963.00 (-48.70%)        108406.00 (-33.77%)        176379.00 (  7.76%)        182729.00 ( 11.64%)
 SpecJBB Bops/JVM        163683.00 (  0.00%)         83963.00 (-48.70%)        108406.00 (-33.77%)        176379.00 (  7.76%)        182729.00 ( 11.64%)

Again, the improvement to the peak and specjbb scores is impressive compared
to v17 but it's still worse than the baseline kernel.


MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
User       316751.91   167098.56   318360.63   307598.67   308390.57
System         60.28   122511.08       59.60     4411.81     1809.94
Elapsed      7434.08     7451.36     7436.99     7437.52     7438.60

Smacks the face off system CPU usage though.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numacore-20121202autonuma-v28fastr4thpmigrate-v8r6
Page Ins                         37112       36416       37492       37436       36744
Page Outs                        29252       35664       29636       28120       29080
Swap Ins                             0           0           0           0           0
Swap Outs                            0           0           0           0           0
Direct pages scanned                 0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0
Page writes file                     0           0           0           0           0
Page writes anon                     0           0           0           0           0
Page reclaim immediate               0           0           0           0           0
Page rescued immediate               0           0           0           0           0
Slabs scanned                        0           0           0           0           0
Direct inode steals                  0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0
THP fault alloc                      3           2           2           2           3
THP collapse alloc                   0           0           0           4           0
THP splits                           0           0           0           1           0
THP fault fallback                   0           0           0           0           0
THP collapse fail                    0           0           0           0           0
Compaction stalls                    0           0           0           0           0
Compaction success                   0           0           0           0           0
Compaction failures                  0           0           0           0           0
Page migrate success                 0           0           0           0    25125765
Page migrate failure                 0           0           0           0           0
Compaction pages isolated            0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0
Compaction free scanned              0           0           0           0           0
Compaction cost                      0           0           0           0       26080
NUMA PTE updates                     0           0           0           0   207309372
NUMA hint faults                     0           0           0           0   201463304
NUMA hint local faults               0           0           0           0    52170319
NUMA pages migrated                  0           0           0           0    25125765
AutoNUMA cost                        0           0           0           0     1009245

No THP

Ordinary benchmarks follow.

KERNBENCH
                           3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                          stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
User    min        1306.06 (  0.00%)     1319.85 ( -1.06%)     1306.15 ( -0.01%)     1303.15 (  0.22%)     1304.14 (  0.15%)
User    mean       1308.87 (  0.00%)     1321.89 ( -0.99%)     1311.67 ( -0.21%)     1305.35 (  0.27%)     1309.67 ( -0.06%)
User    stddev        2.24 (  0.00%)        1.71 ( 23.89%)        3.32 (-48.18%)        1.80 ( 19.79%)        3.75 (-67.16%)
User    max        1312.45 (  0.00%)     1324.45 ( -0.91%)     1315.52 ( -0.23%)     1308.40 (  0.31%)     1314.85 ( -0.18%)
System  min         120.47 (  0.00%)      132.77 (-10.21%)      123.36 ( -2.40%)      124.01 ( -2.94%)      121.08 ( -0.51%)
System  mean        121.17 (  0.00%)      133.28 ( -9.99%)      124.13 ( -2.45%)      124.81 ( -3.00%)      121.89 ( -0.59%)
System  stddev        0.40 (  0.00%)        0.62 (-55.98%)        0.55 (-37.03%)        0.63 (-57.36%)        0.56 (-40.59%)
System  max         121.64 (  0.00%)      134.49 (-10.56%)      124.85 ( -2.64%)      125.61 ( -3.26%)      122.46 ( -0.67%)
Elapsed min          40.42 (  0.00%)       41.19 ( -1.90%)       41.01 ( -1.46%)       41.59 ( -2.89%)       40.61 ( -0.47%)
Elapsed mean         41.35 (  0.00%)       43.00 ( -4.00%)       41.84 ( -1.20%)       42.76 ( -3.43%)       41.77 ( -1.03%)
Elapsed stddev        0.56 (  0.00%)        1.59 (-184.71%)        0.84 (-50.26%)        0.67 (-20.64%)        1.17 (-109.48%)
Elapsed max          42.11 (  0.00%)       45.30 ( -7.58%)       43.43 ( -3.13%)       43.43 ( -3.13%)       43.43 ( -3.13%)
CPU     min        3391.00 (  0.00%)     3208.00 (  5.40%)     3302.00 (  2.62%)     3295.00 (  2.83%)     3288.00 (  3.04%)
CPU     mean       3458.60 (  0.00%)     3387.80 (  2.05%)     3432.00 (  0.77%)     3344.20 (  3.31%)     3429.40 (  0.84%)
CPU     stddev       52.09 (  0.00%)      126.66 (-143.17%)       71.31 (-36.91%)       49.90 (  4.20%)      103.82 (-99.32%)
CPU     max        3546.00 (  0.00%)     3528.00 (  0.51%)     3511.00 (  0.99%)     3433.00 (  3.19%)     3537.00 (  0.25%)

Nothing intresting here

AIM9
                             3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                            stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
Min    page_test   419560.00 (  0.00%)   277028.65 (-33.97%)   277821.45 (-33.78%)   323793.33 (-22.83%)   295716.19 (-29.52%)
Min    brk_test   3277600.00 (  0.00%)  3210859.43 ( -2.04%)  3279866.67 (  0.07%)  3146302.47 ( -4.01%)  3147266.67 ( -3.98%)
Min    exec_test      245.00 (  0.00%)      248.17 (  1.29%)      242.01 ( -1.22%)      257.33 (  5.03%)      243.34 ( -0.68%)
Min    fork_test     1514.95 (  0.00%)     1540.00 (  1.65%)     1527.22 (  0.81%)     1793.33 ( 18.38%)     1512.33 ( -0.17%)
Mean   page_test   427488.29 (  0.00%)   284838.06 (-33.37%)   316690.82 (-25.92%)   400555.24 ( -6.30%)   355496.92 (-16.84%)
Mean   brk_test   3313599.82 (  0.00%)  3237952.27 ( -2.28%)  3321144.18 (  0.23%)  3255156.20 ( -1.76%)  3215398.23 ( -2.96%)
Mean   exec_test      249.28 (  0.00%)      249.25 ( -0.01%)      248.09 ( -0.48%)      263.10 (  5.55%)      245.32 ( -1.59%)
Mean   fork_test     1535.85 (  0.00%)     1566.98 (  2.03%)     1563.50 (  1.80%)     1870.91 ( 21.82%)     1524.94 ( -0.71%)
Stddev page_test     4544.68 (  0.00%)     3785.28 (-16.71%)    44187.93 (872.30%)    26460.03 (482.22%)    35256.95 (675.79%)
Stddev brk_test     19204.63 (  0.00%)    12268.84 (-36.12%)    28452.66 ( 48.16%)    58909.97 (206.75%)    27591.49 ( 43.67%)
Stddev exec_test        2.45 (  0.00%)        0.80 (-67.49%)        2.14 (-12.54%)        3.21 ( 31.07%)        1.30 (-47.13%)
Stddev fork_test       12.76 (  0.00%)       12.66 ( -0.83%)       20.65 ( 61.80%)       51.77 (305.73%)        7.68 (-39.79%)
Max    page_test   432820.00 (  0.00%)   289226.67 (-33.18%)   411740.00 ( -4.87%)   433097.93 (  0.06%)   409653.56 ( -5.35%)
Max    brk_test   3339933.33 (  0.00%)  3257200.00 ( -2.48%)  3366000.00 (  0.78%)  3335400.00 ( -0.14%)  3248234.51 ( -2.75%)
Max    exec_test      253.33 (  0.00%)      250.50 ( -1.12%)      250.00 ( -1.31%)      268.15 (  5.85%)      247.83 ( -2.17%)
Max    fork_test     1561.46 (  0.00%)     1589.10 (  1.77%)     1593.33 (  2.04%)     1952.03 ( 25.01%)     1534.88 ( -1.70%)

page_test is badly damaged by a few of the trees but the deviations are
really high and would need 30+ runs to be sure.

HACKBENCH PIPES
                     3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                    stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
Procs 1       0.0250 (  0.00%)      0.0260 ( -4.00%)      0.0237 (  5.07%)      0.0261 ( -4.27%)      0.0247 (  1.14%)
Procs 4       0.0696 (  0.00%)      0.0702 ( -0.84%)      0.0588 ( 15.54%)      0.0707 ( -1.65%)      0.0729 ( -4.77%)
Procs 8       0.0836 (  0.00%)      0.0973 (-16.43%)      0.0931 (-11.39%)      0.1030 (-23.21%)      0.0906 ( -8.40%)
Procs 12      0.0971 (  0.00%)      0.0969 (  0.21%)      0.1536 (-58.16%)      0.1235 (-27.19%)      0.0890 (  8.34%)
Procs 16      0.1218 (  0.00%)      0.1286 ( -5.52%)      0.1892 (-55.31%)      0.1775 (-45.69%)      0.1168 (  4.10%)
Procs 20      0.1472 (  0.00%)      0.1508 ( -2.48%)      0.2827 (-92.05%)      0.1584 ( -7.64%)      0.1415 (  3.88%)
Procs 24      0.1684 (  0.00%)      0.1823 ( -8.20%)      0.3418 (-102.93%)      0.4648 (-175.96%)      0.1654 (  1.82%)
Procs 28      0.1919 (  0.00%)      0.1969 ( -2.61%)      0.4519 (-135.50%)      0.5287 (-175.57%)      0.1763 (  8.10%)
Procs 32      0.2256 (  0.00%)      0.2163 (  4.12%)      0.5399 (-139.30%)      0.4607 (-104.23%)      0.2148 (  4.81%)
Procs 36      0.2228 (  0.00%)      0.2658 (-19.29%)      0.5221 (-134.34%)      0.6190 (-177.83%)      0.2320 ( -4.13%)
Procs 40      0.2811 (  0.00%)      0.2906 ( -3.37%)      0.6629 (-135.83%)      0.2595 (  7.69%)      0.2587 (  7.98%)

numacore now really hurts hackbench-pipes scores. Is this worklets again?

HACKBENCH SOCKETS
                     3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                    stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
Procs 1       0.0220 (  0.00%)      0.0220 (  0.00%)      0.0226 ( -2.89%)      0.0283 (-28.66%)      0.0222 ( -1.05%)
Procs 4       0.0456 (  0.00%)      0.0513 (-12.51%)      0.0600 (-31.51%)      0.0820 (-79.73%)      0.0457 ( -0.20%)
Procs 8       0.0679 (  0.00%)      0.0714 ( -5.20%)      0.1499 (-120.78%)      0.2772 (-308.32%)      0.0707 ( -4.18%)
Procs 12      0.0940 (  0.00%)      0.0973 ( -3.56%)      0.2183 (-132.18%)      0.1155 (-22.87%)      0.0977 ( -3.93%)
Procs 16      0.1181 (  0.00%)      0.1263 ( -6.96%)      0.2586 (-118.90%)      0.4467 (-278.19%)      0.1242 ( -5.10%)
Procs 20      0.1504 (  0.00%)      0.1531 ( -1.83%)      0.4029 (-167.90%)      0.4917 (-226.94%)      0.1530 ( -1.71%)
Procs 24      0.1757 (  0.00%)      0.1826 ( -3.92%)      0.4248 (-141.69%)      0.5142 (-192.57%)      0.1841 ( -4.75%)
Procs 28      0.2044 (  0.00%)      0.2166 ( -5.93%)      0.5702 (-178.91%)      0.6600 (-222.85%)      0.2150 ( -5.17%)
Procs 32      0.2456 (  0.00%)      0.2501 ( -1.86%)      0.6433 (-161.93%)      0.6391 (-160.22%)      0.2500 ( -1.79%)
Procs 36      0.2649 (  0.00%)      0.2747 ( -3.70%)      0.7377 (-178.45%)      0.5775 (-117.97%)      0.2772 ( -4.63%)
Procs 40      0.3067 (  0.00%)      0.3114 ( -1.56%)      0.8349 (-172.25%)      0.7517 (-145.12%)      0.3091 ( -0.80%)

Same. The impact is really high.

PAGE FAULT TEST
                          3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                         stats-v8r6     numacore-20121130     numacore-20121202    autonuma-v28fastr4       thpmigrate-v8r6
System     1        7.8930 (  0.00%)       8.2460 ( -4.47%)       8.0560 ( -2.07%)       7.9015 ( -0.11%)       8.0490 ( -1.98%)
System     2        7.8590 (  0.00%)       8.1550 ( -3.77%)       8.0285 ( -2.16%)       8.2245 ( -4.65%)       7.9990 ( -1.78%)
System     3        8.0960 (  0.00%)       8.1905 ( -1.17%)       8.1670 ( -0.88%)       8.5495 ( -5.60%)       8.1030 ( -0.09%)
System     4        8.1915 (  0.00%)       8.3430 ( -1.85%)       8.2715 ( -0.98%)       9.7845 (-19.45%)       8.1745 (  0.21%)
System     5        8.4780 (  0.00%)       8.6035 ( -1.48%)       8.5840 ( -1.25%)      10.1610 (-19.85%)       8.5015 ( -0.28%)
System     6        8.8070 (  0.00%)       8.8245 ( -0.20%)       8.8915 ( -0.96%)      10.1365 (-15.10%)       8.7475 (  0.68%)
System     7        8.8075 (  0.00%)       8.8410 ( -0.38%)       8.8820 ( -0.85%)      10.5590 (-19.89%)       8.8330 ( -0.29%)
System     8        8.8155 (  0.00%)       8.8680 ( -0.60%)       8.8885 ( -0.83%)      10.6645 (-20.97%)       8.7550 (  0.69%)
System     9        9.1815 (  0.00%)       9.2985 ( -1.27%)       9.2560 ( -0.81%)      11.1265 (-21.18%)       9.3000 ( -1.29%)
System     10       9.6165 (  0.00%)       9.4230 (  2.01%)       9.5640 (  0.55%)      13.3825 (-39.16%)       9.4725 (  1.50%)
System     11       9.6765 (  0.00%)       9.6625 (  0.14%)       9.6245 (  0.54%)      12.9340 (-33.66%)       9.5180 (  1.64%)
System     12       9.4720 (  0.00%)       9.7155 ( -2.57%)       9.8235 ( -3.71%)      14.4390 (-52.44%)       9.6475 ( -1.85%)
System     13      10.2250 (  0.00%)      10.2560 ( -0.30%)      10.2690 ( -0.43%)      15.0545 (-47.23%)      10.1920 (  0.32%)
System     14      10.6750 (  0.00%)      10.6535 (  0.20%)      10.5300 (  1.36%)      14.1645 (-32.69%)      10.6290 (  0.43%)
System     15      10.7360 (  0.00%)      10.7430 ( -0.07%)      10.8550 ( -1.11%)      16.0740 (-49.72%)      10.6950 (  0.38%)
System     16      11.2250 (  0.00%)      11.0270 (  1.76%)      11.2555 ( -0.27%)      14.7390 (-31.31%)      11.2240 (  0.01%)
System     17      11.7730 (  0.00%)      11.9705 ( -1.68%)      11.9325 ( -1.35%)      15.8845 (-34.92%)      11.8345 ( -0.52%)
System     18      12.3605 (  0.00%)      12.4940 ( -1.08%)      12.5050 ( -1.17%)      17.5500 (-41.98%)      12.3875 ( -0.22%)
System     19      12.8335 (  0.00%)      12.9170 ( -0.65%)      12.8095 (  0.19%)      16.3220 (-27.18%)      12.8060 (  0.21%)
System     20      13.3895 (  0.00%)      13.2975 (  0.69%)      13.1655 (  1.67%)      17.2120 (-28.55%)      13.1775 (  1.58%)
System     21      13.8665 (  0.00%)      13.9600 ( -0.67%)      13.9265 ( -0.43%)      17.0055 (-22.64%)      13.8585 (  0.06%)
System     22      14.6870 (  0.00%)      14.5585 (  0.87%)      14.5055 (  1.24%)      18.0800 (-23.10%)      14.7035 ( -0.11%)
System     23      15.0375 (  0.00%)      15.1435 ( -0.70%)      15.0815 ( -0.29%)      21.8590 (-45.36%)      15.0750 ( -0.25%)
System     24      15.5720 (  0.00%)      15.6860 ( -0.73%)      15.6425 ( -0.45%)      20.0280 (-28.62%)      15.5605 (  0.07%)
System     25      16.1850 (  0.00%)      16.2990 ( -0.70%)      16.2380 ( -0.33%)      19.7815 (-22.22%)      16.1590 (  0.16%)
System     26      16.7550 (  0.00%)      16.8390 ( -0.50%)      16.8285 ( -0.44%)      20.5915 (-22.90%)      16.7820 ( -0.16%)
System     27      17.3460 (  0.00%)      17.4390 ( -0.54%)      17.3790 ( -0.19%)      20.8030 (-19.93%)      17.3155 (  0.18%)
System     28      17.8385 (  0.00%)      17.9415 ( -0.58%)      17.9220 ( -0.47%)      20.1675 (-13.06%)      17.9130 ( -0.42%)
System     29      18.4795 (  0.00%)      18.6130 ( -0.72%)      18.5240 ( -0.24%)      20.9970 (-13.62%)      18.4685 (  0.06%)
System     30      19.1615 (  0.00%)      19.1630 ( -0.01%)      19.1145 (  0.25%)      21.4265 (-11.82%)      19.3905 ( -1.20%)
System     31      19.6885 (  0.00%)      20.0420 ( -1.80%)      19.6155 (  0.37%)      22.0990 (-12.24%)      19.6275 (  0.31%)
System     32      20.2815 (  0.00%)      20.2560 (  0.13%)      20.3200 ( -0.19%)      23.4670 (-15.71%)      20.2700 (  0.06%)
System     33      20.9190 (  0.00%)      20.9980 ( -0.38%)      20.9795 ( -0.29%)      22.6925 ( -8.48%)      21.0735 ( -0.74%)
System     34      21.6390 (  0.00%)      21.6170 (  0.10%)      21.5710 (  0.31%)      23.4910 ( -8.56%)      21.5180 (  0.56%)
System     35      22.5430 (  0.00%)      22.2740 (  1.19%)      22.2290 (  1.39%)      23.6340 ( -4.84%)      22.6760 ( -0.59%)
System     36      23.2625 (  0.00%)      22.8940 (  1.58%)      22.8545 (  1.75%)      25.6300 (-10.18%)      22.9365 (  1.40%)
System     37      23.6060 (  0.00%)      23.6410 ( -0.15%)      23.5440 (  0.26%)      25.1180 ( -6.41%)      23.6345 ( -0.12%)
System     38      24.4005 (  0.00%)      24.4825 ( -0.34%)      24.3875 (  0.05%)      25.9235 ( -6.24%)      24.3450 (  0.23%)
System     39      25.2360 (  0.00%)      25.0845 (  0.60%)      24.9805 (  1.01%)      27.0445 ( -7.17%)      25.0550 (  0.72%)
System     40      25.8580 (  0.00%)      25.8060 (  0.20%)      25.6715 (  0.72%)      26.6250 ( -2.97%)      25.8710 ( -0.05%)
System     41      26.4400 (  0.00%)      26.5940 ( -0.58%)      26.5045 ( -0.24%)      27.0575 ( -2.34%)      26.5910 ( -0.57%)
System     42      27.3705 (  0.00%)      27.3795 ( -0.03%)      27.2985 (  0.26%)      26.9820 (  1.42%)      27.3340 (  0.13%)
System     43      28.0780 (  0.00%)      28.1835 ( -0.38%)      28.0395 (  0.14%)      27.4950 (  2.08%)      28.0755 (  0.01%)
System     44      28.7705 (  0.00%)      29.0155 ( -0.85%)      28.7800 ( -0.03%)      28.2340 (  1.86%)      28.7445 (  0.09%)
System     45      29.5025 (  0.00%)      29.7590 ( -0.87%)      29.6000 ( -0.33%)      28.9635 (  1.83%)      29.4805 (  0.07%)
System     46      30.2505 (  0.00%)      30.5680 ( -1.05%)      30.2940 ( -0.14%)      29.5335 (  2.37%)      30.3190 ( -0.23%)
System     47      31.0195 (  0.00%)      31.3730 ( -1.14%)      31.1710 ( -0.49%)      29.2280 (  5.78%)      31.0505 ( -0.10%)
System     48      31.5845 (  0.00%)      32.0685 ( -1.53%)      31.8725 ( -0.91%)      29.6715 (  6.06%)      31.7560 ( -0.54%)
Elapsed    1        8.5845 (  0.00%)       8.9555 ( -4.32%)       8.7530 ( -1.96%)       8.5860 ( -0.02%)       8.7500 ( -1.93%)
Elapsed    2        4.3255 (  0.00%)       4.5035 ( -4.12%)       4.4180 ( -2.14%)       4.5275 ( -4.67%)       4.4025 ( -1.78%)
Elapsed    3        2.9835 (  0.00%)       3.0050 ( -0.72%)       3.0095 ( -0.87%)       3.1375 ( -5.16%)       2.9880 ( -0.15%)
Elapsed    4        2.2810 (  0.00%)       2.3130 ( -1.40%)       2.2915 ( -0.46%)       2.6595 (-16.59%)       2.2605 (  0.90%)
Elapsed    5        1.9100 (  0.00%)       1.9340 ( -1.26%)       1.9220 ( -0.63%)       2.2235 (-16.41%)       1.9145 ( -0.24%)
Elapsed    6        1.6625 (  0.00%)       1.6625 (  0.00%)       1.6700 ( -0.45%)       1.8840 (-13.32%)       1.6425 (  1.20%)
Elapsed    7        1.4180 (  0.00%)       1.4330 ( -1.06%)       1.4330 ( -1.06%)       1.7080 (-20.45%)       1.4235 ( -0.39%)
Elapsed    8        1.2550 (  0.00%)       1.2490 (  0.48%)       1.2525 (  0.20%)       1.5030 (-19.76%)       1.2280 (  2.15%)
Elapsed    9        1.1765 (  0.00%)       1.2040 ( -2.34%)       1.2000 ( -2.00%)       1.3990 (-18.91%)       1.1960 ( -1.66%)
Elapsed    10       1.1145 (  0.00%)       1.0905 (  2.15%)       1.1240 ( -0.85%)       1.5260 (-36.92%)       1.0990 (  1.39%)
Elapsed    11       1.0260 (  0.00%)       1.0095 (  1.61%)       1.0085 (  1.71%)       1.3505 (-31.63%)       1.0025 (  2.29%)
Elapsed    12       0.8830 (  0.00%)       0.9200 ( -4.19%)       0.9495 ( -7.53%)       1.4225 (-61.10%)       0.9290 ( -5.21%)
Elapsed    13       0.9360 (  0.00%)       0.9325 (  0.37%)       0.9320 (  0.43%)       1.4145 (-51.12%)       0.9225 (  1.44%)
Elapsed    14       0.8840 (  0.00%)       0.8975 ( -1.53%)       0.8725 (  1.30%)       1.3215 (-49.49%)       0.8900 ( -0.68%)
Elapsed    15       0.8210 (  0.00%)       0.8215 ( -0.06%)       0.8315 ( -1.28%)       1.4400 (-75.40%)       0.8205 (  0.06%)
Elapsed    16       0.8110 (  0.00%)       0.7710 (  4.93%)       0.8070 (  0.49%)       1.2700 (-56.60%)       0.8090 (  0.25%)
Elapsed    17       0.8340 (  0.00%)       0.8470 ( -1.56%)       0.8350 ( -0.12%)       1.3785 (-65.29%)       0.8330 (  0.12%)
Elapsed    18       0.8045 (  0.00%)       0.8140 ( -1.18%)       0.8015 (  0.37%)       1.5235 (-89.37%)       0.8065 ( -0.25%)
Elapsed    19       0.7600 (  0.00%)       0.7650 ( -0.66%)       0.7525 (  0.99%)       1.3085 (-72.17%)       0.7690 ( -1.18%)
Elapsed    20       0.7495 (  0.00%)       0.7435 (  0.80%)       0.7140 (  4.74%)       1.3520 (-80.39%)       0.7255 (  3.20%)
Elapsed    21       0.8010 (  0.00%)       0.8050 ( -0.50%)       0.7940 (  0.87%)       1.3145 (-64.11%)       0.8040 ( -0.37%)
Elapsed    22       0.7910 (  0.00%)       0.7790 (  1.52%)       0.7625 (  3.60%)       1.3560 (-71.43%)       0.7960 ( -0.63%)
Elapsed    23       0.7630 (  0.00%)       0.7700 ( -0.92%)       0.7470 (  2.10%)       1.7525 (-129.69%)       0.7640 ( -0.13%)
Elapsed    24       0.7470 (  0.00%)       0.7385 (  1.14%)       0.7165 (  4.08%)       1.4650 (-96.12%)       0.7625 ( -2.07%)
Elapsed    25       0.8470 (  0.00%)       0.8760 ( -3.42%)       0.8620 ( -1.77%)       1.4205 (-67.71%)       0.8420 (  0.59%)
Elapsed    26       0.8235 (  0.00%)       0.8295 ( -0.73%)       0.8275 ( -0.49%)       1.4475 (-75.77%)       0.8325 ( -1.09%)
Elapsed    27       0.8130 (  0.00%)       0.8110 (  0.25%)       0.8085 (  0.55%)       1.4050 (-72.82%)       0.8090 (  0.49%)
Elapsed    28       0.7815 (  0.00%)       0.7815 (  0.00%)       0.7790 (  0.32%)       1.2610 (-61.36%)       0.8040 ( -2.88%)
Elapsed    29       0.7955 (  0.00%)       0.7930 (  0.31%)       0.7830 (  1.57%)       1.3340 (-67.69%)       0.7960 ( -0.06%)
Elapsed    30       0.7930 (  0.00%)       0.7820 (  1.39%)       0.7750 (  2.27%)       1.2825 (-61.73%)       0.8050 ( -1.51%)
Elapsed    31       0.7790 (  0.00%)       0.7895 ( -1.35%)       0.7670 (  1.54%)       1.3090 (-68.04%)       0.7865 ( -0.96%)
Elapsed    32       0.7800 (  0.00%)       0.7590 (  2.69%)       0.7665 (  1.73%)       1.4570 (-86.79%)       0.7905 ( -1.35%)
Elapsed    33       0.7690 (  0.00%)       0.7690 (  0.00%)       0.7795 ( -1.37%)       1.2140 (-57.87%)       0.7975 ( -3.71%)
Elapsed    34       0.7840 (  0.00%)       0.7695 (  1.85%)       0.7580 (  3.32%)       1.2925 (-64.86%)       0.7715 (  1.59%)
Elapsed    35       0.7890 (  0.00%)       0.7635 (  3.23%)       0.7600 (  3.68%)       1.1940 (-51.33%)       0.7915 ( -0.32%)
Elapsed    36       0.7995 (  0.00%)       0.7515 (  6.00%)       0.7560 (  5.44%)       1.5095 (-88.81%)       0.7685 (  3.88%)
Elapsed    37       0.7720 (  0.00%)       0.7560 (  2.07%)       0.7390 (  4.27%)       1.2695 (-64.44%)       0.7690 (  0.39%)
Elapsed    38       0.7775 (  0.00%)       0.7735 (  0.51%)       0.7835 ( -0.77%)       1.3650 (-75.56%)       0.7755 (  0.26%)
Elapsed    39       0.7790 (  0.00%)       0.7645 (  1.86%)       0.7660 (  1.67%)       1.4875 (-90.95%)       0.7710 (  1.03%)
Elapsed    40       0.7710 (  0.00%)       0.7515 (  2.53%)       0.7495 (  2.79%)       1.3105 (-69.97%)       0.7865 ( -2.01%)
Elapsed    41       0.7730 (  0.00%)       0.7650 (  1.03%)       0.7560 (  2.20%)       1.1790 (-52.52%)       0.7785 ( -0.71%)
Elapsed    42       0.7725 (  0.00%)       0.7725 ( -0.00%)       0.7695 (  0.39%)       1.0820 (-40.06%)       0.7875 ( -1.94%)
Elapsed    43       0.7760 (  0.00%)       0.7625 (  1.74%)       0.7705 (  0.71%)       1.0720 (-38.14%)       0.7755 (  0.06%)
Elapsed    44       0.7595 (  0.00%)       0.7570 (  0.33%)       0.7470 (  1.65%)       1.1220 (-47.73%)       0.7565 (  0.39%)
Elapsed    45       0.7600 (  0.00%)       0.7485 (  1.51%)       0.7440 (  2.11%)       1.2230 (-60.92%)       0.7520 (  1.05%)
Elapsed    46       0.7600 (  0.00%)       0.7620 ( -0.26%)       0.7680 ( -1.05%)       1.1900 (-56.58%)       0.7650 ( -0.66%)
Elapsed    47       0.7620 (  0.00%)       0.7690 ( -0.92%)       0.7645 ( -0.33%)       1.0135 (-33.01%)       0.7755 ( -1.77%)
Elapsed    48       0.7565 (  0.00%)       0.7645 ( -1.06%)       0.7500 (  0.86%)       1.0880 (-43.82%)       0.7560 (  0.07%)
Faults/cpu 1   385797.2651 (  0.00%)  370009.1820 ( -4.09%)  378447.2470 ( -1.91%)  385558.9224 ( -0.06%)  377688.6771 ( -2.10%)
Faults/cpu 2   386867.0633 (  0.00%)  372835.1850 ( -3.63%)  378839.1779 ( -2.08%)  370354.5700 ( -4.27%)  379344.3060 ( -1.94%)
Faults/cpu 3   374747.5776 (  0.00%)  371157.5386 ( -0.96%)  371260.9744 ( -0.93%)  356079.5284 ( -4.98%)  373118.1340 ( -0.43%)
Faults/cpu 4   369360.9848 (  0.00%)  363677.6646 ( -1.54%)  366267.6219 ( -0.84%)  317331.0922 (-14.09%)  369773.6799 (  0.11%)
Faults/cpu 5   357210.2234 (  0.00%)  352037.7022 ( -1.45%)  353371.1702 ( -1.07%)  310278.3499 (-13.14%)  355518.4149 ( -0.47%)
Faults/cpu 6   343452.0730 (  0.00%)  343300.6903 ( -0.04%)  340794.7856 ( -0.77%)  307986.8198 (-10.33%)  345375.2079 (  0.56%)
Faults/cpu 7   344267.5779 (  0.00%)  343135.2329 ( -0.33%)  341333.2393 ( -0.85%)  294837.4550 (-14.36%)  342875.7646 ( -0.40%)
Faults/cpu 8   344729.2311 (  0.00%)  342392.7834 ( -0.68%)  342100.6461 ( -0.76%)  292681.8453 (-15.10%)  346478.0993 (  0.51%)
Faults/cpu 9   331315.0698 (  0.00%)  326679.1633 ( -1.40%)  328379.3401 ( -0.89%)  277153.4340 (-16.35%)  325584.9047 ( -1.73%)
Faults/cpu 10  315686.3870 (  0.00%)  323588.4396 (  2.50%)  318443.3297 (  0.87%)  248109.7147 (-21.41%)  320327.6421 (  1.47%)
Faults/cpu 11  314434.9205 (  0.00%)  314943.7061 (  0.16%)  316982.1353 (  0.81%)  252301.2616 (-19.76%)  319064.9472 (  1.47%)
Faults/cpu 12  323781.3933 (  0.00%)  314494.0025 ( -2.87%)  310998.5164 ( -3.95%)  223380.8262 (-31.01%)  315265.1156 ( -2.63%)
Faults/cpu 13  298348.7601 (  0.00%)  298278.7461 ( -0.02%)  298425.8366 (  0.03%)  222713.6991 (-25.35%)  299557.7287 (  0.41%)
Faults/cpu 14  286932.6713 (  0.00%)  287489.8255 (  0.19%)  291403.3643 (  1.56%)  221548.5814 (-22.79%)  288096.1843 (  0.41%)
Faults/cpu 15  286722.6127 (  0.00%)  287065.1515 (  0.12%)  283720.5742 ( -1.05%)  207401.8106 (-27.66%)  287383.0602 (  0.23%)
Faults/cpu 16  273295.6746 (  0.00%)  280964.9626 (  2.81%)  272915.0388 ( -0.14%)  211096.7291 (-22.76%)  272681.3514 ( -0.22%)
Faults/cpu 17  260911.0336 (  0.00%)  256209.7475 ( -1.80%)  257178.1940 ( -1.43%)  197018.3121 (-24.49%)  258463.4380 ( -0.94%)
Faults/cpu 18  248932.2863 (  0.00%)  246396.4580 ( -1.02%)  244519.0507 ( -1.77%)  186688.3694 (-25.00%)  247319.4065 ( -0.65%)
Faults/cpu 19  240209.9640 (  0.00%)  239694.9962 ( -0.21%)  241668.2142 (  0.61%)  196175.4561 (-18.33%)  241274.5498 (  0.44%)
Faults/cpu 20  233606.9640 (  0.00%)  234567.9983 (  0.41%)  237101.6767 (  1.50%)  183173.7872 (-21.59%)  236261.5842 (  1.14%)
Faults/cpu 21  222450.9746 (  0.00%)  222763.7028 (  0.14%)  222768.7097 (  0.14%)  184378.4972 (-17.11%)  222502.3479 (  0.02%)
Faults/cpu 22  212470.6526 (  0.00%)  213873.2886 (  0.66%)  214961.4201 (  1.17%)  174322.1760 (-17.95%)  211664.6775 ( -0.38%)
Faults/cpu 23  208049.7330 (  0.00%)  206332.3011 ( -0.83%)  207950.2886 ( -0.05%)  153090.7052 (-26.42%)  207247.1747 ( -0.39%)
Faults/cpu 24  201679.6538 (  0.00%)  199995.7393 ( -0.83%)  201550.5940 ( -0.06%)  160410.2463 (-20.46%)  200847.7533 ( -0.41%)
Faults/cpu 25  192498.2867 (  0.00%)  190214.9558 ( -1.19%)  191517.8214 ( -0.51%)  161893.8849 (-15.90%)  192452.0624 ( -0.02%)
Faults/cpu 26  186665.4593 (  0.00%)  185379.9941 ( -0.69%)  186275.6707 ( -0.21%)  157300.4300 (-15.73%)  185979.4752 ( -0.37%)
Faults/cpu 27  181066.1031 (  0.00%)  179912.4533 ( -0.64%)  181050.8398 ( -0.01%)  154603.8416 (-14.61%)  180950.5179 ( -0.06%)
Faults/cpu 28  176715.0459 (  0.00%)  175829.3044 ( -0.50%)  176349.1828 ( -0.21%)  156364.7159 (-11.52%)  175741.4427 ( -0.55%)
Faults/cpu 29  170851.0284 (  0.00%)  169571.4289 ( -0.75%)  170335.5602 ( -0.30%)  150767.5486 (-11.75%)  169718.5679 ( -0.66%)
Faults/cpu 30  164787.7507 (  0.00%)  164617.4073 ( -0.10%)  165217.0889 (  0.26%)  147529.2740 (-10.47%)  162722.1423 ( -1.25%)
Faults/cpu 31  160430.2907 (  0.00%)  158345.5797 ( -1.30%)  161176.8077 (  0.47%)  144067.5971 (-10.20%)  160040.1744 ( -0.24%)
Faults/cpu 32  155790.5891 (  0.00%)  156267.8212 (  0.31%)  156027.3350 (  0.15%)  137157.8782 (-11.96%)  155208.7521 ( -0.37%)
Faults/cpu 33  151635.0920 (  0.00%)  150804.1899 ( -0.55%)  150660.8347 ( -0.64%)  139846.4798 ( -7.77%)  149742.2915 ( -1.25%)
Faults/cpu 34  146038.4974 (  0.00%)  146271.6506 (  0.16%)  146751.1769 (  0.49%)  135764.9435 ( -7.03%)  146394.1750 (  0.24%)
Faults/cpu 35  140732.3535 (  0.00%)  141685.1152 (  0.68%)  142430.2661 (  1.21%)  133655.4438 ( -5.03%)  139784.6519 ( -0.67%)
Faults/cpu 36  136166.1072 (  0.00%)  138214.4111 (  1.50%)  138620.7831 (  1.80%)  125246.5874 ( -8.02%)  137717.0072 (  1.14%)
Faults/cpu 37  133937.2426 (  0.00%)  133987.5859 (  0.04%)  134867.5627 (  0.69%)  126717.8556 ( -5.39%)  133528.6312 ( -0.31%)
Faults/cpu 38  129806.2879 (  0.00%)  129484.1027 ( -0.25%)  129867.2448 (  0.05%)  123151.6348 ( -5.13%)  129684.0139 ( -0.09%)
Faults/cpu 39  125884.7197 (  0.00%)  126292.1515 (  0.32%)  126882.4466 (  0.79%)  118738.0142 ( -5.68%)  126367.3210 (  0.38%)
Faults/cpu 40  122900.8619 (  0.00%)  123025.8443 (  0.10%)  123587.2927 (  0.56%)  119741.7381 ( -2.57%)  122339.5679 ( -0.46%)
Faults/cpu 41  119576.6679 (  0.00%)  119415.0157 ( -0.14%)  120053.6669 (  0.40%)  117525.4248 ( -1.72%)  119084.8607 ( -0.41%)
Faults/cpu 42  115959.5010 (  0.00%)  115737.9229 ( -0.19%)  116300.5934 (  0.29%)  117496.2246 (  1.33%)  115714.2527 ( -0.21%)
Faults/cpu 43  113100.2406 (  0.00%)  112783.1393 ( -0.28%)  113218.7913 (  0.10%)  115384.3053 (  2.02%)  112956.7744 ( -0.13%)
Faults/cpu 44  110685.8811 (  0.00%)  109868.9324 ( -0.74%)  110723.3392 (  0.03%)  112456.8917 (  1.60%)  110614.4621 ( -0.06%)
Faults/cpu 45  108052.3179 (  0.00%)  107211.6811 ( -0.78%)  107942.8820 ( -0.10%)  109776.3508 (  1.60%)  107751.8875 ( -0.28%)
Faults/cpu 46  105486.0553 (  0.00%)  104438.6404 ( -0.99%)  105227.3887 ( -0.25%)  107794.7237 (  2.19%)  105004.5390 ( -0.46%)
Faults/cpu 47  102974.9203 (  0.00%)  101941.3971 ( -1.00%)  102383.4542 ( -0.57%)  108478.5733 (  5.34%)  102722.9070 ( -0.24%)
Faults/cpu 48  101380.6453 (  0.00%)   99779.1561 ( -1.58%)  100509.2559 ( -0.86%)  107012.5013 (  5.56%)  100575.1345 ( -0.79%)
Faults/sec 1   385061.9036 (  0.00%)  369180.6824 ( -4.12%)  377729.1950 ( -1.90%)  384819.2327 ( -0.06%)  376970.8477 ( -2.10%)
Faults/sec 2   764226.0132 (  0.00%)  734211.0512 ( -3.93%)  748384.7009 ( -2.07%)  730072.8971 ( -4.47%)  749269.4182 ( -1.96%)
Faults/sec 3  1107501.3719 (  0.00%) 1100252.0430 ( -0.65%) 1097892.0288 ( -0.87%) 1053883.6349 ( -4.84%) 1104152.6367 ( -0.30%)
Faults/sec 4  1449626.8352 (  0.00%) 1429699.7936 ( -1.37%) 1442112.3242 ( -0.52%) 1256242.1498 (-13.34%) 1459113.6594 (  0.65%)
Faults/sec 5  1732002.4008 (  0.00%) 1709515.9035 ( -1.30%) 1720035.6650 ( -0.69%) 1520601.0948 (-12.21%) 1723793.1758 ( -0.47%)
Faults/sec 6  1987676.5357 (  0.00%) 1988958.7810 (  0.06%) 1978458.8821 ( -0.46%) 1787818.8176 (-10.05%) 2008881.3247 (  1.07%)
Faults/sec 7  2333954.1295 (  0.00%) 2306792.1741 ( -1.16%) 2308590.2310 ( -1.09%) 1981176.0202 (-15.12%) 2321139.7761 ( -0.55%)
Faults/sec 8  2634079.1969 (  0.00%) 2648701.1517 (  0.56%) 2639066.6993 (  0.19%) 2262332.1874 (-14.11%) 2688602.1301 (  2.07%)
Faults/sec 9  2812978.3108 (  0.00%) 2750677.5523 ( -2.21%) 2758799.5992 ( -1.93%) 2369151.0162 (-15.78%) 2764307.6039 ( -1.73%)
Faults/sec 10 2972098.8912 (  0.00%) 3040290.9285 (  2.29%) 2947819.8726 ( -0.82%) 2326833.4239 (-21.71%) 3004154.6467 (  1.08%)
Faults/sec 11 3229763.9751 (  0.00%) 3276271.4629 (  1.44%) 3284407.0479 (  1.69%) 2589504.6882 (-19.82%) 3292091.4135 (  1.93%)
Faults/sec 12 3750563.1088 (  0.00%) 3610624.8943 ( -3.73%) 3485259.8243 ( -7.07%) 2427789.3718 (-35.27%) 3558548.4222 ( -5.12%)
Faults/sec 13 3527986.1494 (  0.00%) 3547764.9300 (  0.56%) 3554466.5848 (  0.75%) 2545322.0856 (-27.85%) 3578525.4813 (  1.43%)
Faults/sec 14 3744462.9757 (  0.00%) 3695499.8728 ( -1.31%) 3792594.5449 (  1.29%) 2554380.1216 (-31.78%) 3710059.3946 ( -0.92%)
Faults/sec 15 4035725.6748 (  0.00%) 4028300.7944 ( -0.18%) 3977239.8857 ( -1.45%) 2512491.8119 (-37.74%) 4024326.1784 ( -0.28%)
Faults/sec 16 4082775.7769 (  0.00%) 4305014.9042 (  5.44%) 4087407.6182 (  0.11%) 2611306.0114 (-36.04%) 4089621.8672 (  0.17%)
Faults/sec 17 3961506.5231 (  0.00%) 3911699.0614 ( -1.26%) 3955307.9095 ( -0.16%) 2435984.2387 (-38.51%) 3954971.6225 ( -0.16%)
Faults/sec 18 4109294.2629 (  0.00%) 4063567.0776 ( -1.11%) 4122394.7311 (  0.32%) 2324571.5799 (-43.43%) 4100267.0651 ( -0.22%)
Faults/sec 19 4352529.7769 (  0.00%) 4320169.3481 ( -0.74%) 4390034.0874 (  0.86%) 2652463.0170 (-39.06%) 4299145.4846 ( -1.23%)
Faults/sec 20 4445682.4720 (  0.00%) 4460703.5633 (  0.34%) 4619212.3458 (  3.90%) 2498266.4820 (-43.80%) 4545455.4580 (  2.24%)
Faults/sec 21 4131984.5843 (  0.00%) 4118395.9252 ( -0.33%) 4162005.7007 (  0.73%) 2540251.8566 (-38.52%) 4111807.9229 ( -0.49%)
Faults/sec 22 4188729.2437 (  0.00%) 4251815.6793 (  1.51%) 4344514.0758 (  3.72%) 2470008.1414 (-41.03%) 4176667.1190 ( -0.29%)
Faults/sec 23 4350539.7029 (  0.00%) 4305851.5811 ( -1.03%) 4445627.9803 (  2.19%) 2100441.5050 (-51.72%) 4315733.9925 ( -0.80%)
Faults/sec 24 4432124.0788 (  0.00%) 4486121.1678 (  1.22%) 4617892.4194 (  4.19%) 2334966.8294 (-47.32%) 4341933.4781 ( -2.03%)
Faults/sec 25 3905711.5675 (  0.00%) 3780279.3884 ( -3.21%) 3835315.4941 ( -1.80%) 2450342.7439 (-37.26%) 3919793.7148 (  0.36%)
Faults/sec 26 4014336.5674 (  0.00%) 3983206.8788 ( -0.78%) 4001945.8549 ( -0.31%) 2473461.2355 (-38.38%) 3964715.6542 ( -1.24%)
Faults/sec 27 4065466.2002 (  0.00%) 4072036.9033 (  0.16%) 4091477.7799 (  0.64%) 2486770.5426 (-38.83%) 4079939.4587 (  0.36%)
Faults/sec 28 4228877.7902 (  0.00%) 4221680.2792 ( -0.17%) 4249396.4296 (  0.49%) 2637206.2964 (-37.64%) 4115256.3394 ( -2.69%)
Faults/sec 29 4165815.5539 (  0.00%) 4181083.7955 (  0.37%) 4219026.9458 (  1.28%) 2541629.2498 (-38.99%) 4143539.5645 ( -0.53%)
Faults/sec 30 4172433.9028 (  0.00%) 4232788.6280 (  1.45%) 4273745.8569 (  2.43%) 2602412.4120 (-37.63%) 4111186.6312 ( -1.47%)
Faults/sec 31 4240172.5042 (  0.00%) 4202647.6180 ( -0.88%) 4307691.5000 (  1.59%) 2628430.8761 (-38.01%) 4198363.3015 ( -0.99%)
Faults/sec 32 4252574.8449 (  0.00%) 4368813.9111 (  2.73%) 4311615.6983 (  1.39%) 2508987.6152 (-41.00%) 4180702.4236 ( -1.69%)
Faults/sec 33 4314834.5142 (  0.00%) 4300479.9148 ( -0.33%) 4253617.4375 ( -1.42%) 2768165.8763 (-35.85%) 4149830.2426 ( -3.82%)
Faults/sec 34 4221722.1201 (  0.00%) 4305669.7970 (  1.99%) 4370779.7502 (  3.53%) 2664936.7912 (-36.88%) 4281006.2531 (  1.40%)
Faults/sec 35 4202390.0406 (  0.00%) 4329839.4498 (  3.03%) 4347794.6136 (  3.46%) 2778155.9528 (-33.89%) 4173135.9166 ( -0.70%)
Faults/sec 36 4165128.7518 (  0.00%) 4401592.8642 (  5.68%) 4380704.7567 (  5.18%) 2457370.2456 (-41.00%) 4299049.1481 (  3.22%)
Faults/sec 37 4292251.7710 (  0.00%) 4366957.1507 (  1.74%) 4486600.4386 (  4.53%) 2734551.4751 (-36.29%) 4299181.8790 (  0.16%)
Faults/sec 38 4260325.3309 (  0.00%) 4284727.5988 (  0.57%) 4239258.8528 ( -0.49%) 2778317.8830 (-34.79%) 4259795.0447 ( -0.01%)
Faults/sec 39 4247599.0136 (  0.00%) 4327326.6158 (  1.88%) 4318635.2121 (  1.67%) 2583097.0121 (-39.19%) 4289769.9908 (  0.99%)
Faults/sec 40 4291155.1100 (  0.00%) 4396284.3000 (  2.45%) 4407292.9279 (  2.71%) 2777111.1971 (-35.28%) 4206468.2232 ( -1.97%)
Faults/sec 41 4292508.3942 (  0.00%) 4330351.2177 (  0.88%) 4370343.8776 (  1.81%) 2957005.3922 (-31.11%) 4248354.1350 ( -1.03%)
Faults/sec 42 4287561.7374 (  0.00%) 4289134.7484 (  0.04%) 4306375.5332 (  0.44%) 3066909.7047 (-28.47%) 4188926.4606 ( -2.30%)
Faults/sec 43 4265879.9967 (  0.00%) 4333505.6536 (  1.59%) 4299079.1884 (  0.78%) 3100949.2312 (-27.31%) 4262288.5979 ( -0.08%)
Faults/sec 44 4356451.0096 (  0.00%) 4365045.2799 (  0.20%) 4433069.0671 (  1.76%) 3140365.7636 (-27.91%) 4363013.1609 (  0.15%)
Faults/sec 45 4348783.2119 (  0.00%) 4414799.2678 (  1.52%) 4449419.1956 (  2.31%) 2980161.5064 (-31.47%) 4402440.2661 (  1.23%)
Faults/sec 46 4351251.6124 (  0.00%) 4347660.5404 ( -0.08%) 4305854.6178 ( -1.04%) 3065424.1546 (-29.55%) 4316402.4842 ( -0.80%)
Faults/sec 47 4339586.8853 (  0.00%) 4303221.9106 ( -0.84%) 4324971.9936 ( -0.34%) 3270057.9687 (-24.65%) 4264177.3478 ( -1.74%)
Faults/sec 48 4370579.8512 (  0.00%) 4328591.3197 ( -0.96%) 4392533.0247 (  0.50%) 3237854.7263 (-25.92%) 4367322.6197 ( -0.07%)

Nothing new to report here.


So overall the system CPU usage is much improved and the !THP case is also
better. It would be very interesting to know how worklets are accounted
for to make sure the system CPU usage is not just being hidden.

While the !THP case has improved, there are still a number of cases where
numacore is doing worse against mainline where as balancenuma does "all
right" for the most part.

I have no explanation as to why I see such different results. specjbb I
an sortof understand because I'm running for ranges of warehouses instead
of just the peak but even the peak results are different which is harder
to explain. The autonumabench scores are harder to reconcile.

As before all the scripts used are mmtests and I released 0.08 to be sure
it was up to date.  I can post the configs used if you want to check if
there is something complely screwed in the scripts but even if there is,
all the tested kernels got screwed in the same way.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/10] Latest numa/core release, v18
  2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
  2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
@ 2012-12-03 13:41   ` Mel Gorman
  2012-12-04 17:30     ` Thomas Gleixner
  1 sibling, 1 reply; 39+ messages in thread
From: Mel Gorman @ 2012-12-03 13:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm, Peter Zijlstra,
	Paul Turner, Lee Schermerhorn, Christoph Lameter, Rik van Riel,
	Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Fri, Nov 30, 2012 at 12:37:49PM -0800, Linus Torvalds wrote:
> On Fri, Nov 30, 2012 at 11:58 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > When pushed hard enough via threaded workloads (for example via the
> > numa02 test) then the upstream page migration code in mm/migration.c
> > becomes unscalable, resulting in lot of scheduling on the anon vma
> > mutex and a subsequent drop in performance.
> 
> Ugh.
> 
> I wonder if migration really needs that thing to be a mutex? I may be
> wrong, but the anon_vma lock only protects the actual rmap chains, and
> migration only ever changes the pte *contents*, not the actual chains
> of pte's themselves, right?
> 

Pretty much. As far as migration is concerned all that is critical is
that it find all the old migration ptes and restore them. If any of them
are missed then it will likely BUG later when the page is faulted. If a
process happened to exit while the anon_vma mutex was not held and the
migration pte and anon_vma disappeared during migration, it would not
matter as such. If the protection was a rwsem then migration might cause
delays in a parallel unmap or exit until the migration completed but I
doubt it would ever be noticed.

> So if this is a migration-specific scalability issue, then it might be
> possible to solve by making the mutex be a rwsem instead, and have
> migration only take it for reading.
> 
> Of course, I'm quite possibly wrong, and the code depends on full
> mutual exclusion.
> 
> Just a thought, in case it makes somebody go "Hmm.."
> 

Offhand, I cannot think of a reason why a rwsem would not work. This
thing originally became a mutex because the RT people (Peter in
particular) cared about being able to preempt faster. It'd be nice if
they confirmed that rwsem is not be a problem for them.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem
  2012-12-01 20:10             ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
  2012-12-01 20:19               ` Rik van Riel
@ 2012-12-03 13:59               ` Mel Gorman
  1 sibling, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-12-03 13:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Sat, Dec 01, 2012 at 09:10:30PM +0100, Ingo Molnar wrote:
> 
> Convert the struct anon_vma::mutex to an rwsem, which will help
> in solving a page-migration scalability problem. (Addressed in
> a separate patch.)
> 
> The conversion is simple and straightforward: in every case
> where we mutex_lock()ed we'll now down_write().
> 
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

Confirmation from the RT people that they're ok with this would be nice
but otherwise

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-01 20:15             ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
  2012-12-01 20:33               ` Rik van Riel
@ 2012-12-03 14:17               ` Mel Gorman
  2012-12-04 14:37                 ` Michel Lespinasse
  1 sibling, 1 reply; 39+ messages in thread
From: Mel Gorman @ 2012-12-03 14:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Sat, Dec 01, 2012 at 09:15:38PM +0100, Ingo Molnar wrote:
> 
> Note, with this optimization I went a farther than the 
> boundaries of the migration code - it seemed worthwile to do and 
> I've reviewed all the other users of page_lock_anon_vma() as 
> well and none seemed to be modifying the list inside that lock.
> 
> Please review this patch carefully - in particular the SMP races 
> outlined in anon_vma_free() are exciting: I have updated the 
> reasoning and it still appears to hold, but please double check 
> the changes nevertheless ...
> 
> Thanks,
> 
> 	Ingo
> 
> ------------------->
> From: Ingo Molnar <mingo@kernel.org>
> Date: Sat Dec 1 20:43:04 CET 2012
> 
> rmap_walk_anon() and try_to_unmap_anon() appears to be too careful
> about locking the anon vma: while it needs protection against anon
> vma list modifications, it does not need exclusive access to the
> list itself.
> 
> Transforming this exclusive lock to a read-locked rwsem removes a
> global lock from the hot path of page-migration intense threaded
> workloads which can cause pathological performance like this:
> 
>     96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
>                   |
>                   --- perf_trace_sched_switch
>                       __schedule
>                       schedule
>                       schedule_preempt_disabled
>                       __mutex_lock_common.isra.6
>                       __mutex_lock_slowpath
>                       mutex_lock
>                      |
>                      |--50.61%-- rmap_walk
>                      |          move_to_new_page
>                      |          migrate_pages
>                      |          migrate_misplaced_page
>                      |          __do_numa_page.isra.69
>                      |          handle_pte_fault
>                      |          handle_mm_fault
>                      |          __do_page_fault
>                      |          do_page_fault
>                      |          page_fault
>                      |          __memset_sse2
>                      |          |
>                      |           --100.00%-- worker_thread
>                      |                     |
>                      |                      --100.00%-- start_thread
>                      |
>                       --49.39%-- page_lock_anon_vma
>                                 try_to_unmap_anon
>                                 try_to_unmap
>                                 migrate_pages
>                                 migrate_misplaced_page
>                                 __do_numa_page.isra.69
>                                 handle_pte_fault
>                                 handle_mm_fault
>                                 __do_page_fault
>                                 do_page_fault
>                                 page_fault
>                                 __memset_sse2
>                                 |
>                                  --100.00%-- worker_thread
>                                            start_thread
> 
> With this change applied the profile is now nicely flat
> and there's no anon-vma related scheduling/blocking.
> 
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/linux/rmap.h |   15 +++++++++++++--
>  mm/huge_memory.c     |    4 ++--
>  mm/memory-failure.c  |    4 ++--
>  mm/migrate.c         |    2 +-
>  mm/rmap.c            |   40 ++++++++++++++++++++--------------------
>  5 files changed, 38 insertions(+), 27 deletions(-)
> 
> Index: linux/include/linux/rmap.h
> ===================================================================
> --- linux.orig/include/linux/rmap.h
> +++ linux/include/linux/rmap.h
> @@ -128,6 +128,17 @@ static inline void anon_vma_unlock(struc
>  	up_write(&anon_vma->root->rwsem);
>  }
>  
> +static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
> +{
> +	down_read(&anon_vma->root->rwsem);
> +}
> +
> +static inline void anon_vma_unlock_read(struct anon_vma *anon_vma)
> +{
> +	up_read(&anon_vma->root->rwsem);
> +}
> +
> +
>  /*
>   * anon_vma helper functions.
>   */
> @@ -220,8 +231,8 @@ int try_to_munlock(struct page *);
>  /*
>   * Called by memory-failure.c to kill processes.
>   */
> -struct anon_vma *page_lock_anon_vma(struct page *page);
> -void page_unlock_anon_vma(struct anon_vma *anon_vma);
> +struct anon_vma *page_lock_anon_vma_read(struct page *page);
> +void page_unlock_anon_vma_read(struct anon_vma *anon_vma);
>  int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
>  
>  /*
> Index: linux/mm/huge_memory.c
> ===================================================================
> --- linux.orig/mm/huge_memory.c
> +++ linux/mm/huge_memory.c
> @@ -1645,7 +1645,7 @@ int split_huge_page(struct page *page)
>  	int ret = 1;
>  
>  	BUG_ON(!PageAnon(page));
> -	anon_vma = page_lock_anon_vma(page);
> +	anon_vma = page_lock_anon_vma_read(page);
>  	if (!anon_vma)
>  		goto out;
>  	ret = 0;
> @@ -1658,7 +1658,7 @@ int split_huge_page(struct page *page)
>  
>  	BUG_ON(PageCompound(page));
>  out_unlock:
> -	page_unlock_anon_vma(anon_vma);
> +	page_unlock_anon_vma_read(anon_vma);
>  out:
>  	return ret;
>  }
> Index: linux/mm/memory-failure.c
> ===================================================================
> --- linux.orig/mm/memory-failure.c
> +++ linux/mm/memory-failure.c
> @@ -402,7 +402,7 @@ static void collect_procs_anon(struct pa
>  	struct anon_vma *av;
>  	pgoff_t pgoff;
>  
> -	av = page_lock_anon_vma(page);
> +	av = page_lock_anon_vma_read(page);
>  	if (av == NULL)	/* Not actually mapped anymore */
>  		return;
>  

Probably no real benefit on this one. It takes the tasklist_lock just
after it which is a much heavier lock anyway. I don't think there is
anything wrong with this though.

> @@ -423,7 +423,7 @@ static void collect_procs_anon(struct pa
>  		}
>  	}
>  	read_unlock(&tasklist_lock);
> -	page_unlock_anon_vma(av);
> +	page_unlock_anon_vma_read(av);
>  }
>  
>  /*
> Index: linux/mm/migrate.c
> ===================================================================
> --- linux.orig/mm/migrate.c
> +++ linux/mm/migrate.c
> @@ -751,7 +751,7 @@ static int __unmap_and_move(struct page
>  	 */
>  	if (PageAnon(page)) {
>  		/*
> -		 * Only page_lock_anon_vma() understands the subtleties of
> +		 * Only page_lock_anon_vma_read() understands the subtleties of
>  		 * getting a hold on an anon_vma from outside one of its mms.
>  		 */
>  		anon_vma = page_get_anon_vma(page);
> Index: linux/mm/rmap.c
> ===================================================================
> --- linux.orig/mm/rmap.c
> +++ linux/mm/rmap.c
> @@ -87,18 +87,18 @@ static inline void anon_vma_free(struct
>  	VM_BUG_ON(atomic_read(&anon_vma->refcount));
>  
>  	/*
> -	 * Synchronize against page_lock_anon_vma() such that
> +	 * Synchronize against page_lock_anon_vma_read() such that
>  	 * we can safely hold the lock without the anon_vma getting
>  	 * freed.
>  	 *
>  	 * Relies on the full mb implied by the atomic_dec_and_test() from
>  	 * put_anon_vma() against the acquire barrier implied by
> -	 * mutex_trylock() from page_lock_anon_vma(). This orders:
> +	 * down_read_trylock() from page_lock_anon_vma_read(). This orders:
>  	 *
> -	 * page_lock_anon_vma()		VS	put_anon_vma()
> -	 *   mutex_trylock()			  atomic_dec_and_test()
> +	 * page_lock_anon_vma_read()	VS	put_anon_vma()
> +	 *   down_read_trylock()		  atomic_dec_and_test()
>  	 *   LOCK				  MB
> -	 *   atomic_read()			  mutex_is_locked()
> +	 *   atomic_read()			  rwsem_is_locked()
>  	 *
>  	 * LOCK should suffice since the actual taking of the lock must
>  	 * happen _before_ what follows.
> @@ -146,7 +146,7 @@ static void anon_vma_chain_link(struct v
>   * allocate a new one.
>   *
>   * Anon-vma allocations are very subtle, because we may have
> - * optimistically looked up an anon_vma in page_lock_anon_vma()
> + * optimistically looked up an anon_vma in page_lock_anon_vma_read()
>   * and that may actually touch the spinlock even in the newly
>   * allocated vma (it depends on RCU to make sure that the
>   * anon_vma isn't actually destroyed).
> @@ -442,7 +442,7 @@ out:
>   * atomic op -- the trylock. If we fail the trylock, we fall back to getting a
>   * reference like with page_get_anon_vma() and then block on the mutex.
>   */
> -struct anon_vma *page_lock_anon_vma(struct page *page)
> +struct anon_vma *page_lock_anon_vma_read(struct page *page)
>  {
>  	struct anon_vma *anon_vma = NULL;
>  	struct anon_vma *root_anon_vma;
> @@ -457,14 +457,14 @@ struct anon_vma *page_lock_anon_vma(stru
>  
>  	anon_vma = (struct anon_vma *) (anon_mapping - PAGE_MAPPING_ANON);
>  	root_anon_vma = ACCESS_ONCE(anon_vma->root);
> -	if (down_write_trylock(&root_anon_vma->rwsem)) {
> +	if (down_read_trylock(&root_anon_vma->rwsem)) {
>  		/*
>  		 * If the page is still mapped, then this anon_vma is still
>  		 * its anon_vma, and holding the mutex ensures that it will
>  		 * not go away, see anon_vma_free().
>  		 */
>  		if (!page_mapped(page)) {
> -			up_write(&root_anon_vma->rwsem);
> +			up_read(&root_anon_vma->rwsem);
>  			anon_vma = NULL;
>  		}
>  		goto out;
> @@ -484,7 +484,7 @@ struct anon_vma *page_lock_anon_vma(stru
>  
>  	/* we pinned the anon_vma, its safe to sleep */
>  	rcu_read_unlock();
> -	anon_vma_lock(anon_vma);
> +	anon_vma_lock_read(anon_vma);
>  
>  	if (atomic_dec_and_test(&anon_vma->refcount)) {
>  		/*
> @@ -492,7 +492,7 @@ struct anon_vma *page_lock_anon_vma(stru
>  		 * and bail -- can't simply use put_anon_vma() because
>  		 * we'll deadlock on the anon_vma_lock() recursion.
>  		 */
> -		anon_vma_unlock(anon_vma);
> +		anon_vma_unlock_read(anon_vma);
>  		__put_anon_vma(anon_vma);
>  		anon_vma = NULL;
>  	}
> @@ -504,9 +504,9 @@ out:
>  	return anon_vma;
>  }
>  
> -void page_unlock_anon_vma(struct anon_vma *anon_vma)
> +void page_unlock_anon_vma_read(struct anon_vma *anon_vma)
>  {
> -	anon_vma_unlock(anon_vma);
> +	anon_vma_unlock_read(anon_vma);
>  }
>  
>  /*
> @@ -732,7 +732,7 @@ static int page_referenced_anon(struct p
>  	struct anon_vma_chain *avc;
>  	int referenced = 0;
>  
> -	anon_vma = page_lock_anon_vma(page);
> +	anon_vma = page_lock_anon_vma_read(page);
>  	if (!anon_vma)
>  		return referenced;
>  

This is a slightly trickier one as this path is called from reclaim. It does
open the possibility that reclaim can stall something like a parallel fork
or anything that requires the anon_vma rwsem for a period of time. I very
severely doubt it'll really be a problem but keep an eye out for bug reports
related to delayed mmap/fork/anything_needing_write_lock during page reclaim.

> @@ -754,7 +754,7 @@ static int page_referenced_anon(struct p
>  			break;
>  	}
>  
> -	page_unlock_anon_vma(anon_vma);
> +	page_unlock_anon_vma_read(anon_vma);
>  	return referenced;
>  }
>  
> @@ -1474,7 +1474,7 @@ static int try_to_unmap_anon(struct page
>  	struct anon_vma_chain *avc;
>  	int ret = SWAP_AGAIN;
>  
> -	anon_vma = page_lock_anon_vma(page);
> +	anon_vma = page_lock_anon_vma_read(page);
>  	if (!anon_vma)
>  		return ret;
>  
> @@ -1501,7 +1501,7 @@ static int try_to_unmap_anon(struct page
>  			break;
>  	}
>  
> -	page_unlock_anon_vma(anon_vma);
> +	page_unlock_anon_vma_read(anon_vma);
>  	return ret;
>  }
>  
> @@ -1696,7 +1696,7 @@ static int rmap_walk_anon(struct page *p
>  	int ret = SWAP_AGAIN;
>  
>  	/*
> -	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma()
> +	 * Note: remove_migration_ptes() cannot use page_lock_anon_vma_read()
>  	 * because that depends on page_mapped(); but not all its usages
>  	 * are holding mmap_sem. Users without mmap_sem are required to
>  	 * take a reference count to prevent the anon_vma disappearing
> @@ -1704,7 +1704,7 @@ static int rmap_walk_anon(struct page *p
>  	anon_vma = page_anon_vma(page);
>  	if (!anon_vma)
>  		return ret;
> -	anon_vma_lock(anon_vma);
> +	anon_vma_lock_read(anon_vma);
>  	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root, pgoff, pgoff) {
>  		struct vm_area_struct *vma = avc->vma;
>  		unsigned long address = vma_address(page, vma);
> @@ -1712,7 +1712,7 @@ static int rmap_walk_anon(struct page *p
>  		if (ret != SWAP_AGAIN)
>  			break;
>  	}
> -	anon_vma_unlock(anon_vma);
> +	anon_vma_unlock_read(anon_vma);
>  	return ret;
>  }
>  

I can't think of any reason why this would not work. Good stuff!

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-03 14:17               ` [PATCH 2/2] " Mel Gorman
@ 2012-12-04 14:37                 ` Michel Lespinasse
  2012-12-04 18:17                   ` Mel Gorman
  0 siblings, 1 reply; 39+ messages in thread
From: Michel Lespinasse @ 2012-12-04 14:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Mon, Dec 3, 2012 at 6:17 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Sat, Dec 01, 2012 at 09:15:38PM +0100, Ingo Molnar wrote:
>> @@ -732,7 +732,7 @@ static int page_referenced_anon(struct p
>>       struct anon_vma_chain *avc;
>>       int referenced = 0;
>>
>> -     anon_vma = page_lock_anon_vma(page);
>> +     anon_vma = page_lock_anon_vma_read(page);
>>       if (!anon_vma)
>>               return referenced;
>
> This is a slightly trickier one as this path is called from reclaim. It does
> open the possibility that reclaim can stall something like a parallel fork
> or anything that requires the anon_vma rwsem for a period of time. I very
> severely doubt it'll really be a problem but keep an eye out for bug reports
> related to delayed mmap/fork/anything_needing_write_lock during page reclaim.

I don't see why this would be a problem - rwsem does implement
reader/writer fairness, so having some sites do a read lock instead of
a write lock shouldn't cause the write lock sites to starve. Is this
what you were worried about ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2, v2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-02 15:12                 ` [PATCH 2/2, v2] " Ingo Molnar
  2012-12-02 17:53                   ` Rik van Riel
@ 2012-12-04 14:42                   ` Michel Lespinasse
  2012-12-05  2:59                   ` Michel Lespinasse
  2 siblings, 0 replies; 39+ messages in thread
From: Michel Lespinasse @ 2012-12-04 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Linus Torvalds, Linux Kernel Mailing List,
	linux-mm, Peter Zijlstra, Paul Turner, Lee Schermerhorn,
	Christoph Lameter, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Thomas Gleixner, Johannes Weiner, Hugh Dickins

On Sun, Dec 2, 2012 at 7:12 AM, Ingo Molnar <mingo@kernel.org> wrote:
> * Rik van Riel <riel@redhat.com> wrote:
>
>> >+static inline void anon_vma_lock_read(struct anon_vma *anon_vma)
>> >+{
>> >+    down_read(&anon_vma->root->rwsem);
>> >+}
>>
>> I see you did not rename anon_vma_lock and anon_vma_unlock to
>> anon_vma_lock_write and anon_vma_unlock_write.
>>
>> That could get confusing to people touching that code in the
>> future.
>
> Agreed, doing that rename makes perfect sense - I've done that
> in the v2 version attached below.
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index f3f41d2..c20635c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -118,7 +118,7 @@ static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
>                 up_write(&anon_vma->root->rwsem);
>  }

Note that you haven't changed the names for vma_lock_anon_vma() and
vma_unlock_anon_vma().
I don't have any real good names to suggest though.

> -static inline void anon_vma_lock(struct anon_vma *anon_vma)
> +static inline void anon_vma_lock_write(struct anon_vma *anon_vma)
>  {
>         down_write(&anon_vma->root->rwsem);
>  }
> @@ -128,6 +128,17 @@ static inline void anon_vma_unlock(struct anon_vma *anon_vma)
>         up_write(&anon_vma->root->rwsem);
>  }

And as Rik noticed, you forgot to rename anon_vma_unlock() too.

But really, this is nitpicking. I like the idea behind the patch, and
after giving it a close look, I couldn't find anything wrong with it.

Reviewed-by: Michel Lespinasse <walken@google.com>

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/10] Latest numa/core release, v18
  2012-12-03 13:41   ` [PATCH 00/10] Latest numa/core release, v18 Mel Gorman
@ 2012-12-04 17:30     ` Thomas Gleixner
  0 siblings, 0 replies; 39+ messages in thread
From: Thomas Gleixner @ 2012-12-04 17:30 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linus Torvalds, Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Johannes Weiner,
	Hugh Dickins

On Mon, 3 Dec 2012, Mel Gorman wrote:
> On Fri, Nov 30, 2012 at 12:37:49PM -0800, Linus Torvalds wrote:
> > So if this is a migration-specific scalability issue, then it might be
> > possible to solve by making the mutex be a rwsem instead, and have
> > migration only take it for reading.
> > 
> > Of course, I'm quite possibly wrong, and the code depends on full
> > mutual exclusion.
> > 
> > Just a thought, in case it makes somebody go "Hmm.."
> > 
> 
> Offhand, I cannot think of a reason why a rwsem would not work. This
> thing originally became a mutex because the RT people (Peter in
> particular) cared about being able to preempt faster. It'd be nice if
> they confirmed that rwsem is not be a problem for them.

rwsems are preemptable as well. So I don't think this was Peter's main
concern. If it works with an rwsem, then go ahead.

rwsems degrade on RT because we cannot do multiple reader boosting, so
they allow only a single reader which can take it recursive. But
that's an RT specific issue and nothing you should worry about.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-04 14:37                 ` Michel Lespinasse
@ 2012-12-04 18:17                   ` Mel Gorman
  0 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-12-04 18:17 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Ingo Molnar, Linus Torvalds, Linux Kernel Mailing List, linux-mm,
	Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Andrew Morton, Andrea Arcangeli, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Tue, Dec 04, 2012 at 06:37:41AM -0800, Michel Lespinasse wrote:
> On Mon, Dec 3, 2012 at 6:17 AM, Mel Gorman <mgorman@suse.de> wrote:
> > On Sat, Dec 01, 2012 at 09:15:38PM +0100, Ingo Molnar wrote:
> >> @@ -732,7 +732,7 @@ static int page_referenced_anon(struct p
> >>       struct anon_vma_chain *avc;
> >>       int referenced = 0;
> >>
> >> -     anon_vma = page_lock_anon_vma(page);
> >> +     anon_vma = page_lock_anon_vma_read(page);
> >>       if (!anon_vma)
> >>               return referenced;
> >
> > This is a slightly trickier one as this path is called from reclaim. It does
> > open the possibility that reclaim can stall something like a parallel fork
> > or anything that requires the anon_vma rwsem for a period of time. I very
> > severely doubt it'll really be a problem but keep an eye out for bug reports
> > related to delayed mmap/fork/anything_needing_write_lock during page reclaim.
> 
> I don't see why this would be a problem - rwsem does implement
> reader/writer fairness, so having some sites do a read lock instead of
> a write lock shouldn't cause the write lock sites to starve. Is this
> what you were worried about ?
> 

Yes. I did not expect they would be starved forever, just delayed longer
than they might have been before. I would be very surprised if there is
anything other than a synthetic case that will really care but I've been
"very surprised" before :)

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 00/10] Latest numa/core release, v18
  2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
                   ` (12 preceding siblings ...)
  2012-12-03 11:32 ` Mel Gorman
@ 2012-12-04 22:49 ` Mel Gorman
  13 siblings, 0 replies; 39+ messages in thread
From: Mel Gorman @ 2012-12-04 22:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Peter Zijlstra, Paul Turner,
	Lee Schermerhorn, Christoph Lameter, Rik van Riel, Andrew Morton,
	Andrea Arcangeli, Linus Torvalds, Thomas Gleixner,
	Johannes Weiner, Hugh Dickins

On Fri, Nov 30, 2012 at 08:58:31PM +0100, Ingo Molnar wrote:
> I'm pleased to announce the latest, -v18 numa/core release.
> 

I collected the results for the following kernels

stats-v8r6	  TLB flush optimisations, stats from balancenuma tree
numacore-20121130 numacore v17 (tip/master as of Nov 30th)
numacore-20121202 numacore v18 (tip/master as of Dec  2nd)
numabase-20121203 unified tree (tip/numa/base as of Dec 3rd)
autonuma-v8fastr4 autonuma rebased with THP patch on top
balancenuma-v9r2  Almost identical to balancenuma v8 but as a build fix for mips
balancenuma-v10r1 v9 + Ingo's migration optimisation on top

Unfortunately, I did not get very far with the comparison. On looking
at just the first set of results, I noticed something screwy with the
numacore-20121202 and numabase-20121203 results. It becomes obvious if
you look at the autonuma benchmark.

AUTONUMA BENCH
                                      3.7.0-rc7             3.7.0-rc6             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7             3.7.0-rc7
                                     stats-v8r6     numacore-20121130     numacore-20121202     numabase-20121203    autonuma-v28fastr4      balancenuma-v9r2     balancenuma-v10r1
User    NUMA01               65230.85 (  0.00%)    24835.22 ( 61.93%)    69344.37 ( -6.31%)    62845.76 (  3.66%)    30410.22 ( 53.38%)    52436.65 ( 19.61%)    42111.49 ( 35.44%)
User    NUMA01_THEADLOCAL    60794.67 (  0.00%)    17856.17 ( 70.63%)    53416.06 ( 12.14%)    50088.06 ( 17.61%)    17185.34 ( 71.73%)    17829.96 ( 70.67%)    17820.65 ( 70.69%)
User    NUMA02                7031.50 (  0.00%)     2084.38 ( 70.36%)     6726.17 (  4.34%)     6713.99 (  4.52%)     2238.73 ( 68.16%)     2079.48 ( 70.43%)     2068.27 ( 70.59%)
User    NUMA02_SMT            2916.19 (  0.00%)     1009.28 ( 65.39%)     3207.30 ( -9.98%)     3150.35 ( -8.03%)     1037.07 ( 64.44%)      997.57 ( 65.79%)      990.41 ( 66.04%)
System  NUMA01                  39.66 (  0.00%)      926.55 (-2236.23%)      333.49 (-740.87%)      283.49 (-614.80%)      236.83 (-497.15%)      275.09 (-593.62%)      329.73 (-731.39%)
System  NUMA01_THEADLOCAL       42.33 (  0.00%)      513.99 (-1114.25%)       40.59 (  4.11%)       38.80 (  8.34%)       70.90 (-67.49%)      110.82 (-161.80%)      114.57 (-170.66%)
System  NUMA02                   1.25 (  0.00%)       18.57 (-1385.60%)        1.04 ( 16.80%)        1.06 ( 15.20%)        6.39 (-411.20%)        6.42 (-413.60%)        6.97 (-457.60%)
System  NUMA02_SMT              16.66 (  0.00%)       12.32 ( 26.05%)        0.95 ( 94.30%)        0.93 ( 94.42%)        3.17 ( 80.97%)        3.58 ( 78.51%)        5.75 ( 65.49%)
Elapsed NUMA01                1511.76 (  0.00%)      575.93 ( 61.90%)     1644.63 ( -8.79%)     1508.19 (  0.24%)      701.62 ( 53.59%)     1185.53 ( 21.58%)      950.50 ( 37.13%)
Elapsed NUMA01_THEADLOCAL     1387.17 (  0.00%)      398.55 ( 71.27%)     1260.92 (  9.10%)     1257.44 (  9.35%)      378.47 ( 72.72%)      397.37 ( 71.35%)      399.97 ( 71.17%)
Elapsed NUMA02                 176.81 (  0.00%)       51.14 ( 71.08%)      180.80 ( -2.26%)      180.59 ( -2.14%)       53.45 ( 69.77%)       49.51 ( 72.00%)       50.93 ( 71.20%)
Elapsed NUMA02_SMT             163.96 (  0.00%)       48.92 ( 70.16%)      166.96 ( -1.83%)      163.94 (  0.01%)       48.17 ( 70.62%)       47.71 ( 70.90%)       46.76 ( 71.48%)
CPU     NUMA01                4317.00 (  0.00%)     4473.00 ( -3.61%)     4236.00 (  1.88%)     4185.00 (  3.06%)     4368.00 ( -1.18%)     4446.00 ( -2.99%)     4465.00 ( -3.43%)
CPU     NUMA01_THEADLOCAL     4385.00 (  0.00%)     4609.00 ( -5.11%)     4239.00 (  3.33%)     3986.00 (  9.10%)     4559.00 ( -3.97%)     4514.00 ( -2.94%)     4484.00 ( -2.26%)
CPU     NUMA02                3977.00 (  0.00%)     4111.00 ( -3.37%)     3720.00 (  6.46%)     3718.00 (  6.51%)     4200.00 ( -5.61%)     4212.00 ( -5.91%)     4074.00 ( -2.44%)
CPU     NUMA02_SMT            1788.00 (  0.00%)     2087.00 (-16.72%)     1921.00 ( -7.44%)     1922.00 ( -7.49%)     2159.00 (-20.75%)     2098.00 (-17.34%)     2130.00 (-19.13%)

While numacore-v17 did quite well for the range of workloads, v18 does
not. It's just about comparable to mainline and the unified tree is more
or less the same.

balancenuma does reasonably well. It does not do a great job on numa01
but it's better than mainline is and it's been explained already why
balancenuma without a placement policy is not able to interleave like the
adverse workload requires.

MMTests Statistics: duration
           3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
          stats-v8r6numacore-20121130numacore-20121202numabase-20121203autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r1
User       135980.38    45792.55   132701.13   122805.28    50878.50    73350.91    62997.72
System        100.53     1472.19      376.74      324.98      317.89      396.58      457.66
Elapsed      3248.36     1084.63     3262.62     3118.70     1191.85     1689.70     1456.66

Everyone adds system CPU overhead. numacore-v18 has lower overhead than
v17 and I thought it might be how worklets were accounted for but then I
looked at the vmstats.

MMTests Statistics: vmstat
                             3.7.0-rc7   3.7.0-rc6   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7   3.7.0-rc7
                            stats-v8r6numacore-20121130numacore-20121202numabase-20121203autonuma-v28fastr4balancenuma-v9r2balancenuma-v10r1
Page Ins                         42320       41628       40624       40404       41592       40524       40800
Page Outs                        16516        8032       17064       16320        8596       10712        9652
Swap Ins                             0           0           0           0           0           0           0
Swap Outs                            0           0           0           0           0           0           0
Direct pages scanned                 0           0           0           0           0           0           0
Kswapd pages scanned                 0           0           0           0           0           0           0
Kswapd pages reclaimed               0           0           0           0           0           0           0
Direct pages reclaimed               0           0           0           0           0           0           0
Kswapd efficiency                 100%        100%        100%        100%        100%        100%        100%
Kswapd velocity                  0.000       0.000       0.000       0.000       0.000       0.000       0.000
Direct efficiency                 100%        100%        100%        100%        100%        100%        100%
Direct velocity                  0.000       0.000       0.000       0.000       0.000       0.000       0.000
Percentage direct scans             0%          0%          0%          0%          0%          0%          0%
Page writes by reclaim               0           0           0           0           0           0           0
Page writes file                     0           0           0           0           0           0           0
Page writes anon                     0           0           0           0           0           0           0
Page reclaim immediate               0           0           0           0           0           0           0
Page rescued immediate               0           0           0           0           0           0           0
Slabs scanned                        0           0           0           0           0           0           0
Direct inode steals                  0           0           0           0           0           0           0
Kswapd inode steals                  0           0           0           0           0           0           0
Kswapd skipped wait                  0           0           0           0           0           0           0
THP fault alloc                  17801       13484       19107       19323       20032       18691       17880
THP collapse alloc                  14           0           6          11          54           9           5
THP splits                           5           0           5           6           7           2           8
THP fault fallback                   0           0           0           0           0           0           0
THP collapse fail                    0           0           0           0           0           0           0
Compaction stalls                    0           0           0           0           0           0           0
Compaction success                   0           0           0           0           0           0           0
Compaction failures                  0           0           0           0           0           0           0
Page migrate success                 0           0           0           0           0     9599473     9266463
Page migrate failure                 0           0           0           0           0           0           0
Compaction pages isolated            0           0           0           0           0           0           0
Compaction migrate scanned           0           0           0           0           0           0           0
Compaction free scanned              0           0           0           0           0           0           0
Compaction cost                      0           0           0           0           0        9964        9618
NUMA PTE updates                     0           0           0           0           0   132800892   130575725
NUMA hint faults                     0           0           0           0           0      606294      501532
NUMA hint local faults               0           0           0           0           0      453880      370744
NUMA pages migrated                  0           0           0           0           0     9599473     9266463
AutoNUMA cost                        0           0           0           0           0        4143        3597

The unified tree numabase-20121203 should have had some NUMA PTE activity
and the stat code looked ok at a glance. However, zero activity there
implies that numacore is completely disabled or non-existant. I checked,
the patch had applied and it was certainly enabled in the kernel config
so I looked closer and I see that task_tick_numa looks like this.

static void task_tick_numa(struct rq *rq, struct task_struct *curr)
{
        /* Cheap checks first: */
        if (!task_numa_candidate(curr)) {
                if (curr->numa_shared >= 0)
                        curr->numa_shared = -1;
                return;
        }

        task_tick_numa_scan(rq, curr);
        task_tick_numa_placement(rq, curr);
}

Ok, so task_numa_candidate() is meant to shortcut expensive steps, fair
enough but it begins with this check.

        /* kthreads don't have any user-space memory to scan: */
        if (!p->mm || !p->numa_faults)
                return false;

How is numa_faults ever meant to be positive if task_tick_numa_scan()
never even gets the chance to run to set a PTE pte_numa? Is numacore not
effectively disabled? I'm also not 100% sure that the "/* Don't disturb
hard-bound tasks: */" is correct either.  A task could be bound to the
CPUs across 2 nodes, just not all nodes and still want to do balancing.

Ingo, you reported that you were seeing results within 1% of
hard-binding. What were you testing with and are you sure that's what you
pushed to tip/master? The damage appears to be caused by "sched: Add RSS
filter to NUMA-balancing" which is doing more than just RSS filtering but
if so, then it's not clear what you were testing that you saw good results
with it unless you accidentally merged the wrong version of that patch.

I'll stop the analysis for now. FWIW, very broadly speaking it looked like
the migration scalability patches help balancenuma a bit for some of the
tests although it increases system CPU usage a little.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH 2/2, v2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
  2012-12-02 15:12                 ` [PATCH 2/2, v2] " Ingo Molnar
  2012-12-02 17:53                   ` Rik van Riel
  2012-12-04 14:42                   ` Michel Lespinasse
@ 2012-12-05  2:59                   ` Michel Lespinasse
  2 siblings, 0 replies; 39+ messages in thread
From: Michel Lespinasse @ 2012-12-05  2:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Rik van Riel, Linus Torvalds, Linux Kernel Mailing List,
	linux-mm, Peter Zijlstra, Paul Turner, Lee Schermerhorn,
	Christoph Lameter, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Thomas Gleixner, Johannes Weiner, Hugh Dickins

On Sun, Dec 2, 2012 at 7:12 AM, Ingo Molnar <mingo@kernel.org> wrote:
> Subject: [PATCH] mm/rmap, migration: Make rmap_walk_anon() and
>  try_to_unmap_anon() more scalable
>
> rmap_walk_anon() and try_to_unmap_anon() appears to be too
> careful about locking the anon vma: while it needs protection
> against anon vma list modifications, it does not need exclusive
> access to the list itself.
>
> Transforming this exclusive lock to a read-locked rwsem removes
> a global lock from the hot path of page-migration intense
> threaded workloads which can cause pathological performance like
> this:
>
>     96.43%        process 0  [kernel.kallsyms]  [k] perf_trace_sched_switch
>                   |
>                   --- perf_trace_sched_switch
>                       __schedule
>                       schedule
>                       schedule_preempt_disabled
>                       __mutex_lock_common.isra.6
>                       __mutex_lock_slowpath
>                       mutex_lock
>                      |
>                      |--50.61%-- rmap_walk
>                      |          move_to_new_page
>                      |          migrate_pages
>                      |          migrate_misplaced_page
>                      |          __do_numa_page.isra.69
>                      |          handle_pte_fault
>                      |          handle_mm_fault
>                      |          __do_page_fault
>                      |          do_page_fault
>                      |          page_fault
>                      |          __memset_sse2
>                      |          |
>                      |           --100.00%-- worker_thread
>                      |                     |
>                      |                      --100.00%-- start_thread
>                      |
>                       --49.39%-- page_lock_anon_vma
>                                 try_to_unmap_anon
>                                 try_to_unmap
>                                 migrate_pages
>                                 migrate_misplaced_page
>                                 __do_numa_page.isra.69
>                                 handle_pte_fault
>                                 handle_mm_fault
>                                 __do_page_fault
>                                 do_page_fault
>                                 page_fault
>                                 __memset_sse2
>                                 |
>                                  --100.00%-- worker_thread
>                                            start_thread
>
> With this change applied the profile is now nicely flat
> and there's no anon-vma related scheduling/blocking.

Wouldn't the same reasoning apply to i_mmap_mutex ? Should we make
that a rwsem as well ? I take it that Ingo's test case does not show
this, but i_mmap_mutex's role with file pages is actually quite
similar to the anon_vma lock with anon pages...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2012-12-05  2:59 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-30 19:58 [PATCH 00/10] Latest numa/core release, v18 Ingo Molnar
2012-11-30 19:58 ` [PATCH 01/10] sched: Add "task flipping" support Ingo Molnar
2012-11-30 19:58 ` [PATCH 02/10] sched: Move the NUMA placement logic to a worklet Ingo Molnar
2012-11-30 19:58 ` [PATCH 03/10] numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior Ingo Molnar
2012-11-30 19:58 ` [PATCH 04/10] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
2012-11-30 19:58 ` [PATCH 05/10] sched: Introduce directed NUMA convergence Ingo Molnar
2012-11-30 19:58 ` [PATCH 06/10] sched: Remove statistical NUMA scheduling Ingo Molnar
2012-11-30 19:58 ` [PATCH 07/10] sched: Track quality and strength of convergence Ingo Molnar
2012-11-30 19:58 ` [PATCH 08/10] sched: Converge NUMA migrations Ingo Molnar
2012-11-30 19:58 ` [PATCH 09/10] sched: Add convergence strength based adaptive NUMA page fault rate Ingo Molnar
2012-11-30 19:58 ` [PATCH 10/10] sched: Refine the 'shared tasks' memory interleaving logic Ingo Molnar
2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
2012-12-01  9:49   ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
2012-12-01 12:26     ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
2012-12-01 18:38       ` Linus Torvalds
2012-12-01 18:41         ` Ingo Molnar
2012-12-01 18:50           ` Linus Torvalds
2012-12-01 20:10             ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
2012-12-01 20:19               ` Rik van Riel
2012-12-02 15:10                 ` Ingo Molnar
2012-12-03 13:59               ` Mel Gorman
2012-12-01 20:15             ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
2012-12-01 20:33               ` Rik van Riel
2012-12-02 15:12                 ` [PATCH 2/2, v2] " Ingo Molnar
2012-12-02 17:53                   ` Rik van Riel
2012-12-04 14:42                   ` Michel Lespinasse
2012-12-05  2:59                   ` Michel Lespinasse
2012-12-03 14:17               ` [PATCH 2/2] " Mel Gorman
2012-12-04 14:37                 ` Michel Lespinasse
2012-12-04 18:17                   ` Mel Gorman
2012-12-01 18:55         ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Rik van Riel
2012-12-01 16:19     ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Rik van Riel
2012-12-01 17:55     ` Linus Torvalds
2012-12-01 18:30       ` Ingo Molnar
2012-12-03 13:41   ` [PATCH 00/10] Latest numa/core release, v18 Mel Gorman
2012-12-04 17:30     ` Thomas Gleixner
2012-12-03 10:43 ` Mel Gorman
2012-12-03 11:32 ` Mel Gorman
2012-12-04 22:49 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).