[GIT TREE] Unified NUMA balancing tree, v3

* [GIT TREE] Unified NUMA balancing tree, v3
@ 2012-12-07  0:19 Ingo Molnar
  2012-12-07  0:19 ` [PATCH 1/9] numa, sched: Fix NUMA tick ->numa_shared setting Ingo Molnar
                   ` (9 more replies)
  0 siblings, 10 replies; 19+ messages in thread
From: Ingo Molnar @ 2012-12-07  0:19 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Peter Zijlstra, Paul Turner, Lee Schermerhorn, Christoph Lameter,
	Rik van Riel, Mel Gorman, Andrew Morton, Andrea Arcangeli,
	Linus Torvalds, Thomas Gleixner, Johannes Weiner, Hugh Dickins

I'm pleased to announce the -v3 version of the unified NUMA tree,
which can be accessed at the following Git address:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master

[ To test this tree, just pick up the Git tree and enable
  CONFIG_NUMA_BALANCING=y. On an at least 2 node NUMA system
  you should see speedups in many types of long-running,
  memory-intense user-space workloads with this feature enabled.
  Or a slowdown if the plan does not work out. Please report
  both cases. ]

The focus of the -v3 release is regression fixes. Half of the
regression fixed were related to the unification, half of them
were due prior bugs.

Main changes since -v2:

  - Implement last-CPU+PID hash tracking
  - Improve staggered convergence
  - Improve directed convergence
  - Fix !THP, 4K-pte "2M-emu" NUMA fault handling

In particular the new CPU+PID hashing code works very well, and
I'd be curious whether others can confirm that they are seeing
speedups as well.

Some performance figures. Here's the comparison to mainline:

 ##############
 # res-v3.6-vanilla.log vs res-numaunified-v3.log:
 #------------------------------------------------------------------------------------>
   autonuma benchmark                run time (lower is better)         speedup %
 ------------------------------------------------------------------------------------->
   numa01                           :   337.29  vs.  195.47   |           +72.5 %
   numa01_THREAD_ALLOC              :   428.79  vs.  119.97   |          +257.4 %
   numa02                           :    56.32  vs.   16.82   |          +234.8 %
   numa02_SMT                       :    56.55  vs.   16.98   |          +233.0 %
   ------------------------------------------------------------

Still much better, all around.

Comparison to the -v17, the last non-regressing pre-unification
tree:

 ##############
 # res-numacore-v18b.log vs res-numaunified-v3.log:
 #------------------------------------------------------------------------------------>
   autonuma benchmark                run time (lower is better)         speedup %
 ------------------------------------------------------------------------------------->
   numa01                           :   177.64  vs.  195.47   |            -9.1 %
   numa01_THREAD_ALLOC              :   127.07  vs.  119.97   |            +5.9 %
   numa02                           :    18.08  vs.   16.82   |            +7.4 %
   numa02_SMT                       :    36.97  vs.   16.98   |          +117.7 %
   ------------------------------------------------------------

[ Note: the 'numa01' result is a bit slower, due to us not
  taking node distances into account on larger than 2-node
  systems, and this run spreading the tasks in a A-B-A-B
  suboptimal order, instead of A-A-B-B. There's a 50% chance for
  that outcome and this run got the worse convergence layout.

  That behavior due to node assymetry will be improved in future
  versions. Note that even in the less ideal layout it's faster
  than mainline. ]

- The twice as fast numa02_SMT result is a regression fix.

- 'numa02' and 'numa01_THREAD_ALLOC' got genuinely faster - and
  that's good news because those are our prime target 'good'
  NUMA workloads.

The SPECjbb 4x JVM numbers are still very close to the
hard-binding results:

  Fri Dec  7 02:08:42 CET 2012
  spec1.txt:           throughput =     188667.94 SPECjbb2005 bops
  spec2.txt:           throughput =     190109.31 SPECjbb2005 bops
  spec3.txt:           throughput =     191438.13 SPECjbb2005 bops
  spec4.txt:           throughput =     192508.34 SPECjbb2005 bops
                                      --------------------------
        SUM:           throughput =     762723.72 SPECjbb2005 bops

And the same is true for !THP as well.

( In case you have sent a regression report please re-test this
  version - I'll try to work down some of my email backlog and
  reply to any mails I have not replied to yet. )

Reports, fixes, suggestions are welcome, as always!

Thanks,

	Ingo

------------------------------------------------->
Ingo Molnar (9):
  numa, sched: Fix NUMA tick ->numa_shared setting
  numa, sched: Add tracking of runnable NUMA tasks
  numa, sched: Implement wake-cpu migration support
  numa, mm, sched: Implement last-CPU+PID hash tracking
  numa, mm, sched: Fix NUMA affinity tracking logic
  numa, mm: Fix !THP, 4K-pte "2M-emu" NUMA fault handling
  numa, sched: Improve staggered convergence
  numa, sched: Improve directed convergence
  numa, sched: Streamline and fix numa_allow_migration() use

 include/linux/init_task.h         |   4 +-
 include/linux/mempolicy.h         |   4 +-
 include/linux/mm.h                |  79 +++++---
 include/linux/mm_types.h          |   4 +-
 include/linux/page-flags-layout.h |  23 ++-
 include/linux/sched.h             |   9 +-
 kernel/sched/core.c               |  29 ++-
 kernel/sched/fair.c               | 370 +++++++++++++++++++++++++++-----------
 kernel/sched/features.h           |   2 +
 kernel/sched/sched.h              |   4 +
 kernel/sysctl.c                   |   8 +
 mm/huge_memory.c                  |  25 +--
 mm/memory.c                       | 175 +++++++++++++-----
 mm/mempolicy.c                    |  50 ++++--
 mm/migrate.c                      |   4 +-
 mm/mprotect.c                     |   4 +-
 mm/page_alloc.c                   |   6 +-
 17 files changed, 574 insertions(+), 226 deletions(-)

-- 
1.7.11.7

^ permalink raw reply	[flat|nested] 19+ messages in thread