From: Ingo Molnar <mingo@kernel.org>
To: linux-kernel@vger.kernel.org, linux-mm@kvack.org
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
Paul Turner <pjt@google.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>,
Johannes Weiner <hannes@cmpxchg.org>,
Hugh Dickins <hughd@google.com>
Subject: [PATCH 00/10] Latest numa/core release, v18
Date: Fri, 30 Nov 2012 20:58:31 +0100 [thread overview]
Message-ID: <1354305521-11583-1-git-send-email-mingo@kernel.org> (raw)
I'm pleased to announce the latest, -v18 numa/core release.
This release fixes regressions and improves NUMA performance.
It has the following main changes:
- Introduce directed NUMA convergence, which is based on
the 'task buddy' relation introduced in -v17, and make
use of the new "task flipping" facility.
- Add "related task group" balancing notion to the scheduler, to
be able to 'compress' and 'spread' NUMA workloads
based on which tasks relate to each other via their
working set (i.e. which tasks access the same memory areas).
- Track the quality and strength of NUMA convergence and
create a feedback loop with the scheduler:
- use it to direct migrations
- use it to slow down and speed up the rate of the
NUMA hinting page faults
- Turn 4K pte NUMA faults into effective hugepage ones
- Refine the 'shared tasks' memory interleaving logic
- Improve CONFIG_NUMA_BALANCING=y OOM behavior
One key practical area of improvement are enhancements to
the NUMA convergence of "multiple JVM" kind of workloads.
As a recap, this was -v17 performance with 4x SPECjbb instances
on a 4-node system (32 CPUs, 4 instances, 8 warehouses each, 240
seconds runtime, +THP):
spec1.txt: throughput = 177460.44 SPECjbb2005 bops
spec2.txt: throughput = 176175.08 SPECjbb2005 bops
spec3.txt: throughput = 175053.91 SPECjbb2005 bops
spec4.txt: throughput = 171383.52 SPECjbb2005 bops
--------------------------
SUM: throughput = 700072.95 SPECjbb2005 bops
The new -v18 figures are:
spec1.txt: throughput = 191415.52 SPECjbb2005 bops
spec2.txt: throughput = 193481.96 SPECjbb2005 bops
spec3.txt: throughput = 192865.30 SPECjbb2005 bops
spec4.txt: throughput = 191627.40 SPECjbb2005 bops
--------------------------
SUM: throughput = 769390.18 SPECjbb2005 bops
Which is 10% faster than -v17, 22% faster than mainline and it is
within 1% of the hard-binding results (where each JVM is explicitly
memory and CPU-bound to a single node each).
Occording to my measurements the -v18 NUMA kernel is also faster than
AutoNUMA (+THP-fix):
spec1.txt: throughput = 184327.49 SPECjbb2005 bops
spec2.txt: throughput = 187508.83 SPECjbb2005 bops
spec3.txt: throughput = 186206.44 SPECjbb2005 bops
spec4.txt: throughput = 188739.22 SPECjbb2005 bops
--------------------------
SUM: throughput = 746781.98 SPECjbb2005 bops
Mainline has the following 4x JVM performance:
spec1.txt: throughput = 157839.25 SPECjbb2005 bops
spec2.txt: throughput = 156969.15 SPECjbb2005 bops
spec3.txt: throughput = 157571.59 SPECjbb2005 bops
spec4.txt: throughput = 157873.86 SPECjbb2005 bops
--------------------------
SUM: throughput = 630253.85 SPECjbb2005 bops
Another key area of improvement is !THP (4K pages) performance.
Mainline 4x SPECjbb !THP JVM results:
spec1.txt: throughput = 128575.47 SPECjbb2005 bops
spec2.txt: throughput = 125767.24 SPECjbb2005 bops
spec3.txt: throughput = 130042.30 SPECjbb2005 bops
spec4.txt: throughput = 128155.32 SPECjbb2005 bops
--------------------------
SUM: throughput = 512540.33 SPECjbb2005 bops
numa/core -v18 4x SPECjbb JVM !THP results:
spec1.txt: throughput = 158023.05 SPECjbb2005 bops
spec2.txt: throughput = 156895.51 SPECjbb2005 bops
spec3.txt: throughput = 156158.11 SPECjbb2005 bops
spec4.txt: throughput = 157414.52 SPECjbb2005 bops
--------------------------
SUM: throughput = 628491.19 SPECjbb2005 bops
That too is roughly 22% faster than mainline - the !THP regression
that was reported by Mel Gorman appears to be fixed.
AutoNUMA-benchmark comparison to the mainline kernel:
##############
# res-v3.6-vanilla.log vs res-numacore-v18b.log:
#------------------------------------------------------------------------------------>
autonuma benchmark run time (lower is better) speedup %
------------------------------------------------------------------------------------->
numa01 : 337.29 vs. 177.64 | +89.8 %
numa01_THREAD_ALLOC : 428.79 vs. 127.07 | +237.4 %
numa02 : 56.32 vs. 18.08 | +211.5 %
------------------------------------------------------------
(this is similar to -v17, within noise.)
Comparison to AutoNUMA-v28 (+THP-fix):
##############
# res-autonuma-v28-THP.log vs res-numacore-v18b.log:
#------------------------------------------------------------------------------------>
autonuma benchmark run time (lower is better) speedup %
------------------------------------------------------------------------------------->
numa01 : 235.77 vs. 177.64 | +32.7 %
numa01_THREAD_ALLOC : 134.53 vs. 127.07 | +5.8 %
numa02 : 19.49 vs. 18.08 | +7.7 %
------------------------------------------------------------
A few caveats: I'm still seeing problems on !THP.
Here's the analysis of one of the last regression sources I'm still
seeing with it on larger systems. I have identified the source
of the regression, and I see how the AutoNUMA and 'balancenuma' trees
solved this problem - but I disagree with the solution.
When pushed hard enough via threaded workloads (for example via the
numa02 test) then the upstream page migration code in mm/migration.c
becomes unscalable, resulting in lot of scheduling on the anon vma
mutex and a subsequent drop in performance.
When the points of scheduling are call-graph profiled, the
unscalability appears to be due to interaction between the
following page migration code paths:
96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch
|
--- perf_trace_sched_switch
__schedule
schedule
schedule_preempt_disabled
__mutex_lock_common.isra.6
__mutex_lock_slowpath
mutex_lock
|
|--50.61%-- rmap_walk
| move_to_new_page
| migrate_pages
| migrate_misplaced_page
| __do_numa_page.isra.69
| handle_pte_fault
| handle_mm_fault
| __do_page_fault
| do_page_fault
| page_fault
| __memset_sse2
| |
| --100.00%-- worker_thread
| |
| --100.00%-- start_thread
|
--49.39%-- page_lock_anon_vma
try_to_unmap_anon
try_to_unmap
migrate_pages
migrate_misplaced_page
__do_numa_page.isra.69
handle_pte_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault
__memset_sse2
|
--100.00%-- worker_thread
start_thread
>From what I can see theAutoNUMA and 'balancenuma' kernels works
around this !THP scalability issue by rate-limiting migrations.
For example balancenuma rate-limits migrations to about 1.2 GB/sec
bandwidth.
Rate-limiting to solve scalability limits is not the right
solution IMO, because it hurts cases where migration is justified.
The migration of the working set itself is not a problem, it would
in fact be beneficial - but our implementation of it does not scale
beyond a certain rate.
( THP, which has a 512 times lower natural rate of migration page
faults, does not run into this scalability limit. )
So this issue is still open and testers are encouraged to use THP
if they can.
These patches are on top of the "v17" tree (no point in resending those),
and it can all be found in the tip:master tree as well:
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git master
Please re-report any bugs and regressions that you can still see.
Reports, fixes, suggestions are welcome, as always!
Thanks,
Ingo
--------------------->
Ingo Molnar (10):
sched: Add "task flipping" support
sched: Move the NUMA placement logic to a worklet
numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior
mm, numa: Turn 4K pte NUMA faults into effective hugepage ones
sched: Introduce directed NUMA convergence
sched: Remove statistical NUMA scheduling
sched: Track quality and strength of convergence
sched: Converge NUMA migrations
sched: Add convergence strength based adaptive NUMA page fault rate
sched: Refine the 'shared tasks' memory interleaving logic
include/linux/migrate.h | 6 +
include/linux/sched.h | 12 +-
include/uapi/linux/mempolicy.h | 1 +
init/Kconfig | 1 +
kernel/sched/core.c | 99 ++-
kernel/sched/fair.c | 1913 ++++++++++++++++++++++++++++------------
kernel/sched/features.h | 24 +-
kernel/sched/sched.h | 19 +-
kernel/sysctl.c | 11 +-
mm/huge_memory.c | 50 +-
mm/memory.c | 151 +++-
mm/mempolicy.c | 86 +-
mm/migrate.c | 3 +-
mm/mprotect.c | 24 +-
14 files changed, 1699 insertions(+), 701 deletions(-)
--
1.7.11.7
next reply other threads:[~2012-11-30 19:58 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-30 19:58 Ingo Molnar [this message]
2012-11-30 19:58 ` [PATCH 01/10] sched: Add "task flipping" support Ingo Molnar
2012-11-30 19:58 ` [PATCH 02/10] sched: Move the NUMA placement logic to a worklet Ingo Molnar
2012-11-30 19:58 ` [PATCH 03/10] numa, mempolicy: Improve CONFIG_NUMA_BALANCING=y OOM behavior Ingo Molnar
2012-11-30 19:58 ` [PATCH 04/10] mm, numa: Turn 4K pte NUMA faults into effective hugepage ones Ingo Molnar
2012-11-30 19:58 ` [PATCH 05/10] sched: Introduce directed NUMA convergence Ingo Molnar
2012-11-30 19:58 ` [PATCH 06/10] sched: Remove statistical NUMA scheduling Ingo Molnar
2012-11-30 19:58 ` [PATCH 07/10] sched: Track quality and strength of convergence Ingo Molnar
2012-11-30 19:58 ` [PATCH 08/10] sched: Converge NUMA migrations Ingo Molnar
2012-11-30 19:58 ` [PATCH 09/10] sched: Add convergence strength based adaptive NUMA page fault rate Ingo Molnar
2012-11-30 19:58 ` [PATCH 10/10] sched: Refine the 'shared tasks' memory interleaving logic Ingo Molnar
2012-11-30 20:37 ` [PATCH 00/10] Latest numa/core release, v18 Linus Torvalds
2012-12-01 9:49 ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Ingo Molnar
2012-12-01 12:26 ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Ingo Molnar
2012-12-01 18:38 ` Linus Torvalds
2012-12-01 18:41 ` Ingo Molnar
2012-12-01 18:50 ` Linus Torvalds
2012-12-01 20:10 ` [PATCH 1/2] mm/rmap: Convert the struct anon_vma::mutex to an rwsem Ingo Molnar
2012-12-01 20:19 ` Rik van Riel
2012-12-02 15:10 ` Ingo Molnar
2012-12-03 13:59 ` Mel Gorman
2012-12-01 20:15 ` [PATCH 2/2] mm/migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable Ingo Molnar
2012-12-01 20:33 ` Rik van Riel
2012-12-02 15:12 ` [PATCH 2/2, v2] " Ingo Molnar
2012-12-02 17:53 ` Rik van Riel
2012-12-04 14:42 ` Michel Lespinasse
2012-12-05 2:59 ` Michel Lespinasse
2012-12-03 14:17 ` [PATCH 2/2] " Mel Gorman
2012-12-04 14:37 ` Michel Lespinasse
2012-12-04 18:17 ` Mel Gorman
2012-12-01 18:55 ` [RFC PATCH] mm/migration: Remove anon vma locking from try_to_unmap() use Rik van Riel
2012-12-01 16:19 ` [RFC PATCH] mm/migration: Don't lock anon vmas in rmap_walk_anon() Rik van Riel
2012-12-01 17:55 ` Linus Torvalds
2012-12-01 18:30 ` Ingo Molnar
2012-12-03 13:41 ` [PATCH 00/10] Latest numa/core release, v18 Mel Gorman
2012-12-04 17:30 ` Thomas Gleixner
2012-12-03 10:43 ` Mel Gorman
2012-12-03 11:32 ` Mel Gorman
2012-12-04 22:49 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1354305521-11583-1-git-send-email-mingo@kernel.org \
--to=mingo@kernel.org \
--cc=Lee.Schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).