From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Paul Turner <pjt@google.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>
Subject: Benchmark results: "Enhanced NUMA scheduling with adaptive affinity"
Date: Mon, 12 Nov 2012 19:48:33 +0100 [thread overview]
Message-ID: <20121112184833.GA17503@gmail.com> (raw)
In-Reply-To: <20121112160451.189715188@chello.nl>
* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Hi,
>
> This series implements an improved version of NUMA scheduling,
> based on the review and testing feedback we got.
>
> [...]
>
> This new scheduler code is then able to group tasks that are
> "memory related" via their memory access patterns together: in
> the NUMA context moving them on the same node if possible, and
> spreading them amongst nodes if they use private memory.
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.
Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
system (higher numbers are better):
v3.7-vanilla: run #1: 475630
run #2: 538271
run #3: 533888
run #4: 431525
----------------------------------
avg: 494828 transactions/sec
v3.7-NUMA: run #1: 626692
run #2: 622069
run #3: 630335
run #4: 629817
----------------------------------
avg: 627228 transactions/sec [ +26.7% ]
Beyond the +26.7% performance improvement in throughput, the
standard deviation of the results is much lower as well with
NUMA scheduling enabled, by about an order of magnitude.
[ That is probably so because memory and task placement is more
balanced with NUMA scheduling enabled - while with the vanilla
kernel initial placement of the working set determines the
final performance figure. ]
I've also tested Andrea's 'autonumabench' benchmark suite
against vanilla and the NUMA kernel, because Mel reported that
the CONFIG_SCHED_NUMA=y code regressed. It does not regress
anymore:
#
# NUMA01
#
perf stat --null --repeat 3 ./numa01
v3.7-vanilla: 340.3 seconds ( +/- 0.31% )
v3.7-NUMA: 216.9 seconds [ +56% ] ( +/- 8.32% )
-------------------------------------
v3.7-HARD_BIND: 166.6 seconds
Here the new NUMA code is faster than vanilla by 56% - that is
because with the vanilla kernel all memory is allocated on
node0, overloading that node's memory bandwidth.
[ Standard deviation on the vanilla kernel is low, because the
autonuma test causes close to the worst-case placement for the
vanilla kernel - and there's not much space to deviate away
from the worst-case. Despite that, stddev in the NUMA seems a
tad high, suggesting further room for improvement. ]
#
# NUMA01_THREAD_ALLOC
#
perf stat --null --repeat 3 ./numa01_THREAD_ALLOC
v3.7-vanilla: 425.1 seconds ( +/- 1.04% )
v3.7-NUMA: 118.7 seconds [ +250% ] ( +/- 0.49% )
-------------------------------------
v3.7-HARD_BIND: 200.56 seconds
Here the NUMA kernel was able to go beyond the (naive)
hard-binding result and achieved 3.5x the performance of the
vanilla kernel, with a low stddev.
#
# NUMA02
#
perf stat --null --repeat 3 ./numa02
v3.7-vanilla: 56.1 seconds ( +/- 0.72% )
v3.7-NUMA: 17.0 seconds [ +230% ] ( +/- 0.18% )
-------------------------------------
v3.7-HARD_BIND: 14.9 seconds
Here the NUMA kernel runs the test much (3.3x) faster than the
vanilla kernel. The workload is able to converge very quickly
and approximate the hard-binding ideal number very closely. If
runtime was a bit longer it would approximate it even closer.
Standard deviation is also 3 times lower than vanilla,
suggesting stable NUMA convergence.
#
# NUMA02_SMT
#
perf stat --null --repeat 3 ./numa02_SMT
v3.7-vanilla: 56.1 seconds ( +- 0.42% )
v3.7-NUMA: 17.3 seconds [ +220% ] ( +- 0.88% )
-------------------------------------
v3.7-HARD_BIND: 14.6 seconds
In this test too the NUMA kernel outperforms the vanilla kernel,
by a factor of 3.2x. It comes very close to the ideal
hard-binding convergence result. Standard deviation is a bit
high.
I have also created a new perf benchmarking and workload
generation tool: 'perf bench numa' (I'll post it later in a
separate reply).
Via 'perf bench numa' we can generate arbitrary process and
thread layouts, with arbitrary memory sharing arrangements
between them.
Here are various comparisons to the vanilla kernel (higher
numbers are better):
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T 0
#
v3.7-vanilla: 14.8 GB/sec
v3.7-NUMA: 32.9 GB/sec [ +122.3% ]
2.2 times faster.
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 0 -T 1024
#
v3.7-vanilla: 17.0 GB/sec
v3.7-NUMA: 36.3 GB/sec [ +113.5% ]
2.1 times faster.
So it's a nice improvement all around. With this version the
regressions that Mel Gorman reported a week ago appear to be
fixed as well.
Thanks,
Ingo
ps. If anyone is curious about further details, let me know.
The base kernel I used for measurement was commit
02743c9c03f1 + the 8 patches Peter sent out.
WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
Paul Turner <pjt@google.com>,
Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
Mel Gorman <mgorman@suse.de>,
Andrew Morton <akpm@linux-foundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Thomas Gleixner <tglx@linutronix.de>
Subject: Benchmark results: "Enhanced NUMA scheduling with adaptive affinity"
Date: Mon, 12 Nov 2012 19:48:33 +0100 [thread overview]
Message-ID: <20121112184833.GA17503@gmail.com> (raw)
In-Reply-To: <20121112160451.189715188@chello.nl>
* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Hi,
>
> This series implements an improved version of NUMA scheduling,
> based on the review and testing feedback we got.
>
> [...]
>
> This new scheduler code is then able to group tasks that are
> "memory related" via their memory access patterns together: in
> the NUMA context moving them on the same node if possible, and
> spreading them amongst nodes if they use private memory.
Here are some preliminary performance figures, comparing the
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.
Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server
system (higher numbers are better):
v3.7-vanilla: run #1: 475630
run #2: 538271
run #3: 533888
run #4: 431525
----------------------------------
avg: 494828 transactions/sec
v3.7-NUMA: run #1: 626692
run #2: 622069
run #3: 630335
run #4: 629817
----------------------------------
avg: 627228 transactions/sec [ +26.7% ]
Beyond the +26.7% performance improvement in throughput, the
standard deviation of the results is much lower as well with
NUMA scheduling enabled, by about an order of magnitude.
[ That is probably so because memory and task placement is more
balanced with NUMA scheduling enabled - while with the vanilla
kernel initial placement of the working set determines the
final performance figure. ]
I've also tested Andrea's 'autonumabench' benchmark suite
against vanilla and the NUMA kernel, because Mel reported that
the CONFIG_SCHED_NUMA=y code regressed. It does not regress
anymore:
#
# NUMA01
#
perf stat --null --repeat 3 ./numa01
v3.7-vanilla: 340.3 seconds ( +/- 0.31% )
v3.7-NUMA: 216.9 seconds [ +56% ] ( +/- 8.32% )
-------------------------------------
v3.7-HARD_BIND: 166.6 seconds
Here the new NUMA code is faster than vanilla by 56% - that is
because with the vanilla kernel all memory is allocated on
node0, overloading that node's memory bandwidth.
[ Standard deviation on the vanilla kernel is low, because the
autonuma test causes close to the worst-case placement for the
vanilla kernel - and there's not much space to deviate away
from the worst-case. Despite that, stddev in the NUMA seems a
tad high, suggesting further room for improvement. ]
#
# NUMA01_THREAD_ALLOC
#
perf stat --null --repeat 3 ./numa01_THREAD_ALLOC
v3.7-vanilla: 425.1 seconds ( +/- 1.04% )
v3.7-NUMA: 118.7 seconds [ +250% ] ( +/- 0.49% )
-------------------------------------
v3.7-HARD_BIND: 200.56 seconds
Here the NUMA kernel was able to go beyond the (naive)
hard-binding result and achieved 3.5x the performance of the
vanilla kernel, with a low stddev.
#
# NUMA02
#
perf stat --null --repeat 3 ./numa02
v3.7-vanilla: 56.1 seconds ( +/- 0.72% )
v3.7-NUMA: 17.0 seconds [ +230% ] ( +/- 0.18% )
-------------------------------------
v3.7-HARD_BIND: 14.9 seconds
Here the NUMA kernel runs the test much (3.3x) faster than the
vanilla kernel. The workload is able to converge very quickly
and approximate the hard-binding ideal number very closely. If
runtime was a bit longer it would approximate it even closer.
Standard deviation is also 3 times lower than vanilla,
suggesting stable NUMA convergence.
#
# NUMA02_SMT
#
perf stat --null --repeat 3 ./numa02_SMT
v3.7-vanilla: 56.1 seconds ( +- 0.42% )
v3.7-NUMA: 17.3 seconds [ +220% ] ( +- 0.88% )
-------------------------------------
v3.7-HARD_BIND: 14.6 seconds
In this test too the NUMA kernel outperforms the vanilla kernel,
by a factor of 3.2x. It comes very close to the ideal
hard-binding convergence result. Standard deviation is a bit
high.
I have also created a new perf benchmarking and workload
generation tool: 'perf bench numa' (I'll post it later in a
separate reply).
Via 'perf bench numa' we can generate arbitrary process and
thread layouts, with arbitrary memory sharing arrangements
between them.
Here are various comparisons to the vanilla kernel (higher
numbers are better):
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T 0
#
v3.7-vanilla: 14.8 GB/sec
v3.7-NUMA: 32.9 GB/sec [ +122.3% ]
2.2 times faster.
#
# 4 processes with 4 threads per process, sharing 4x 1GB of
# process-wide memory:
#
# perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 0 -T 1024
#
v3.7-vanilla: 17.0 GB/sec
v3.7-NUMA: 36.3 GB/sec [ +113.5% ]
2.1 times faster.
So it's a nice improvement all around. With this version the
regressions that Mel Gorman reported a week ago appear to be
fixed as well.
Thanks,
Ingo
ps. If anyone is curious about further details, let me know.
The base kernel I used for measurement was commit
02743c9c03f1 + the 8 patches Peter sent out.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2012-11-12 18:48 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-12 16:04 [PATCH 0/8] Announcement: Enhanced NUMA scheduling with adaptive affinity Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 1/8] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 2/8] sched, numa, mm: Implement THP migration Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 3/8] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 4/8] sched, numa, mm: Add last_cpu to page flags Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-13 11:55 ` Ingo Molnar
2012-11-13 11:55 ` Ingo Molnar
2012-11-13 16:09 ` Rik van Riel
2012-11-13 16:09 ` Rik van Riel
2012-11-12 16:04 ` [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-13 0:02 ` Christoph Lameter
2012-11-13 0:02 ` Christoph Lameter
2012-11-13 8:19 ` Ingo Molnar
2012-11-13 8:19 ` Ingo Molnar
2012-11-13 22:57 ` Rik van Riel
2012-11-13 22:57 ` Rik van Riel
2012-11-16 18:06 ` Rik van Riel
2012-11-16 18:06 ` Rik van Riel
2012-11-16 18:14 ` Ingo Molnar
2012-11-16 18:14 ` Ingo Molnar
2012-11-16 18:23 ` Rik van Riel
2012-11-16 18:23 ` Rik van Riel
2012-11-29 19:34 ` Andi Kleen
2012-11-29 19:34 ` Andi Kleen
2012-11-12 16:04 ` [PATCH 6/8] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 7/8] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 8/8] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 18:48 ` Ingo Molnar [this message]
2012-11-12 18:48 ` Benchmark results: "Enhanced NUMA scheduling with adaptive affinity" Ingo Molnar
2012-11-15 10:08 ` Mel Gorman
2012-11-15 10:08 ` Mel Gorman
2012-11-15 16:29 ` Linus Torvalds
2012-11-15 18:52 ` Rik van Riel
2012-11-15 18:52 ` Rik van Riel
2012-11-15 21:27 ` Mel Gorman
2012-11-15 21:27 ` Mel Gorman
2012-11-15 20:32 ` Linus Torvalds
2012-11-15 20:32 ` Linus Torvalds
2012-11-15 22:04 ` Rik van Riel
2012-11-15 22:04 ` Rik van Riel
2012-11-16 14:14 ` Mel Gorman
2012-11-16 14:14 ` Mel Gorman
2012-11-16 19:50 ` Andrea Arcangeli
2012-11-16 19:50 ` Andrea Arcangeli
2012-11-16 20:05 ` Mel Gorman
2012-11-16 20:05 ` Mel Gorman
2012-11-16 16:16 ` Ingo Molnar
2012-11-16 16:16 ` Ingo Molnar
2012-11-16 15:56 ` Ingo Molnar
2012-11-16 15:56 ` Ingo Molnar
2012-11-16 16:25 ` Mel Gorman
2012-11-16 16:25 ` Mel Gorman
2012-11-16 17:49 ` Ingo Molnar
2012-11-16 17:49 ` Ingo Molnar
2012-11-16 19:04 ` Mel Gorman
2012-11-16 19:04 ` Mel Gorman
2012-11-12 23:43 ` [PATCH 0/8] Announcement: Enhanced NUMA scheduling with adaptive affinity Christoph Lameter
2012-11-12 23:43 ` Christoph Lameter
2012-11-13 7:24 ` Ingo Molnar
2012-11-13 7:24 ` Ingo Molnar
2012-11-15 14:26 ` Christoph Lameter
2012-11-15 14:26 ` Christoph Lameter
2012-11-16 15:59 ` Ingo Molnar
2012-11-16 15:59 ` Ingo Molnar
2012-11-16 20:57 ` Christoph Lameter
2012-11-16 20:57 ` Christoph Lameter
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121112184833.GA17503@gmail.com \
--to=mingo@kernel.org \
--cc=Lee.Schermerhorn@hp.com \
--cc=a.p.zijlstra@chello.nl \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cl@linux.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@suse.de \
--cc=pjt@google.com \
--cc=riel@redhat.com \
--cc=tglx@linutronix.de \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.