All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Benchmark results: "Enhanced NUMA scheduling with adaptive affinity"
Date: Mon, 12 Nov 2012 19:48:33 +0100	[thread overview]
Message-ID: <20121112184833.GA17503@gmail.com> (raw)
In-Reply-To: <20121112160451.189715188@chello.nl>


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Hi,
> 
> This series implements an improved version of NUMA scheduling, 
> based on the review and testing feedback we got.
>
> [...]
>
> This new scheduler code is then able to group tasks that are 
> "memory related" via their memory access patterns together: in 
> the NUMA context moving them on the same node if possible, and 
> spreading them amongst nodes if they use private memory.

Here are some preliminary performance figures, comparing the 
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.

Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server 
system (higher numbers are better):

   v3.7-vanilla:    run #1:    475630
                    run #2:    538271
                    run #3:    533888
                    run #4:    431525
                    ----------------------------------
                       avg:    494828 transactions/sec

   v3.7-NUMA:       run #1:    626692
                    run #2:    622069
                    run #3:    630335
                    run #4:    629817
                    ----------------------------------
                       avg:    627228 transactions/sec    [ +26.7% ]

Beyond the +26.7% performance improvement in throughput, the 
standard deviation of the results is much lower as well with 
NUMA scheduling enabled, by about an order of magnitude.

[ That is probably so because memory and task placement is more 
  balanced with NUMA scheduling enabled - while with the vanilla 
  kernel initial placement of the working set determines the 
  final performance figure. ]

I've also tested Andrea's 'autonumabench' benchmark suite 
against vanilla and the NUMA kernel, because Mel reported that 
the CONFIG_SCHED_NUMA=y code regressed. It does not regress 
anymore:

  #
  # NUMA01
  #
  perf stat --null --repeat 3 ./numa01

   v3.7-vanilla:           340.3 seconds           ( +/- 0.31% )
   v3.7-NUMA:              216.9 seconds  [ +56% ] ( +/- 8.32% )
   -------------------------------------
   v3.7-HARD_BIND:         166.6 seconds

Here the new NUMA code is faster than vanilla by 56% - that is 
because with the vanilla kernel all memory is allocated on 
node0, overloading that node's memory bandwidth.

[ Standard deviation on the vanilla kernel is low, because the 
  autonuma test causes close to the worst-case placement for the 
  vanilla kernel - and there's not much space to deviate away 
  from the worst-case. Despite that, stddev in the NUMA seems a 
  tad high, suggesting further room for improvement. ]

  #
  # NUMA01_THREAD_ALLOC
  #
  perf stat --null --repeat 3 ./numa01_THREAD_ALLOC

   v3.7-vanilla:            425.1 seconds             ( +/- 1.04% )
   v3.7-NUMA:               118.7 seconds  [ +250% ]  ( +/- 0.49% )
   -------------------------------------
   v3.7-HARD_BIND:          200.56 seconds

Here the NUMA kernel was able to go beyond the (naive) 
hard-binding result and achieved 3.5x the performance of the 
vanilla kernel, with a low stddev.

  #
  # NUMA02
  #
  perf stat --null --repeat 3 ./numa02

   v3.7-vanilla:           56.1 seconds               ( +/- 0.72% )
   v3.7-NUMA:              17.0 seconds    [ +230% ]  ( +/- 0.18% )
   -------------------------------------
   v3.7-HARD_BIND:         14.9 seconds

Here the NUMA kernel runs the test much (3.3x) faster than the 
vanilla kernel. The workload is able to converge very quickly 
and approximate the hard-binding ideal number very closely. If 
runtime was a bit longer it would approximate it even closer.

Standard deviation is also 3 times lower than vanilla, 
suggesting stable NUMA convergence.

  #
  # NUMA02_SMT
  #
  perf stat --null --repeat 3 ./numa02_SMT
   v3.7-vanilla:            56.1 seconds                 ( +- 0.42% )
   v3.7-NUMA:               17.3 seconds     [ +220% ]   ( +- 0.88% )
   -------------------------------------
   v3.7-HARD_BIND:          14.6 seconds

In this test too the NUMA kernel outperforms the vanilla kernel, 
by a factor of 3.2x. It comes very close to the ideal 
hard-binding convergence result. Standard deviation is a bit 
high.

I have also created a new perf benchmarking and workload 
generation tool: 'perf bench numa' (I'll post it later in a 
separate reply).

Via 'perf bench numa' we can generate arbitrary process and 
thread layouts, with arbitrary memory sharing arrangements 
between them.

Here are various comparisons to the vanilla kernel (higher 
numbers are better):

  #
  # 4 processes with 4 threads per process, sharing 4x 1GB of 
  # process-wide memory:
  #
  # perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T    0
  #
           v3.7-vanilla:       14.8 GB/sec
           v3.7-NUMA:          32.9 GB/sec    [ +122.3% ]

2.2 times faster.

  #
  # 4 processes with 4 threads per process, sharing 4x 1GB of 
  # process-wide memory:
  #
  # perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P    0 -T 1024
  #

           v3.7-vanilla:        17.0 GB/sec
           v3.7-NUMA:           36.3 GB/sec    [ +113.5% ]

2.1 times faster.

So it's a nice improvement all around. With this version the 
regressions that Mel Gorman reported a week ago appear to be 
fixed as well.

Thanks,

	Ingo

ps. If anyone is curious about further details, let me know.
    The base kernel I used for measurement was commit
    02743c9c03f1 + the 8 patches Peter sent out.

WARNING: multiple messages have this Message-ID (diff)
From: Ingo Molnar <mingo@kernel.org>
To: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Paul Turner <pjt@google.com>,
	Lee Schermerhorn <Lee.Schermerhorn@hp.com>,
	Christoph Lameter <cl@linux.com>, Rik van Riel <riel@redhat.com>,
	Mel Gorman <mgorman@suse.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Thomas Gleixner <tglx@linutronix.de>
Subject: Benchmark results: "Enhanced NUMA scheduling with adaptive affinity"
Date: Mon, 12 Nov 2012 19:48:33 +0100	[thread overview]
Message-ID: <20121112184833.GA17503@gmail.com> (raw)
In-Reply-To: <20121112160451.189715188@chello.nl>


* Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:

> Hi,
> 
> This series implements an improved version of NUMA scheduling, 
> based on the review and testing feedback we got.
>
> [...]
>
> This new scheduler code is then able to group tasks that are 
> "memory related" via their memory access patterns together: in 
> the NUMA context moving them on the same node if possible, and 
> spreading them amongst nodes if they use private memory.

Here are some preliminary performance figures, comparing the 
vanilla kernel against the CONFIG_SCHED_NUMA=y kernel.

Java SPEC benchmark, running on a 4 node, 64 GB, 32-way server 
system (higher numbers are better):

   v3.7-vanilla:    run #1:    475630
                    run #2:    538271
                    run #3:    533888
                    run #4:    431525
                    ----------------------------------
                       avg:    494828 transactions/sec

   v3.7-NUMA:       run #1:    626692
                    run #2:    622069
                    run #3:    630335
                    run #4:    629817
                    ----------------------------------
                       avg:    627228 transactions/sec    [ +26.7% ]

Beyond the +26.7% performance improvement in throughput, the 
standard deviation of the results is much lower as well with 
NUMA scheduling enabled, by about an order of magnitude.

[ That is probably so because memory and task placement is more 
  balanced with NUMA scheduling enabled - while with the vanilla 
  kernel initial placement of the working set determines the 
  final performance figure. ]

I've also tested Andrea's 'autonumabench' benchmark suite 
against vanilla and the NUMA kernel, because Mel reported that 
the CONFIG_SCHED_NUMA=y code regressed. It does not regress 
anymore:

  #
  # NUMA01
  #
  perf stat --null --repeat 3 ./numa01

   v3.7-vanilla:           340.3 seconds           ( +/- 0.31% )
   v3.7-NUMA:              216.9 seconds  [ +56% ] ( +/- 8.32% )
   -------------------------------------
   v3.7-HARD_BIND:         166.6 seconds

Here the new NUMA code is faster than vanilla by 56% - that is 
because with the vanilla kernel all memory is allocated on 
node0, overloading that node's memory bandwidth.

[ Standard deviation on the vanilla kernel is low, because the 
  autonuma test causes close to the worst-case placement for the 
  vanilla kernel - and there's not much space to deviate away 
  from the worst-case. Despite that, stddev in the NUMA seems a 
  tad high, suggesting further room for improvement. ]

  #
  # NUMA01_THREAD_ALLOC
  #
  perf stat --null --repeat 3 ./numa01_THREAD_ALLOC

   v3.7-vanilla:            425.1 seconds             ( +/- 1.04% )
   v3.7-NUMA:               118.7 seconds  [ +250% ]  ( +/- 0.49% )
   -------------------------------------
   v3.7-HARD_BIND:          200.56 seconds

Here the NUMA kernel was able to go beyond the (naive) 
hard-binding result and achieved 3.5x the performance of the 
vanilla kernel, with a low stddev.

  #
  # NUMA02
  #
  perf stat --null --repeat 3 ./numa02

   v3.7-vanilla:           56.1 seconds               ( +/- 0.72% )
   v3.7-NUMA:              17.0 seconds    [ +230% ]  ( +/- 0.18% )
   -------------------------------------
   v3.7-HARD_BIND:         14.9 seconds

Here the NUMA kernel runs the test much (3.3x) faster than the 
vanilla kernel. The workload is able to converge very quickly 
and approximate the hard-binding ideal number very closely. If 
runtime was a bit longer it would approximate it even closer.

Standard deviation is also 3 times lower than vanilla, 
suggesting stable NUMA convergence.

  #
  # NUMA02_SMT
  #
  perf stat --null --repeat 3 ./numa02_SMT
   v3.7-vanilla:            56.1 seconds                 ( +- 0.42% )
   v3.7-NUMA:               17.3 seconds     [ +220% ]   ( +- 0.88% )
   -------------------------------------
   v3.7-HARD_BIND:          14.6 seconds

In this test too the NUMA kernel outperforms the vanilla kernel, 
by a factor of 3.2x. It comes very close to the ideal 
hard-binding convergence result. Standard deviation is a bit 
high.

I have also created a new perf benchmarking and workload 
generation tool: 'perf bench numa' (I'll post it later in a 
separate reply).

Via 'perf bench numa' we can generate arbitrary process and 
thread layouts, with arbitrary memory sharing arrangements 
between them.

Here are various comparisons to the vanilla kernel (higher 
numbers are better):

  #
  # 4 processes with 4 threads per process, sharing 4x 1GB of 
  # process-wide memory:
  #
  # perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P 1024 -T    0
  #
           v3.7-vanilla:       14.8 GB/sec
           v3.7-NUMA:          32.9 GB/sec    [ +122.3% ]

2.2 times faster.

  #
  # 4 processes with 4 threads per process, sharing 4x 1GB of 
  # process-wide memory:
  #
  # perf bench numa mem -l 100 -zZ0 -p 4 -t 4 -P    0 -T 1024
  #

           v3.7-vanilla:        17.0 GB/sec
           v3.7-NUMA:           36.3 GB/sec    [ +113.5% ]

2.1 times faster.

So it's a nice improvement all around. With this version the 
regressions that Mel Gorman reported a week ago appear to be 
fixed as well.

Thanks,

	Ingo

ps. If anyone is curious about further details, let me know.
    The base kernel I used for measurement was commit
    02743c9c03f1 + the 8 patches Peter sent out.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2012-11-12 18:48 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-12 16:04 [PATCH 0/8] Announcement: Enhanced NUMA scheduling with adaptive affinity Peter Zijlstra
2012-11-12 16:04 ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 1/8] sched, numa, mm: Introduce sched_feat_numa() Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 2/8] sched, numa, mm: Implement THP migration Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 3/8] sched, numa, mm: Add credits for NUMA placement Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 4/8] sched, numa, mm: Add last_cpu to page flags Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-13 11:55   ` Ingo Molnar
2012-11-13 11:55     ` Ingo Molnar
2012-11-13 16:09   ` Rik van Riel
2012-11-13 16:09     ` Rik van Riel
2012-11-12 16:04 ` [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-13  0:02   ` Christoph Lameter
2012-11-13  0:02     ` Christoph Lameter
2012-11-13  8:19     ` Ingo Molnar
2012-11-13  8:19       ` Ingo Molnar
2012-11-13 22:57   ` Rik van Riel
2012-11-13 22:57     ` Rik van Riel
2012-11-16 18:06   ` Rik van Riel
2012-11-16 18:06     ` Rik van Riel
2012-11-16 18:14     ` Ingo Molnar
2012-11-16 18:14       ` Ingo Molnar
2012-11-16 18:23       ` Rik van Riel
2012-11-16 18:23         ` Rik van Riel
2012-11-29 19:34   ` Andi Kleen
2012-11-29 19:34     ` Andi Kleen
2012-11-12 16:04 ` [PATCH 6/8] sched, numa, mm: Implement constant, per task Working Set Sampling (WSS) rate Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 7/8] sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-12 16:04 ` [PATCH 8/8] sched, numa, mm: Implement slow start for working set sampling Peter Zijlstra
2012-11-12 16:04   ` Peter Zijlstra
2012-11-12 18:48 ` Ingo Molnar [this message]
2012-11-12 18:48   ` Benchmark results: "Enhanced NUMA scheduling with adaptive affinity" Ingo Molnar
2012-11-15 10:08   ` Mel Gorman
2012-11-15 10:08     ` Mel Gorman
2012-11-15 16:29     ` Linus Torvalds
2012-11-15 18:52     ` Rik van Riel
2012-11-15 18:52       ` Rik van Riel
2012-11-15 21:27       ` Mel Gorman
2012-11-15 21:27         ` Mel Gorman
2012-11-15 20:32     ` Linus Torvalds
2012-11-15 20:32       ` Linus Torvalds
2012-11-15 22:04       ` Rik van Riel
2012-11-15 22:04         ` Rik van Riel
2012-11-16 14:14         ` Mel Gorman
2012-11-16 14:14           ` Mel Gorman
2012-11-16 19:50           ` Andrea Arcangeli
2012-11-16 19:50             ` Andrea Arcangeli
2012-11-16 20:05             ` Mel Gorman
2012-11-16 20:05               ` Mel Gorman
2012-11-16 16:16       ` Ingo Molnar
2012-11-16 16:16         ` Ingo Molnar
2012-11-16 15:56     ` Ingo Molnar
2012-11-16 15:56       ` Ingo Molnar
2012-11-16 16:25       ` Mel Gorman
2012-11-16 16:25         ` Mel Gorman
2012-11-16 17:49         ` Ingo Molnar
2012-11-16 17:49           ` Ingo Molnar
2012-11-16 19:04           ` Mel Gorman
2012-11-16 19:04             ` Mel Gorman
2012-11-12 23:43 ` [PATCH 0/8] Announcement: Enhanced NUMA scheduling with adaptive affinity Christoph Lameter
2012-11-12 23:43   ` Christoph Lameter
2012-11-13  7:24   ` Ingo Molnar
2012-11-13  7:24     ` Ingo Molnar
2012-11-15 14:26     ` Christoph Lameter
2012-11-15 14:26       ` Christoph Lameter
2012-11-16 15:59       ` Ingo Molnar
2012-11-16 15:59         ` Ingo Molnar
2012-11-16 20:57         ` Christoph Lameter
2012-11-16 20:57           ` Christoph Lameter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20121112184833.GA17503@gmail.com \
    --to=mingo@kernel.org \
    --cc=Lee.Schermerhorn@hp.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cl@linux.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@suse.de \
    --cc=pjt@google.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.