sched : performance regression 24% between 4.4rc4 and 4.3 kernel

From: Jirka Hladky <jhladky@redhat.com>
To: linux-kernel@vger.kernel.org
Subject: sched : performance regression 24% between 4.4rc4 and 4.3 kernel
Date: Fri, 11 Dec 2015 15:17:50 +0100	[thread overview]
Message-ID: <CAE4VaGAziCRGRXPPO6YtiHLXqeLUMcuYCh3mkmGpDjfW8GaetQ@mail.gmail.com> (raw)

Hello,

we are doing performance testing of the new kernel scheduler (commit
53528695ff6d8b77011bc818407c13e30914a946). In most cases we see
performance improvements compared to 4.3 kernel with the exception of
stream benchmark when running on 4 NUMA node server.

When we run 4 stream benchmark processes on 4 NUMA node server and we
compare the total performance we see drop about 24% compared to 4.3
kernel. This is caused by the fact that 2 stream benchmarks are
running on the same NUMA node while 1 NUMA node does not run any
stream benchmark. With kernel 4.3, load is distributed evenly among
all 4 NUMA nodes. When two stream benchmarks are running on the same
NUMA node then the runtime is almost twice as long compared to one
stream bench running on one NUMA node. See log files [1] bellow.

Please see the graph comparing stream benchmark results between kernel
4.3 and 4.4rc4 (for legend see [2] bellow).
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stream_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png

Could you please help us to identify the root cause of this
regression? We don't have the skills to fix the problem ourselves but
we will be more than happy to test any proposed patch for this issue.

Thanks a lot for your help on that!
Jirka

Further details:

[1] Log files can be downloaded here:
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/4.4RC4_stream_log_files.tar.bz2

$grep "User time" *log
stream.defaultRun.004streams.loop01.instance001.log:User time:  12.370 seconds
stream.defaultRun.004streams.loop01.instance002.log:User time:  10.560 seconds
stream.defaultRun.004streams.loop01.instance003.log:User time:  19.330 seconds
stream.defaultRun.004streams.loop01.instance004.log:User time:  17.820 seconds

$grep "NUMA nodes:" *log
stream.defaultRun.004streams.loop01.instance001.log:NUMA nodes:     2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2
stream.defaultRun.004streams.loop01.instance002.log:NUMA nodes:     0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
stream.defaultRun.004streams.loop01.instance003.log:NUMA nodes:     3
3 3 3 3 3 3 3 3 0 0 0 0 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 0 0 0 0 0 0 0 0 0 0 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3
stream.defaultRun.004streams.loop01.instance004.log:NUMA nodes:     3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 0 0 0 0 0 0 0 0 0 0 0 0

=> please note that NO bench is running on NUMA node #1 and instances
#3 and #4 are running both on NUMA node #3. This has huge performance
impact as stream instances on node #3 need 19 and 17 seconds to finish
compared to 10 and 12 seconds for instances running alone on one NUMA
node.

[2] Graph:
https://jhladky.fedorapeople.org/sched_stream_kernel_4.3vs4.4rc4/Stream_benchmark_on_4_NUMA_node_server_4.3vs4.4rc4_kernel.png

Graph Legend:
GREEN line => kernel 4.3
BLUE line =>    kernel 4.4rc4
x-axis      =>     number of parallel stream instances
y-axis      =>     Sum [1/runtime] over all stream instances

Details on server: DELL PowerEdge R820, 4x E5-4607 0 @ 2.20GHz and 128GB RAM
http://ark.intel.com/products/64604