Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel

From: Jirka Hladky <jhladky@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@redhat.com>,
	Kamil Kolakowski <kkolakow@redhat.com>
Subject: Re: Kernel 4.7rc3 - Performance drop 30-40% for SPECjbb2005 and SPECjvm2008 benchmarks against 4.6 kernel
Date: Tue, 21 Jun 2016 15:17:52 +0200	[thread overview]
Message-ID: <CAE4VaGC595XMvCxJBSso53t6pmsSySvQ2HKEbvABKfHCJAUnHQ@mail.gmail.com> (raw)
In-Reply-To: <CAE4VaGCYvAbvNRD65SPfLPGVCspRfUYDAtfPt-VmkukPbt-L4Q@mail.gmail.com>

Hi Peter,

I have an update for this performance issue. I have tested several
kernels, I'm not at the parent of

  2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels

and I still see the performance regression for multithreaded workloads.

There are only 27 commits remaining between v4.6 (last known to be OK)
and current HEAD (6ecdd74962f246dfe8750b7bea481a1c0816315d)
6ecdd74962f246dfe8750b7bea481a1c0816315d    sched/fair: Generalize the
load/util averages resolution definitionq hook unless util changed

See below [0].

Any hint which commit should I try now?

Thanks a lot!
Jirka

[0]
$ git log --pretty=oneline v4.6..HEAD kernel/sched
6ecdd74962f246dfe8750b7bea481a1c0816315d sched/fair: Generalize the
load/util averages resolution definition
2159197d66770ec01f75c93fb11dc66df81fd45b sched/core: Enable increased
load resolution on 64-bit kernels
e7904a28f5331c21d17af638cb477c83662e3cb6 locking/lockdep, sched/core:
Implement a better lock pinning scheme
eb58075149b7f0300ff19142e6245fe75db2a081 sched/core: Introduce 'struct rq_flags'
3e71a462dd483ce508a723356b293731e7d788ea sched/core: Move
task_rq_lock() out of line
64b7aad5798478ffff52e110878ccaae4c3aaa34 Merge branch 'sched/urgent'
into sched/core, to pick up fixes before applying new changes
f98db6013c557c216da5038d9c52045be55cd039 sched/core: Add
switch_mm_irqs_off() and use it in the scheduler
594dd290cf5403a9a5818619dfff42d8e8e0518e sched/cpufreq: Optimize
cpufreq update kicker to avoid update multiple times
fec148c000d0f9ac21679601722811eb60b4cc52 sched/deadline: Fix a bug in
dl_overflow()
9fd81dd5ce0b12341c9f83346f8d32ac68bd3841 sched/fair: Optimize
!CONFIG_NO_HZ_COMMON CPU load updates
1f41906a6fda1114debd3898668bd7ab6470ee41 sched/fair: Correctly handle
nohz ticks CPU load accounting
cee1afce3053e7aa0793fbd5f2e845fa2cef9e33 sched/fair: Gather CPU load
functions under a more conventional namespace
a2c6c91f98247fef0fe75216d607812485aeb0df sched/fair: Call cpufreq hook
in additional paths
41e0d37f7ac81297c07ba311e4ad39465b8c8295 sched/fair: Do not call
cpufreq hook unless util changed
21e96f88776deead303ecd30a17d1d7c2a1776e3 sched/fair: Move cpufreq hook
to update_cfs_rq_load_avg()
1f621e028baf391f6684003e32e009bc934b750f sched/fair: Fix asym packing
to select correct CPU
bd92883051a0228cc34996b8e766111ba10c9aac sched/cpuacct: Check for NULL
when using task_pt_regs()
2c923e94cd9c6acff3b22f0ae29cfe65e2658b40 sched/clock: Make
local_clock()/cpu_clock() inline
c78b17e28cc2c2df74264afc408bdc6aaf3fbcc8 sched/clock: Remove pointless
test in cpu_clock/local_clock
fb90a6e93c0684ab2629a42462400603aa829b9c sched/debug: Don't dump sched
debug info in SysRq-W
2b8c41daba327c633228169e8bd8ec067ab443f8 sched/fair: Initiate a new
task's util avg to a bounded value
1c3de5e19fc96206dd086e634129d08e5f7b1000 sched/fair: Update comments
after a variable rename
47252cfbac03644ee4a3adfa50c77896aa94f2bb sched/core: Add preempt
checks in preempt_schedule() code
bfdb198ccd99472c5bded689699eb30dd06316bb sched/numa: Remove
unnecessary NUMA dequeue update from non-SMP kernels
d02c071183e1c01a76811c878c8a52322201f81f sched/fair: Reset
nr_balance_failed after active balancing
d740037fac7052e49450f6fa1454f1144a103b55 sched/cpuacct: Split usage
accounting into user_usage and sys_usage
5ca3726af7f66a8cc71ce4414cfeb86deb784491 sched/cpuacct: Show all
possible CPUs in cpuacct output

On Fri, Jun 17, 2016 at 1:04 AM, Jirka Hladky <jhladky@redhat.com> wrote:
>> > we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>> Blergh, of course I don't have those.. :/
>
> SPECjvm2008 is publicly available.
> https://www.spec.org/download.html
>
> We will prepare a reproducer and attach it to the BZ.
>
>> What kind of config and userspace setup? Do you run this cruft in a
>> cgroup of sorts?
>
>  No, we don't do any special setup except to control the number of threads.
>
> Thanks for the hints which commits are most likely the root cause for
> this. We will try to find the commit which has caused it.
>
> Jirka
>
>
>
> On Thu, Jun 16, 2016 at 7:22 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Thu, Jun 16, 2016 at 06:38:50PM +0200, Jirka Hladky wrote:
>>> Hello,
>>>
>>> we see performance drop 30-40% for SPECjbb2005 and SPECjvm2008
>>
>> Blergh, of course I don't have those.. :/
>>
>>> benchmarks starting from 4.7.0-0.rc0 kernel compared to 4.6 kernel.
>>>
>>> We have tested kernels 4.7.0-0.rc1 and 4.7.0-0.rc3 and these are as
>>> well affected.
>>>
>>> We have observed the drop on variety of different x86_64 servers with
>>> different configuration (different CPU models, RAM sizes, both with
>>> Hyper Threading ON and OFF, different NUMA configurations (2 and 4
>>> NUMA nodes)
>>
>> What kind of config and userspace setup? Do you run this cruft in a
>> cgroup of sorts?
>>
>> If so, does it change anything if you run it in the root cgroup?
>>
>>> Linpack and Stream benchmarks do not show any performance drop.
>>>
>>> The performance drop increases with higher number of threads. The
>>> maximum number of threads in each benchmark is the same as number of
>>> CPUs.
>>>
>>> We have opened a BZ to track the progress:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=120481
>>>
>>> You can find more details along with graphs and tables there.
>>>
>>> Do you have any hints which commit should we try to reverse?
>>
>> There were only 66 commits or so, and I think we can rule out the
>> hotplug changes, which should reduce it even further.
>>
>> You could see what the parent of this one does:
>>
>>   2159197d6677 sched/core: Enable increased load resolution on 64-bit kernels
>>
>> If not that, maybe the parent of:
>>
>>   c58d25f371f5 sched/fair: Move record_wakee()
>>
>> After that I suppose you'll have to go bisect.
>>