Re: [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression

From: Yin Fengwei <fengwei.yin@intel.com>
To: lkp@lists.01.org
Subject: Re: [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression
Date: Wed, 17 Nov 2021 09:27:35 +0800	[thread overview]
Message-ID: <82106f10-98e9-241e-38dc-7f766b0218b5@intel.com> (raw)
In-Reply-To: <8B87DFB3-FC34-443B-995F-DB54C37E0CF2@zytor.com>

[-- Attachment #1: Type: text/plain, Size: 14798 bytes --]


On 11/17/2021 9:00 AM, H. Peter Anvin wrote:
> My mistake, I confused r- and p-values (The r-value is useful but not as useful as p.)
No worries. Both are fresh to me. Thanks a lot for sharing the info. Let me learn
them first.

Regards
Yin, Fengwei

> 
> The p-value is a statistical measure of validity. It is typically produced either bya t-test or z-test.
> 
> https://en.wikipedia.org/wiki/P-value
> https://en.wikipedia.org/wiki/Student%27s_t-test
> https://en.wikipedia.org/wiki/Z-test
> 
> The z-test is more "mathematically pure", but as far as I understand the t-test is less likely to have false positives for a smaller number of data points.
> 
> On November 15, 2021 5:40:55 PM PST, Yin Fengwei <fengwei.yin@intel.com> wrote:
>> Hi,
>>
>> On 11/16/2021 3:20 AM, H. Peter Anvin wrote:
>>> [Cc: Peter Z.]
>>>
>>> This seems totally bizarre... that is an *enormous* change, and if I'm reading it right it seems like this somehow related to the performance monitoring framework itself?
>> We can rerun the benchmark with performance monitor totally disabled.
>>
>>>
>>> The lower-performance init code is all pushed into the pre-boot path, unless for some strange reason not all code gets patched e.g. at module loading time.
>>>
>>> A quick peek around made me notice a few minor possibilities, but none of them look particularly sane:
>>>
>>> 1. We don't use "asm inline" in asm_volatile_goto, and we probably
>>>    should; otherwise gcc might get the idea this is a more heavyweight
>>>    operation than it actually is.
>>> 2. There is a workaround in asm_volatile_goto for a bug which apparently
>>>    was fixed in gcc 4.8.x that might mislead gcc's code generator into
>>>    generating worse code.
>>>
>>> Did you see any functions for which the code got *bigger*?
>> I checked 7 or 8 functions. Most of them were same. None of them got bigger.
>> This intel_pmu_store_lbr was smaller. I could do a full comparing.
>>
>>>
>>>
>>> Not directly related, but it would be really helpful to get an r-value with your statistics.  It would greatly help avoiding chasing ghosts.
>> Can you share what's the "r-value" and how to get it? Thanks.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>
>>>
>>>
>>> On 11/15/21 01:53, Yin Fengwei wrote:
>>>> Hi,
>>>>
>>>> On 11/15/2021 3:37 PM, kernel test robot wrote:
>>>>>
>>>>>
>>>>> Greeting,
>>>>>
>>>>> FYI, we noticed a -4.9% regression of will-it-scale.per_process_ops due to commit:
>>>>>
>>>>>
>>>>> commit: 0507503671f9b1c867e889cbec0f43abf904f23c ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>> in testcase: will-it-scale
>>>>> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
>>>>> with following parameters:
>>>>>
>>>>>     nr_task: 50%
>>>>>     mode: process
>>>>>     test: mmap2
>>>>>     cpufreq_governor: performance
>>>>>     ucode: 0xd000280
>>>>>
>>>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>>>> test-url: https://github.com/antonblanchard/will-it-scale
>>>>>
>>>>>
>>>>> please be noted, since we don't have clue why this commit could cause
>>>>> performance drop, so we did further tests on other platforms or with
>>>>> different parameters, and got below results.
>>>> Add Kees in case he is interest to this behavior.
>>>>
>>>> Observation on this regression:
>>>>     After the patch, the better code is generated. De-assembled the function intel_pmu_store_lbr with
>>>>     vmlinux built from commit f87bc8dc7a7c and 0507503671f9 and got:
>>>>        With commit f87bc8dc7a7c (parent commit):
>>>>       https://zerobin.net/?22efb1114b097030#ryD/8LpasEIg8WrS6O/M+sHYJp7c/LoAXPfeB7BUqu4=
>>>>
>>>>     With commit 0507503671f9:
>>>>       https://zerobin.net/?e57652572b3ec83c#CvobUggve54SIHmlZ6jkzs3s4k8iQN4ophFoOs7LHMI=
>>>>
>>>>     The assembly code with commit 0507503671f9 is smaller than with parent commit. The
>>>>     register r12 in parent commit is strange IIUC.
>>>>
>>>>
>>>>     BTW, the reason that we picked up function intel_pmu_store_lbr is:
>>>>       It's the first function in System.map which has different size w/o the patch
>>>>
>>>>
>>>> Suppose the performance data with commit 0507503671f9 should be better. But the
>>>> test result showed it had improvement only on one test box. On other three test box,
>>>> it introduced regressions. Looks like strange.
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> except the 1% improvement from the first test on a 4 sockets Haswell-EX,
>>>>> others all show similar regression:
>>>>>
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops +1.0% improvement                   |
>>>>> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=50%                                                                      |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0x16                                                                       |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.7% regression                    |
>>>>> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=50%                                                                      |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0x700001e                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.1% regression                    |
>>>>> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory  |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=16                                                                       |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0xd000280                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.9% regression                    |
>>>>> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=16                                                                       |
>>>>> |                  | test=mmap1                                                                       |
>>>>> |                  | ucode=0x5003006                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.5% regression                    |
>>>>> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=50%                                                                      |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0x5003006                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> If you fix the issue, kindly add following tag
>>>>> Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>>
>>>>>
>>>>> Details are as below:
>>>>> -------------------------------------------------------------------------------------------------->
>>>>>
>>>>>
>>>>> To reproduce:
>>>>>
>>>>>          git clone https://github.com/intel/lkp-tests.git
>>>>>          cd lkp-tests
>>>>>          sudo bin/lkp install job.yaml           # job file is attached in this email
>>>>>          bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
>>>>>          sudo bin/lkp run generated-yaml-file
>>>>>
>>>>>          # if come across any failure that blocks the test,
>>>>>          # please remove ~/.lkp and /lkp dir to run from a clean state.
>>>>>
>>>>> =========================================================================================
>>>>> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
>>>>>    gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/mmap2/will-it-scale/0xd000280
>>>>>
>>>>> commit:
>>>>>    f87bc8dc7a ("x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix")
>>>>>    0507503671 ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()")
>>>>>
>>>>> f87bc8dc7a7c438c 0507503671f9b1c867e889cbec0
>>>>> ---------------- ---------------------------
>>>>>           %stddev     %change         %stddev
>>>>>               \          |                \
>>>>>    41898923            -4.9%   39829159        will-it-scale.64.processes
>>>>>      654670            -4.9%     622330        will-it-scale.per_process_ops
>>>>>    41898923            -4.9%   39829159        will-it-scale.workload
>>>>>        6918 ± 54%    +116.5%      14975 ± 14%  softirqs.CPU20.SCHED
>>>>>      240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.active_objs
>>>>>      240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.num_objs
>>>>>        0.01 ± 28%     -36.1%       0.01 ± 14%  perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
>>>>>        6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
>>>>>        6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
>>>>>        1409 ± 30%     -30.7%     977.00 ± 23%  interrupts.CPU1.CAL:Function_call_interrupts
>>>>>        3001 ± 58%     -70.3%     892.50 ± 69%  interrupts.CPU1.RES:Rescheduling_interrupts
>>>>>      669.83 ±172%    +696.9%       5338 ±102%  interrupts.CPU108.NMI:Non-maskable_interrupts
>