All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yin Fengwei <fengwei.yin@intel.com>
To: lkp@lists.01.org
Subject: Re: [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression
Date: Wed, 17 Nov 2021 09:27:35 +0800	[thread overview]
Message-ID: <82106f10-98e9-241e-38dc-7f766b0218b5@intel.com> (raw)
In-Reply-To: <8B87DFB3-FC34-443B-995F-DB54C37E0CF2@zytor.com>

[-- Attachment #1: Type: text/plain, Size: 14798 bytes --]


On 11/17/2021 9:00 AM, H. Peter Anvin wrote:
> My mistake, I confused r- and p-values (The r-value is useful but not as useful as p.)
No worries. Both are fresh to me. Thanks a lot for sharing the info. Let me learn
them first.

Regards
Yin, Fengwei

> 
> The p-value is a statistical measure of validity. It is typically produced either bya t-test or z-test.
> 
> https://en.wikipedia.org/wiki/P-value
> https://en.wikipedia.org/wiki/Student%27s_t-test
> https://en.wikipedia.org/wiki/Z-test
> 
> The z-test is more "mathematically pure", but as far as I understand the t-test is less likely to have false positives for a smaller number of data points.
> 
> On November 15, 2021 5:40:55 PM PST, Yin Fengwei <fengwei.yin@intel.com> wrote:
>> Hi,
>>
>> On 11/16/2021 3:20 AM, H. Peter Anvin wrote:
>>> [Cc: Peter Z.]
>>>
>>> This seems totally bizarre... that is an *enormous* change, and if I'm reading it right it seems like this somehow related to the performance monitoring framework itself?
>> We can rerun the benchmark with performance monitor totally disabled.
>>
>>>
>>> The lower-performance init code is all pushed into the pre-boot path, unless for some strange reason not all code gets patched e.g. at module loading time.
>>>
>>> A quick peek around made me notice a few minor possibilities, but none of them look particularly sane:
>>>
>>> 1. We don't use "asm inline" in asm_volatile_goto, and we probably
>>>    should; otherwise gcc might get the idea this is a more heavyweight
>>>    operation than it actually is.
>>> 2. There is a workaround in asm_volatile_goto for a bug which apparently
>>>    was fixed in gcc 4.8.x that might mislead gcc's code generator into
>>>    generating worse code.
>>>
>>> Did you see any functions for which the code got *bigger*?
>> I checked 7 or 8 functions. Most of them were same. None of them got bigger.
>> This intel_pmu_store_lbr was smaller. I could do a full comparing.
>>
>>>
>>>
>>> Not directly related, but it would be really helpful to get an r-value with your statistics.  It would greatly help avoiding chasing ghosts.
>> Can you share what's the "r-value" and how to get it? Thanks.
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>>
>>>
>>>
>>> On 11/15/21 01:53, Yin Fengwei wrote:
>>>> Hi,
>>>>
>>>> On 11/15/2021 3:37 PM, kernel test robot wrote:
>>>>>
>>>>>
>>>>> Greeting,
>>>>>
>>>>> FYI, we noticed a -4.9% regression of will-it-scale.per_process_ops due to commit:
>>>>>
>>>>>
>>>>> commit: 0507503671f9b1c867e889cbec0f43abf904f23c ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>>
>>>>> in testcase: will-it-scale
>>>>> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
>>>>> with following parameters:
>>>>>
>>>>>     nr_task: 50%
>>>>>     mode: process
>>>>>     test: mmap2
>>>>>     cpufreq_governor: performance
>>>>>     ucode: 0xd000280
>>>>>
>>>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
>>>>> test-url: https://github.com/antonblanchard/will-it-scale
>>>>>
>>>>>
>>>>> please be noted, since we don't have clue why this commit could cause
>>>>> performance drop, so we did further tests on other platforms or with
>>>>> different parameters, and got below results.
>>>> Add Kees in case he is interest to this behavior.
>>>>
>>>> Observation on this regression:
>>>>     After the patch, the better code is generated. De-assembled the function intel_pmu_store_lbr with
>>>>     vmlinux built from commit f87bc8dc7a7c and 0507503671f9 and got:
>>>>        With commit f87bc8dc7a7c (parent commit):
>>>>       https://zerobin.net/?22efb1114b097030#ryD/8LpasEIg8WrS6O/M+sHYJp7c/LoAXPfeB7BUqu4=
>>>>
>>>>     With commit 0507503671f9:
>>>>       https://zerobin.net/?e57652572b3ec83c#CvobUggve54SIHmlZ6jkzs3s4k8iQN4ophFoOs7LHMI=
>>>>
>>>>     The assembly code with commit 0507503671f9 is smaller than with parent commit. The
>>>>     register r12 in parent commit is strange IIUC.
>>>>
>>>>
>>>>     BTW, the reason that we picked up function intel_pmu_store_lbr is:
>>>>       It's the first function in System.map which has different size w/o the patch
>>>>
>>>>
>>>> Suppose the performance data with commit 0507503671f9 should be better. But the
>>>> test result showed it had improvement only on one test box. On other three test box,
>>>> it introduced regressions. Looks like strange.
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> except the 1% improvement from the first test on a 4 sockets Haswell-EX,
>>>>> others all show similar regression:
>>>>>
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops +1.0% improvement                   |
>>>>> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=50%                                                                      |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0x16                                                                       |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.7% regression                    |
>>>>> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=50%                                                                      |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0x700001e                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.1% regression                    |
>>>>> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory  |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=16                                                                       |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0xd000280                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.9% regression                    |
>>>>> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=16                                                                       |
>>>>> |                  | test=mmap1                                                                       |
>>>>> |                  | ucode=0x5003006                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.5% regression                    |
>>>>> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  |
>>>>> | test parameters  | cpufreq_governor=performance                                                     |
>>>>> |                  | mode=process                                                                     |
>>>>> |                  | nr_task=50%                                                                      |
>>>>> |                  | test=mmap2                                                                       |
>>>>> |                  | ucode=0x5003006                                                                  |
>>>>> +------------------+----------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> If you fix the issue, kindly add following tag
>>>>> Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>>
>>>>>
>>>>> Details are as below:
>>>>> -------------------------------------------------------------------------------------------------->
>>>>>
>>>>>
>>>>> To reproduce:
>>>>>
>>>>>          git clone https://github.com/intel/lkp-tests.git
>>>>>          cd lkp-tests
>>>>>          sudo bin/lkp install job.yaml           # job file is attached in this email
>>>>>          bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
>>>>>          sudo bin/lkp run generated-yaml-file
>>>>>
>>>>>          # if come across any failure that blocks the test,
>>>>>          # please remove ~/.lkp and /lkp dir to run from a clean state.
>>>>>
>>>>> =========================================================================================
>>>>> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
>>>>>    gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/mmap2/will-it-scale/0xd000280
>>>>>
>>>>> commit:
>>>>>    f87bc8dc7a ("x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix")
>>>>>    0507503671 ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()")
>>>>>
>>>>> f87bc8dc7a7c438c 0507503671f9b1c867e889cbec0
>>>>> ---------------- ---------------------------
>>>>>           %stddev     %change         %stddev
>>>>>               \          |                \
>>>>>    41898923            -4.9%   39829159        will-it-scale.64.processes
>>>>>      654670            -4.9%     622330        will-it-scale.per_process_ops
>>>>>    41898923            -4.9%   39829159        will-it-scale.workload
>>>>>        6918 ± 54%    +116.5%      14975 ± 14%  softirqs.CPU20.SCHED
>>>>>      240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.active_objs
>>>>>      240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.num_objs
>>>>>        0.01 ± 28%     -36.1%       0.01 ± 14%  perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
>>>>>        6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
>>>>>        6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
>>>>>        1409 ± 30%     -30.7%     977.00 ± 23%  interrupts.CPU1.CAL:Function_call_interrupts
>>>>>        3001 ± 58%     -70.3%     892.50 ± 69%  interrupts.CPU1.RES:Rescheduling_interrupts
>>>>>      669.83 ±172%    +696.9%       5338 ±102%  interrupts.CPU108.NMI:Non-maskable_interrupts
> 

      reply	other threads:[~2021-11-17  1:27 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-15  7:37 [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression kernel test robot
2021-11-15  9:53 ` Yin Fengwei
2021-11-15 19:20   ` H. Peter Anvin
2021-11-15 20:39     ` Peter Zijlstra
2021-11-15 21:15       ` H. Peter Anvin
2021-11-15 21:46         ` Peter Zijlstra
2021-11-16  1:41           ` Feng Tang
2021-11-16  1:57         ` Yin Fengwei
2021-11-16  1:40     ` Yin Fengwei
2021-11-16  9:33       ` Yin Fengwei
2021-11-17  2:44         ` Oliver Sang
2021-11-17  1:00       ` H. Peter Anvin
2021-11-17  1:27         ` Yin Fengwei [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=82106f10-98e9-241e-38dc-7f766b0218b5@intel.com \
    --to=fengwei.yin@intel.com \
    --cc=lkp@lists.01.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.