My mistake, I confused r- and p-values (The r-value is useful but not as useful as p.) The p-value is a statistical measure of validity. It is typically produced either bya t-test or z-test. https://en.wikipedia.org/wiki/P-value https://en.wikipedia.org/wiki/Student%27s_t-test https://en.wikipedia.org/wiki/Z-test The z-test is more "mathematically pure", but as far as I understand the t-test is less likely to have false positives for a smaller number of data points. On November 15, 2021 5:40:55 PM PST, Yin Fengwei wrote: >Hi, > >On 11/16/2021 3:20 AM, H. Peter Anvin wrote: >> [Cc: Peter Z.] >> >> This seems totally bizarre... that is an *enormous* change, and if I'm reading it right it seems like this somehow related to the performance monitoring framework itself? >We can rerun the benchmark with performance monitor totally disabled. > >> >> The lower-performance init code is all pushed into the pre-boot path, unless for some strange reason not all code gets patched e.g. at module loading time. >> >> A quick peek around made me notice a few minor possibilities, but none of them look particularly sane: >> >> 1. We don't use "asm inline" in asm_volatile_goto, and we probably >>    should; otherwise gcc might get the idea this is a more heavyweight >>    operation than it actually is. >> 2. There is a workaround in asm_volatile_goto for a bug which apparently >>    was fixed in gcc 4.8.x that might mislead gcc's code generator into >>    generating worse code. >> >> Did you see any functions for which the code got *bigger*? >I checked 7 or 8 functions. Most of them were same. None of them got bigger. >This intel_pmu_store_lbr was smaller. I could do a full comparing. > >> >> >> Not directly related, but it would be really helpful to get an r-value with your statistics.  It would greatly help avoiding chasing ghosts. >Can you share what's the "r-value" and how to get it? Thanks. > > >Regards >Yin, Fengwei > >> >> >> >> >> On 11/15/21 01:53, Yin Fengwei wrote: >>> Hi, >>> >>> On 11/15/2021 3:37 PM, kernel test robot wrote: >>>> >>>> >>>> Greeting, >>>> >>>> FYI, we noticed a -4.9% regression of will-it-scale.per_process_ops due to commit: >>>> >>>> >>>> commit: 0507503671f9b1c867e889cbec0f43abf904f23c ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()") >>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >>>> >>>> in testcase: will-it-scale >>>> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory >>>> with following parameters: >>>> >>>>     nr_task: 50% >>>>     mode: process >>>>     test: mmap2 >>>>     cpufreq_governor: performance >>>>     ucode: 0xd000280 >>>> >>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two. >>>> test-url: https://github.com/antonblanchard/will-it-scale >>>> >>>> >>>> please be noted, since we don't have clue why this commit could cause >>>> performance drop, so we did further tests on other platforms or with >>>> different parameters, and got below results. >>> Add Kees in case he is interest to this behavior. >>> >>> Observation on this regression: >>>     After the patch, the better code is generated. De-assembled the function intel_pmu_store_lbr with >>>     vmlinux built from commit f87bc8dc7a7c and 0507503671f9 and got: >>>        With commit f87bc8dc7a7c (parent commit): >>>       https://zerobin.net/?22efb1114b097030#ryD/8LpasEIg8WrS6O/M+sHYJp7c/LoAXPfeB7BUqu4= >>> >>>     With commit 0507503671f9: >>>       https://zerobin.net/?e57652572b3ec83c#CvobUggve54SIHmlZ6jkzs3s4k8iQN4ophFoOs7LHMI= >>> >>>     The assembly code with commit 0507503671f9 is smaller than with parent commit. The >>>     register r12 in parent commit is strange IIUC. >>> >>> >>>     BTW, the reason that we picked up function intel_pmu_store_lbr is: >>>       It's the first function in System.map which has different size w/o the patch >>> >>> >>> Suppose the performance data with commit 0507503671f9 should be better. But the >>> test result showed it had improvement only on one test box. On other three test box, >>> it introduced regressions. Looks like strange. >>> >>> >>> Regards >>> Yin, Fengwei >>> >>>> >>>> except the 1% improvement from the first test on a 4 sockets Haswell-EX, >>>> others all show similar regression: >>>> >>>> +------------------+----------------------------------------------------------------------------------+ >>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops +1.0% improvement                   | >>>> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory | >>>> | test parameters  | cpufreq_governor=performance                                                     | >>>> |                  | mode=process                                                                     | >>>> |                  | nr_task=50%                                                                      | >>>> |                  | test=mmap2                                                                       | >>>> |                  | ucode=0x16                                                                       | >>>> +------------------+----------------------------------------------------------------------------------+ >>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.7% regression                    | >>>> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory | >>>> | test parameters  | cpufreq_governor=performance                                                     | >>>> |                  | mode=process                                                                     | >>>> |                  | nr_task=50%                                                                      | >>>> |                  | test=mmap2                                                                       | >>>> |                  | ucode=0x700001e                                                                  | >>>> +------------------+----------------------------------------------------------------------------------+ >>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.1% regression                    | >>>> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory  | >>>> | test parameters  | cpufreq_governor=performance                                                     | >>>> |                  | mode=process                                                                     | >>>> |                  | nr_task=16                                                                       | >>>> |                  | test=mmap2                                                                       | >>>> |                  | ucode=0xd000280                                                                  | >>>> +------------------+----------------------------------------------------------------------------------+ >>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.9% regression                    | >>>> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  | >>>> | test parameters  | cpufreq_governor=performance                                                     | >>>> |                  | mode=process                                                                     | >>>> |                  | nr_task=16                                                                       | >>>> |                  | test=mmap1                                                                       | >>>> |                  | ucode=0x5003006                                                                  | >>>> +------------------+----------------------------------------------------------------------------------+ >>>> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.5% regression                    | >>>> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  | >>>> | test parameters  | cpufreq_governor=performance                                                     | >>>> |                  | mode=process                                                                     | >>>> |                  | nr_task=50%                                                                      | >>>> |                  | test=mmap2                                                                       | >>>> |                  | ucode=0x5003006                                                                  | >>>> +------------------+----------------------------------------------------------------------------------+ >>>> >>>> >>>> If you fix the issue, kindly add following tag >>>> Reported-by: kernel test robot >>>> >>>> >>>> Details are as below: >>>> --------------------------------------------------------------------------------------------------> >>>> >>>> >>>> To reproduce: >>>> >>>>          git clone https://github.com/intel/lkp-tests.git >>>>          cd lkp-tests >>>>          sudo bin/lkp install job.yaml           # job file is attached in this email >>>>          bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run >>>>          sudo bin/lkp run generated-yaml-file >>>> >>>>          # if come across any failure that blocks the test, >>>>          # please remove ~/.lkp and /lkp dir to run from a clean state. >>>> >>>> ========================================================================================= >>>> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: >>>>    gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/mmap2/will-it-scale/0xd000280 >>>> >>>> commit: >>>>    f87bc8dc7a ("x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix") >>>>    0507503671 ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()") >>>> >>>> f87bc8dc7a7c438c 0507503671f9b1c867e889cbec0 >>>> ---------------- --------------------------- >>>>           %stddev     %change         %stddev >>>>               \          |                \ >>>>    41898923            -4.9%   39829159        will-it-scale.64.processes >>>>      654670            -4.9%     622330        will-it-scale.per_process_ops >>>>    41898923            -4.9%   39829159        will-it-scale.workload >>>>        6918 ± 54%    +116.5%      14975 ± 14%  softirqs.CPU20.SCHED >>>>      240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.active_objs >>>>      240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.num_objs >>>>        0.01 ± 28%     -36.1%       0.01 ± 14%  perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown] >>>>        6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll >>>>        6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll >>>>        1409 ± 30%     -30.7%     977.00 ± 23%  interrupts.CPU1.CAL:Function_call_interrupts >>>>        3001 ± 58%     -70.3%     892.50 ± 69%  interrupts.CPU1.RES:Rescheduling_interrupts >>>>      669.83 ±172%    +696.9%       5338 ±102%  interrupts.CPU108.NMI:Non-maskable_interrupts -- Sent from my Android device with K-9 Mail. Please excuse my brevity.