[Cc: Peter Z.] This seems totally bizarre... that is an *enormous* change, and if I'm reading it right it seems like this somehow related to the performance monitoring framework itself? The lower-performance init code is all pushed into the pre-boot path, unless for some strange reason not all code gets patched e.g. at module loading time. A quick peek around made me notice a few minor possibilities, but none of them look particularly sane: 1. We don't use "asm inline" in asm_volatile_goto, and we probably should; otherwise gcc might get the idea this is a more heavyweight operation than it actually is. 2. There is a workaround in asm_volatile_goto for a bug which apparently was fixed in gcc 4.8.x that might mislead gcc's code generator into generating worse code. Did you see any functions for which the code got *bigger*? Not directly related, but it would be really helpful to get an r-value with your statistics. It would greatly help avoiding chasing ghosts. On 11/15/21 01:53, Yin Fengwei wrote: > Hi, > > On 11/15/2021 3:37 PM, kernel test robot wrote: >> >> >> Greeting, >> >> FYI, we noticed a -4.9% regression of will-it-scale.per_process_ops due to commit: >> >> >> commit: 0507503671f9b1c867e889cbec0f43abf904f23c ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()") >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >> >> in testcase: will-it-scale >> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory >> with following parameters: >> >> nr_task: 50% >> mode: process >> test: mmap2 >> cpufreq_governor: performance >> ucode: 0xd000280 >> >> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two. >> test-url: https://github.com/antonblanchard/will-it-scale >> >> >> please be noted, since we don't have clue why this commit could cause >> performance drop, so we did further tests on other platforms or with >> different parameters, and got below results. > Add Kees in case he is interest to this behavior. > > Observation on this regression: > After the patch, the better code is generated. De-assembled the function intel_pmu_store_lbr with > vmlinux built from commit f87bc8dc7a7c and 0507503671f9 and got: > > With commit f87bc8dc7a7c (parent commit): > https://zerobin.net/?22efb1114b097030#ryD/8LpasEIg8WrS6O/M+sHYJp7c/LoAXPfeB7BUqu4= > > With commit 0507503671f9: > https://zerobin.net/?e57652572b3ec83c#CvobUggve54SIHmlZ6jkzs3s4k8iQN4ophFoOs7LHMI= > > The assembly code with commit 0507503671f9 is smaller than with parent commit. The > register r12 in parent commit is strange IIUC. > > > BTW, the reason that we picked up function intel_pmu_store_lbr is: > It's the first function in System.map which has different size w/o the patch > > > Suppose the performance data with commit 0507503671f9 should be better. But the > test result showed it had improvement only on one test box. On other three test box, > it introduced regressions. Looks like strange. > > > Regards > Yin, Fengwei > >> >> except the 1% improvement from the first test on a 4 sockets Haswell-EX, >> others all show similar regression: >> >> +------------------+----------------------------------------------------------------------------------+ >> | testcase: change | will-it-scale: will-it-scale.per_process_ops +1.0% improvement | >> | test machine | 144 threads 4 sockets Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory | >> | test parameters | cpufreq_governor=performance | >> | | mode=process | >> | | nr_task=50% | >> | | test=mmap2 | >> | | ucode=0x16 | >> +------------------+----------------------------------------------------------------------------------+ >> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.7% regression | >> | test machine | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory | >> | test parameters | cpufreq_governor=performance | >> | | mode=process | >> | | nr_task=50% | >> | | test=mmap2 | >> | | ucode=0x700001e | >> +------------------+----------------------------------------------------------------------------------+ >> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.1% regression | >> | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory | >> | test parameters | cpufreq_governor=performance | >> | | mode=process | >> | | nr_task=16 | >> | | test=mmap2 | >> | | ucode=0xd000280 | >> +------------------+----------------------------------------------------------------------------------+ >> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.9% regression | >> | test machine | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory | >> | test parameters | cpufreq_governor=performance | >> | | mode=process | >> | | nr_task=16 | >> | | test=mmap1 | >> | | ucode=0x5003006 | >> +------------------+----------------------------------------------------------------------------------+ >> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.5% regression | >> | test machine | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory | >> | test parameters | cpufreq_governor=performance | >> | | mode=process | >> | | nr_task=50% | >> | | test=mmap2 | >> | | ucode=0x5003006 | >> +------------------+----------------------------------------------------------------------------------+ >> >> >> If you fix the issue, kindly add following tag >> Reported-by: kernel test robot >> >> >> Details are as below: >> --------------------------------------------------------------------------------------------------> >> >> >> To reproduce: >> >> git clone https://github.com/intel/lkp-tests.git >> cd lkp-tests >> sudo bin/lkp install job.yaml # job file is attached in this email >> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run >> sudo bin/lkp run generated-yaml-file >> >> # if come across any failure that blocks the test, >> # please remove ~/.lkp and /lkp dir to run from a clean state. >> >> ========================================================================================= >> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: >> gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/mmap2/will-it-scale/0xd000280 >> >> commit: >> f87bc8dc7a ("x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix") >> 0507503671 ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()") >> >> f87bc8dc7a7c438c 0507503671f9b1c867e889cbec0 >> ---------------- --------------------------- >> %stddev %change %stddev >> \ | \ >> 41898923 -4.9% 39829159 will-it-scale.64.processes >> 654670 -4.9% 622330 will-it-scale.per_process_ops >> 41898923 -4.9% 39829159 will-it-scale.workload >> 6918 ± 54% +116.5% 14975 ± 14% softirqs.CPU20.SCHED >> 240.00 ± 18% +57.8% 378.67 ± 20% slabinfo.biovec-64.active_objs >> 240.00 ± 18% +57.8% 378.67 ± 20% slabinfo.biovec-64.num_objs >> 0.01 ± 28% -36.1% 0.01 ± 14% perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown] >> 6114 ± 24% -46.1% 3296 ± 46% perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll >> 6114 ± 24% -46.1% 3296 ± 46% perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll >> 1409 ± 30% -30.7% 977.00 ± 23% interrupts.CPU1.CAL:Function_call_interrupts >> 3001 ± 58% -70.3% 892.50 ± 69% interrupts.CPU1.RES:Rescheduling_interrupts >> 669.83 ±172% +696.9% 5338 ±102% interrupts.CPU108.NMI:Non-maskable_interrupts