From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============8662452794968638592==" MIME-Version: 1.0 From: Yin Fengwei To: lkp@lists.01.org Subject: Re: [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression Date: Mon, 15 Nov 2021 17:53:32 +0800 Message-ID: <0cfae054-3150-35d4-f95f-68b5ec1897e7@intel.com> In-Reply-To: <20211115073738.GA23967@xsang-OptiPlex-9020> List-Id: --===============8662452794968638592== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi, On 11/15/2021 3:37 PM, kernel test robot wrote: > = > = > Greeting, > = > FYI, we noticed a -4.9% regression of will-it-scale.per_process_ops due t= o commit: > = > = > commit: 0507503671f9b1c867e889cbec0f43abf904f23c ("x86/asm: Avoid adding = register pressure for the init case in static_cpu_has()") > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master > = > in testcase: will-it-scale > on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2= .00GHz with 256G memory > with following parameters: > = > nr_task: 50% > mode: process > test: mmap2 > cpufreq_governor: performance > ucode: 0xd000280 > = > test-description: Will It Scale takes a testcase and runs it from 1 throu= gh to n parallel copies to see if the testcase will scale. It builds both a= process and threads based test in order to see any differences between the= two. > test-url: https://github.com/antonblanchard/will-it-scale > = > = > please be noted, since we don't have clue why this commit could cause > performance drop, so we did further tests on other platforms or with > different parameters, and got below results. Add Kees in case he is interest to this behavior. Observation on this regression: After the patch, the better code is generated. De-assembled the function= intel_pmu_store_lbr with vmlinux built from commit f87bc8dc7a7c and 0507503671f9 and got: = With commit f87bc8dc7a7c (parent commit): https://zerobin.net/?22efb1114b097030#ryD/8LpasEIg8WrS6O/M+sHYJp7c/LoA= XPfeB7BUqu4=3D With commit 0507503671f9: https://zerobin.net/?e57652572b3ec83c#CvobUggve54SIHmlZ6jkzs3s4k8iQN4o= phFoOs7LHMI=3D The assembly code with commit 0507503671f9 is smaller than with parent c= ommit. The register r12 in parent commit is strange IIUC. BTW, the reason that we picked up function intel_pmu_store_lbr is: It's the first function in System.map which has different size w/o the= patch Suppose the performance data with commit 0507503671f9 should be better. But= the test result showed it had improvement only on one test box. On other three = test box, it introduced regressions. Looks like strange. Regards Yin, Fengwei > = > except the 1% improvement from the first test on a 4 sockets Haswell-EX, > others all show similar regression: > = > +------------------+-----------------------------------------------------= -----------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_process_ops +1.0% i= mprovement | > | test machine | 144 threads 4 sockets Intel(R) Xeon(R) CPU E7-8890 v= 3 @ 2.50GHz with 512G memory | > | test parameters | cpufreq_governor=3Dperformance = | > | | mode=3Dprocess = | > | | nr_task=3D50% = | > | | test=3Dmmap2 = | > | | ucode=3D0x16 = | > +------------------+-----------------------------------------------------= -----------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.7% r= egression | > | test machine | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CP= U @ 2.50GHz with 128G memory | > | test parameters | cpufreq_governor=3Dperformance = | > | | mode=3Dprocess = | > | | nr_task=3D50% = | > | | test=3Dmmap2 = | > | | ucode=3D0x700001e = | > +------------------+-----------------------------------------------------= -----------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.1% r= egression | > | test machine | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU= @ 2.00GHz with 256G memory | > | test parameters | cpufreq_governor=3Dperformance = | > | | mode=3Dprocess = | > | | nr_task=3D16 = | > | | test=3Dmmap2 = | > | | ucode=3D0xd000280 = | > +------------------+-----------------------------------------------------= -----------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.9% r= egression | > | test machine | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU= @ 2.10GHz with 128G memory | > | test parameters | cpufreq_governor=3Dperformance = | > | | mode=3Dprocess = | > | | nr_task=3D16 = | > | | test=3Dmmap1 = | > | | ucode=3D0x5003006 = | > +------------------+-----------------------------------------------------= -----------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.5% r= egression | > | test machine | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU= @ 2.10GHz with 128G memory | > | test parameters | cpufreq_governor=3Dperformance = | > | | mode=3Dprocess = | > | | nr_task=3D50% = | > | | test=3Dmmap2 = | > | | ucode=3D0x5003006 = | > +------------------+-----------------------------------------------------= -----------------------------+ > = > = > If you fix the issue, kindly add following tag > Reported-by: kernel test robot > = > = > Details are as below: > -------------------------------------------------------------------------= -------------------------> > = > = > To reproduce: > = > git clone https://github.com/intel/lkp-tests.git > cd lkp-tests > sudo bin/lkp install job.yaml # job file is attached in= this email > bin/lkp split-job --compatible job.yaml # generate the yaml file = for lkp run > sudo bin/lkp run generated-yaml-file > = > # if come across any failure that blocks the test, > # please remove ~/.lkp and /lkp dir to run from a clean state. > = > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/tes= tcase/ucode: > gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-202006= 03.cgz/lkp-icl-2sp2/mmap2/will-it-scale/0xd000280 > = > commit: = > f87bc8dc7a ("x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix") > 0507503671 ("x86/asm: Avoid adding register pressure for the init case = in static_cpu_has()") > = > f87bc8dc7a7c438c 0507503671f9b1c867e889cbec0 = > ---------------- --------------------------- = > %stddev %change %stddev > \ | \ = > 41898923 -4.9% 39829159 will-it-scale.64.processes > 654670 -4.9% 622330 will-it-scale.per_process_o= ps > 41898923 -4.9% 39829159 will-it-scale.workload > 6918 =C2=B1 54% +116.5% 14975 =C2=B1 14% softirqs.CPU20.SC= HED > 240.00 =C2=B1 18% +57.8% 378.67 =C2=B1 20% slabinfo.biovec-6= 4.active_objs > 240.00 =C2=B1 18% +57.8% 378.67 =C2=B1 20% slabinfo.biovec-6= 4.num_objs > 0.01 =C2=B1 28% -36.1% 0.01 =C2=B1 14% perf-sched.sch_de= lay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[un= known] > 6114 =C2=B1 24% -46.1% 3296 =C2=B1 46% perf-sched.wait_a= nd_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constp= rop.0.do_sys_poll > 6114 =C2=B1 24% -46.1% 3296 =C2=B1 46% perf-sched.wait_t= ime.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0= .do_sys_poll > 1409 =C2=B1 30% -30.7% 977.00 =C2=B1 23% interrupts.CPU1.C= AL:Function_call_interrupts > 3001 =C2=B1 58% -70.3% 892.50 =C2=B1 69% interrupts.CPU1.R= ES:Rescheduling_interrupts > 669.83 =C2=B1172% +696.9% 5338 =C2=B1102% interrupts.CPU108= .NMI:Non-maskable_interrupts --===============8662452794968638592==--