Re: [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression

From: Yin Fengwei <fengwei.yin@intel.com>
To: lkp@lists.01.org
Subject: Re: [x86/asm] 0507503671: will-it-scale.per_process_ops -4.9% regression
Date: Mon, 15 Nov 2021 17:53:32 +0800	[thread overview]
Message-ID: <0cfae054-3150-35d4-f95f-68b5ec1897e7@intel.com> (raw)
In-Reply-To: <20211115073738.GA23967@xsang-OptiPlex-9020>

[-- Attachment #1: Type: text/plain, Size: 9369 bytes --]

Hi,

On 11/15/2021 3:37 PM, kernel test robot wrote:
> 
> 
> Greeting,
> 
> FYI, we noticed a -4.9% regression of will-it-scale.per_process_ops due to commit:
> 
> 
> commit: 0507503671f9b1c867e889cbec0f43abf904f23c ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> in testcase: will-it-scale
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory
> with following parameters:
> 
> 	nr_task: 50%
> 	mode: process
> 	test: mmap2
> 	cpufreq_governor: performance
> 	ucode: 0xd000280
> 
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
> 
> 
> please be noted, since we don't have clue why this commit could cause
> performance drop, so we did further tests on other platforms or with
> different parameters, and got below results.
Add Kees in case he is interest to this behavior.

Observation on this regression:
   After the patch, the better code is generated. De-assembled the function intel_pmu_store_lbr with
   vmlinux built from commit f87bc8dc7a7c and 0507503671f9 and got:

   With commit f87bc8dc7a7c (parent commit):
     https://zerobin.net/?22efb1114b097030#ryD/8LpasEIg8WrS6O/M+sHYJp7c/LoAXPfeB7BUqu4=

   With commit 0507503671f9:
     https://zerobin.net/?e57652572b3ec83c#CvobUggve54SIHmlZ6jkzs3s4k8iQN4ophFoOs7LHMI=

   The assembly code with commit 0507503671f9 is smaller than with parent commit. The
   register r12 in parent commit is strange IIUC.

   BTW, the reason that we picked up function intel_pmu_store_lbr is:
     It's the first function in System.map which has different size w/o the patch

Suppose the performance data with commit 0507503671f9 should be better. But the
test result showed it had improvement only on one test box. On other three test box,
it introduced regressions. Looks like strange.

Regards
Yin, Fengwei

> 
> except the 1% improvement from the first test on a 4 sockets Haswell-EX,
> others all show similar regression:
> 
> +------------------+----------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops +1.0% improvement                   |
> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz with 512G memory |
> | test parameters  | cpufreq_governor=performance                                                     |
> |                  | mode=process                                                                     |
> |                  | nr_task=50%                                                                      |
> |                  | test=mmap2                                                                       |
> |                  | ucode=0x16                                                                       |
> +------------------+----------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.7% regression                    |
> | test machine     | 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz with 128G memory |
> | test parameters  | cpufreq_governor=performance                                                     |
> |                  | mode=process                                                                     |
> |                  | nr_task=50%                                                                      |
> |                  | test=mmap2                                                                       |
> |                  | ucode=0x700001e                                                                  |
> +------------------+----------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.1% regression                    |
> | test machine     | 128 threads 2 sockets Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz with 256G memory  |
> | test parameters  | cpufreq_governor=performance                                                     |
> |                  | mode=process                                                                     |
> |                  | nr_task=16                                                                       |
> |                  | test=mmap2                                                                       |
> |                  | ucode=0xd000280                                                                  |
> +------------------+----------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -5.9% regression                    |
> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  |
> | test parameters  | cpufreq_governor=performance                                                     |
> |                  | mode=process                                                                     |
> |                  | nr_task=16                                                                       |
> |                  | test=mmap1                                                                       |
> |                  | ucode=0x5003006                                                                  |
> +------------------+----------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.5% regression                    |
> | test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory  |
> | test parameters  | cpufreq_governor=performance                                                     |
> |                  | mode=process                                                                     |
> |                  | nr_task=50%                                                                      |
> |                  | test=mmap2                                                                       |
> |                  | ucode=0x5003006                                                                  |
> +------------------+----------------------------------------------------------------------------------+
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <oliver.sang@intel.com>
> 
> 
> Details are as below:
> -------------------------------------------------------------------------------------------------->
> 
> 
> To reproduce:
> 
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         sudo bin/lkp install job.yaml           # job file is attached in this email
>         bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
>         sudo bin/lkp run generated-yaml-file
> 
>         # if come across any failure that blocks the test,
>         # please remove ~/.lkp and /lkp dir to run from a clean state.
> 
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
>   gcc-9/performance/x86_64-rhel-8.3/process/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp2/mmap2/will-it-scale/0xd000280
> 
> commit: 
>   f87bc8dc7a ("x86/asm: Add _ASM_RIP() macro for x86-64 (%rip) suffix")
>   0507503671 ("x86/asm: Avoid adding register pressure for the init case in static_cpu_has()")
> 
> f87bc8dc7a7c438c 0507503671f9b1c867e889cbec0 
> ---------------- --------------------------- 
>          %stddev     %change         %stddev
>              \          |                \  
>   41898923            -4.9%   39829159        will-it-scale.64.processes
>     654670            -4.9%     622330        will-it-scale.per_process_ops
>   41898923            -4.9%   39829159        will-it-scale.workload
>       6918 ± 54%    +116.5%      14975 ± 14%  softirqs.CPU20.SCHED
>     240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.active_objs
>     240.00 ± 18%     +57.8%     378.67 ± 20%  slabinfo.biovec-64.num_objs
>       0.01 ± 28%     -36.1%       0.01 ± 14%  perf-sched.sch_delay.max.ms.__x64_sys_pause.do_syscall_64.entry_SYSCALL_64_after_hwframe.[unknown]
>       6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_and_delay.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
>       6114 ± 24%     -46.1%       3296 ± 46%  perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
>       1409 ± 30%     -30.7%     977.00 ± 23%  interrupts.CPU1.CAL:Function_call_interrupts
>       3001 ± 58%     -70.3%     892.50 ± 69%  interrupts.CPU1.RES:Rescheduling_interrupts
>     669.83 ±172%    +696.9%       5338 ±102%  interrupts.CPU108.NMI:Non-maskable_interrupts