Greeting, FYI, we noticed a 3.8% improvement of will-it-scale.per_thread_ops due to commit: commit: de87ae29269664b890e5323ff9649ab7990960cc ("x86/pti/64: Remove the SYSCALL64 entry trampoline") git://internal_merge_and_test_tree devel-catchup-201808280250 in testcase: will-it-scale on test machine: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory with following parameters: nr_task: 16 mode: thread test: futex3 cpufreq_governor: performance test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two. test-url: https://github.com/antonblanchard/will-it-scale In addition to that, the commit also has significant impact on the following tests: +------------------+----------------------------------------------------------------------+ | testcase: change | will-it-scale: will-it-scale.per_thread_ops 4.3% improvement | | test machine | 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory | | test parameters | cpufreq_governor=performance | | | mode=thread | | | nr_task=16 | | | test=getppid1 | +------------------+----------------------------------------------------------------------+ Details are as below: --------------------------------------------------------------------------------------------------> To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp run job.yaml ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.2/thread/16/debian-x86_64-2018-04-03.cgz/lkp-bdw-ep3d/futex3/will-it-scale commit: cba614f88d ("x86/entry/64: Use the TSS sp2 slot for rsp_scratch") de87ae2926 ("x86/pti/64: Remove the SYSCALL64 entry trampoline") cba614f88dfdeadd de87ae29269664b890e5323ff9 ---------------- -------------------------- %stddev %change %stddev \ | \ 3518673 +3.8% 3650732 will-it-scale.per_thread_ops 3182 -2.2% 3112 will-it-scale.time.system_time 1633 +4.3% 1703 will-it-scale.time.user_time 56298775 +3.8% 58411722 will-it-scale.workload 4744393 ±120% -71.5% 1351494 ± 2% cpuidle.C1.time 2328850 ± 6% +10.4% 2570720 ± 7% softirqs.TIMER 1717 ± 13% +29.9% 2231 ± 4% numa-meminfo.node0.PageTables 15164 ± 18% -62.4% 5695 ± 86% numa-meminfo.node0.Shmem 2576 ± 9% -20.8% 2041 ± 4% numa-meminfo.node1.PageTables 60960 ± 24% -36.5% 38712 ± 19% numa-numastat.node0.local_node 62186 ± 23% -31.4% 42639 ± 16% numa-numastat.node0.numa_hit 1226 ± 57% +220.2% 3926 ± 20% numa-numastat.node0.other_node 3544 ± 20% -76.3% 840.00 ± 95% numa-numastat.node1.other_node 429.25 ± 13% +29.9% 557.50 ± 4% numa-vmstat.node0.nr_page_table_pages 3794 ± 18% -62.5% 1423 ± 86% numa-vmstat.node0.nr_shmem 1373 ± 50% +194.5% 4044 ± 20% numa-vmstat.node0.numa_other 643.25 ± 9% -20.6% 510.75 ± 4% numa-vmstat.node1.nr_page_table_pages 650501 +2.2% 664689 proc-vmstat.numa_hit 645729 +2.2% 659920 proc-vmstat.numa_local 38.50 ± 81% +11564.9% 4491 ± 55% proc-vmstat.numa_pages_migrated 2461 ± 66% +622.7% 17786 ± 66% proc-vmstat.numa_pte_updates 699026 +2.3% 715240 proc-vmstat.pgalloc_normal 783179 +1.6% 795471 proc-vmstat.pgfault 683943 +1.9% 697109 proc-vmstat.pgfree 38.50 ± 81% +11564.9% 4491 ± 55% proc-vmstat.pgmigrate_success 23.57 ± 3% -23.6 0.00 perf-profile.calltrace.cycles-pp.__entry_trampoline_start 0.00 +25.7 25.66 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64 23.68 ± 3% -23.7 0.00 perf-profile.children.cycles-pp.__entry_trampoline_start 0.00 +1.1 1.14 ± 6% perf-profile.children.cycles-pp.__x86_indirect_thunk_rax 0.00 +25.7 25.69 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64 23.68 ± 3% -23.7 0.00 perf-profile.self.cycles-pp.__entry_trampoline_start 4.62 ± 6% -1.7 2.95 ± 5% perf-profile.self.cycles-pp.do_syscall_64 2.49 ± 5% -0.3 2.23 ± 5% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.00 +1.1 1.07 ± 5% perf-profile.self.cycles-pp.__x86_indirect_thunk_rax 0.00 +25.7 25.69 ± 4% perf-profile.self.cycles-pp.entry_SYSCALL_64 293.92 ± 22% -27.7% 212.58 ± 6% sched_debug.cfs_rq:/.load_avg.max 81.81 ± 8% -14.8% 69.67 ± 15% sched_debug.cfs_rq:/.load_avg.stddev 10489 ± 4% +16.9% 12261 ± 6% sched_debug.cfs_rq:/.min_vruntime.min 3.27 ± 16% -74.0% 0.85 ±101% sched_debug.cfs_rq:/.removed.util_avg.avg 69.08 ± 24% -72.9% 18.75 ±100% sched_debug.cfs_rq:/.removed.util_avg.max 14.37 ± 19% -73.8% 3.76 ±100% sched_debug.cfs_rq:/.removed.util_avg.stddev 169390 ± 15% +41.4% 239515 ± 19% sched_debug.cpu.avg_idle.min 2349 ± 21% -29.7% 1651 ± 18% sched_debug.cpu.nr_load_updates.stddev 62893 ± 28% -30.1% 43970 ± 14% sched_debug.cpu.sched_count.max 14764 ± 17% -17.6% 12171 ± 10% sched_debug.cpu.sched_count.stddev 279.25 ± 6% -18.2% 228.54 ± 8% sched_debug.cpu.ttwu_count.min 4.42 -1.3 3.12 perf-stat.branch-miss-rate% 5.174e+10 -29.9% 3.624e+10 perf-stat.branch-misses 33184723 ± 4% +15.1% 38191250 ± 12% perf-stat.cache-misses 8.231e+08 +7.9% 8.883e+08 ± 4% perf-stat.cache-references 1.83 -2.2% 1.79 perf-stat.cpi 2.071e+12 +2.1% 2.114e+12 perf-stat.dTLB-loads 27.80 ± 2% +71.5 99.28 perf-stat.iTLB-load-miss-rate% 9.74e+09 ± 4% +94.9% 1.898e+10 perf-stat.iTLB-load-misses 2.528e+10 -99.5% 1.383e+08 ± 16% perf-stat.iTLB-loads 8.152e+12 +2.4% 8.345e+12 perf-stat.instructions 838.42 ± 4% -47.6% 439.73 perf-stat.instructions-per-iTLB-miss 0.55 +2.3% 0.56 perf-stat.ipc 762567 +1.5% 773691 perf-stat.minor-faults 4676629 ± 30% -55.2% 2092806 ± 4% perf-stat.node-stores 762578 +1.5% 773695 perf-stat.page-faults 144800 -1.3% 142864 perf-stat.path-length will-it-scale.per_thread_ops 4e+06 +-+---------------------------------------------------------------+ O O O O O O O O O O O O O O O O O O O O O O O O O O | 3.5e+06 +-+.+.+.+.+.+.+..+.+.+.+.+.+.+.+.+.+.+.+.+.+.+.+..+.+.+.+.+.+.+.+.| 3e+06 +-+ | | | 2.5e+06 +-+ | | | 2e+06 +-+ | | | 1.5e+06 +-+ | 1e+06 +-+ | | | 500000 +-+ | | | 0 +-+-----------O---------------------------------------------------+ will-it-scale.workload 6e+07 O-O-O-O--O-O-O---O-O-O-O--O-O-O-O-O-O-O-O--O-O-O-O-O-O-O------------+ |.+.+.+..+.+.+.+.+.+.+.+..+.+.+.+.+.+.+.+..+.+.+.+.+.+.+.+..+.+.+.+.| 5e+07 +-+ | | | | | 4e+07 +-+ | | | 3e+07 +-+ | | | 2e+07 +-+ | | | | | 1e+07 +-+ | | | 0 +-+------------O----------------------------------------------------+ will-it-scale.time.user_time 1800 +-+------------------------------------------------------------------+ O.O.O.O..O.O.O.+.O.O..O.O.O.O.O.O..O.O.O.O.O.O.O..O.O.O.O.+.+..+.+.+.| 1600 +-+ | 1400 +-+ | | | 1200 +-+ | 1000 +-+ | | | 800 +-+ | 600 +-+ | | | 400 +-+ | 200 +-+ | | | 0 +-+------------O-----------------------------------------------------+ will-it-scale.time.system_time 3500 +-+------------------------------------------------------------------+ |.+.+.+..+.+.O.+.+.+..+.+.+.+.+.+..+.+.+.+.+.+.+..+.+.+.+.+.+..+.+.+.| 3000 O-O O O O O O O O O O O O O O O O O O O O O O O O | | | 2500 +-+ | | | 2000 +-+ | | | 1500 +-+ | | | 1000 +-+ | | | 500 +-+ | | | 0 +-+------------O-----------------------------------------------------+ [*] bisect-good sample [O] bisect-bad sample *************************************************************************************************** lkp-bdw-ep3d: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz with 64G memory ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase: gcc-7/performance/x86_64-rhel-7.2/thread/16/debian-x86_64-2018-04-03.cgz/lkp-bdw-ep3d/getppid1/will-it-scale commit: cba614f88d ("x86/entry/64: Use the TSS sp2 slot for rsp_scratch") de87ae2926 ("x86/pti/64: Remove the SYSCALL64 entry trampoline") cba614f88dfdeadd de87ae29269664b890e5323ff9 ---------------- -------------------------- %stddev %change %stddev \ | \ 4145966 +4.3% 4324715 will-it-scale.per_thread_ops 66335462 +4.3% 69195456 will-it-scale.workload 0.00 ± 30% +0.0 0.01 ± 27% mpstat.cpu.soft% 1570 ±141% +51.1% 2372 ± 98% numa-numastat.node0.other_node 2070 ± 4% +25.0% 2588 ± 17% numa-meminfo.node1.PageTables 25821 ± 8% +8.7% 28067 ± 5% numa-meminfo.node1.SReclaimable 1208 ± 4% +5.7% 1276 ± 5% slabinfo.eventpoll_pwq.active_objs 1208 ± 4% +5.7% 1276 ± 5% slabinfo.eventpoll_pwq.num_objs 6962 +2.7% 7149 proc-vmstat.nr_shmem 430.33 +770.7% 3747 ± 87% proc-vmstat.numa_hint_faults_local 3875 +5.8% 4099 ± 4% proc-vmstat.pgactivate 16121 ± 13% -32.4% 10896 ± 11% numa-vmstat.node0 1687 ±129% +47.6% 2489 ± 92% numa-vmstat.node0.numa_other 9365 ± 22% +55.3% 14544 ± 8% numa-vmstat.node1 517.33 ± 4% +25.1% 647.00 ± 17% numa-vmstat.node1.nr_page_table_pages 6454 ± 8% +8.7% 7016 ± 5% numa-vmstat.node1.nr_slab_reclaimable 8466921 ± 97% -83.8% 1374133 ± 2% cpuidle.C1.time 148474 ± 64% -52.1% 71137 ± 2% cpuidle.C1.usage 2.471e+08 ± 29% -98.4% 3942202 cpuidle.C1E.time 1802745 ± 64% -98.7% 24177 cpuidle.C1E.usage 1.408e+08 ± 59% +698.7% 1.124e+09 ± 4% cpuidle.C3.time 167633 ± 52% +1466.7% 2626385 ± 2% cpuidle.C3.usage 1.971e+09 ± 4% -34.8% 1.286e+09 ± 2% cpuidle.C6.time 2098510 ± 4% +8.6% 2278392 ± 5% cpuidle.C6.usage 38191909 ±140% -99.9% 21778 ± 6% cpuidle.POLL.time 617067 ±140% -99.7% 2139 cpuidle.POLL.usage 145146 ± 66% -53.1% 68095 turbostat.C1 0.12 ± 99% -0.1 0.02 turbostat.C1% 1802589 ± 64% -98.7% 24071 turbostat.C1E 3.40 ± 29% -3.4 0.05 turbostat.C1E% 167399 ± 52% +1468.9% 2626260 ± 2% turbostat.C3 1.93 ± 59% +13.5 15.45 ± 4% turbostat.C3% 2097688 ± 4% +8.6% 2277836 ± 5% turbostat.C6 27.12 ± 4% -9.5 17.66 ± 2% turbostat.C6% 0.01 ± 81% +1550.0% 0.17 ± 69% turbostat.CPU%c3 0.39 ± 50% -52.2% 0.18 ± 24% turbostat.CPU%c6 7.908e+11 -5.5% 7.469e+11 perf-stat.branch-instructions 5.21 ± 2% -2.3 2.89 perf-stat.branch-miss-rate% 4.12e+10 -47.5% 2.162e+10 perf-stat.branch-misses 40984991 ± 3% -5.2% 38849817 ± 2% perf-stat.cache-misses 0.01 ± 23% -0.0 0.00 ± 8% perf-stat.dTLB-load-miss-rate% 1.229e+08 ± 23% -76.8% 28492076 ± 7% perf-stat.dTLB-load-misses 0.00 ± 19% -0.0 0.00 ± 12% perf-stat.dTLB-store-miss-rate% 8136287 ± 19% -38.6% 4992992 ± 11% perf-stat.dTLB-store-misses 9.911e+11 -2.1% 9.703e+11 perf-stat.dTLB-stores 2.92 ± 22% +90.8 93.67 perf-stat.iTLB-load-miss-rate% 9.176e+08 ± 24% +2336.6% 2.236e+10 ± 2% perf-stat.iTLB-load-misses 3.041e+10 -95.0% 1.511e+09 perf-stat.iTLB-loads 4858 ± 26% -96.2% 185.77 ± 2% perf-stat.instructions-per-iTLB-miss 0.28 +0.4% 0.28 perf-stat.ipc 94.74 -2.1 92.67 perf-stat.node-load-miss-rate% 631159 ± 4% +34.9% 851575 perf-stat.node-loads 62949 -4.7% 59990 perf-stat.path-length 421.22 ± 21% +24.4% 524.13 ± 8% sched_debug.cfs_rq:/.exec_clock.min 40629 ± 2% +19.9% 48710 ± 10% sched_debug.cfs_rq:/.load.avg 65283 ± 2% +136.6% 154433 ± 51% sched_debug.cfs_rq:/.load.max 25268 ± 3% +63.6% 41328 ± 35% sched_debug.cfs_rq:/.load.stddev 262.06 ± 27% +17.0% 306.67 ± 19% sched_debug.cfs_rq:/.load_avg.max 9970 ± 17% +14.2% 11388 ± 8% sched_debug.cfs_rq:/.min_vruntime.min 0.69 ± 2% +10.7% 0.77 ± 5% sched_debug.cfs_rq:/.nr_running.avg 39579 ± 2% +16.3% 46028 ± 10% sched_debug.cfs_rq:/.runnable_weight.avg 56834 ± 2% +160.2% 147884 ± 55% sched_debug.cfs_rq:/.runnable_weight.max 23517 ± 3% +68.5% 39632 ± 39% sched_debug.cfs_rq:/.runnable_weight.stddev 333.16 ± 23% +34.7% 448.79 ± 14% sched_debug.cfs_rq:/.util_est_enqueued.avg 791.83 ± 23% +25.7% 995.42 ± 2% sched_debug.cfs_rq:/.util_est_enqueued.max 306.76 ± 26% +27.2% 390.29 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev 258487 ± 15% -36.7% 163619 ± 27% sched_debug.cpu.avg_idle.min 175699 ± 5% +17.7% 206786 ± 3% sched_debug.cpu.avg_idle.stddev 0.11 ±141% +425.0% 0.58 ± 14% sched_debug.cpu.cpu_load[2].min 0.33 ±141% +150.0% 0.83 ± 40% sched_debug.cpu.cpu_load[4].min 40053 ± 2% +17.4% 47018 ± 9% sched_debug.cpu.load.avg 65371 ± 2% +136.2% 154433 ± 51% sched_debug.cpu.load.max 25663 ± 2% +64.5% 42219 ± 36% sched_debug.cpu.load.stddev 0.00 ± 6% +7.5% 0.00 ± 7% sched_debug.cpu.next_balance.stddev 1645 ± 9% +43.0% 2353 ± 7% sched_debug.cpu.nr_load_updates.stddev 0.69 ± 2% +11.6% 0.77 ± 3% sched_debug.cpu.nr_running.avg 0.01 -75.0% 0.00 ±100% sched_debug.cpu.nr_uninterruptible.avg 3.86 ± 5% +5.4% 4.06 ± 5% sched_debug.cpu.nr_uninterruptible.stddev 8404 ± 7% +8.5% 9118 ± 6% sched_debug.cpu.sched_count.avg 195.00 ± 32% +34.9% 263.00 ± 6% sched_debug.cpu.ttwu_count.min 7183 ± 79% -62.8% 2669 ± 10% sched_debug.cpu.ttwu_local.max 30.00 -30.0 0.00 perf-profile.calltrace.cycles-pp.__entry_trampoline_start 19.13 -3.4 15.74 ± 2% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe 16.13 -2.6 13.57 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe 11.06 -1.0 10.02 perf-profile.calltrace.cycles-pp.__x64_sys_getppid.do_syscall_64.entry_SYSCALL_64_after_hwframe 16.31 +5.2 21.48 ± 15% perf-profile.calltrace.cycles-pp.secondary_startup_64 16.31 +5.2 21.48 ± 15% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64 16.31 +5.2 21.48 ± 15% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64 16.31 +5.2 21.48 ± 15% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64 16.25 +5.2 21.46 ± 15% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64 15.74 ± 4% +5.7 21.45 ± 15% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary 0.00 +30.8 30.82 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64 30.13 -30.1 0.00 perf-profile.children.cycles-pp.__entry_trampoline_start 19.25 -3.2 16.05 ± 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe 16.30 -2.5 13.78 perf-profile.children.cycles-pp.do_syscall_64 10.57 -0.4 10.15 perf-profile.children.cycles-pp.__x64_sys_getppid 0.46 ± 4% -0.2 0.30 ± 3% perf-profile.children.cycles-pp.apic_timer_interrupt 0.42 ± 4% -0.1 0.28 ± 3% perf-profile.children.cycles-pp.smp_apic_timer_interrupt 0.34 ± 6% -0.1 0.24 ± 2% perf-profile.children.cycles-pp.hrtimer_interrupt 0.21 ± 4% -0.1 0.16 ± 6% perf-profile.children.cycles-pp.__hrtimer_run_queues 0.17 ± 4% -0.1 0.12 ± 8% perf-profile.children.cycles-pp.tick_sched_timer 0.14 ± 3% -0.0 0.11 ± 4% perf-profile.children.cycles-pp.tick_sched_handle 0.14 ± 5% -0.0 0.11 ± 4% perf-profile.children.cycles-pp.update_process_times 0.08 ± 6% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.clockevents_program_event 0.00 +0.7 0.66 perf-profile.children.cycles-pp.__x86_indirect_thunk_rax 16.31 +5.2 21.48 ± 15% perf-profile.children.cycles-pp.secondary_startup_64 16.31 +5.2 21.48 ± 15% perf-profile.children.cycles-pp.cpu_startup_entry 16.31 +5.2 21.48 ± 15% perf-profile.children.cycles-pp.do_idle 16.31 +5.2 21.48 ± 15% perf-profile.children.cycles-pp.start_secondary 16.26 +5.2 21.46 ± 15% perf-profile.children.cycles-pp.cpuidle_enter_state 15.75 ± 4% +5.7 21.45 ± 15% perf-profile.children.cycles-pp.intel_idle 0.00 +30.9 30.86 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64 30.13 -30.1 0.00 perf-profile.self.cycles-pp.__entry_trampoline_start 5.43 ± 2% -2.3 3.17 ± 4% perf-profile.self.cycles-pp.do_syscall_64 3.04 -0.5 2.54 ± 6% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe 0.00 +0.6 0.60 perf-profile.self.cycles-pp.__x86_indirect_thunk_rax 15.75 ± 4% +5.7 21.45 ± 15% perf-profile.self.cycles-pp.intel_idle 0.00 +30.9 30.86 ± 4% perf-profile.self.cycles-pp.entry_SYSCALL_64 Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. Thanks, Rong, Chen