Re: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned

From: kernel test robot <oliver.sang@intel.com>
To: Raghavendra K T <raghavendra.kt@amd.com>
Cc: <oe-lkp@lists.linux.dev>, <lkp@intel.com>,
	Bharata B Rao <bharata@amd.com>, <linux-kernel@vger.kernel.org>,
	<ying.huang@intel.com>, <feng.tang@intel.com>,
	<fengwei.yin@intel.com>, <aubrey.li@linux.intel.com>,
	<yu.c.chen@intel.com>, <linux-mm@kvack.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Mel Gorman <mgorman@suse.de>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>, <rppt@kernel.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Aithal Srikanth <sraithal@amd.com>,
	"kernel test robot" <oliver.sang@intel.com>,
	Raghavendra K T <raghavendra.kt@amd.com>,
	Sapkal Swapnil <Swapnil.Sapkal@amd.com>,
	K Prateek Nayak <kprateek.nayak@amd.com>
Subject: Re: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned
Date: Sun, 10 Sep 2023 23:29:28 +0800	[thread overview]
Message-ID: <202309102311.84b42068-oliver.sang@intel.com> (raw)
In-Reply-To: <109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@amd.com>

Hello,

kernel test robot noticed a -33.6% improvement of autonuma-benchmark.numa02.seconds on:

commit: af46f3c9ca2d16485912f8b9c896ef48bbfe1388 ("[RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned")
url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2f88c8e802c8b128a155976631f4eb2ce4f3c805
patch link: https://lore.kernel.org/all/109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@amd.com/
patch subject: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned

testcase: autonuma-benchmark
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
parameters:

	iterations: 4x
	test: numa01_THREAD_ALLOC
	cpufreq_governor: performance

Details are as below:
-------------------------------------------------------------------------------------------------->

The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230910/202309102311.84b42068-oliver.sang@intel.com

=========================================================================================
compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
  gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark

commit: 
  167773d1dd ("sched/numa: Increase tasks' access history")
  af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned")

167773d1ddb5ffdd af46f3c9ca2d16485912f8b9c89 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
 2.534e+10 ± 10%     -13.0%  2.204e+10 ±  7%  cpuidle..time
  26431366 ± 10%     -13.2%   22948978 ±  7%  cpuidle..usage
      0.15 ±  4%      -0.0        0.12 ±  3%  mpstat.cpu.all.soft%
      2.92 ±  3%      +0.4        3.32 ±  4%  mpstat.cpu.all.sys%
      2243 ±  2%     -12.7%       1957 ±  3%  uptime.boot
     29811 ±  8%     -11.1%      26507 ±  6%  uptime.idle
      5.32 ± 79%     -64.2%       1.91 ± 60%  perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
      2.70 ± 18%     +37.8%       3.72 ±  9%  perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
      0.64 ±137%  +26644.2%     169.91 ±220%  perf-sched.wait_time.avg.ms.__cond_resched.task_work_run.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode
      0.08 ± 20%      +0.0        0.12 ± 10%  perf-profile.children.cycles-pp.terminate_walk
      0.10 ± 25%      +0.0        0.14 ± 10%  perf-profile.children.cycles-pp.wake_up_q
      0.06 ± 50%      +0.0        0.10 ± 10%  perf-profile.children.cycles-pp.vfs_readlink
      0.15 ± 36%      +0.1        0.22 ± 13%  perf-profile.children.cycles-pp.readlink
      1.31 ± 19%      +0.4        1.69 ± 12%  perf-profile.children.cycles-pp.unmap_vmas
      2.46 ± 19%      +0.5        2.99 ±  4%  perf-profile.children.cycles-pp.exit_mmap
    311653 ± 10%     -23.7%     237884 ±  9%  turbostat.C1E
  26018024 ± 10%     -13.1%   22597563 ±  7%  turbostat.C6
      6.41 ±  9%     -13.6%       5.54 ±  8%  turbostat.CPU%c1
      2.47 ± 11%     +36.0%       3.36 ±  6%  turbostat.CPU%c6
 2.881e+08 ±  2%     -12.8%  2.513e+08 ±  3%  turbostat.IRQ
    212.86            +2.8%     218.84        turbostat.RAMWatt
    341.49            -4.1%     327.42 ±  2%  autonuma-benchmark.numa01.seconds
    186.67 ±  6%     -27.1%     136.12 ±  7%  autonuma-benchmark.numa01_THREAD_ALLOC.seconds
     21.17 ±  7%     -33.6%      14.05        autonuma-benchmark.numa02.seconds
      2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time
      2200 ±  2%     -13.0%       1913 ±  3%  autonuma-benchmark.time.elapsed_time.max
   1159380 ±  2%     -12.0%    1019969 ±  3%  autonuma-benchmark.time.involuntary_context_switches
   3363550            -5.0%    3194802        autonuma-benchmark.time.minor_page_faults
    243046 ±  2%     -13.3%     210725 ±  3%  autonuma-benchmark.time.user_time
   7494239            -6.8%    6984234        proc-vmstat.numa_hit
    118829 ±  6%     +13.7%     135136 ±  6%  proc-vmstat.numa_huge_pte_updates
   6207618            -8.4%    5686795 ±  2%  proc-vmstat.numa_local
   8834573 ±  3%     +20.2%   10616944 ±  4%  proc-vmstat.numa_pages_migrated
  61094857 ±  6%     +13.6%   69409875 ±  6%  proc-vmstat.numa_pte_updates
   8602789            -9.0%    7827793 ±  2%  proc-vmstat.pgfault
   8834573 ±  3%     +20.2%   10616944 ±  4%  proc-vmstat.pgmigrate_success
    371818           -10.1%     334391 ±  2%  proc-vmstat.pgreuse
     17200 ±  3%     +20.3%      20686 ±  4%  proc-vmstat.thp_migration_success
  16401792 ±  2%     -12.7%   14322816 ±  3%  proc-vmstat.unevictable_pgs_scanned
 1.606e+08 ±  2%     -13.8%  1.385e+08 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.avg
 1.666e+08 ±  2%     -14.0%  1.433e+08 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.max
 1.364e+08 ±  2%     -11.7%  1.204e+08 ±  3%  sched_debug.cfs_rq:/.avg_vruntime.min
   4795327 ±  7%     -17.5%    3956991 ±  7%  sched_debug.cfs_rq:/.avg_vruntime.stddev
 1.606e+08 ±  2%     -13.8%  1.385e+08 ±  3%  sched_debug.cfs_rq:/.min_vruntime.avg
 1.666e+08 ±  2%     -14.0%  1.433e+08 ±  3%  sched_debug.cfs_rq:/.min_vruntime.max
 1.364e+08 ±  2%     -11.7%  1.204e+08 ±  3%  sched_debug.cfs_rq:/.min_vruntime.min
   4795327 ±  7%     -17.5%    3956991 ±  7%  sched_debug.cfs_rq:/.min_vruntime.stddev
    364.96 ±  6%     +16.6%     425.70 ±  5%  sched_debug.cfs_rq:/.util_est_enqueued.avg
   1099114           -13.0%     956021 ±  2%  sched_debug.cpu.clock.avg
   1099477           -13.0%     956344 ±  2%  sched_debug.cpu.clock.max
   1098702           -13.0%     955643 ±  2%  sched_debug.cpu.clock.min
   1080712           -13.0%     940415 ±  2%  sched_debug.cpu.clock_task.avg
   1085309           -13.1%     943557 ±  2%  sched_debug.cpu.clock_task.max
   1064613           -13.0%     925993 ±  2%  sched_debug.cpu.clock_task.min
     28890 ±  3%     -11.7%      25504 ±  3%  sched_debug.cpu.curr->pid.avg
     35200           -11.0%      31344        sched_debug.cpu.curr->pid.max
    862245 ±  3%      -8.7%     786984        sched_debug.cpu.max_idle_balance_cost.max
     74019 ±  9%     -28.2%      53158 ±  7%  sched_debug.cpu.max_idle_balance_cost.stddev
     15507           -11.9%      13667 ±  2%  sched_debug.cpu.nr_switches.avg
     57616 ±  6%     -19.0%      46642 ±  8%  sched_debug.cpu.nr_switches.max
      8460 ±  6%     -12.9%       7368 ±  5%  sched_debug.cpu.nr_switches.stddev
   1098689           -13.0%     955631 ±  2%  sched_debug.cpu_clk
   1097964           -13.0%     954907 ±  2%  sched_debug.ktime
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_migratory.avg
      0.03           +15.0%       0.03 ±  2%  sched_debug.rt_rq:.rt_nr_migratory.max
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_migratory.stddev
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_running.avg
      0.03           +15.0%       0.03 ±  2%  sched_debug.rt_rq:.rt_nr_running.max
      0.00           +15.0%       0.00 ±  2%  sched_debug.rt_rq:.rt_nr_running.stddev
   1099511           -13.0%     956501 ±  2%  sched_debug.sched_clk
      1162 ±  2%     +15.2%       1339 ±  3%  perf-stat.i.MPKI
 1.656e+08            +3.6%  1.716e+08        perf-stat.i.branch-instructions
      0.95 ±  4%      +0.1        1.03        perf-stat.i.branch-miss-rate%
   1538367 ±  6%     +11.0%    1707146 ±  2%  perf-stat.i.branch-misses
 6.327e+08 ±  3%     +18.7%  7.513e+08 ±  4%  perf-stat.i.cache-misses
 8.282e+08 ±  2%     +15.2%  9.542e+08 ±  3%  perf-stat.i.cache-references
    658.12 ±  3%     -11.4%     582.98 ±  6%  perf-stat.i.cycles-between-cache-misses
 2.201e+08            +2.8%  2.263e+08        perf-stat.i.dTLB-loads
    579771            +0.9%     584915        perf-stat.i.dTLB-store-misses
 1.122e+08            +1.4%  1.138e+08        perf-stat.i.dTLB-stores
 8.278e+08            +3.1%  8.538e+08        perf-stat.i.instructions
     13.98 ±  2%     +14.3%      15.98 ±  3%  perf-stat.i.metric.M/sec
      3797            +4.3%       3958        perf-stat.i.minor-faults
    258749            +8.0%     279391 ±  2%  perf-stat.i.node-load-misses
    261169 ±  2%      +7.4%     280417 ±  5%  perf-stat.i.node-loads
     40.91 ±  3%      -3.0       37.89 ±  3%  perf-stat.i.node-store-miss-rate%
 3.841e+08 ±  6%     +27.6%  4.902e+08 ±  7%  perf-stat.i.node-stores
      3797            +4.3%       3958        perf-stat.i.page-faults
    998.24 ±  2%     +11.8%       1116 ±  2%  perf-stat.overall.MPKI
    463.91            -3.2%     448.99        perf-stat.overall.cpi
    604.23 ±  3%     -15.9%     508.08 ±  4%  perf-stat.overall.cycles-between-cache-misses
      0.00            +3.3%       0.00        perf-stat.overall.ipc
     39.20 ±  5%      -4.5       34.70 ±  6%  perf-stat.overall.node-store-miss-rate%
 1.636e+08            +3.8%  1.698e+08        perf-stat.ps.branch-instructions
   1499760 ±  6%     +11.1%    1665855 ±  2%  perf-stat.ps.branch-misses
 6.296e+08 ±  3%     +19.0%  7.489e+08 ±  4%  perf-stat.ps.cache-misses
 8.178e+08 ±  2%     +15.5%  9.447e+08 ±  3%  perf-stat.ps.cache-references
  2.18e+08            +2.9%  2.244e+08        perf-stat.ps.dTLB-loads
    578148            +0.9%     583328        perf-stat.ps.dTLB-store-misses
 1.117e+08            +1.4%  1.132e+08        perf-stat.ps.dTLB-stores
 8.192e+08            +3.3%   8.46e+08        perf-stat.ps.instructions
      3744            +4.3%       3906        perf-stat.ps.minor-faults
    255974            +8.2%     276924 ±  2%  perf-stat.ps.node-load-misses
    263796 ±  2%      +7.7%     284110 ±  5%  perf-stat.ps.node-loads
  3.82e+08 ±  6%     +27.7%  4.879e+08 ±  7%  perf-stat.ps.node-stores
      3744            +4.3%       3906        perf-stat.ps.page-faults
 1.805e+12 ±  2%     -10.1%  1.622e+12 ±  2%  perf-stat.total.instructions

Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki