* [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression @ 2021-05-25 3:16 kernel test robot 2021-05-25 3:11 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: kernel test robot @ 2021-05-25 3:16 UTC (permalink / raw) To: Jason Gunthorpe Cc: Linus Torvalds, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, lkp, ying.huang, feng.tang, zhengjun.xing [-- Attachment #1: Type: text/plain, Size: 27230 bytes --] Greeting, FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit: commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork") https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master in testcase: will-it-scale on test machine: 96 threads 2 sockets Ice Lake with 256G memory with following parameters: nr_task: 50% mode: thread test: mmap1 cpufreq_governor: performance ucode: 0xb000280 test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two. test-url: https://github.com/antonblanchard/will-it-scale In addition to that, the commit also has significant impact on the following tests: +------------------+---------------------------------------------------------------------------------+ | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement | | test machine | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory | | test parameters | cpufreq_governor=performance | | | mode=thread | | | nr_task=50% | | | test=mmap1 | | | ucode=0x5003006 | +------------------+---------------------------------------------------------------------------------+ If you fix the issue, kindly add following tag Reported-by: kernel test robot <oliver.sang@intel.com> Details are as below: --------------------------------------------------------------------------------------------------> To reproduce: git clone https://github.com/intel/lkp-tests.git cd lkp-tests bin/lkp install job.yaml # job file is attached in this email bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run bin/lkp run generated-yaml-file ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp1/mmap1/will-it-scale/0xb000280 commit: c28b1fc703 ("mm/gup: reorganize internal_get_user_pages_fast()") 57efa1fe59 ("mm/gup: prevent gup_fast from racing with COW during fork") c28b1fc70390df32 57efa1fe5957694fa541c9062de ---------------- --------------------------- %stddev %change %stddev \ | \ 342141 -9.2% 310805 ± 2% will-it-scale.48.threads 7127 -9.2% 6474 ± 2% will-it-scale.per_thread_ops 342141 -9.2% 310805 ± 2% will-it-scale.workload 2555927 ± 3% +45.8% 3727702 meminfo.Committed_AS 12108 ± 13% -36.7% 7665 ± 7% vmstat.system.cs 1142492 ± 30% -47.3% 602364 ± 11% cpuidle.C1.usage 282373 ± 13% -45.6% 153684 ± 7% cpuidle.POLL.usage 48437 ± 3% -5.9% 45563 proc-vmstat.nr_active_anon 54617 ± 3% -5.5% 51602 proc-vmstat.nr_shmem 48437 ± 3% -5.9% 45563 proc-vmstat.nr_zone_active_anon 70511 ± 3% -5.1% 66942 ± 2% proc-vmstat.pgactivate 278653 ± 8% +23.4% 343904 ± 4% sched_debug.cpu.avg_idle.stddev 22572 ± 16% -36.3% 14378 ± 4% sched_debug.cpu.nr_switches.avg 66177 ± 16% -36.8% 41800 ± 21% sched_debug.cpu.nr_switches.max 11613 ± 15% -41.4% 6810 ± 23% sched_debug.cpu.nr_switches.stddev 22.96 ± 15% +55.6% 35.73 ± 12% perf-sched.total_wait_and_delay.average.ms 69713 ± 19% -38.0% 43235 ± 12% perf-sched.total_wait_and_delay.count.ms 22.95 ± 15% +55.6% 35.72 ± 12% perf-sched.total_wait_time.average.ms 29397 ± 23% -35.3% 19030 ± 17% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap 31964 ± 20% -50.8% 15738 ± 14% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff 4.59 ± 6% +12.2% 5.15 ± 4% perf-stat.i.MPKI 3.105e+09 -2.1% 3.04e+09 perf-stat.i.branch-instructions 12033 ± 12% -36.8% 7600 ± 7% perf-stat.i.context-switches 10.06 +1.9% 10.25 perf-stat.i.cpi 4.067e+09 -1.3% 4.016e+09 perf-stat.i.dTLB-loads 4.521e+08 -5.1% 4.291e+08 ± 2% perf-stat.i.dTLB-stores 1.522e+10 -1.6% 1.497e+10 perf-stat.i.instructions 0.10 -1.9% 0.10 perf-stat.i.ipc 0.19 ± 8% -22.8% 0.15 ± 5% perf-stat.i.metric.K/sec 80.30 -1.7% 78.93 perf-stat.i.metric.M/sec 167270 ± 6% -14.9% 142312 ± 11% perf-stat.i.node-loads 49.76 -1.6 48.11 perf-stat.i.node-store-miss-rate% 3945152 +6.2% 4189006 perf-stat.i.node-stores 4.59 ± 6% +12.1% 5.15 ± 4% perf-stat.overall.MPKI 10.04 +1.8% 10.23 perf-stat.overall.cpi 0.10 -1.8% 0.10 perf-stat.overall.ipc 49.76 -1.6 48.12 perf-stat.overall.node-store-miss-rate% 13400506 +8.2% 14504566 perf-stat.overall.path-length 3.094e+09 -2.1% 3.03e+09 perf-stat.ps.branch-instructions 12087 ± 13% -36.9% 7622 ± 7% perf-stat.ps.context-switches 4.054e+09 -1.3% 4.002e+09 perf-stat.ps.dTLB-loads 4.508e+08 -5.1% 4.278e+08 ± 2% perf-stat.ps.dTLB-stores 1.516e+10 -1.6% 1.492e+10 perf-stat.ps.instructions 3932404 +6.2% 4175831 perf-stat.ps.node-stores 4.584e+12 -1.7% 4.506e+12 perf-stat.total.instructions 364038 ± 6% -40.3% 217265 ± 9% interrupts.CAL:Function_call_interrupts 5382 ± 33% -63.4% 1970 ± 35% interrupts.CPU44.CAL:Function_call_interrupts 6325 ± 19% -58.1% 2650 ± 37% interrupts.CPU47.CAL:Function_call_interrupts 11699 ± 19% -60.6% 4610 ± 23% interrupts.CPU48.CAL:Function_call_interrupts 94.20 ± 22% -45.8% 51.09 ± 46% interrupts.CPU48.TLB:TLB_shootdowns 9223 ± 24% -52.5% 4383 ± 28% interrupts.CPU49.CAL:Function_call_interrupts 9507 ± 24% -57.5% 4040 ± 27% interrupts.CPU50.CAL:Function_call_interrupts 4530 ± 18% -33.9% 2993 ± 28% interrupts.CPU62.CAL:Function_call_interrupts 82.00 ± 21% -41.9% 47.64 ± 38% interrupts.CPU63.TLB:TLB_shootdowns 4167 ± 16% -45.4% 2276 ± 22% interrupts.CPU64.CAL:Function_call_interrupts 135.20 ± 31% -58.4% 56.27 ± 51% interrupts.CPU64.TLB:TLB_shootdowns 4155 ± 17% -42.5% 2387 ± 27% interrupts.CPU65.CAL:Function_call_interrupts 95.00 ± 48% -53.8% 43.91 ± 42% interrupts.CPU65.TLB:TLB_shootdowns 4122 ± 20% -39.4% 2497 ± 29% interrupts.CPU66.CAL:Function_call_interrupts 3954 ± 14% -41.4% 2318 ± 28% interrupts.CPU67.CAL:Function_call_interrupts 3802 ± 17% -41.9% 2209 ± 17% interrupts.CPU70.CAL:Function_call_interrupts 3787 ± 11% -48.2% 1961 ± 29% interrupts.CPU71.CAL:Function_call_interrupts 3580 ± 14% -45.1% 1964 ± 19% interrupts.CPU72.CAL:Function_call_interrupts 3711 ± 20% -51.3% 1806 ± 25% interrupts.CPU73.CAL:Function_call_interrupts 3494 ± 21% -40.6% 2076 ± 21% interrupts.CPU76.CAL:Function_call_interrupts 3416 ± 21% -45.2% 1873 ± 26% interrupts.CPU77.CAL:Function_call_interrupts 3047 ± 24% -38.0% 1890 ± 18% interrupts.CPU78.CAL:Function_call_interrupts 3102 ± 28% -41.8% 1805 ± 16% interrupts.CPU80.CAL:Function_call_interrupts 2811 ± 23% -36.5% 1785 ± 22% interrupts.CPU83.CAL:Function_call_interrupts 2617 ± 17% -30.7% 1814 ± 30% interrupts.CPU84.CAL:Function_call_interrupts 3322 ± 25% -38.1% 2055 ± 29% interrupts.CPU87.CAL:Function_call_interrupts 2941 ± 12% -39.2% 1787 ± 27% interrupts.CPU93.CAL:Function_call_interrupts 72.56 -19.7 52.82 perf-profile.calltrace.cycles-pp.__mmap 72.52 -19.7 52.78 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap 72.48 -19.7 52.74 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap 72.49 -19.7 52.76 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap 72.47 -19.7 52.74 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap 71.74 -19.7 52.04 perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe 71.63 -19.7 51.95 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64 71.52 -19.6 51.88 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff 70.12 -19.2 50.92 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff 0.91 ± 2% -0.2 0.70 perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff 0.87 ± 2% +0.1 0.95 ± 2% perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe 0.00 +0.6 0.63 ± 2% perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap 24.24 ± 3% +19.4 43.62 perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap 24.72 ± 3% +19.8 44.47 perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap 24.87 ± 3% +19.8 44.62 perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe 24.78 ± 3% +19.8 44.54 perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64 25.94 ± 3% +19.8 45.73 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap 25.97 ± 3% +19.8 45.77 perf-profile.calltrace.cycles-pp.__munmap 25.90 ± 3% +19.8 45.70 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap 25.88 ± 3% +19.8 45.68 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap 25.87 ± 3% +19.8 45.67 perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap 72.57 -19.7 52.83 perf-profile.children.cycles-pp.__mmap 72.48 -19.7 52.74 perf-profile.children.cycles-pp.ksys_mmap_pgoff 72.48 -19.7 52.74 perf-profile.children.cycles-pp.vm_mmap_pgoff 0.22 ± 5% -0.1 0.14 ± 6% perf-profile.children.cycles-pp.unmap_region 0.08 ± 23% -0.0 0.04 ± 61% perf-profile.children.cycles-pp.__schedule 0.06 ± 7% -0.0 0.03 ± 75% perf-profile.children.cycles-pp.perf_event_mmap 0.12 ± 4% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.up_write 0.09 ± 7% -0.0 0.06 ± 16% perf-profile.children.cycles-pp.unmap_vmas 0.10 ± 4% -0.0 0.08 ± 3% perf-profile.children.cycles-pp.up_read 0.18 ± 2% +0.0 0.20 ± 3% perf-profile.children.cycles-pp.vm_area_dup 0.18 ± 5% +0.0 0.21 ± 2% perf-profile.children.cycles-pp.vma_merge 0.12 ± 4% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.kmem_cache_free 0.19 ± 6% +0.0 0.23 ± 2% perf-profile.children.cycles-pp.get_unmapped_area 0.16 ± 6% +0.0 0.20 ± 2% perf-profile.children.cycles-pp.vm_unmapped_area 0.17 ± 6% +0.0 0.21 ± 2% perf-profile.children.cycles-pp.arch_get_unmapped_area_topdown 0.07 ± 10% +0.1 0.14 ± 16% perf-profile.children.cycles-pp.find_vma 0.27 ± 4% +0.1 0.35 ± 2% perf-profile.children.cycles-pp.__vma_adjust 0.35 ± 2% +0.1 0.43 ± 3% perf-profile.children.cycles-pp.__split_vma 0.87 ± 2% +0.1 0.95 ± 2% perf-profile.children.cycles-pp.__do_munmap 1.23 +0.1 1.33 perf-profile.children.cycles-pp.rwsem_spin_on_owner 25.98 ± 3% +19.8 45.78 perf-profile.children.cycles-pp.__munmap 25.87 ± 3% +19.8 45.68 perf-profile.children.cycles-pp.__vm_munmap 25.88 ± 3% +19.8 45.68 perf-profile.children.cycles-pp.__x64_sys_munmap 0.53 ± 2% -0.2 0.35 ± 3% perf-profile.self.cycles-pp.rwsem_optimistic_spin 0.08 ± 5% -0.1 0.03 ± 75% perf-profile.self.cycles-pp.do_mmap 0.11 ± 6% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.up_write 0.19 ± 4% -0.0 0.16 ± 5% perf-profile.self.cycles-pp.down_write_killable 0.05 ± 8% +0.0 0.07 ± 8% perf-profile.self.cycles-pp.downgrade_write 0.11 ± 4% +0.0 0.14 ± 4% perf-profile.self.cycles-pp.__vma_adjust 0.16 ± 6% +0.0 0.20 ± 3% perf-profile.self.cycles-pp.vm_unmapped_area 0.05 ± 9% +0.0 0.10 ± 13% perf-profile.self.cycles-pp.find_vma 1.21 +0.1 1.31 perf-profile.self.cycles-pp.rwsem_spin_on_owner will-it-scale.per_thread_ops 7400 +--------------------------------------------------------------------+ | + | 7200 |.++. .+. +.+ .++. : :.+ .+ +. | | ++.+.++ ++ + +.+.++.+ ++.+.++.++. : + + : : + | 7000 |-+ + + : + :: | | + + .+ : + | 6800 |-+ + + | | O | 6600 |-+ O O O O O O OO OO O | | OO O O O OO O O O O O O O O O | 6400 |-+ O O O O O O OO O | | O O O | 6200 |-+ O | | O | 6000 +--------------------------------------------------------------------+ [*] bisect-good sample [O] bisect-bad sample *************************************************************************************************** lkp-csl-2sp9: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory ========================================================================================= compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode: gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp9/mmap1/will-it-scale/0x5003006 commit: c28b1fc703 ("mm/gup: reorganize internal_get_user_pages_fast()") 57efa1fe59 ("mm/gup: prevent gup_fast from racing with COW during fork") c28b1fc70390df32 57efa1fe5957694fa541c9062de ---------------- --------------------------- %stddev %change %stddev \ | \ 247840 +3.7% 257132 ± 2% will-it-scale.44.threads 5632 +3.7% 5843 ± 2% will-it-scale.per_thread_ops 247840 +3.7% 257132 ± 2% will-it-scale.workload 0.10 ± 5% +0.0 0.13 ± 8% perf-profile.children.cycles-pp.find_vma 14925 ± 19% -48.2% 7724 ± 8% softirqs.CPU87.SCHED 9950 ± 3% -36.1% 6355 ± 2% vmstat.system.cs 3312916 ± 4% +13.9% 3774536 ± 9% cpuidle.C1.time 1675504 ± 5% -36.6% 1061625 cpuidle.POLL.time 987055 ± 5% -41.8% 574757 ± 2% cpuidle.POLL.usage 165545 ± 3% -12.2% 145358 ± 4% meminfo.Active 165235 ± 3% -12.1% 145188 ± 4% meminfo.Active(anon) 180757 ± 3% -11.7% 159538 ± 3% meminfo.Shmem 2877001 ± 11% +16.2% 3342948 ± 10% sched_debug.cfs_rq:/.min_vruntime.avg 5545708 ± 11% +9.8% 6086941 ± 8% sched_debug.cfs_rq:/.min_vruntime.max 2773178 ± 11% +15.4% 3199941 ± 9% sched_debug.cfs_rq:/.spread0.avg 733740 ± 3% -12.0% 646033 ± 5% sched_debug.cpu.avg_idle.avg 17167 ± 10% -28.2% 12332 ± 7% sched_debug.cpu.nr_switches.avg 49180 ± 14% -33.5% 32687 ± 22% sched_debug.cpu.nr_switches.max 9311 ± 18% -36.2% 5943 ± 22% sched_debug.cpu.nr_switches.stddev 41257 ± 3% -12.1% 36252 ± 4% proc-vmstat.nr_active_anon 339681 -1.6% 334294 proc-vmstat.nr_file_pages 10395 -3.5% 10036 proc-vmstat.nr_mapped 45130 ± 3% -11.7% 39848 ± 3% proc-vmstat.nr_shmem 41257 ± 3% -12.1% 36252 ± 4% proc-vmstat.nr_zone_active_anon 841530 -1.7% 826917 proc-vmstat.numa_local 21515 ± 11% -68.9% 6684 ± 70% proc-vmstat.numa_pages_migrated 60224 ± 3% -11.1% 53513 ± 3% proc-vmstat.pgactivate 981265 -2.5% 956415 proc-vmstat.pgalloc_normal 895893 -1.9% 878978 proc-vmstat.pgfree 21515 ± 11% -68.9% 6684 ± 70% proc-vmstat.pgmigrate_success 0.07 ±135% -74.1% 0.02 ± 5% perf-sched.sch_delay.max.ms.preempt_schedule_common._cond_resched.stop_one_cpu.__set_cpus_allowed_ptr.sched_setaffinity 21.44 ± 5% +80.9% 38.79 ± 3% perf-sched.total_wait_and_delay.average.ms 67273 ± 6% -44.9% 37095 ± 5% perf-sched.total_wait_and_delay.count.ms 21.44 ± 5% +80.9% 38.79 ± 3% perf-sched.total_wait_time.average.ms 0.08 ± 14% +60.1% 0.13 ± 9% perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap 0.09 ± 12% +58.0% 0.15 ± 15% perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff 255.38 ± 14% +22.1% 311.71 ± 17% perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll 31877 ± 10% -54.2% 14606 ± 13% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap 27110 ± 7% -47.3% 14280 ± 4% perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff 138.60 ± 13% -21.4% 109.00 ± 15% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll 1.00 ±199% -99.9% 0.00 ±200% perf-sched.wait_time.avg.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap 0.08 ± 14% +60.9% 0.13 ± 9% perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap 0.09 ± 12% +58.2% 0.15 ± 15% perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff 255.38 ± 14% +22.1% 311.71 ± 17% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll 1.00 ±199% -99.9% 0.00 ±200% perf-sched.wait_time.max.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap 4.99 -36.1% 3.19 ± 36% perf-sched.wait_time.max.ms.rcu_gp_kthread.kthread.ret_from_fork 9869 ± 3% -36.2% 6295 ± 2% perf-stat.i.context-switches 0.00 ± 7% +0.0 0.00 ± 29% perf-stat.i.dTLB-load-miss-rate% 76953 ± 7% +327.4% 328871 ± 29% perf-stat.i.dTLB-load-misses 4152320 -3.0% 4026365 perf-stat.i.iTLB-load-misses 1665297 -2.2% 1628746 perf-stat.i.iTLB-loads 8627 +3.5% 8933 perf-stat.i.instructions-per-iTLB-miss 0.33 ± 3% -11.0% 0.29 ± 6% perf-stat.i.metric.K/sec 87.42 +1.7 89.11 perf-stat.i.node-load-miss-rate% 7507752 -9.2% 6814138 perf-stat.i.node-load-misses 1078418 ± 2% -22.9% 831563 ± 3% perf-stat.i.node-loads 3091445 -8.2% 2838247 perf-stat.i.node-store-misses 0.00 ± 7% +0.0 0.00 ± 29% perf-stat.overall.dTLB-load-miss-rate% 8599 +3.6% 8907 perf-stat.overall.instructions-per-iTLB-miss 87.44 +1.7 89.13 perf-stat.overall.node-load-miss-rate% 43415811 -3.3% 41994695 ± 2% perf-stat.overall.path-length 9895 ± 3% -36.4% 6291 ± 2% perf-stat.ps.context-switches 76756 ± 7% +327.0% 327716 ± 29% perf-stat.ps.dTLB-load-misses 4138410 -3.0% 4012712 perf-stat.ps.iTLB-load-misses 1659653 -2.2% 1623167 perf-stat.ps.iTLB-loads 7483002 -9.2% 6791226 perf-stat.ps.node-load-misses 1074856 ± 2% -22.9% 828780 ± 3% perf-stat.ps.node-loads 3081222 -8.2% 2828732 perf-stat.ps.node-store-misses 335021 ± 2% -27.9% 241715 ± 12% interrupts.CAL:Function_call_interrupts 3662 ± 31% -61.3% 1417 ± 16% interrupts.CPU10.CAL:Function_call_interrupts 4671 ± 32% -65.6% 1607 ± 30% interrupts.CPU12.CAL:Function_call_interrupts 4999 ± 34% -68.1% 1592 ± 43% interrupts.CPU14.CAL:Function_call_interrupts 129.00 ± 30% -46.8% 68.60 ± 34% interrupts.CPU14.RES:Rescheduling_interrupts 4531 ± 49% -58.5% 1881 ± 39% interrupts.CPU15.CAL:Function_call_interrupts 4639 ± 28% -37.6% 2893 ± 2% interrupts.CPU18.NMI:Non-maskable_interrupts 4639 ± 28% -37.6% 2893 ± 2% interrupts.CPU18.PMI:Performance_monitoring_interrupts 6310 ± 49% -68.5% 1988 ± 57% interrupts.CPU21.CAL:Function_call_interrupts 149.40 ± 49% -49.3% 75.80 ± 42% interrupts.CPU21.RES:Rescheduling_interrupts 3592 ± 38% -63.0% 1330 ± 14% interrupts.CPU24.CAL:Function_call_interrupts 5350 ± 21% -30.5% 3720 ± 44% interrupts.CPU24.NMI:Non-maskable_interrupts 5350 ± 21% -30.5% 3720 ± 44% interrupts.CPU24.PMI:Performance_monitoring_interrupts 139.00 ± 27% -33.4% 92.60 ± 26% interrupts.CPU24.RES:Rescheduling_interrupts 3858 ± 42% -53.7% 1785 ± 38% interrupts.CPU26.CAL:Function_call_interrupts 5964 ± 28% -42.4% 3432 ± 55% interrupts.CPU27.NMI:Non-maskable_interrupts 5964 ± 28% -42.4% 3432 ± 55% interrupts.CPU27.PMI:Performance_monitoring_interrupts 3429 ± 37% -57.1% 1470 ± 44% interrupts.CPU28.CAL:Function_call_interrupts 3008 ± 35% -37.6% 1877 ± 38% interrupts.CPU29.CAL:Function_call_interrupts 4684 ± 73% -60.0% 1872 ± 34% interrupts.CPU30.CAL:Function_call_interrupts 4300 ± 46% -54.7% 1949 ± 13% interrupts.CPU43.CAL:Function_call_interrupts 10255 ± 26% -50.0% 5127 ± 29% interrupts.CPU44.CAL:Function_call_interrupts 5800 ± 20% -28.3% 4158 ± 27% interrupts.CPU52.CAL:Function_call_interrupts 4802 ± 19% -31.7% 3279 ± 18% interrupts.CPU58.CAL:Function_call_interrupts 4042 ± 32% -65.6% 1391 ± 41% interrupts.CPU6.CAL:Function_call_interrupts 128.60 ± 31% -52.9% 60.60 ± 38% interrupts.CPU6.RES:Rescheduling_interrupts 4065 ± 20% -37.8% 2530 ± 6% interrupts.CPU63.CAL:Function_call_interrupts 4340 ± 24% -36.2% 2771 ± 11% interrupts.CPU64.CAL:Function_call_interrupts 3983 ± 11% -27.1% 2904 ± 19% interrupts.CPU65.CAL:Function_call_interrupts 3392 ± 25% -55.2% 1518 ± 53% interrupts.CPU7.CAL:Function_call_interrupts 171.80 ± 67% -62.5% 64.40 ± 32% interrupts.CPU7.RES:Rescheduling_interrupts 2942 ± 33% -50.5% 1455 ± 25% interrupts.CPU8.CAL:Function_call_interrupts 7818 -27.3% 5681 ± 31% interrupts.CPU85.NMI:Non-maskable_interrupts 7818 -27.3% 5681 ± 31% interrupts.CPU85.PMI:Performance_monitoring_interrupts 320.80 ± 54% -44.6% 177.80 ± 58% interrupts.CPU87.TLB:TLB_shootdowns 3212 ± 31% -64.8% 1130 ± 36% interrupts.CPU9.CAL:Function_call_interrupts Disclaimer: Results have been estimated based on internal Intel analysis and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance. --- 0DAY/LKP+ Test Infrastructure Open Source Technology Center https://lists.01.org/hyperkitty/list/lkp@lists.01.org Intel Corporation Thanks, Oliver Sang [-- Attachment #2: job-script --] [-- Type: text/plain, Size: 7928 bytes --] #!/bin/sh export_top_env() { export suite='will-it-scale' export testcase='will-it-scale' export category='benchmark' export nr_task=48 export job_origin='will-it-scale-part2.yaml' export queue_cmdline_keys= export queue='vip' export testbox='lkp-icl-2sp1' export tbox_group='lkp-icl-2sp1' export kconfig='x86_64-rhel-8.3' export submit_id='608271eb0b9a9366cf75aa8b' export job_file='/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210423-26319-1fupkzh-3.yaml' export id='2d333079611b73199587392d819fd36dc870581a' export queuer_version='/lkp/xsang/.src-20210423-103236' export model='Ice Lake' export nr_node=2 export nr_cpu=96 export memory='256G' export nr_hdd_partitions=1 export hdd_partitions='/dev/disk/by-id/ata-ST9500530NS_9SP1KLAR-part1' export ssd_partitions='/dev/nvme0n1p1' export swap_partitions= export kernel_cmdline_hw='acpi_rsdp=0x665fd014' export rootfs_partition='/dev/disk/by-id/ata-INTEL_SSDSC2BB800G4_PHWL4204005K800RGN-part3' export commit='57efa1fe5957694fa541c9062de0a127f0b9acb0' export ucode='0xb000280' export need_kconfig_hw='CONFIG_IGB=y CONFIG_IXGBE=y CONFIG_SATA_AHCI' export enqueue_time='2021-04-23 15:06:19 +0800' export _id='608271ef0b9a9366cf75aa8e' export _rt='/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0' export user='lkp' export compiler='gcc-9' export LKP_SERVER='internal-lkp-server' export head_commit='59d492ff832e57456a83d5652009434a44874a3e' export base_commit='f40ddce88593482919761f74910f42f4b84c004b' export branch='linus/master' export rootfs='debian-10.4-x86_64-20200603.cgz' export monitor_sha='70d6d718' export result_root='/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/8' export scheduler_version='/lkp/lkp/.src-20210422-153727' export arch='x86_64' export max_uptime=2100 export initrd='/osimage/debian/debian-10.4-x86_64-20200603.cgz' export bootloader_append='root=/dev/ram0 user=lkp job=/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210423-26319-1fupkzh-3.yaml ARCH=x86_64 kconfig=x86_64-rhel-8.3 branch=linus/master commit=57efa1fe5957694fa541c9062de0a127f0b9acb0 BOOT_IMAGE=/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957 acpi_rsdp=0x665fd014 max_uptime=2100 RESULT_ROOT=/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/8 LKP_SERVER=internal-lkp-server nokaslr selinux=0 debug apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err ignore_loglevel console=tty0 earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw' export modules_initrd='/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/modules.cgz' export bm_initrd='/osimage/deps/debian-10.4-x86_64-20200603.cgz/run-ipconfig_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/lkp_20201211.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/rsync-rootfs_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/will-it-scale_20210401.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/will-it-scale-x86_64-a34a85c-1_20210401.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/mpstat_20200714.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/perf_20201126.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/perf-x86_64-d19cc4bfbff1-1_20210401.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/sar-x86_64-34c92ae-1_20200702.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/hw_20200715.cgz' export ucode_initrd='/osimage/ucode/intel-ucode-20201117.cgz' export lkp_initrd='/osimage/user/lkp/lkp-x86_64.cgz' export site='inn' export LKP_CGI_PORT=80 export LKP_CIFS_PORT=139 export last_kernel='5.10.0-00044-g57efa1fe5957' export good_samples='7084 6873 6971' export queue_at_least_once=1 export kernel='/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957' export dequeue_time='2021-04-23 16:58:17 +0800' export job_initrd='/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210423-26319-1fupkzh-3.cgz' [ -n "$LKP_SRC" ] || export LKP_SRC=/lkp/${user:-lkp}/src } run_job() { echo $$ > $TMP/run-job.pid . $LKP_SRC/lib/http.sh . $LKP_SRC/lib/job.sh . $LKP_SRC/lib/env.sh export_top_env run_setup $LKP_SRC/setup/cpufreq_governor 'performance' run_monitor $LKP_SRC/monitors/wrapper kmsg run_monitor $LKP_SRC/monitors/no-stdout/wrapper boot-time run_monitor $LKP_SRC/monitors/wrapper uptime run_monitor $LKP_SRC/monitors/wrapper iostat run_monitor $LKP_SRC/monitors/wrapper heartbeat run_monitor $LKP_SRC/monitors/wrapper vmstat run_monitor $LKP_SRC/monitors/wrapper numa-numastat run_monitor $LKP_SRC/monitors/wrapper numa-vmstat run_monitor $LKP_SRC/monitors/wrapper numa-meminfo run_monitor $LKP_SRC/monitors/wrapper proc-vmstat run_monitor $LKP_SRC/monitors/wrapper proc-stat run_monitor $LKP_SRC/monitors/wrapper meminfo run_monitor $LKP_SRC/monitors/wrapper slabinfo run_monitor $LKP_SRC/monitors/wrapper interrupts run_monitor $LKP_SRC/monitors/wrapper lock_stat run_monitor lite_mode=1 $LKP_SRC/monitors/wrapper perf-sched run_monitor $LKP_SRC/monitors/wrapper softirqs run_monitor $LKP_SRC/monitors/one-shot/wrapper bdi_dev_mapping run_monitor $LKP_SRC/monitors/wrapper diskstats run_monitor $LKP_SRC/monitors/wrapper nfsstat run_monitor $LKP_SRC/monitors/wrapper cpuidle run_monitor $LKP_SRC/monitors/wrapper cpufreq-stats run_monitor $LKP_SRC/monitors/wrapper sched_debug run_monitor $LKP_SRC/monitors/wrapper perf-stat run_monitor $LKP_SRC/monitors/wrapper mpstat run_monitor $LKP_SRC/monitors/no-stdout/wrapper perf-profile run_monitor pmeter_server='lkp-nhm-dp2' pmeter_device='yokogawa-wt310' $LKP_SRC/monitors/wrapper pmeter run_monitor $LKP_SRC/monitors/wrapper oom-killer run_monitor $LKP_SRC/monitors/plain/watchdog run_test mode='thread' test='mmap1' $LKP_SRC/tests/wrapper will-it-scale } extract_stats() { export stats_part_begin= export stats_part_end= env mode='thread' test='mmap1' $LKP_SRC/stats/wrapper will-it-scale $LKP_SRC/stats/wrapper kmsg $LKP_SRC/stats/wrapper boot-time $LKP_SRC/stats/wrapper uptime $LKP_SRC/stats/wrapper iostat $LKP_SRC/stats/wrapper vmstat $LKP_SRC/stats/wrapper numa-numastat $LKP_SRC/stats/wrapper numa-vmstat $LKP_SRC/stats/wrapper numa-meminfo $LKP_SRC/stats/wrapper proc-vmstat $LKP_SRC/stats/wrapper meminfo $LKP_SRC/stats/wrapper slabinfo $LKP_SRC/stats/wrapper interrupts $LKP_SRC/stats/wrapper lock_stat env lite_mode=1 $LKP_SRC/stats/wrapper perf-sched $LKP_SRC/stats/wrapper softirqs $LKP_SRC/stats/wrapper diskstats $LKP_SRC/stats/wrapper nfsstat $LKP_SRC/stats/wrapper cpuidle $LKP_SRC/stats/wrapper sched_debug $LKP_SRC/stats/wrapper perf-stat $LKP_SRC/stats/wrapper mpstat $LKP_SRC/stats/wrapper perf-profile env pmeter_server='lkp-nhm-dp2' pmeter_device='yokogawa-wt310' $LKP_SRC/stats/wrapper pmeter $LKP_SRC/stats/wrapper time will-it-scale.time $LKP_SRC/stats/wrapper dmesg $LKP_SRC/stats/wrapper kmsg $LKP_SRC/stats/wrapper last_state $LKP_SRC/stats/wrapper stderr $LKP_SRC/stats/wrapper time } "$@" [-- Attachment #3: job.yaml --] [-- Type: text/plain, Size: 5114 bytes --] --- suite: will-it-scale testcase: will-it-scale category: benchmark nr_task: 50% will-it-scale: mode: thread test: mmap1 job_origin: will-it-scale-part2.yaml queue_cmdline_keys: - branch - commit - queue_at_least_once queue: bisect testbox: lkp-icl-2sp1 tbox_group: lkp-icl-2sp1 kconfig: x86_64-rhel-8.3 submit_id: 6039018663f28a9f5549bb6e job_file: "/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210226-40789-1bw58rx-2.yaml" id: 568034105811dd2aa8af615c3d1cbd509191f301 queuer_version: "/lkp-src" model: Ice Lake nr_node: 2 nr_cpu: 96 memory: 256G nr_hdd_partitions: 1 hdd_partitions: "/dev/disk/by-id/ata-ST9500530NS_9SP1KLAR-part1" ssd_partitions: "/dev/nvme0n1p1" swap_partitions: kernel_cmdline_hw: acpi_rsdp=0x665fd014 rootfs_partition: "/dev/disk/by-id/ata-INTEL_SSDSC2BB800G4_PHWL4204005K800RGN-part3" kmsg: boot-time: uptime: iostat: heartbeat: vmstat: numa-numastat: numa-vmstat: numa-meminfo: proc-vmstat: proc-stat: meminfo: slabinfo: interrupts: lock_stat: perf-sched: lite_mode: 1 softirqs: bdi_dev_mapping: diskstats: nfsstat: cpuidle: cpufreq-stats: sched_debug: perf-stat: mpstat: perf-profile: cpufreq_governor: performance commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ucode: '0xb000280' need_kconfig_hw: - CONFIG_IGB=y - CONFIG_IXGBE=y - CONFIG_SATA_AHCI pmeter: pmeter_server: lkp-nhm-dp2 pmeter_device: yokogawa-wt310 enqueue_time: 2021-02-26 22:11:18.355364388 +08:00 _id: 603906b363f28a9f5549bb70 _rt: "/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0" user: lkp compiler: gcc-9 LKP_SERVER: internal-lkp-server head_commit: 59d492ff832e57456a83d5652009434a44874a3e base_commit: f40ddce88593482919761f74910f42f4b84c004b branch: linus/master rootfs: debian-10.4-x86_64-20200603.cgz monitor_sha: 70d6d718 result_root: "/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/0" scheduler_version: "/lkp/lkp/.src-20210226-170207" arch: x86_64 max_uptime: 2100 initrd: "/osimage/debian/debian-10.4-x86_64-20200603.cgz" bootloader_append: - root=/dev/ram0 - user=lkp - job=/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210226-40789-1bw58rx-2.yaml - ARCH=x86_64 - kconfig=x86_64-rhel-8.3 - branch=linus/master - commit=57efa1fe5957694fa541c9062de0a127f0b9acb0 - BOOT_IMAGE=/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957 - acpi_rsdp=0x665fd014 - max_uptime=2100 - RESULT_ROOT=/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/0 - LKP_SERVER=internal-lkp-server - nokaslr - selinux=0 - debug - apic=debug - sysrq_always_enabled - rcupdate.rcu_cpu_stall_timeout=100 - net.ifnames=0 - printk.devkmsg=on - panic=-1 - softlockup_panic=1 - nmi_watchdog=panic - oops=panic - load_ramdisk=2 - prompt_ramdisk=0 - drbd.minor_count=8 - systemd.log_level=err - ignore_loglevel - console=tty0 - earlyprintk=ttyS0,115200 - console=ttyS0,115200 - vga=normal - rw modules_initrd: "/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/modules.cgz" bm_initrd: "/osimage/deps/debian-10.4-x86_64-20200603.cgz/run-ipconfig_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/lkp_20201211.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/rsync-rootfs_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/will-it-scale_20210108.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/will-it-scale-x86_64-6b6f1f6-1_20210108.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/mpstat_20200714.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/perf_20201126.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/perf-x86_64-e71ba9452f0b-1_20210106.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/sar-x86_64-34c92ae-1_20200702.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/hw_20200715.cgz" ucode_initrd: "/osimage/ucode/intel-ucode-20201117.cgz" lkp_initrd: "/osimage/user/lkp/lkp-x86_64.cgz" site: inn LKP_CGI_PORT: 80 LKP_CIFS_PORT: 139 oom-killer: watchdog: last_kernel: 5.11.0-07287-g933a73780a7a repeat_to: 3 good_samples: - 7084 - 6873 - 6971 #! queue options #! user overrides queue_at_least_once: 0 #! schedule options kernel: "/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957" dequeue_time: 2021-02-26 22:36:05.668722260 +08:00 #! /lkp/lkp/.src-20210226-170207/include/site/inn #! runtime status job_state: finished loadavg: 40.45 29.66 13.27 1/716 10184 start_time: '1614350233' end_time: '1614350535' version: "/lkp/lkp/.src-20210226-170239:f6d2b143:03255feb8" [-- Attachment #4: reproduce --] [-- Type: text/plain, Size: 335 bytes --] for cpu_dir in /sys/devices/system/cpu/cpu[0-9]* do online_file="$cpu_dir"/online [ -f "$online_file" ] && [ "$(cat "$online_file")" -eq 0 ] && continue file="$cpu_dir"/cpufreq/scaling_governor [ -f "$file" ] && echo "performance" > "$file" done "/lkp/benchmarks/python3/bin/python3" "./runtest.py" "mmap1" "295" "thread" "48" ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-05-25 3:16 [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression kernel test robot @ 2021-05-25 3:11 ` Linus Torvalds 2021-06-04 7:04 ` Feng Tang 2021-06-04 8:37 ` [LKP] " Xing Zhengjun 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2021-05-25 3:11 UTC (permalink / raw) To: kernel test robot Cc: Jason Gunthorpe, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, Feng Tang, zhengjun.xing On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote: > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit: > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork") Hmm. This looks like one of those "random fluctuations" things. It would be good to hear if other test-cases also bisect to the same thing, but this report already says: > In addition to that, the commit also has significant impact on the following tests: > > +------------------+---------------------------------------------------------------------------------+ > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement | which does kind of reinforce that "this benchmark gives unstable numbers". The perf data doesn't even mention any of the GUP paths, and on the pure fork path the biggest impact would be: (a) maybe "struct mm_struct" changed in size or had a different cache layout (b) two added (nonatomic) increment operations in the fork path due to the seqcount and I'm not seeing what would cause that 9% change. Obviously cache placement has done it before. If somebody else sees something that I'm missing, please holler. But I'll ignore this as "noise" otherwise. Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-05-25 3:11 ` Linus Torvalds @ 2021-06-04 7:04 ` Feng Tang 2021-06-04 7:52 ` Feng Tang 2021-06-04 8:37 ` [LKP] " Xing Zhengjun 1 sibling, 1 reply; 13+ messages in thread From: Feng Tang @ 2021-06-04 7:04 UTC (permalink / raw) To: Linus Torvalds Cc: kernel test robot, Jason Gunthorpe, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing Hi Linus, Sorry for the late response. On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote: > On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote: > > > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit: > > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork") > > Hmm. This looks like one of those "random fluctuations" things. > > It would be good to hear if other test-cases also bisect to the same > thing, but this report already says: > > > In addition to that, the commit also has significant impact on the following tests: > > > > +------------------+---------------------------------------------------------------------------------+ > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement | > > which does kind of reinforce that "this benchmark gives unstable numbers". > > The perf data doesn't even mention any of the GUP paths, and on the > pure fork path the biggest impact would be: > > (a) maybe "struct mm_struct" changed in size or had a different cache layout Yes, this seems to be the cause of the regression. The test case is many thread are doing map/unmap at the same time, so the process's rw_semaphore 'mmap_lock' is highly contended. Before the patch (with 0day's kconfig), the mmap_lock is separated into 2 cachelines, the 'count' is in one line, and the other members sit in the next line, so it luckily avoid some cache bouncing. After the patch, the 'mmap_lock' is pushed into one cacheline, which may cause the regression. Below is the pahole info: - before the patch spinlock_t page_table_lock; /* 116 4 */ struct rw_semaphore mmap_lock; /* 120 40 */ /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ struct list_head mmlist; /* 160 16 */ long unsigned int hiwater_rss; /* 176 8 */ - after the patch spinlock_t page_table_lock; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ struct rw_semaphore mmap_lock; /* 128 40 */ struct list_head mmlist; /* 168 16 */ long unsigned int hiwater_rss; /* 184 8 */ perf c2c log can also confirm this. Thanks, Feng > (b) two added (nonatomic) increment operations in the fork path due > to the seqcount > > and I'm not seeing what would cause that 9% change. Obviously cache > placement has done it before. > > If somebody else sees something that I'm missing, please holler. But > I'll ignore this as "noise" otherwise. > > Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-04 7:04 ` Feng Tang @ 2021-06-04 7:52 ` Feng Tang 2021-06-04 17:57 ` Linus Torvalds 2021-06-04 17:58 ` John Hubbard 0 siblings, 2 replies; 13+ messages in thread From: Feng Tang @ 2021-06-04 7:52 UTC (permalink / raw) To: Linus Torvalds, Jason Gunthorpe Cc: kernel test robot, Jason Gunthorpe, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote: > Hi Linus, > > Sorry for the late response. > > On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote: > > On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote: > > > > > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit: > > > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork") > > > > Hmm. This looks like one of those "random fluctuations" things. > > > > It would be good to hear if other test-cases also bisect to the same > > thing, but this report already says: > > > > > In addition to that, the commit also has significant impact on the following tests: > > > > > > +------------------+---------------------------------------------------------------------------------+ > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement | > > > > which does kind of reinforce that "this benchmark gives unstable numbers". > > > > The perf data doesn't even mention any of the GUP paths, and on the > > pure fork path the biggest impact would be: > > > > (a) maybe "struct mm_struct" changed in size or had a different cache layout > > Yes, this seems to be the cause of the regression. > > The test case is many thread are doing map/unmap at the same time, > so the process's rw_semaphore 'mmap_lock' is highly contended. > > Before the patch (with 0day's kconfig), the mmap_lock is separated > into 2 cachelines, the 'count' is in one line, and the other members > sit in the next line, so it luckily avoid some cache bouncing. After > the patch, the 'mmap_lock' is pushed into one cacheline, which may > cause the regression. > > Below is the pahole info: > > - before the patch > > spinlock_t page_table_lock; /* 116 4 */ > struct rw_semaphore mmap_lock; /* 120 40 */ > /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ > struct list_head mmlist; /* 160 16 */ > long unsigned int hiwater_rss; /* 176 8 */ > > - after the patch > > spinlock_t page_table_lock; /* 124 4 */ > /* --- cacheline 2 boundary (128 bytes) --- */ > struct rw_semaphore mmap_lock; /* 128 40 */ > struct list_head mmlist; /* 168 16 */ > long unsigned int hiwater_rss; /* 184 8 */ > > perf c2c log can also confirm this. We've tried some patch, which can restore the regerssion. As the newly added member 'write_protect_seq' is 4 bytes long, and putting it into an existing 4 bytes long hole can restore the regeression, while not affecting most of other member's alignment. Please review the following patch, thanks! - Feng From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001 From: Feng Tang <feng.tang@intel.com> Date: Fri, 4 Jun 2021 15:20:57 +0800 Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork), on 64bits system, the hot member rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into 2 cachelines, that its member 'count' sits in one cacheline while all other members in next cacheline, this naturally reduces some cache bouncing, and with the commit, the 'mmap_lock' is pushed into one cacheline, as shown in the pahole info: - before the commit spinlock_t page_table_lock; /* 116 4 */ struct rw_semaphore mmap_lock; /* 120 40 */ /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ struct list_head mmlist; /* 160 16 */ long unsigned int hiwater_rss; /* 176 8 */ - after the commit spinlock_t page_table_lock; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ struct rw_semaphore mmap_lock; /* 128 40 */ struct list_head mmlist; /* 168 16 */ long unsigned int hiwater_rss; /* 184 8 */ and it causes one 9.2% regression for 'mmap1' case of will-it-scale benchmark[1], as in the case 'mmap_lock' is highly contented (occupies 90%+ cpu cycles). Though relayouting a structure could be a double-edged sword, as it helps some case, but may hurt other cases. So one solution is the newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes hole in 'mm_struct' will not affect most of other members's alignment, while restoring the regression. [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> --- include/linux/mm_types.h | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5aacc1c..5b55f88 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -445,13 +445,6 @@ struct mm_struct { */ atomic_t has_pinned; - /** - * @write_protect_seq: Locked when any thread is write - * protecting pages mapped by this mm to enforce a later COW, - * for instance during page table copying for fork(). - */ - seqcount_t write_protect_seq; - #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif @@ -480,7 +473,15 @@ struct mm_struct { unsigned long stack_vm; /* VM_STACK */ unsigned long def_flags; + /** + * @write_protect_seq: Locked when any thread is write + * protecting pages mapped by this mm to enforce a later COW, + * for instance during page table copying for fork(). + */ + seqcount_t write_protect_seq; + spinlock_t arg_lock; /* protect the below fields */ + unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; -- 2.7.4 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-04 7:52 ` Feng Tang @ 2021-06-04 17:57 ` Linus Torvalds 2021-06-06 10:16 ` Feng Tang 2021-06-04 17:58 ` John Hubbard 1 sibling, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2021-06-04 17:57 UTC (permalink / raw) To: Feng Tang Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <feng.tang@intel.com> wrote: > > On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote: > > > > > > The perf data doesn't even mention any of the GUP paths, and on the > > > pure fork path the biggest impact would be: > > > > > > (a) maybe "struct mm_struct" changed in size or had a different cache layout > > > > Yes, this seems to be the cause of the regression. > > > > The test case is many thread are doing map/unmap at the same time, > > so the process's rw_semaphore 'mmap_lock' is highly contended. > > > > Before the patch (with 0day's kconfig), the mmap_lock is separated > > into 2 cachelines, the 'count' is in one line, and the other members > > sit in the next line, so it luckily avoid some cache bouncing. After > > the patch, the 'mmap_lock' is pushed into one cacheline, which may > > cause the regression. Ok, thanks for following up on this. > We've tried some patch, which can restore the regerssion. As the > newly added member 'write_protect_seq' is 4 bytes long, and putting > it into an existing 4 bytes long hole can restore the regeression, > while not affecting most of other member's alignment. Please review > the following patch, thanks! The patch looks fine to me. At the same time, I do wonder if maybe it would be worth exploring if it's a good idea to perhaps move the 'mmap_sem' thing instead. Or at least add a big comment. It's not clear to me exactly _which_ other fields are the ones that are so hot that the contention on mmap_sem then causes even more cacheline bouncing. For example, is it either (a) we *want* the mmap_sem to be in the first 128-byte region, because then when we get the mmap_sem, the other fields in that same cacheline are hot OR (b) we do *not* want mmap_sem to be in the *second* 128-byte region, because there is something *else* in that region that is touched independently of mmap_sem that is very very hot and now you get even more bouncing? but I can't tell which one it is. It would be great to have a comment in the code - and in the commit message - about exactly which fields are the criticial ones. Because I doubt it is 'write_protect_seq' itself that matters at all. If it's "mmap_sem should be close to other commonly used fields", maybe we should just move mmap_sem upwards in the structure? Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-04 17:57 ` Linus Torvalds @ 2021-06-06 10:16 ` Feng Tang 2021-06-06 19:20 ` Linus Torvalds 0 siblings, 1 reply; 13+ messages in thread From: Feng Tang @ 2021-06-06 10:16 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing [-- Attachment #1: Type: text/plain, Size: 4401 bytes --] On Fri, Jun 04, 2021 at 10:57:44AM -0700, Linus Torvalds wrote: > On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <feng.tang@intel.com> wrote: > > > > On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote: > > > > > > > > The perf data doesn't even mention any of the GUP paths, and on the > > > > pure fork path the biggest impact would be: > > > > > > > > (a) maybe "struct mm_struct" changed in size or had a different cache layout > > > > > > Yes, this seems to be the cause of the regression. > > > > > > The test case is many thread are doing map/unmap at the same time, > > > so the process's rw_semaphore 'mmap_lock' is highly contended. > > > > > > Before the patch (with 0day's kconfig), the mmap_lock is separated > > > into 2 cachelines, the 'count' is in one line, and the other members > > > sit in the next line, so it luckily avoid some cache bouncing. After > > > the patch, the 'mmap_lock' is pushed into one cacheline, which may > > > cause the regression. > > Ok, thanks for following up on this. > > > We've tried some patch, which can restore the regerssion. As the > > newly added member 'write_protect_seq' is 4 bytes long, and putting > > it into an existing 4 bytes long hole can restore the regeression, > > while not affecting most of other member's alignment. Please review > > the following patch, thanks! > > The patch looks fine to me. > > At the same time, I do wonder if maybe it would be worth exploring if > it's a good idea to perhaps move the 'mmap_sem' thing instead. > > Or at least add a big comment. It's not clear to me exactly _which_ > other fields are the ones that are so hot that the contention on > mmap_sem then causes even more cacheline bouncing. > > For example, is it either > > (a) we *want* the mmap_sem to be in the first 128-byte region, > because then when we get the mmap_sem, the other fields in that same > cacheline are hot > > OR > > (b) we do *not* want mmap_sem to be in the *second* 128-byte region, > because there is something *else* in that region that is touched > independently of mmap_sem that is very very hot and now you get even > more bouncing? > > but I can't tell which one it is. Yes, it's better to get more details of which fields are hottest, and following are some perf data details. Let me know if more info is needed. * perf-stat: we see more cache-misses 32158577 ± 7% +9.0% 35060321 ± 6% perf-stat.ps.cache-misses 69612918 ± 6% +11.2% 77382336 ± 5% perf-stat.ps.cache-references * perf profile: the 'mmap_lock' are the hottest, though the ratio from map/unmap has some difference from 72:24 to 52:45, and this is the part that I don't understand - old kernel (without commit 57efa1fe59) 96.60% 0.19% [kernel.kallsyms] [k] down_write_killable - - 72.46% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap 24.14% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap - new kernel 96.60% 0.16% [kernel.kallsyms] [k] down_write_killable - - 51.85% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap 44.74% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the data structure change - old kernel - first cacheline mmap_lock->count (75%) mm->mapcount (14%) - second cacheline mmap_lock->owner (97%) - new kernel mainly in the cacheline of 'mmap_lock' mmap_lock->count (~2%) mmap_lock->owner (95%) I also attached the reduced pah and perf-c2c log for further check. (The absolute HITM events number can be ignored, as the recording time for new/old kernel may be different) > It would be great to have a comment in the code - and in the commit > message - about exactly which fields are the criticial ones. Because I > doubt it is 'write_protect_seq' itself that matters at all. > > If it's "mmap_sem should be close to other commonly used fields", > maybe we should just move mmap_sem upwards in the structure? Ok, will add more comments if the patch is still fine with the above updated info. Thanks, Feng > Linus [-- Attachment #2: pah_new.log --] [-- Type: text/plain, Size: 5610 bytes --] struct rw_semaphore { atomic_long_t count; /* 0 0 */ /* XXX 8 bytes hole, try to pack */ atomic_long_t owner; /* 8 0 */ /* XXX 8 bytes hole, try to pack */ struct optimistic_spin_queue osq; /* 16 0 */ /* XXX 4 bytes hole, try to pack */ raw_spinlock_t wait_lock; /* 20 0 */ /* XXX 4 bytes hole, try to pack */ struct list_head wait_list; /* 24 0 */ /* size: 40, cachelines: 1, members: 5 */ /* padding: 16 */ /* last cacheline: 40 bytes */ }; struct mm_struct { struct { struct vm_area_struct * mmap; /* 0 8 */ struct rb_root mm_rb; /* 8 8 */ u64 vmacache_seqnum; /* 16 8 */ long unsigned int (*get_unmapped_area)(struct file *, long unsigned int, long unsigned int, long unsigned int, long unsigned int); /* 24 8 */ long unsigned int mmap_base; /* 32 8 */ long unsigned int mmap_legacy_base; /* 40 8 */ long unsigned int mmap_compat_base; /* 48 8 */ long unsigned int mmap_compat_legacy_base; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int task_size; /* 64 8 */ long unsigned int highest_vm_end; /* 72 8 */ pgd_t * pgd; /* 80 8 */ atomic_t membarrier_state; /* 88 4 */ atomic_t mm_users; /* 92 4 */ atomic_t mm_count; /* 96 4 */ atomic_t has_pinned; /* 100 4 */ seqcount_t write_protect_seq; /* 104 4 */ /* XXX 4 bytes hole, try to pack */ atomic_long_t pgtables_bytes; /* 112 8 */ int map_count; /* 120 4 */ spinlock_t page_table_lock; /* 124 4 */ /* --- cacheline 2 boundary (128 bytes) --- */ struct rw_semaphore mmap_lock; /* 128 40 */ struct list_head mmlist; /* 168 16 */ long unsigned int hiwater_rss; /* 184 8 */ /* --- cacheline 3 boundary (192 bytes) --- */ long unsigned int hiwater_vm; /* 192 8 */ long unsigned int total_vm; /* 200 8 */ long unsigned int locked_vm; /* 208 8 */ atomic64_t pinned_vm; /* 216 8 */ long unsigned int data_vm; /* 224 8 */ long unsigned int exec_vm; /* 232 8 */ long unsigned int stack_vm; /* 240 8 */ long unsigned int def_flags; /* 248 8 */ /* --- cacheline 4 boundary (256 bytes) --- */ spinlock_t arg_lock; /* 256 4 */ /* XXX 4 bytes hole, try to pack */ long unsigned int start_code; /* 264 8 */ long unsigned int end_code; /* 272 8 */ long unsigned int start_data; /* 280 8 */ long unsigned int end_data; /* 288 8 */ long unsigned int start_brk; /* 296 8 */ long unsigned int brk; /* 304 8 */ long unsigned int start_stack; /* 312 8 */ /* --- cacheline 5 boundary (320 bytes) --- */ long unsigned int arg_start; /* 320 8 */ long unsigned int arg_end; /* 328 8 */ long unsigned int env_start; /* 336 8 */ long unsigned int env_end; /* 344 8 */ long unsigned int saved_auxv[46]; /* 352 368 */ /* --- cacheline 11 boundary (704 bytes) was 16 bytes ago --- */ struct mm_rss_stat rss_stat; /* 720 32 */ struct linux_binfmt * binfmt; /* 752 8 */ mm_context_t context; /* 760 128 */ /* --- cacheline 13 boundary (832 bytes) was 56 bytes ago --- */ long unsigned int flags; /* 888 8 */ /* --- cacheline 14 boundary (896 bytes) --- */ struct core_state * core_state; /* 896 8 */ spinlock_t ioctx_lock; /* 904 4 */ /* XXX 4 bytes hole, try to pack */ struct kioctx_table * ioctx_table; /* 912 8 */ struct task_struct * owner; /* 920 8 */ struct user_namespace * user_ns; /* 928 8 */ struct file * exe_file; /* 936 8 */ struct mmu_notifier_subscriptions * notifier_subscriptions; /* 944 8 */ long unsigned int numa_next_scan; /* 952 8 */ /* --- cacheline 15 boundary (960 bytes) --- */ long unsigned int numa_scan_offset; /* 960 8 */ int numa_scan_seq; /* 968 4 */ atomic_t tlb_flush_pending; /* 972 4 */ bool tlb_flush_batched; /* 976 1 */ /* XXX 7 bytes hole, try to pack */ struct uprobes_state uprobes_state; /* 984 8 */ atomic_long_t hugetlb_usage; /* 992 8 */ struct work_struct async_put_work; /* 1000 32 */ /* --- cacheline 16 boundary (1024 bytes) was 8 bytes ago --- */ u32 pasid; /* 1032 4 */ }; /* 0 1040 */ /* XXX last struct has 4 bytes of padding */ long unsigned int cpu_bitmap[]; /* 1040 0 */ /* size: 1040, cachelines: 17, members: 2 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 16 bytes */ }; [-- Attachment #3: pah_old.log --] [-- Type: text/plain, Size: 5508 bytes --] struct rw_semaphore { atomic_long_t count; /* 0 0 */ /* XXX 8 bytes hole, try to pack */ atomic_long_t owner; /* 8 0 */ /* XXX 8 bytes hole, try to pack */ struct optimistic_spin_queue osq; /* 16 0 */ /* XXX 4 bytes hole, try to pack */ raw_spinlock_t wait_lock; /* 20 0 */ /* XXX 4 bytes hole, try to pack */ struct list_head wait_list; /* 24 0 */ /* size: 40, cachelines: 1, members: 5 */ /* padding: 16 */ /* last cacheline: 40 bytes */ }; struct mm_struct { struct { struct vm_area_struct * mmap; /* 0 8 */ struct rb_root mm_rb; /* 8 8 */ u64 vmacache_seqnum; /* 16 8 */ long unsigned int (*get_unmapped_area)(struct file *, long unsigned int, long unsigned int, long unsigned int, long unsigned int); /* 24 8 */ long unsigned int mmap_base; /* 32 8 */ long unsigned int mmap_legacy_base; /* 40 8 */ long unsigned int mmap_compat_base; /* 48 8 */ long unsigned int mmap_compat_legacy_base; /* 56 8 */ /* --- cacheline 1 boundary (64 bytes) --- */ long unsigned int task_size; /* 64 8 */ long unsigned int highest_vm_end; /* 72 8 */ pgd_t * pgd; /* 80 8 */ atomic_t membarrier_state; /* 88 4 */ atomic_t mm_users; /* 92 4 */ atomic_t mm_count; /* 96 4 */ atomic_t has_pinned; /* 100 4 */ atomic_long_t pgtables_bytes; /* 104 8 */ int map_count; /* 112 4 */ spinlock_t page_table_lock; /* 116 4 */ struct rw_semaphore mmap_lock; /* 120 40 */ /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ struct list_head mmlist; /* 160 16 */ long unsigned int hiwater_rss; /* 176 8 */ long unsigned int hiwater_vm; /* 184 8 */ /* --- cacheline 3 boundary (192 bytes) --- */ long unsigned int total_vm; /* 192 8 */ long unsigned int locked_vm; /* 200 8 */ atomic64_t pinned_vm; /* 208 8 */ long unsigned int data_vm; /* 216 8 */ long unsigned int exec_vm; /* 224 8 */ long unsigned int stack_vm; /* 232 8 */ long unsigned int def_flags; /* 240 8 */ spinlock_t arg_lock; /* 248 4 */ /* XXX 4 bytes hole, try to pack */ /* --- cacheline 4 boundary (256 bytes) --- */ long unsigned int start_code; /* 256 8 */ long unsigned int end_code; /* 264 8 */ long unsigned int start_data; /* 272 8 */ long unsigned int end_data; /* 280 8 */ long unsigned int start_brk; /* 288 8 */ long unsigned int brk; /* 296 8 */ long unsigned int start_stack; /* 304 8 */ long unsigned int arg_start; /* 312 8 */ /* --- cacheline 5 boundary (320 bytes) --- */ long unsigned int arg_end; /* 320 8 */ long unsigned int env_start; /* 328 8 */ long unsigned int env_end; /* 336 8 */ long unsigned int saved_auxv[46]; /* 344 368 */ /* --- cacheline 11 boundary (704 bytes) was 8 bytes ago --- */ struct mm_rss_stat rss_stat; /* 712 32 */ struct linux_binfmt * binfmt; /* 744 8 */ mm_context_t context; /* 752 128 */ /* --- cacheline 13 boundary (832 bytes) was 48 bytes ago --- */ long unsigned int flags; /* 880 8 */ struct core_state * core_state; /* 888 8 */ /* --- cacheline 14 boundary (896 bytes) --- */ spinlock_t ioctx_lock; /* 896 4 */ /* XXX 4 bytes hole, try to pack */ struct kioctx_table * ioctx_table; /* 904 8 */ struct task_struct * owner; /* 912 8 */ struct user_namespace * user_ns; /* 920 8 */ struct file * exe_file; /* 928 8 */ struct mmu_notifier_subscriptions * notifier_subscriptions; /* 936 8 */ long unsigned int numa_next_scan; /* 944 8 */ long unsigned int numa_scan_offset; /* 952 8 */ /* --- cacheline 15 boundary (960 bytes) --- */ int numa_scan_seq; /* 960 4 */ atomic_t tlb_flush_pending; /* 964 4 */ bool tlb_flush_batched; /* 968 1 */ /* XXX 7 bytes hole, try to pack */ struct uprobes_state uprobes_state; /* 976 8 */ atomic_long_t hugetlb_usage; /* 984 8 */ struct work_struct async_put_work; /* 992 32 */ /* --- cacheline 16 boundary (1024 bytes) --- */ u32 pasid; /* 1024 4 */ }; /* 0 1032 */ /* XXX last struct has 4 bytes of padding */ long unsigned int cpu_bitmap[]; /* 1032 0 */ /* size: 1032, cachelines: 17, members: 2 */ /* paddings: 1, sum paddings: 4 */ /* last cacheline: 8 bytes */ }; [-- Attachment #4: c2c_new.log --] [-- Type: text/plain, Size: 37291 bytes --] ================================================= Trace Event Information ================================================= Total records : 293248 Locked Load/Store Operations : 6171 Load Operations : 52087 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 0 Loads - no mapping : 467 Load Fill Buffer Hit : 14960 Load L1D hit : 17949 Load L2D hit : 377 Load LLC hit : 11861 Load Local HITM : 5926 Load Remote HITM : 4183 Load Remote HIT : 0 Load Local DRAM : 580 Load Remote DRAM : 5893 Load MESI State Exclusive : 5893 Load MESI State Shared : 580 Load LLC Misses : 10656 Load access blocked by data : 0 Load access blocked by address : 0 LLC Misses to Local DRAM : 5.4% LLC Misses to Remote DRAM : 55.3% LLC Misses to Remote cache (HIT) : 0.0% LLC Misses to Remote cache (HITM) : 39.3% Store Operations : 0 Store - uncacheable : 0 Store - no mapping : 0 Store L1D Hit : 0 Store L1D Miss : 0 No Page Map Rejects : 50306 Unable to parse data source : 241161 ================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 1441 Load HITs on shared lines : 32608 Fill Buffer Hits on shared lines : 10597 L1D hits on shared lines : 4777 L2D hits on shared lines : 22 LLC hits on shared lines : 11030 Locked Access on shared lines : 4481 Blocked Access on shared lines : 0 Store HITs on shared lines : 0 Store L1D hits on shared lines : 0 Total Merged records : 10109 ================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P Cachelines sort on : Total HITMs Cacheline data grouping : offset,iaddr ================================================= Shared Data Cache Line Table ================================================= # # ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ---- # Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt # ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........ # 0 0xff110002a24ffc00 0 10811 28.14% 2845 1535 1310 10447 10447 0 0 0 2307 2819 0 412 1535 0 1310 292 1772 1 0xff110002a24ffb80 0 423 2.41% 244 136 108 722 722 0 0 0 328 3 0 38 136 0 108 0 109 2 0xff1100209ae58b00 1 1125 2.36% 239 128 111 1268 1268 0 0 0 754 94 0 67 128 0 111 0 114 3 0xff1100209ae58580 1 1227 2.20% 222 99 123 1029 1029 0 0 0 582 46 0 55 99 0 123 0 124 4 0xff110002a24ffc40 0 1386 1.69% 171 169 2 1205 1205 0 0 0 714 0 0 195 169 0 2 3 122 5 0xff11001fffeaba00 0 161 0.84% 85 44 41 237 237 0 0 0 46 19 0 42 44 0 41 2 43 6 0xff11001ffff2ba00 0 169 0.78% 79 32 47 252 252 0 0 0 48 26 0 37 32 0 47 8 54 7 0xff11003fdcc6ba00 1 116 0.78% 79 39 40 205 205 0 0 0 30 15 0 31 39 0 40 2 48 8 0xff110002a24ffbc0 0 64 0.77% 78 36 42 393 393 0 0 0 132 119 2 20 36 0 42 0 42 9 0xff11003fdc6eba00 1 144 0.76% 77 48 29 195 195 0 0 0 31 19 0 29 48 0 29 4 35 10 0xff11001fffe2ba00 0 156 0.74% 75 36 39 226 226 0 0 0 50 24 0 31 36 0 39 3 43 11 0xff11003fdc7aba00 1 137 0.74% 75 38 37 204 204 0 0 0 30 20 0 34 38 0 37 3 42 12 0xff11001fffc2ba00 0 128 0.73% 74 44 30 182 182 0 0 0 31 16 0 23 44 0 30 1 37 13 0xff110020001aba00 0 135 0.72% 73 41 32 198 198 0 0 0 28 20 0 33 41 0 32 4 40 14 0xff1100200012ba00 0 151 0.71% 72 36 36 221 221 0 0 0 35 21 0 45 36 0 36 6 42 15 0xff1100200002ba00 0 122 0.69% 70 41 29 184 184 0 0 0 31 16 0 32 41 0 29 3 32 16 0xff11003fdc92ba00 1 142 0.69% 70 32 38 204 204 0 0 0 40 17 0 29 32 0 38 3 45 17 0xff11003fdc96ba00 1 138 0.69% 70 37 33 207 207 0 0 0 33 18 0 43 37 0 33 4 39 18 0xff11003fdcd2ba00 1 109 0.69% 70 31 39 184 184 0 0 0 28 18 0 22 31 0 39 0 46 19 0xff11001ffff6ba00 0 141 0.68% 69 43 26 193 193 0 0 0 38 23 0 29 43 0 26 4 30 20 0xff11001ffffaba00 0 142 0.68% 69 36 33 204 204 0 0 0 35 23 0 38 36 0 33 2 37 21 0xff11001fffe6ba00 0 133 0.67% 68 42 26 195 195 0 0 0 39 18 0 38 42 0 26 2 30 22 0xff1100200016ba00 0 117 0.65% 66 42 24 185 185 0 0 0 37 10 0 40 42 0 24 5 27 23 0xff1100209ae58b40 1 1 0.65% 66 66 0 210 210 0 0 0 1 0 0 40 66 0 0 18 85 24 0xff11003fdc8aba00 1 126 0.65% 66 32 34 192 192 0 0 0 37 17 0 30 32 0 34 3 39 25 0xff11001fffceba00 0 160 0.64% 65 30 35 222 222 0 0 0 40 26 0 48 30 0 35 3 40 26 0xff11003fdcdeba00 1 121 0.64% 65 37 28 182 182 0 0 0 34 11 0 34 37 0 28 3 35 27 0xff11001fffdaba00 0 92 0.63% 64 38 26 170 170 0 0 0 29 11 0 30 38 0 26 4 32 28 0xff110020000aba00 0 159 0.63% 64 37 27 189 189 0 0 0 43 17 0 30 37 0 27 2 33 29 0xff11003fdc82ba00 1 137 0.62% 63 30 33 200 200 0 0 0 43 20 0 33 30 0 33 5 36 30 0xff1100200006ba00 0 127 0.61% 62 34 28 184 184 0 0 0 33 20 0 34 34 0 28 3 32 31 0xff11003fdcbaba00 1 139 0.61% 62 35 27 180 180 0 0 0 39 19 0 26 35 0 27 3 31 32 0xff11001fffc6ba00 0 130 0.60% 61 37 24 185 185 0 0 0 41 23 0 30 37 0 24 3 27 33 0xff11001fffd2ba00 0 106 0.60% 61 29 32 167 167 0 0 0 26 17 0 26 29 0 32 1 36 34 0xff11001fffd6ba00 0 85 0.60% 61 31 30 166 166 0 0 0 29 7 0 33 31 0 30 1 35 35 0xff11001fffbeba00 0 118 0.59% 60 44 16 161 161 0 0 0 38 13 0 28 44 0 16 6 16 36 0xff11001ffffeba00 0 144 0.59% 60 41 19 194 194 0 0 0 35 23 0 46 41 0 19 4 26 37 0xff110020000eba00 0 133 0.59% 60 36 24 180 180 0 0 0 34 18 0 32 36 0 24 5 31 38 0xff11003fdcc2ba00 1 121 0.59% 60 29 31 188 188 0 0 0 37 13 0 37 29 0 31 6 35 39 0xff11003fdcbeba00 1 119 0.58% 59 32 27 171 171 0 0 0 30 15 0 33 32 0 27 3 31 40 0xff11003fdcd6ba00 1 116 0.58% 59 29 30 174 174 0 0 0 20 22 0 39 29 0 30 1 33 41 0xff11003fdceeba00 1 79 0.58% 59 35 24 151 151 0 0 0 28 4 0 28 35 0 24 0 32 42 0xff11003fdcfaba00 1 105 0.58% 59 34 25 155 155 0 0 0 26 13 0 29 34 0 25 0 28 43 0xff11001fffeeba00 0 113 0.57% 58 31 27 172 172 0 0 0 32 16 0 37 31 0 27 1 28 44 0xff11003fdcb6ba00 1 128 0.56% 57 32 25 177 177 0 0 0 43 14 0 32 32 0 25 2 29 45 0xff11003fdcfeba00 1 88 0.56% 57 33 24 144 144 0 0 0 29 8 0 16 33 0 24 4 30 46 0xff11003fdc42ba00 1 86 0.55% 56 28 28 159 159 0 0 0 34 12 0 23 28 0 28 3 31 47 0xff11003fdcaeba00 1 107 0.54% 55 22 33 154 154 0 0 0 24 11 0 24 22 0 33 1 39 48 0xff11001fffcaba00 0 111 0.53% 54 32 22 156 156 0 0 0 35 11 0 28 32 0 22 1 27 49 0xff11003fdcaaba00 1 87 0.52% 53 29 24 141 141 0 0 0 21 9 0 26 29 0 24 3 29 50 0xff11003fdc86ba00 1 89 0.49% 50 30 20 147 147 0 0 0 27 11 0 31 30 0 20 2 26 51 0xff11001fffdeba00 0 96 0.47% 48 26 22 144 144 0 0 0 25 10 0 28 26 0 22 7 26 52 0xff11003fdc46ba00 1 98 0.47% 48 30 18 127 127 0 0 0 21 8 0 27 30 0 18 1 22 53 0xff11003fdc6aba00 1 80 0.41% 41 20 21 121 121 0 0 0 24 8 0 22 20 0 21 1 25 54 0xff11003fdcb2ba00 1 92 0.40% 40 20 20 137 137 0 0 0 29 12 0 27 20 0 20 2 27 55 0xff110020d9ecc000 1 37 0.31% 31 15 16 68 68 0 0 0 9 2 0 8 15 0 16 0 18 56 0xff1100406e91c000 1 57 0.27% 27 16 11 80 80 0 0 0 17 4 0 21 16 0 11 0 11 57 0xff1100406f884000 1 59 0.27% 27 12 15 76 76 0 0 0 17 4 0 13 12 0 15 0 15 58 0xff110001e3bd9bc0 0 125 0.26% 26 10 16 269 269 0 0 0 19 1 0 207 10 0 16 0 16 59 0xff110001085fd700 0 196 0.25% 25 15 10 316 316 0 0 0 7 39 0 235 15 0 10 0 10 60 0xff11000154dd8000 0 58 0.25% 25 12 13 87 87 0 0 0 24 5 0 20 12 0 13 0 13 61 0xff1100406a59c000 1 60 0.25% 25 11 14 81 81 0 0 0 21 4 0 17 11 0 14 0 14 62 0xff1100406d9f0000 1 40 0.25% 25 16 9 58 58 0 0 0 8 4 0 11 16 0 9 0 10 63 0xff11000156a24000 0 51 0.24% 24 10 14 71 71 0 0 0 18 3 0 12 10 0 14 0 14 64 0xff110001b4044000 0 66 0.24% 24 17 7 78 78 0 0 0 27 2 0 17 17 0 7 0 8 65 0xff11000212360000 0 69 0.24% 24 9 15 104 104 0 0 0 40 4 0 21 9 0 15 0 15 66 0xff1100209d074000 1 56 0.23% 23 11 12 67 67 0 0 0 19 5 0 7 11 0 12 0 13 67 0xff11004070a10000 1 47 0.23% 23 11 12 74 74 0 0 0 21 1 0 17 11 0 12 0 12 68 0xff11000109248000 0 50 0.22% 22 13 9 77 77 0 0 0 20 2 0 24 13 0 9 0 9 69 0xff1100012d114000 0 57 0.22% 22 16 6 76 76 0 0 0 30 3 0 15 16 0 6 0 6 70 0xff110020d9614000 1 31 0.22% 22 16 6 55 55 0 0 0 13 0 0 14 16 0 6 0 6 71 0xff1100406a600000 1 55 0.22% 22 10 12 75 75 0 0 0 24 4 0 11 10 0 12 0 14 72 0xffffffff835477c0 1 1 0.21% 21 9 12 33 33 0 0 0 0 0 0 0 9 0 12 0 12 73 0xff11000270708000 0 48 0.21% 21 9 12 64 64 0 0 0 15 6 0 9 9 0 12 0 13 74 0xff11000117070000 0 66 0.20% 20 8 12 95 95 0 0 0 39 5 0 19 8 0 12 0 12 75 0xff1100012dd18000 0 57 0.20% 20 12 8 75 75 0 0 0 22 4 0 21 12 0 8 0 8 76 0xff11000242dbc000 0 64 0.20% 20 12 8 86 86 0 0 0 31 2 0 23 12 0 8 0 10 77 0xff1100027104c000 0 58 0.20% 20 9 11 82 82 0 0 0 24 4 0 21 9 0 11 0 13 78 0xff1100406bca4000 1 43 0.20% 20 5 15 56 56 0 0 0 8 2 0 11 5 0 15 0 15 79 0xff110001091c4000 0 60 0.19% 19 12 7 77 77 0 0 0 17 4 0 29 12 0 7 0 8 80 0xff1100012bab8000 0 69 0.19% 19 7 12 95 95 0 0 0 33 9 0 22 7 0 12 0 12 81 0xff11000155a60000 0 58 0.19% 19 6 13 84 84 0 0 0 27 6 0 19 6 0 13 0 13 82 0xff110001b4244000 0 49 0.19% 19 13 6 66 66 0 0 0 23 7 0 11 13 0 6 0 6 83 0xff110001e207c000 0 56 0.19% 19 14 5 68 68 0 0 0 22 3 0 19 14 0 5 0 5 84 0xff110002a24fff00 0 1 0.19% 19 19 0 197 197 0 0 0 8 0 0 160 19 0 0 0 10 85 0xff110020b736c000 1 56 0.19% 19 13 6 70 70 0 0 0 26 4 0 15 13 0 6 0 6 86 0xff110020ba320000 1 60 0.19% 19 11 8 68 68 0 0 0 18 3 0 20 11 0 8 0 8 87 0xff110040755c4000 1 58 0.19% 19 12 7 73 73 0 0 0 20 10 0 16 12 0 7 0 8 88 0xff110001092a0000 0 37 0.18% 18 12 6 56 56 0 0 0 16 0 0 15 12 0 6 0 7 89 0xff1100015696c000 0 58 0.18% 18 10 8 77 77 0 0 0 29 3 0 19 10 0 8 0 8 90 0xff110001866f8000 0 57 0.18% 18 7 11 84 84 0 0 0 35 4 0 16 7 0 11 0 11 91 0xff11004072f84000 1 40 0.18% 18 12 6 56 56 0 0 0 15 4 0 13 12 0 6 0 6 92 0xff110001163d8000 0 46 0.17% 17 4 13 63 63 0 0 0 18 1 0 13 4 0 13 0 14 93 0xff1100406bca4040 1 1 0.17% 17 17 0 34 34 0 0 0 1 0 0 5 17 0 0 0 11 94 0xff11004070424000 1 41 0.17% 17 8 9 64 64 0 0 0 20 5 0 13 8 0 9 0 9 95 0xff11004079f7c000 1 52 0.17% 17 11 6 52 52 0 0 0 8 4 0 17 11 0 6 0 6 96 0xff11000117b9c000 0 55 0.16% 16 5 11 76 76 0 0 0 26 7 0 16 5 0 11 0 11 97 0xff110002408b8000 0 78 0.16% 16 6 10 86 86 0 0 0 38 9 0 13 6 0 10 0 10 98 0xff110002a1a88000 0 48 0.16% 16 7 9 64 64 0 0 0 15 5 0 19 7 0 9 0 9 99 0xff11004072594000 1 53 0.16% 16 8 8 58 58 0 0 0 17 4 0 12 8 0 8 0 9 100 0xff11004073a30000 1 46 0.16% 16 9 7 61 61 0 0 0 21 4 0 12 9 0 7 0 8 101 0xff11000117b9c040 0 1 0.15% 15 15 0 31 31 0 0 0 0 0 0 6 15 0 0 0 10 102 0xff1100012f064000 0 63 0.15% 15 10 5 90 90 0 0 0 49 7 0 13 10 0 5 0 6 103 0xff110020b736c040 1 1 0.15% 15 15 0 27 27 0 0 0 0 0 0 1 15 0 0 0 11 104 0xff1100406f161bc0 1 32 0.15% 15 11 4 131 131 0 0 0 10 0 1 101 11 0 4 0 4 105 0xff11000156a24040 0 3 0.14% 14 13 1 28 28 0 0 0 1 0 0 2 13 0 1 0 11 106 0xff11000300658000 0 55 0.14% 14 9 5 65 65 0 0 0 24 3 0 19 9 0 5 0 5 107 0xff1100406b92c040 1 1 0.14% 14 14 0 31 31 0 0 0 0 0 0 7 14 0 0 0 10 108 0xff1100406d770000 1 36 0.14% 14 5 9 43 43 0 0 0 7 1 0 12 5 0 9 0 9 109 0xff110001092a0040 0 1 0.13% 13 13 0 31 31 0 0 0 0 0 0 8 13 0 0 0 10 110 0xff110020d7e88000 1 48 0.13% 13 9 4 44 44 0 0 0 9 3 0 15 9 0 4 0 4 111 0xff1100406b92c000 1 40 0.13% 13 8 5 42 42 0 0 0 12 3 0 7 8 0 5 0 7 112 0xff1100407419c000 1 48 0.13% 13 11 2 60 60 0 0 0 24 7 0 13 11 0 2 0 3 113 0xff110001091c4040 0 1 0.12% 12 12 0 25 25 0 0 0 0 0 0 4 12 0 0 0 9 114 0xff11000300658040 0 1 0.12% 12 12 0 22 22 0 0 0 0 0 0 4 12 0 0 0 6 115 0xff110020ba320040 1 1 0.12% 12 12 0 33 33 0 0 0 2 0 0 3 12 0 0 0 16 116 0xff11004070424040 1 5 0.12% 12 12 0 32 32 0 0 0 0 1 0 3 12 0 0 0 16 117 0xff1100407bca8000 1 41 0.12% 12 8 4 42 42 0 0 0 8 4 0 13 8 0 4 0 5 118 0xff11000155a60040 0 1 0.11% 11 11 0 29 29 0 0 0 0 0 0 6 11 0 0 0 12 119 0xff110001e3bd9700 0 42 0.11% 11 5 6 82 82 0 0 0 43 0 0 22 5 0 6 0 6 ================================================= Shared Cache Line Distribution Pareto ================================================= # # ----- HITM ----- -- Store Refs -- --------- Data address --------- ---------- cycles ---------- Total cpu Shared # Num RmtHitm LclHitm L1 Hit L1 Miss Offset Node PA cnt Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node # ..... ....... ....... ....... ....... .................. .... ...... .................. ........ ........ ........ ....... ........ .............................. ................. ................. .... # ------------------------------------------------------------- 0 1310 1535 0 0 0xff110002a24ffc00 ------------------------------------------------------------- 1.37% 0.20% 0.00% 0.00% 0x0 0 1 0xffffffff81157a52 854 620 429 737 49 [k] rwsem_optimistic_spin [kernel.kallsyms] atomic64_64.h:190 0 1 0.69% 0.65% 0.00% 0.00% 0x0 0 1 0xffffffff81157a37 520 194 327 587 48 [k] rwsem_optimistic_spin [kernel.kallsyms] atomic64_64.h:22 0 1 0.00% 0.13% 0.00% 0.00% 0x0 0 1 0xffffffff81157e89 0 762 203 5 5 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 1 54.43% 58.18% 0.00% 0.00% 0x8 0 1 0xffffffff81157958 474 237 435 2611 50 [k] rwsem_spin_on_owner [kernel.kallsyms] atomic64_64.h:22 0 1 40.76% 39.28% 0.00% 0.00% 0x8 0 1 0xffffffff811578c7 408 185 324 1831 49 [k] rwsem_spin_on_owner [kernel.kallsyms] atomic64_64.h:22 0 1 0.23% 0.20% 0.00% 0.00% 0x8 0 1 0xffffffff811578fe 376 189 270 774 48 [k] rwsem_spin_on_owner [kernel.kallsyms] atomic64_64.h:22 0 1 0.00% 0.07% 0.00% 0.00% 0x8 0 1 0xffffffff81157c28 0 157 332 547 48 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 1 1.22% 0.65% 0.00% 0.00% 0x10 0 1 0xffffffff811587b5 1775 740 1033 592 49 [k] osq_lock [kernel.kallsyms] atomic.h:208 0 1 1.07% 0.33% 0.00% 0.00% 0x10 0 1 0xffffffff811588d3 1065 396 443 703 49 [k] osq_unlock [kernel.kallsyms] atomic.h:196 0 1 0.08% 0.07% 0.00% 0.00% 0x14 0 1 0xffffffff81cb8d19 1502 437 583 17 34 [k] _raw_spin_lock_irqsave [kernel.kallsyms] atomic.h:202 0 1 0.00% 0.20% 0.00% 0.00% 0x18 0 1 0xffffffff811573d6 0 248 623 10 10 [k] rwsem_mark_wake [kernel.kallsyms] rwsem.c:414 0 1 0.00% 0.07% 0.00% 0.00% 0x18 0 1 0xffffffff81157e67 0 789 0 1 1 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem.c:1245 0 0.15% 0.00% 0.00% 0.00% 0x38 0 1 0xffffffff812f72d4 458 0 186 305 48 [k] unmap_region [kernel.kallsyms] mm.h:1945 0 1 ------------------------------------------------------------- 1 108 136 0 0 0xff110002a24ffb80 ------------------------------------------------------------- 57.41% 51.47% 0.00% 0.00% 0x10 0 1 0xffffffff812e24a4 386 157 62 212 44 [k] vmacache_find [kernel.kallsyms] vmacache.c:49 0 1 42.59% 48.53% 0.00% 0.00% 0x18 0 1 0xffffffff812f73cc 392 164 86 181 44 [k] get_unmapped_area [kernel.kallsyms] mmap.c:2270 0 1 ------------------------------------------------------------- 2 111 128 0 0 0xff1100209ae58b00 ------------------------------------------------------------- 21.62% 25.78% 0.00% 0.00% 0x20 1 1 0xffffffff812f80f9 410 166 146 106 39 [k] __vma_adjust [kernel.kallsyms] mmap.c:536 0 1 16.22% 13.28% 0.00% 0.00% 0x20 1 1 0xffffffff812f7537 401 170 277 87 38 [k] find_vma [kernel.kallsyms] mmap.c:2321 0 1 49.55% 53.12% 0.00% 0.00% 0x28 1 1 0xffffffff812f92bb 406 168 84 204 47 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2065 0 1 10.81% 7.81% 0.00% 0.00% 0x28 1 1 0xffffffff812f92bf 399 162 61 49 26 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2065 0 1 1.80% 0.00% 0.00% 0.00% 0x28 1 1 0xffffffff812f7b99 424 0 0 4 2 [k] __vma_link_rb [kernel.kallsyms] mmap.c:434 0 1 ------------------------------------------------------------- 3 123 99 0 0 0xff1100209ae58580 ------------------------------------------------------------- 63.41% 40.40% 0.00% 0.00% 0x0 1 1 0xffffffff812f906b 391 199 94 233 46 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2060 0 1 12.20% 14.14% 0.00% 0.00% 0x0 1 1 0xffffffff812f810d 387 159 159 62 29 [k] __vma_adjust [kernel.kallsyms] mmap.c:542 0 1 1.63% 15.15% 0.00% 0.00% 0x0 1 1 0xffffffff812f7dd1 398 174 72 28 20 [k] __vma_adjust [kernel.kallsyms] mmap.c:756 0 1 8.94% 5.05% 0.00% 0.00% 0x0 1 1 0xffffffff812f7544 384 163 236 41 23 [k] find_vma [kernel.kallsyms] mmap.c:2317 0 1 0.81% 0.00% 0.00% 0.00% 0x0 1 1 0xffffffff812f79a3 365 0 147 6 5 [k] __vma_rb_erase [kernel.kallsyms] mmap.c:301 0 1 0.00% 1.01% 0.00% 0.00% 0x0 1 1 0xffffffff812f998a 0 160 87 9 9 [k] __do_munmap [kernel.kallsyms] mmap.c:301 0 1 0.81% 0.00% 0.00% 0.00% 0x20 1 1 0xffffffff812f7b90 399 0 294 49 29 [k] __vma_link_rb [kernel.kallsyms] mmap.c:434 0 1 0.00% 1.01% 0.00% 0.00% 0x28 1 1 0xffffffff812f80f9 0 184 193 42 23 [k] __vma_adjust [kernel.kallsyms] mmap.c:536 0 1 8.94% 7.07% 0.00% 0.00% 0x30 1 1 0xffffffff812f7b99 378 167 126 34 16 [k] __vma_link_rb [kernel.kallsyms] mmap.c:434 0 1 0.00% 13.13% 0.00% 0.00% 0x30 1 1 0xffffffff812f90a9 0 164 335 29 24 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2085 0 1 3.25% 2.02% 0.00% 0.00% 0x30 1 1 0xffffffff812f9959 369 170 231 22 11 [k] __do_munmap [kernel.kallsyms] mmap.c:434 0 1 0.00% 1.01% 0.00% 0.00% 0x30 1 1 0xffffffff812f796b 0 148 219 9 8 [k] __vma_rb_erase [kernel.kallsyms] mmap.c:434 0 1 [-- Attachment #5: c2c_old.log --] [-- Type: text/plain, Size: 39473 bytes --] ================================================= Trace Event Information ================================================= Total records : 419820 Locked Load/Store Operations : 8851 Load Operations : 73936 Loads - uncacheable : 0 Loads - IO : 0 Loads - Miss : 0 Loads - no mapping : 650 Load Fill Buffer Hit : 19371 Load L1D hit : 25362 Load L2D hit : 621 Load LLC hit : 16164 Load Local HITM : 7915 Load Remote HITM : 7527 Load Remote HIT : 0 Load Local DRAM : 795 Load Remote DRAM : 10973 Load MESI State Exclusive : 10973 Load MESI State Shared : 795 Load LLC Misses : 19295 Load access blocked by data : 0 Load access blocked by address : 0 LLC Misses to Local DRAM : 4.1% LLC Misses to Remote DRAM : 56.9% LLC Misses to Remote cache (HIT) : 0.0% LLC Misses to Remote cache (HITM) : 39.0% Store Operations : 0 Store - uncacheable : 0 Store - no mapping : 0 Store L1D Hit : 0 Store L1D Miss : 0 No Page Map Rejects : 71173 Unable to parse data source : 345884 ================================================= Global Shared Cache Line Event Information ================================================= Total Shared Cache Lines : 1755 Load HITs on shared lines : 48124 Fill Buffer Hits on shared lines : 13949 L1D hits on shared lines : 7439 L2D hits on shared lines : 83 LLC hits on shared lines : 15246 Locked Access on shared lines : 6623 Blocked Access on shared lines : 0 Store HITs on shared lines : 0 Store L1D hits on shared lines : 0 Total Merged records : 15442 ================================================= c2c details ================================================= Events : cpu/mem-loads,ldlat=30/P : cpu/mem-stores/P Cachelines sort on : Total HITMs Cacheline data grouping : offset,iaddr ================================================= Shared Data Cache Line Table ================================================= # # ----------- Cacheline ---------- Tot ------- Load Hitm ------- Total Total Total ---- Stores ---- ----- Core Load Hit ----- - LLC Load Hit -- - RMT Load Hit -- --- Load Dram ---- # Index Address Node PA cnt Hitm Total LclHitm RmtHitm records Loads Stores L1Hit L1Miss FB L1 L2 LclHit LclHitm RmtHit RmtHitm Lcl Rmt # ..... .................. .... ...... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ....... ........ ....... ........ ....... ........ ........ # 0 0xff11004077f52280 1 6828 25.29% 3905 1988 1917 10654 10654 0 0 0 1804 1048 2 757 1988 0 1917 158 2980 1 0xff11004077f52240 1 4806 18.11% 2796 909 1887 9484 9484 0 0 0 1020 2800 0 465 909 0 1887 223 2180 2 0xff11004077f52200 1 459 1.28% 197 114 83 748 748 0 0 0 377 1 1 89 114 0 83 0 83 3 0xff11004077f522c0 1 1929 1.13% 175 171 4 1704 1704 0 0 0 1117 0 1 209 171 0 4 13 189 4 0xff110002d1012400 0 857 1.02% 158 78 80 817 817 0 0 0 413 96 0 67 78 0 80 3 80 5 0xff1100407dadc980 1 724 0.93% 144 58 86 792 792 0 0 0 465 46 0 49 58 0 86 0 88 6 0xff1100407dadcd40 1 1036 0.87% 134 60 74 1157 1157 0 0 0 828 59 0 62 60 0 74 0 74 7 0xff11003fdcdaba00 1 200 0.66% 102 55 47 278 278 0 0 0 47 21 0 51 55 0 47 4 53 8 0xff110002d10123c0 0 158 0.63% 97 46 51 577 577 0 0 0 147 18 0 256 46 0 51 0 59 9 0xff11003fdca6ba00 1 147 0.58% 89 45 44 258 258 0 0 0 44 15 0 49 45 0 44 3 58 10 0xff110020001aba00 0 174 0.56% 86 46 40 253 253 0 0 0 31 28 0 57 46 0 40 4 47 11 0xff11001ffffaba00 0 155 0.54% 84 37 47 236 236 0 0 0 40 19 0 35 37 0 47 4 54 12 0xff11001fffdaba00 0 152 0.54% 83 40 43 237 237 0 0 0 34 21 0 40 40 0 43 3 56 13 0xff11003fdca2ba00 1 169 0.54% 83 52 31 237 237 0 0 0 48 22 0 45 52 0 31 3 36 14 0xff11003fdcdeba00 1 158 0.54% 83 40 43 232 232 0 0 0 38 21 0 33 40 0 43 5 52 15 0xff11001fffe2ba00 0 136 0.53% 82 43 39 210 210 0 0 0 32 14 0 25 43 0 39 9 48 16 0xff11001fffe6ba00 0 145 0.52% 81 43 38 228 228 0 0 0 47 12 0 40 43 0 38 4 44 17 0xff11003fdcb2ba00 1 133 0.52% 81 36 45 242 242 0 0 0 43 23 0 36 36 0 45 1 58 18 0xff11003fdcd6ba00 1 153 0.52% 80 40 40 237 237 0 0 0 42 15 0 43 40 0 40 6 51 19 0xff11001fffeaba00 0 138 0.51% 78 42 36 220 220 0 0 0 31 18 0 42 42 0 36 5 46 20 0xff11003fdcc2ba00 1 142 0.51% 78 40 38 217 217 0 0 0 40 18 0 31 40 0 38 3 47 21 0xff11001fffdeba00 0 143 0.50% 77 42 35 218 218 0 0 0 31 27 0 39 42 0 35 2 42 22 0xff11001fffeeba00 0 154 0.50% 77 34 43 226 226 0 0 0 37 19 0 38 34 0 43 4 51 23 0xff11004077f52580 1 1 0.50% 77 77 0 228 228 0 0 0 23 0 0 61 77 0 0 1 66 24 0xff11001ffff2ba00 0 138 0.49% 76 41 35 223 223 0 0 0 37 17 0 40 41 0 35 5 48 25 0xff11003fdcc6ba00 1 155 0.49% 76 44 32 224 224 0 0 0 44 23 0 37 44 0 32 4 40 26 0xff11003fdccaba00 1 123 0.49% 76 48 28 195 195 0 0 0 31 20 0 29 48 0 28 2 37 27 0xff11001fffd6ba00 0 155 0.48% 74 41 33 232 232 0 0 0 36 26 0 51 41 0 33 8 37 28 0xff1100200002ba00 0 174 0.48% 74 34 40 240 240 0 0 0 42 21 0 49 34 0 40 0 54 29 0xff11003fdcbaba00 1 158 0.48% 74 33 41 235 235 0 0 0 40 17 0 50 33 0 41 3 51 30 0xff11003fdcd2ba00 1 167 0.48% 74 37 37 242 242 0 0 0 57 17 0 45 37 0 37 5 44 31 0xff11001ffff6ba00 0 146 0.47% 73 33 40 215 215 0 0 0 35 26 0 32 33 0 40 4 45 32 0xff11003fdcb6ba00 1 144 0.47% 73 32 41 217 217 0 0 0 46 13 0 33 32 0 41 4 48 33 0xff110020000aba00 0 146 0.47% 72 29 43 220 220 0 0 0 33 22 0 39 29 0 43 6 48 34 0xff11003fdceaba00 1 111 0.46% 71 36 35 189 189 0 0 0 18 18 0 38 36 0 35 3 41 35 0xff11001fffc2ba00 0 138 0.45% 70 33 37 217 217 0 0 0 38 21 0 37 33 0 37 3 48 36 0xff11001ffffeba00 0 152 0.45% 70 39 31 204 204 0 0 0 29 23 0 39 39 0 31 7 36 37 0xff1100200016ba00 0 154 0.45% 70 36 34 241 241 0 0 0 47 29 0 39 36 0 34 6 50 38 0xff11001fffceba00 0 129 0.44% 68 28 40 219 219 0 0 0 36 23 0 41 28 0 40 5 46 39 0xff11003fdcf2ba00 1 118 0.44% 68 32 36 206 206 0 0 0 33 18 0 41 32 0 36 1 45 40 0xff1100200012ba00 0 126 0.43% 67 37 30 203 203 0 0 0 27 25 0 37 37 0 30 8 39 41 0xff11003fdcceba00 1 142 0.43% 67 28 39 223 223 0 0 0 42 23 0 41 28 0 39 2 48 42 0xff11003fdcfaba00 1 119 0.43% 67 31 36 200 200 0 0 0 36 14 0 34 31 0 36 3 46 43 0xff110020000eba00 0 106 0.42% 65 36 29 174 174 0 0 0 25 16 0 23 36 0 29 2 43 44 0xff11003fdcaaba00 1 144 0.42% 65 32 33 216 216 0 0 0 46 23 0 36 32 0 33 2 44 45 0xff110020001eba00 0 136 0.41% 64 24 40 223 223 0 0 0 44 17 0 39 24 0 40 4 55 46 0xff11003fdceeba00 1 115 0.41% 63 32 31 200 200 0 0 0 37 12 0 46 32 0 31 4 38 47 0xff11003fdcaeba00 1 134 0.40% 61 32 29 196 196 0 0 0 37 24 0 27 32 0 29 5 42 48 0xff11003fdce6ba00 1 126 0.40% 61 36 25 184 184 0 0 0 24 19 0 49 36 0 25 3 28 49 0xff11003fdcfeba00 1 107 0.39% 60 36 24 183 183 0 0 0 38 9 0 37 36 0 24 3 36 50 0xff11003fdce2ba00 1 126 0.38% 58 33 25 187 187 0 0 0 38 13 0 37 33 0 25 8 33 51 0xff11001fffd2ba00 0 108 0.36% 56 31 25 154 154 0 0 0 26 16 0 23 31 0 25 0 33 52 0xff11004077f52540 1 1 0.36% 56 56 0 155 155 0 0 0 6 0 0 40 56 0 0 2 51 53 0xff11001fffcaba00 0 100 0.36% 55 31 24 172 172 0 0 0 27 17 0 35 31 0 24 3 35 54 0xff11003fdcf6ba00 1 105 0.35% 54 29 25 163 163 0 0 0 23 20 0 25 29 0 25 5 36 55 0xff1100407dadcd80 1 183 0.34% 53 43 10 345 345 0 0 0 156 17 0 86 43 0 10 5 28 56 0xff11001fffc6ba00 0 129 0.34% 52 20 32 201 201 0 0 0 48 18 0 36 20 0 32 3 44 57 0xff1100200006ba00 0 152 0.34% 52 21 31 201 201 0 0 0 49 22 0 38 21 0 31 4 36 58 0xff11004077f524c0 1 397 0.34% 52 52 0 626 626 0 0 0 445 0 0 63 52 0 0 1 65 59 0xff11003fdcbeba00 1 82 0.32% 50 22 28 141 141 0 0 0 24 13 0 20 22 0 28 3 31 60 0xff110020db958000 1 85 0.32% 49 18 31 129 129 0 0 0 25 13 0 11 18 0 31 0 31 61 0xff110001b550c000 0 102 0.26% 40 15 25 131 131 0 0 0 23 20 0 23 15 0 25 0 25 62 0xff11000185270000 0 96 0.24% 37 16 21 123 123 0 0 0 30 11 0 24 16 0 21 0 21 63 0xff110040745f0000 1 100 0.24% 37 16 21 132 132 0 0 0 31 15 0 27 16 0 21 0 22 64 0xff11000116e68000 0 102 0.23% 36 18 18 131 131 0 0 0 36 20 0 21 18 0 18 0 18 65 0xff110002405e4000 0 100 0.23% 36 13 23 131 131 0 0 0 36 18 0 17 13 0 23 1 23 66 0xff1100209ebfc000 1 117 0.23% 36 13 23 146 146 0 0 0 38 16 0 33 13 0 23 0 23 67 0xff11004075200000 1 83 0.23% 36 15 21 132 132 0 0 0 34 18 0 23 15 0 21 0 21 68 0xff1100209c38c000 1 98 0.23% 35 17 18 139 139 0 0 0 37 20 0 29 17 0 18 0 18 69 0xff11003fdc5eba00 1 69 0.23% 35 20 15 104 104 0 0 0 21 14 0 10 20 0 15 2 22 70 0xff1100407546c000 1 101 0.23% 35 17 18 147 147 0 0 0 54 19 0 20 17 0 18 0 19 71 0xff1100012d434000 0 115 0.22% 34 17 17 147 147 0 0 0 43 22 0 29 17 0 17 0 19 72 0xff110002704b4000 0 81 0.22% 34 17 17 116 116 0 0 0 29 14 0 21 17 0 17 0 18 73 0xff110002d0b18000 0 90 0.22% 34 17 17 114 114 0 0 0 34 13 0 14 17 0 17 0 19 74 0xff1100406c458000 1 105 0.22% 34 16 18 127 127 0 0 0 34 23 0 18 16 0 18 0 18 75 0xff1100407eeb8000 1 97 0.22% 34 16 18 133 133 0 0 0 46 19 0 16 16 0 18 0 18 76 0xff110020b69bc000 1 105 0.21% 33 16 17 134 134 0 0 0 45 16 0 22 16 0 17 0 18 77 0xffffffff835477c0 1 1 0.21% 32 13 19 51 51 0 0 0 0 0 0 0 13 0 19 0 19 78 0xff110001b2a1ba00 0 215 0.21% 32 19 13 330 330 0 0 0 4 37 0 244 19 0 13 0 13 79 0xff110002720d8000 0 98 0.21% 32 15 17 129 129 0 0 0 33 20 0 27 15 0 17 0 17 80 0xff1100209cf60000 1 92 0.21% 32 15 17 108 108 0 0 0 30 11 0 18 15 0 17 0 17 81 0xff1100011628c000 0 95 0.20% 31 17 14 112 112 0 0 0 30 12 0 25 17 0 14 0 14 82 0xff110020b5cac000 1 93 0.20% 31 14 17 133 133 0 0 0 49 13 0 22 14 0 17 0 18 83 0xff11000156b04000 0 90 0.19% 30 15 15 100 100 0 0 0 27 13 0 15 15 0 15 0 15 84 0xff110001570cc000 0 104 0.19% 30 18 12 106 106 0 0 0 24 19 0 21 18 0 12 0 12 85 0xff1100407ed74000 1 70 0.19% 30 14 16 89 89 0 0 0 23 7 0 13 14 0 16 0 16 86 0xff1100209d8fc000 1 87 0.19% 29 9 20 121 121 0 0 0 39 16 0 16 9 0 20 0 21 87 0xff1100407db84000 1 78 0.19% 29 14 15 103 103 0 0 0 32 8 0 19 14 0 15 0 15 88 0xff11000116638000 0 84 0.18% 28 15 13 116 116 0 0 0 32 17 0 26 15 0 13 0 13 89 0xff110001b2fb4000 0 59 0.18% 28 15 13 79 79 0 0 0 15 7 0 16 15 0 13 0 13 90 0xff1100010927c000 0 87 0.17% 27 13 14 134 134 0 0 0 60 15 0 17 13 0 14 0 15 91 0xff11000185274000 0 88 0.17% 27 13 14 112 112 0 0 0 24 20 0 26 13 0 14 0 15 92 0xff1100209a4f4000 1 91 0.17% 27 13 14 120 120 0 0 0 36 14 0 28 13 0 14 0 15 93 0xff110001b343c000 0 96 0.17% 26 9 17 119 119 0 0 0 29 27 0 19 9 0 17 0 18 94 0xff1100021189c000 0 87 0.17% 26 12 14 102 102 0 0 0 22 19 0 21 12 0 14 0 14 95 0xff110001e3704000 0 78 0.16% 25 17 8 80 80 0 0 0 17 15 0 13 17 0 8 0 10 96 0xff110002703b0000 0 95 0.16% 25 12 13 129 129 0 0 0 41 25 0 23 12 0 13 0 15 97 0xff110020b5ef0000 1 91 0.16% 25 10 15 111 111 0 0 0 29 16 0 26 10 0 15 0 15 98 0xff110020dd11c000 1 90 0.16% 24 12 12 106 106 0 0 0 32 15 0 22 12 0 12 1 12 99 0xff11000116b38000 0 69 0.15% 23 9 14 103 103 0 0 0 30 20 0 16 9 0 14 0 14 100 0xff11000211070000 0 85 0.15% 23 9 14 100 100 0 0 0 26 18 0 18 9 0 14 0 15 101 0xff1100407303c000 1 90 0.15% 23 8 15 119 119 0 0 0 40 18 0 20 8 0 15 0 18 102 0xff1100407db80000 1 84 0.15% 23 10 13 109 109 0 0 0 44 8 1 20 10 0 13 0 13 103 0xff1100012df30000 0 102 0.14% 22 13 9 108 108 0 0 0 29 25 0 22 13 0 9 0 10 104 0xff110020b69b8000 1 84 0.14% 21 9 12 117 117 0 0 0 43 19 0 21 9 0 12 0 13 105 0xff110001e1ff0000 0 87 0.13% 20 9 11 110 110 0 0 0 42 21 0 14 9 0 11 0 13 106 0xff1100407348c000 1 75 0.13% 20 9 11 98 98 0 0 0 27 19 0 20 9 0 11 0 12 107 0xff11004075204000 1 81 0.13% 20 9 11 104 104 0 0 0 30 21 0 21 9 0 11 0 12 108 0xff110020b7eb8000 1 64 0.12% 19 7 12 93 93 0 0 0 45 5 0 12 7 0 12 0 12 109 0xff110001e47a0000 0 81 0.12% 18 5 13 93 93 0 0 0 24 21 0 16 5 0 13 0 14 110 0xff1100010923c000 0 91 0.11% 17 9 8 92 92 0 0 0 30 17 0 18 9 0 8 0 10 111 0xff110001b550c040 0 3 0.10% 16 16 0 40 40 0 0 0 4 0 0 2 16 0 0 0 18 112 0xff110002d10124c0 0 428 0.10% 16 14 2 646 646 0 0 0 357 3 0 236 14 0 2 0 34 113 0xff11003fdc6aba00 1 35 0.10% 16 9 7 47 47 0 0 0 10 8 0 3 9 0 7 0 10 ================================================= Shared Cache Line Distribution Pareto ================================================= # # ----- HITM ----- -- Store Refs -- --------- Data address --------- ---------- cycles ---------- Total cpu Shared # Num RmtHitm LclHitm L1 Hit L1 Miss Offset Node PA cnt Code address rmt hitm lcl hitm load records cnt Symbol Object Source:Line Node # ..... ....... ....... ....... ....... .................. .... ...... .................. ........ ........ ........ ....... ........ .............................. ................. ................ .... # ------------------------------------------------------------- 0 1917 1988 0 0 0xff11004077f52280 ------------------------------------------------------------- 53.31% 49.04% 0.00% 0.00% 0x0 1 1 0xffffffff81157918 497 277 465 3558 56 [k] rwsem_spin_on_owner [kernel.kallsyms] atomic64_64.h:22 0 1 30.57% 35.87% 0.00% 0.00% 0x0 1 1 0xffffffff81157887 413 228 392 2686 55 [k] rwsem_spin_on_owner [kernel.kallsyms] atomic64_64.h:22 0 1 13.72% 12.02% 0.00% 0.00% 0x0 1 1 0xffffffff81157be8 483 187 328 836 52 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 1 0.37% 0.55% 0.00% 0.00% 0x0 1 1 0xffffffff811578be 411 213 309 1156 52 [k] rwsem_spin_on_owner [kernel.kallsyms] atomic64_64.h:22 0 1 0.21% 0.60% 0.00% 0.00% 0x0 1 1 0xffffffff81157a71 557 255 442 71 36 [k] rwsem_optimistic_spin [kernel.kallsyms] atomic64_64.h:22 0 1 0.05% 0.15% 0.00% 0.00% 0x0 1 1 0xffffffff811577e8 379 445 247 194 52 [k] downgrade_write [kernel.kallsyms] atomic64_64.h:22 0 1 0.05% 0.05% 0.00% 0.00% 0x0 1 1 0xffffffff81157760 386 172 300 5 4 [k] up_read [kernel.kallsyms] atomic64_64.h:22 0 1 1.10% 0.25% 0.00% 0.00% 0x8 1 1 0xffffffff81158775 1059 580 939 617 54 [k] osq_lock [kernel.kallsyms] atomic.h:208 0 1 0.26% 0.40% 0.00% 0.00% 0x8 1 1 0xffffffff81158893 1153 408 561 774 55 [k] osq_unlock [kernel.kallsyms] atomic.h:196 0 1 0.05% 0.00% 0.00% 0.00% 0xc 1 1 0xffffffff81cb8d19 560 0 548 34 46 [k] _raw_spin_lock_irqsave [kernel.kallsyms] atomic.h:202 0 1 0.05% 0.15% 0.00% 0.00% 0x10 1 1 0xffffffff81157396 796 234 436 42 30 [k] rwsem_mark_wake [kernel.kallsyms] rwsem.c:414 0 1 0.16% 0.05% 0.00% 0.00% 0x10 1 1 0xffffffff811576ba 543 209 554 26 20 [k] rwsem_wake.isra.0 [kernel.kallsyms] list.h:282 0 1 0.10% 0.86% 0.00% 0.00% 0x38 1 1 0xffffffff812f96c2 408 166 146 158 46 [k] __do_munmap [kernel.kallsyms] mm.h:1951 0 1 ------------------------------------------------------------- 1 1887 909 0 0 0xff11004077f52240 ------------------------------------------------------------- 0.58% 2.42% 0.00% 0.00% 0x8 1 1 0xffffffff812f8eea 449 214 266 83 33 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2048 0 1 0.00% 1.87% 0.00% 0.00% 0x10 1 1 0xffffffff812eaec3 0 172 130 124 48 [k] free_pgd_range [kernel.kallsyms] pgtable.h:106 0 1 0.32% 0.22% 0.00% 0.00% 0x10 1 1 0xffffffff812ed581 410 320 86 156 46 [k] unmap_page_range [kernel.kallsyms] pgtable.h:106 0 1 0.48% 0.44% 0.00% 0.00% 0x18 1 1 0xffffffff81125dab 407 332 96 24 13 [k] finish_task_switch [kernel.kallsyms] atomic.h:29 0 1 0.00% 0.22% 0.00% 0.00% 0x20 1 1 0xffffffff81cb2ed4 0 749 388 11 15 [k] __schedule [kernel.kallsyms] atomic.h:95 0 1 0.00% 0.11% 0.00% 0.00% 0x20 1 1 0xffffffff81125cad 0 1506 641 11 14 [k] finish_task_switch [kernel.kallsyms] atomic.h:123 0 1 7.53% 5.39% 0.00% 0.00% 0x30 1 1 0xffffffff812fb7d5 619 216 667 466 52 [k] do_mmap [kernel.kallsyms] mmap.c:1445 0 1 4.72% 2.53% 0.00% 0.00% 0x30 1 1 0xffffffff812f9533 459 172 184 250 50 [k] __do_munmap [kernel.kallsyms] mmap.c:2853 0 1 2.17% 2.86% 0.00% 0.00% 0x30 1 1 0xffffffff812f800f 405 169 220 233 52 [k] __vma_adjust [kernel.kallsyms] mmap.c:719 0 1 0.05% 1.21% 0.00% 0.00% 0x30 1 1 0xffffffff812f7f3d 421 165 302 41 42 [k] __vma_adjust [kernel.kallsyms] mmap.c:958 0 1 0.11% 0.22% 0.00% 0.00% 0x30 1 1 0xffffffff812f964c 487 156 335 123 55 [k] __do_munmap [kernel.kallsyms] mmap.c:2697 0 1 69.79% 75.91% 0.00% 0.00% 0x38 1 1 0xffffffff811579f7 436 177 342 3585 54 [k] rwsem_optimistic_spin [kernel.kallsyms] atomic64_64.h:22 0 1 13.78% 5.28% 0.00% 0.00% 0x38 1 1 0xffffffff81157a12 1240 483 484 1354 55 [k] rwsem_optimistic_spin [kernel.kallsyms] atomic64_64.h:190 0 1 0.16% 0.44% 0.00% 0.00% 0x38 1 1 0xffffffff81157794 1128 695 539 545 54 [k] up_write [kernel.kallsyms] atomic64_64.h:172 0 1 0.00% 0.55% 0.00% 0.00% 0x38 1 1 0xffffffff81157a7d 0 221 314 552 52 [k] rwsem_optimistic_spin [kernel.kallsyms] atomic64_64.h:22 0 1 0.11% 0.11% 0.00% 0.00% 0x38 1 1 0xffffffff811577e3 1386 415 381 355 54 [k] downgrade_write [kernel.kallsyms] atomic64_64.h:172 0 1 0.11% 0.00% 0.00% 0.00% 0x38 1 1 0xffffffff81cb6168 1250 0 502 930 55 [k] down_write_killable [kernel.kallsyms] atomic64_64.h:190 0 1 0.00% 0.11% 0.00% 0.00% 0x38 1 1 0xffffffff81157c9c 0 146 0 1 1 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 0.00% 0.11% 0.00% 0.00% 0x38 1 1 0xffffffff81157ce6 0 230 2218 2 2 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 0.05% 0.00% 0.00% 0.00% 0x38 1 1 0xffffffff81157e2c 394 0 0 2 1 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 0.05% 0.00% 0.00% 0.00% 0x38 1 1 0xffffffff81157e49 501 0 82 5 4 [k] rwsem_down_write_slowpath [kernel.kallsyms] atomic64_64.h:22 0 1 ------------------------------------------------------------- 2 83 114 0 0 0xff11004077f52200 ------------------------------------------------------------- 49.40% 41.23% 0.00% 0.00% 0x10 1 1 0xffffffff812e2464 391 166 141 173 46 [k] vmacache_find [kernel.kallsyms] vmacache.c:49 0 1 50.60% 58.77% 0.00% 0.00% 0x18 1 1 0xffffffff812f728c 394 165 68 203 48 [k] get_unmapped_area [kernel.kallsyms] mmap.c:2270 0 1 ------------------------------------------------------------- 3 4 171 0 0 0xff11004077f522c0 ------------------------------------------------------------- 0.00% 5.85% 0.00% 0.00% 0x0 1 1 0xffffffff812f96b0 0 170 247 90 39 [k] __do_munmap [kernel.kallsyms] mm.h:1951 0 1 0.00% 4.68% 0.00% 0.00% 0x0 1 1 0xffffffff812fb228 0 176 236 47 54 [k] mmap_region [kernel.kallsyms] mmap.c:3384 0 1 25.00% 1.17% 0.00% 0.00% 0x0 1 1 0xffffffff811f12f0 453 726 273 34 27 [k] __acct_update_integrals [kernel.kallsyms] tsacct.c:140 0 1 0.00% 1.17% 0.00% 0.00% 0x0 1 1 0xffffffff812fa195 0 200 210 361 53 [k] may_expand_vm [kernel.kallsyms] mmap.c:3359 0 1 0.00% 22.22% 0.00% 0.00% 0x8 1 1 0xffffffff812f959a 0 166 178 177 51 [k] __do_munmap [kernel.kallsyms] mmap.c:2889 0 1 0.00% 0.58% 0.00% 0.00% 0x18 1 1 0xffffffff812fa1cf 0 177 208 306 52 [k] may_expand_vm [kernel.kallsyms] mmap.c:3363 0 1 75.00% 64.33% 0.00% 0.00% 0x30 1 1 0xffffffff812fb836 371 234 237 329 51 [k] do_mmap [kernel.kallsyms] mmap.c:1472 0 1 ------------------------------------------------------------- 4 80 78 0 0 0xff110002d1012400 ------------------------------------------------------------- 0.00% 2.56% 0.00% 0.00% 0x0 0 1 0xffffffff812f8f1e 0 232 235 74 34 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2060 0 1 22.50% 6.41% 0.00% 0.00% 0x10 0 1 0xffffffff812f7406 382 188 177 75 35 [k] find_vma [kernel.kallsyms] mmap.c:2323 0 1 1.25% 0.00% 0.00% 0.00% 0x10 0 1 0xffffffff812fb12f 71 0 101 6 4 [k] mmap_region [kernel.kallsyms] mmap.c:536 0 1 76.25% 91.03% 0.00% 0.00% 0x20 0 1 0xffffffff812f917b 393 172 69 239 51 [k] vm_unmapped_area [kernel.kallsyms] mmap.c:2065 0 1 ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-06 10:16 ` Feng Tang @ 2021-06-06 19:20 ` Linus Torvalds 2021-06-06 22:13 ` Waiman Long 2021-06-07 6:05 ` Feng Tang 0 siblings, 2 replies; 13+ messages in thread From: Linus Torvalds @ 2021-06-06 19:20 UTC (permalink / raw) To: Feng Tang, Waiman Long Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing [ Adding Waiman Long to the participants, because this seems to be a very specific cacheline alignment behavior of rwsems, maybe Waiman has some comments ] On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote: > > * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the > data structure change > > - old kernel > > - first cacheline > mmap_lock->count (75%) > mm->mapcount (14%) > > - second cacheline > mmap_lock->owner (97%) > > - new kernel > > mainly in the cacheline of 'mmap_lock' > > mmap_lock->count (~2%) > mmap_lock->owner (95%) Oooh. It looks like pretty much all the contention is on mmap_lock, and the difference is that the old kernel just _happened_ to split the mmap_lock rwsem at *exactly* the right place. The rw_semaphore structure looks like this: struct rw_semaphore { atomic_long_t count; atomic_long_t owner; struct optimistic_spin_queue osq; /* spinner MCS lock */ ... and before the addition of the 'write_protect_seq' field, the mmap_sem was at offset 120 in 'struct mm_struct'. Which meant that count and owner were in two different cachelines, and then when you have contention and spend time in rwsem_down_write_slowpath(), this is probably *exactly* the kind of layout you want. Because first the rwsem_write_trylock() will do a cmpxchg on the first cacheline (for the optimistic fast-path), and then in the case of contention, rwsem_down_write_slowpath() will just access the second cacheline. Which is probably just optimal for a load that spends a lot of time contended - new waiters touch that first cacheline, and then they queue themselves up on the second cacheline. Waiman, does that sound believable? Anyway, I'm certainly ok with the patch that just moves 'write_protect_seq' down, it might be worth commenting about how this is about some very special cache layout of the mmap_sem part of the structure. That said, this means that it all is very subtle dependent on a lot of kernel config options, and I'm not sure how relevant the exact kernel options are that the test robot has been using. But even if this is just a "kernel test robot reports", I think it's an interesting case and worth a comment for when this happens next time... Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-06 19:20 ` Linus Torvalds @ 2021-06-06 22:13 ` Waiman Long 2021-06-07 6:05 ` Feng Tang 1 sibling, 0 replies; 13+ messages in thread From: Waiman Long @ 2021-06-06 22:13 UTC (permalink / raw) To: Linus Torvalds, Feng Tang Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On 6/6/21 3:20 PM, Linus Torvalds wrote: > [ Adding Waiman Long to the participants, because this seems to be a > very specific cacheline alignment behavior of rwsems, maybe Waiman has > some comments ] > > On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote: >> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the >> data structure change >> >> - old kernel >> >> - first cacheline >> mmap_lock->count (75%) >> mm->mapcount (14%) >> >> - second cacheline >> mmap_lock->owner (97%) >> >> - new kernel >> >> mainly in the cacheline of 'mmap_lock' >> >> mmap_lock->count (~2%) >> mmap_lock->owner (95%) > Oooh. > > It looks like pretty much all the contention is on mmap_lock, and the > difference is that the old kernel just _happened_ to split the > mmap_lock rwsem at *exactly* the right place. > > The rw_semaphore structure looks like this: > > struct rw_semaphore { > atomic_long_t count; > atomic_long_t owner; > struct optimistic_spin_queue osq; /* spinner MCS lock */ > ... > > and before the addition of the 'write_protect_seq' field, the mmap_sem > was at offset 120 in 'struct mm_struct'. > > Which meant that count and owner were in two different cachelines, and > then when you have contention and spend time in > rwsem_down_write_slowpath(), this is probably *exactly* the kind of > layout you want. > > Because first the rwsem_write_trylock() will do a cmpxchg on the first > cacheline (for the optimistic fast-path), and then in the case of > contention, rwsem_down_write_slowpath() will just access the second > cacheline. > > Which is probably just optimal for a load that spends a lot of time > contended - new waiters touch that first cacheline, and then they > queue themselves up on the second cacheline. Waiman, does that sound > believable? Yes, I think so. The count field is accessed when a task tries to acquire the rwsem or when a owner releases the lock. If the trylock fails, the writer will go into the slowpath doing optimistic spinning on the owner field. As a result, a lot of reads to owner are issued relative to the read/write of count. Normally, there should only be one spinner that has the OSQ lock spinning on owner and the 9% performance degradation seems a bit high to me. In the rare case that the head waiter in the wait queue sets the handoff flag, the waiter may also spin on owner causing a bit more contention on the owner cacheline. I will do further investigation on this possibility when I have time. Cheers, Longman ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-06 19:20 ` Linus Torvalds 2021-06-06 22:13 ` Waiman Long @ 2021-06-07 6:05 ` Feng Tang 2021-06-08 0:03 ` Linus Torvalds 1 sibling, 1 reply; 13+ messages in thread From: Feng Tang @ 2021-06-07 6:05 UTC (permalink / raw) To: Linus Torvalds Cc: Waiman Long, Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On Sun, Jun 06, 2021 at 12:20:46PM -0700, Linus Torvalds wrote: > [ Adding Waiman Long to the participants, because this seems to be a > very specific cacheline alignment behavior of rwsems, maybe Waiman has > some comments ] > > On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote: > > > > * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the > > data structure change > > > > - old kernel > > > > - first cacheline > > mmap_lock->count (75%) > > mm->mapcount (14%) > > > > - second cacheline > > mmap_lock->owner (97%) > > > > - new kernel > > > > mainly in the cacheline of 'mmap_lock' > > > > mmap_lock->count (~2%) > > mmap_lock->owner (95%) > > Oooh. > > It looks like pretty much all the contention is on mmap_lock, and the > difference is that the old kernel just _happened_ to split the > mmap_lock rwsem at *exactly* the right place. > > The rw_semaphore structure looks like this: > > struct rw_semaphore { > atomic_long_t count; > atomic_long_t owner; > struct optimistic_spin_queue osq; /* spinner MCS lock */ > ... > > and before the addition of the 'write_protect_seq' field, the mmap_sem > was at offset 120 in 'struct mm_struct'. > > Which meant that count and owner were in two different cachelines, and > then when you have contention and spend time in > rwsem_down_write_slowpath(), this is probably *exactly* the kind of > layout you want. > > Because first the rwsem_write_trylock() will do a cmpxchg on the first > cacheline (for the optimistic fast-path), and then in the case of > contention, rwsem_down_write_slowpath() will just access the second > cacheline. > > Which is probably just optimal for a load that spends a lot of time > contended - new waiters touch that first cacheline, and then they > queue themselves up on the second cacheline. Waiman, does that sound > believable? > > Anyway, I'm certainly ok with the patch that just moves > 'write_protect_seq' down, it might be worth commenting about how this > is about some very special cache layout of the mmap_sem part of the > structure. > > That said, this means that it all is very subtle dependent on a lot of > kernel config options, and I'm not sure how relevant the exact kernel > options are that the test robot has been using. But even if this is > just a "kernel test robot reports", I think it's an interesting case > and worth a comment for when this happens next time... There are 3 kernel config options before 'mmap_lock' (inside 'mm_struct'): CONFIG_MMU CONFIG_MEMBARRIER CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES 0day's default kernel config is similar to RHEL-8.3, which has all these three enabled. IIUC, the first 2 options are 'y' for many common configs, while 'CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES' is only available on x86 system. Please review the updated patch, thanks - Feng From cbdbe70fb9e5bab2988d645c6f0f614d51b2e386 Mon Sep 17 00:00:00 2001 From: Feng Tang <feng.tang@intel.com> Date: Fri, 4 Jun 2021 15:20:57 +0800 Subject: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct 0day robot reports a 9.2% regression for will-it-scale mmap1 test case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast from racing with COW during fork"). Further debug shows the regression is due to that commit changes the offset of hot fields 'mmap_lock' inside structure 'mm_struct', thus some cache alignmeent changes. From the perf data, the contention for 'mmap_lock' is very severe and takes around 95% cpu cycles, and it is a rw_semaphore struct rw_semaphore { atomic_long_t count; /* 8 bytes */ atomic_long_t owner; /* 8 bytes */ struct optimistic_spin_queue osq; /* spinner MCS lock */ ... Before commit 57efa1fe5957 adds the 'write_protect_seq', it happens to have a very optimal cache alignment layout, as Linus explained: "and before the addition of the 'write_protect_seq' field, the mmap_sem was at offset 120 in 'struct mm_struct'. Which meant that count and owner were in two different cachelines, and then when you have contention and spend time in rwsem_down_write_slowpath(), this is probably *exactly* the kind of layout you want. Because first the rwsem_write_trylock() will do a cmpxchg on the first cacheline (for the optimistic fast-path), and then in the case of contention, rwsem_down_write_slowpath() will just access the second cacheline. Which is probably just optimal for a load that spends a lot of time contended - new waiters touch that first cacheline, and then they queue themselves up on the second cacheline." After the commit, the rw_semaphore is at offset 128, which means the 'count' and 'owner' fields are now in the same cacheline, and causes more cache bouncing. Currently there are "#ifdef CONFIG_XXX" before 'mmap_lock' which will affect its offset: CONFIG_MMU CONFIG_MEMBARRIER CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES The layout above is on 64 bits system with 0day's default kernel config (similar to RHEL-8.3's config), which all these 3 options are 'y'. And the layout can ary with different kernel configs. Relayouting a structure is usually a double-edged sword, as sometimes it can helps one case, but hurt other cases. For this case, one solution is, as the newly added 'write_protect_seq' is a 4 bytes long seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an existing 4 bytes hole in 'mm_struct' will not change other fields' alignment, while restoring the regression. [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Feng Tang <feng.tang@intel.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> --- include/linux/mm_types.h | 27 ++++++++++++++++++++------- 1 file changed, 20 insertions(+), 7 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 5aacc1c..cba6022 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -445,13 +445,6 @@ struct mm_struct { */ atomic_t has_pinned; - /** - * @write_protect_seq: Locked when any thread is write - * protecting pages mapped by this mm to enforce a later COW, - * for instance during page table copying for fork(). - */ - seqcount_t write_protect_seq; - #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif @@ -460,6 +453,18 @@ struct mm_struct { spinlock_t page_table_lock; /* Protects page tables and some * counters */ + /* + * With some kernel config, the current mmap_lock's offset + * inside 'mm_struct' is at 0x120, which is very optimal, as + * its two hot fields 'count' and 'owner' sit in 2 different + * cachelines, and when mmap_lock is highly contended, both + * of the 2 fields will be accessed frequently, current layout + * will help to reduce cache bouncing. + * + * So please be careful with adding new fields before + * mmap_lock, which can easily push the 2 fields into one + * cacheline. + */ struct rw_semaphore mmap_lock; struct list_head mmlist; /* List of maybe swapped mm's. These @@ -480,7 +485,15 @@ struct mm_struct { unsigned long stack_vm; /* VM_STACK */ unsigned long def_flags; + /** + * @write_protect_seq: Locked when any thread is write + * protecting pages mapped by this mm to enforce a later COW, + * for instance during page table copying for fork(). + */ + seqcount_t write_protect_seq; + spinlock_t arg_lock; /* protect the below fields */ + unsigned long start_code, end_code, start_data, end_data; unsigned long start_brk, brk, start_stack; unsigned long arg_start, arg_end, env_start, env_end; -- 2.7.4 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-07 6:05 ` Feng Tang @ 2021-06-08 0:03 ` Linus Torvalds 0 siblings, 0 replies; 13+ messages in thread From: Linus Torvalds @ 2021-06-08 0:03 UTC (permalink / raw) To: Feng Tang Cc: Waiman Long, Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On Sun, Jun 6, 2021 at 11:06 PM Feng Tang <feng.tang@intel.com> wrote: > > Please review the updated patch, thanks Looks good to me. Thanks, Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-04 7:52 ` Feng Tang 2021-06-04 17:57 ` Linus Torvalds @ 2021-06-04 17:58 ` John Hubbard 2021-06-06 4:47 ` Feng Tang 1 sibling, 1 reply; 13+ messages in thread From: John Hubbard @ 2021-06-04 17:58 UTC (permalink / raw) To: Feng Tang, Linus Torvalds, Jason Gunthorpe Cc: kernel test robot, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On 6/4/21 12:52 AM, Feng Tang wrote: ... >>> The perf data doesn't even mention any of the GUP paths, and on the >>> pure fork path the biggest impact would be: >>> >>> (a) maybe "struct mm_struct" changed in size or had a different cache layout >> >> Yes, this seems to be the cause of the regression. >> >> The test case is many thread are doing map/unmap at the same time, >> so the process's rw_semaphore 'mmap_lock' is highly contended. >> >> Before the patch (with 0day's kconfig), the mmap_lock is separated >> into 2 cachelines, the 'count' is in one line, and the other members >> sit in the next line, so it luckily avoid some cache bouncing. After Wow! That's quite a fortunate layout to land on by accident. Almost makes me wonder if mmap_lock should be designed to do that, but it's probably even better to just keep working on having a less contended mmap_lock. I *suppose* it's worth trying to keep this fragile layout in place, but it is a landmine for anyone who touches mm_struct. And the struct is so large already that I'm not sure a comment warning would even be noticed. Anyway... >> the patch, the 'mmap_lock' is pushed into one cacheline, which may >> cause the regression. >> >> Below is the pahole info: >> >> - before the patch >> >> spinlock_t page_table_lock; /* 116 4 */ >> struct rw_semaphore mmap_lock; /* 120 40 */ >> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ >> struct list_head mmlist; /* 160 16 */ >> long unsigned int hiwater_rss; /* 176 8 */ >> >> - after the patch >> >> spinlock_t page_table_lock; /* 124 4 */ >> /* --- cacheline 2 boundary (128 bytes) --- */ >> struct rw_semaphore mmap_lock; /* 128 40 */ >> struct list_head mmlist; /* 168 16 */ >> long unsigned int hiwater_rss; /* 184 8 */ >> >> perf c2c log can also confirm this. > > We've tried some patch, which can restore the regerssion. As the > newly added member 'write_protect_seq' is 4 bytes long, and putting > it into an existing 4 bytes long hole can restore the regeression, > while not affecting most of other member's alignment. Please review > the following patch, thanks! > So, this is a neat little solution, if we agree that it's worth "fixing". I'm definitely on the fence, but leaning toward, "go for it", because I like the "no cache effect" result of using up the hole. Reviewed-by: John Hubbard <jhubbard@nvidia.com> thanks, -- John Hubbard NVIDIA > - Feng > > From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001 > From: Feng Tang <feng.tang@intel.com> > Date: Fri, 4 Jun 2021 15:20:57 +0800 > Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct > > Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from > racing with COW during fork), on 64bits system, the hot member > rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into > 2 cachelines, that its member 'count' sits in one cacheline while > all other members in next cacheline, this naturally reduces some > cache bouncing, and with the commit, the 'mmap_lock' is pushed > into one cacheline, as shown in the pahole info: > > - before the commit > > spinlock_t page_table_lock; /* 116 4 */ > struct rw_semaphore mmap_lock; /* 120 40 */ > /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ > struct list_head mmlist; /* 160 16 */ > long unsigned int hiwater_rss; /* 176 8 */ > > - after the commit > > spinlock_t page_table_lock; /* 124 4 */ > /* --- cacheline 2 boundary (128 bytes) --- */ > struct rw_semaphore mmap_lock; /* 128 40 */ > struct list_head mmlist; /* 168 16 */ > long unsigned int hiwater_rss; /* 184 8 */ > > and it causes one 9.2% regression for 'mmap1' case of will-it-scale > benchmark[1], as in the case 'mmap_lock' is highly contented (occupies > 90%+ cpu cycles). > > Though relayouting a structure could be a double-edged sword, as it > helps some case, but may hurt other cases. So one solution is the > newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n), > placing it into an existing 4 bytes hole in 'mm_struct' will not > affect most of other members's alignment, while restoring the > regression. > > [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ > Reported-by: kernel test robot <oliver.sang@intel.com> > Signed-off-by: Feng Tang <feng.tang@intel.com> > --- > include/linux/mm_types.h | 15 ++++++++------- > 1 file changed, 8 insertions(+), 7 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 5aacc1c..5b55f88 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -445,13 +445,6 @@ struct mm_struct { > */ > atomic_t has_pinned; > > - /** > - * @write_protect_seq: Locked when any thread is write > - * protecting pages mapped by this mm to enforce a later COW, > - * for instance during page table copying for fork(). > - */ > - seqcount_t write_protect_seq; > - > #ifdef CONFIG_MMU > atomic_long_t pgtables_bytes; /* PTE page table pages */ > #endif > @@ -480,7 +473,15 @@ struct mm_struct { > unsigned long stack_vm; /* VM_STACK */ > unsigned long def_flags; > > + /** > + * @write_protect_seq: Locked when any thread is write > + * protecting pages mapped by this mm to enforce a later COW, > + * for instance during page table copying for fork(). > + */ > + seqcount_t write_protect_seq; > + > spinlock_t arg_lock; /* protect the below fields */ > + > unsigned long start_code, end_code, start_data, end_data; > unsigned long start_brk, brk, start_stack; > unsigned long arg_start, arg_end, env_start, env_end; > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-06-04 17:58 ` John Hubbard @ 2021-06-06 4:47 ` Feng Tang 0 siblings, 0 replies; 13+ messages in thread From: Feng Tang @ 2021-06-06 4:47 UTC (permalink / raw) To: John Hubbard Cc: Linus Torvalds, Jason Gunthorpe, kernel test robot, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing On Fri, Jun 04, 2021 at 10:58:14AM -0700, John Hubbard wrote: > On 6/4/21 12:52 AM, Feng Tang wrote: > ... > >>>The perf data doesn't even mention any of the GUP paths, and on the > >>>pure fork path the biggest impact would be: > >>> > >>> (a) maybe "struct mm_struct" changed in size or had a different cache layout > >> > >>Yes, this seems to be the cause of the regression. > >> > >>The test case is many thread are doing map/unmap at the same time, > >>so the process's rw_semaphore 'mmap_lock' is highly contended. > >> > >>Before the patch (with 0day's kconfig), the mmap_lock is separated > >>into 2 cachelines, the 'count' is in one line, and the other members > >>sit in the next line, so it luckily avoid some cache bouncing. After > > Wow! That's quite a fortunate layout to land on by accident. Almost > makes me wonder if mmap_lock should be designed to do that, but it's > probably even better to just keep working on having a less contended > mmap_lock. Yes, manipulating cache alignment is always tricky and fragile, as data structure keeps being changed and it is affected by differrent kernel config options, also different workloads will see different hot fields of it. Optimizing 'mmap_lock' is the better and ultimate solution. > I *suppose* it's worth trying to keep this fragile layout in place, > but it is a landmine for anyone who touches mm_struct. And the struct > is so large already that I'm not sure a comment warning would even > be noticed. Anyway... Linus also mentioned clear comment is needed. Will collect more info. > >>the patch, the 'mmap_lock' is pushed into one cacheline, which may > >>cause the regression. > >> > >>Below is the pahole info: > >> > >>- before the patch > >> > >> spinlock_t page_table_lock; /* 116 4 */ > >> struct rw_semaphore mmap_lock; /* 120 40 */ > >> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ > >> struct list_head mmlist; /* 160 16 */ > >> long unsigned int hiwater_rss; /* 176 8 */ > >> > >>- after the patch > >> > >> spinlock_t page_table_lock; /* 124 4 */ > >> /* --- cacheline 2 boundary (128 bytes) --- */ > >> struct rw_semaphore mmap_lock; /* 128 40 */ > >> struct list_head mmlist; /* 168 16 */ > >> long unsigned int hiwater_rss; /* 184 8 */ > >> > >>perf c2c log can also confirm this. > >We've tried some patch, which can restore the regerssion. As the > >newly added member 'write_protect_seq' is 4 bytes long, and putting > >it into an existing 4 bytes long hole can restore the regeression, > >while not affecting most of other member's alignment. Please review > >the following patch, thanks! > > > > So, this is a neat little solution, if we agree that it's worth "fixing". > > I'm definitely on the fence, but leaning toward, "go for it", because > I like the "no cache effect" result of using up the hole. > > Reviewed-by: John Hubbard <jhubbard@nvidia.com> Thanks for the reviewing! - Feng > thanks, > -- > John Hubbard > NVIDIA > > >- Feng > > > > From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001 > >From: Feng Tang <feng.tang@intel.com> > >Date: Fri, 4 Jun 2021 15:20:57 +0800 > >Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct > > > >Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from > >racing with COW during fork), on 64bits system, the hot member > >rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into > >2 cachelines, that its member 'count' sits in one cacheline while > >all other members in next cacheline, this naturally reduces some > >cache bouncing, and with the commit, the 'mmap_lock' is pushed > >into one cacheline, as shown in the pahole info: > > > > - before the commit > > > > spinlock_t page_table_lock; /* 116 4 */ > > struct rw_semaphore mmap_lock; /* 120 40 */ > > /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */ > > struct list_head mmlist; /* 160 16 */ > > long unsigned int hiwater_rss; /* 176 8 */ > > > > - after the commit > > > > spinlock_t page_table_lock; /* 124 4 */ > > /* --- cacheline 2 boundary (128 bytes) --- */ > > struct rw_semaphore mmap_lock; /* 128 40 */ > > struct list_head mmlist; /* 168 16 */ > > long unsigned int hiwater_rss; /* 184 8 */ > > > >and it causes one 9.2% regression for 'mmap1' case of will-it-scale > >benchmark[1], as in the case 'mmap_lock' is highly contented (occupies > >90%+ cpu cycles). > > > >Though relayouting a structure could be a double-edged sword, as it > >helps some case, but may hurt other cases. So one solution is the > >newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n), > >placing it into an existing 4 bytes hole in 'mm_struct' will not > >affect most of other members's alignment, while restoring the > >regression. > > > >[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/ > >Reported-by: kernel test robot <oliver.sang@intel.com> > >Signed-off-by: Feng Tang <feng.tang@intel.com> > >--- > > include/linux/mm_types.h | 15 ++++++++------- > > 1 file changed, 8 insertions(+), 7 deletions(-) > > > >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > >index 5aacc1c..5b55f88 100644 > >--- a/include/linux/mm_types.h > >+++ b/include/linux/mm_types.h > >@@ -445,13 +445,6 @@ struct mm_struct { > > */ > > atomic_t has_pinned; > >- /** > >- * @write_protect_seq: Locked when any thread is write > >- * protecting pages mapped by this mm to enforce a later COW, > >- * for instance during page table copying for fork(). > >- */ > >- seqcount_t write_protect_seq; > >- > > #ifdef CONFIG_MMU > > atomic_long_t pgtables_bytes; /* PTE page table pages */ > > #endif > >@@ -480,7 +473,15 @@ struct mm_struct { > > unsigned long stack_vm; /* VM_STACK */ > > unsigned long def_flags; > >+ /** > >+ * @write_protect_seq: Locked when any thread is write > >+ * protecting pages mapped by this mm to enforce a later COW, > >+ * for instance during page table copying for fork(). > >+ */ > >+ seqcount_t write_protect_seq; > >+ > > spinlock_t arg_lock; /* protect the below fields */ > >+ > > unsigned long start_code, end_code, start_data, end_data; > > unsigned long start_brk, brk, start_stack; > > unsigned long arg_start, arg_end, env_start, env_end; > > ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [LKP] Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression 2021-05-25 3:11 ` Linus Torvalds 2021-06-04 7:04 ` Feng Tang @ 2021-06-04 8:37 ` Xing Zhengjun 1 sibling, 0 replies; 13+ messages in thread From: Xing Zhengjun @ 2021-06-04 8:37 UTC (permalink / raw) To: Linus Torvalds, kernel test robot Cc: Jason Gunthorpe, John Hubbard, Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot Hi Linus, On 5/25/2021 11:11 AM, Linus Torvalds wrote: > On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote: >> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit: >> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork") > Hmm. This looks like one of those "random fluctuations" things. > > It would be good to hear if other test-cases also bisect to the same > thing, but this report already says: > >> In addition to that, the commit also has significant impact on the following tests: >> >> +------------------+---------------------------------------------------------------------------------+ >> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement | > which does kind of reinforce that "this benchmark gives unstable numbers". > > The perf data doesn't even mention any of the GUP paths, and on the > pure fork path the biggest impact would be: > > (a) maybe "struct mm_struct" changed in size or had a different cache layout I move "write_protect_seq" to the tail of the "struct mm_struct", the regression reduced to -3.6%. The regression should relate to the cache layout. ========================================================================================= tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode: lkp-icl-2sp1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/thread/mmap1/performance/0xb000280 commit: c28b1fc70390df32e29991eedd52bd86e7aba080 57efa1fe5957694fa541c9062de0a127f0b9acb0 f6a9c27882d51ff551e15522992d3725c342372d (the test patch) c28b1fc70390df32 57efa1fe5957694fa541c9062de f6a9c27882d51ff551e15522992 ---------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev \ | \ | \ 341938 -9.0% 311218 ± 2% -3.6% 329513 will-it-scale.48.threads 7123 -9.0% 6483 ± 2% -3.6% 6864 will-it-scale.per_thread_ops 341938 -9.0% 311218 ± 2% -3.6% 329513 will-it-scale.workload diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 915f4f100383..34bb2a01806c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -447,13 +447,6 @@ struct mm_struct { */ atomic_t has_pinned; - /** - * @write_protect_seq: Locked when any thread is write - * protecting pages mapped by this mm to enforce a later COW, - * for instance during page table copying for fork(). - */ - seqcount_t write_protect_seq; - #ifdef CONFIG_MMU atomic_long_t pgtables_bytes; /* PTE page table pages */ #endif @@ -564,6 +557,12 @@ struct mm_struct { #ifdef CONFIG_IOMMU_SUPPORT u32 pasid; #endif + /** + * @write_protect_seq: Locked when any thread is write + * protecting pages mapped by this mm to enforce a later COW, + * for instance during page table copying for fork(). + */ + seqcount_t write_protect_seq; } __randomize_layout; /* > > (b) two added (nonatomic) increment operations in the fork path due > to the seqcount > > and I'm not seeing what would cause that 9% change. Obviously cache > placement has done it before. > > If somebody else sees something that I'm missing, please holler. But > I'll ignore this as "noise" otherwise. > > Linus > _______________________________________________ > LKP mailing list -- lkp@lists.01.org > To unsubscribe send an email to lkp-leave@lists.01.org -- Zhengjun Xing ^ permalink raw reply related [flat|nested] 13+ messages in thread
end of thread, other threads:[~2021-06-08 0:05 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-25 3:16 [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression kernel test robot 2021-05-25 3:11 ` Linus Torvalds 2021-06-04 7:04 ` Feng Tang 2021-06-04 7:52 ` Feng Tang 2021-06-04 17:57 ` Linus Torvalds 2021-06-06 10:16 ` Feng Tang 2021-06-06 19:20 ` Linus Torvalds 2021-06-06 22:13 ` Waiman Long 2021-06-07 6:05 ` Feng Tang 2021-06-08 0:03 ` Linus Torvalds 2021-06-04 17:58 ` John Hubbard 2021-06-06 4:47 ` Feng Tang 2021-06-04 8:37 ` [LKP] " Xing Zhengjun
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).