linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-05-25  3:16 [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression kernel test robot
@ 2021-05-25  3:11 ` Linus Torvalds
  2021-06-04  7:04   ` Feng Tang
  2021-06-04  8:37   ` [LKP] " Xing Zhengjun
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2021-05-25  3:11 UTC (permalink / raw)
  To: kernel test robot
  Cc: Jason Gunthorpe, John Hubbard, Jan Kara, Peter Xu,
	Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, Feng Tang,
	zhengjun.xing

On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote:
>
> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")

Hmm. This looks like one of those "random fluctuations" things.

It would be good to hear if other test-cases also bisect to the same
thing, but this report already says:

> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+---------------------------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement                    |

which does kind of reinforce that "this benchmark gives unstable numbers".

The perf data doesn't even mention any of the GUP paths, and on the
pure fork path the biggest impact would be:

 (a) maybe "struct mm_struct" changed in size or had a different cache layout

 (b) two added (nonatomic) increment operations in the fork path due
to the seqcount

and I'm not seeing what would cause that 9% change. Obviously cache
placement has done it before.

If somebody else sees something that I'm missing, please holler. But
I'll ignore this as "noise" otherwise.

            Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [mm/gup]  57efa1fe59:  will-it-scale.per_thread_ops -9.2% regression
@ 2021-05-25  3:16 kernel test robot
  2021-05-25  3:11 ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: kernel test robot @ 2021-05-25  3:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Linus Torvalds, John Hubbard, Jan Kara, Peter Xu,
	Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, lkp, ying.huang, feng.tang, zhengjun.xing

[-- Attachment #1: Type: text/plain, Size: 27230 bytes --]



Greeting,

FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:


commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


in testcase: will-it-scale
on test machine: 96 threads 2 sockets Ice Lake with 256G memory
with following parameters:

	nr_task: 50%
	mode: thread
	test: mmap1
	cpufreq_governor: performance
	ucode: 0xb000280

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement                    |
| test machine     | 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory |
| test parameters  | cpufreq_governor=performance                                                    |
|                  | mode=thread                                                                     |
|                  | nr_task=50%                                                                     |
|                  | test=mmap1                                                                      |
|                  | ucode=0x5003006                                                                 |
+------------------+---------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <oliver.sang@intel.com>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

        git clone https://github.com/intel/lkp-tests.git
        cd lkp-tests
        bin/lkp install                job.yaml  # job file is attached in this email
        bin/lkp split-job --compatible job.yaml  # generate the yaml file for lkp run
        bin/lkp run                    generated-yaml-file

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
  gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-icl-2sp1/mmap1/will-it-scale/0xb000280

commit: 
  c28b1fc703 ("mm/gup: reorganize internal_get_user_pages_fast()")
  57efa1fe59 ("mm/gup: prevent gup_fast from racing with COW during fork")

c28b1fc70390df32 57efa1fe5957694fa541c9062de 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    342141            -9.2%     310805 ±  2%  will-it-scale.48.threads
      7127            -9.2%       6474 ±  2%  will-it-scale.per_thread_ops
    342141            -9.2%     310805 ±  2%  will-it-scale.workload
   2555927 ±  3%     +45.8%    3727702        meminfo.Committed_AS
     12108 ± 13%     -36.7%       7665 ±  7%  vmstat.system.cs
   1142492 ± 30%     -47.3%     602364 ± 11%  cpuidle.C1.usage
    282373 ± 13%     -45.6%     153684 ±  7%  cpuidle.POLL.usage
     48437 ±  3%      -5.9%      45563        proc-vmstat.nr_active_anon
     54617 ±  3%      -5.5%      51602        proc-vmstat.nr_shmem
     48437 ±  3%      -5.9%      45563        proc-vmstat.nr_zone_active_anon
     70511 ±  3%      -5.1%      66942 ±  2%  proc-vmstat.pgactivate
    278653 ±  8%     +23.4%     343904 ±  4%  sched_debug.cpu.avg_idle.stddev
     22572 ± 16%     -36.3%      14378 ±  4%  sched_debug.cpu.nr_switches.avg
     66177 ± 16%     -36.8%      41800 ± 21%  sched_debug.cpu.nr_switches.max
     11613 ± 15%     -41.4%       6810 ± 23%  sched_debug.cpu.nr_switches.stddev
     22.96 ± 15%     +55.6%      35.73 ± 12%  perf-sched.total_wait_and_delay.average.ms
     69713 ± 19%     -38.0%      43235 ± 12%  perf-sched.total_wait_and_delay.count.ms
     22.95 ± 15%     +55.6%      35.72 ± 12%  perf-sched.total_wait_time.average.ms
     29397 ± 23%     -35.3%      19030 ± 17%  perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
     31964 ± 20%     -50.8%      15738 ± 14%  perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
      4.59 ±  6%     +12.2%       5.15 ±  4%  perf-stat.i.MPKI
 3.105e+09            -2.1%   3.04e+09        perf-stat.i.branch-instructions
     12033 ± 12%     -36.8%       7600 ±  7%  perf-stat.i.context-switches
     10.06            +1.9%      10.25        perf-stat.i.cpi
 4.067e+09            -1.3%  4.016e+09        perf-stat.i.dTLB-loads
 4.521e+08            -5.1%  4.291e+08 ±  2%  perf-stat.i.dTLB-stores
 1.522e+10            -1.6%  1.497e+10        perf-stat.i.instructions
      0.10            -1.9%       0.10        perf-stat.i.ipc
      0.19 ±  8%     -22.8%       0.15 ±  5%  perf-stat.i.metric.K/sec
     80.30            -1.7%      78.93        perf-stat.i.metric.M/sec
    167270 ±  6%     -14.9%     142312 ± 11%  perf-stat.i.node-loads
     49.76            -1.6       48.11        perf-stat.i.node-store-miss-rate%
   3945152            +6.2%    4189006        perf-stat.i.node-stores
      4.59 ±  6%     +12.1%       5.15 ±  4%  perf-stat.overall.MPKI
     10.04            +1.8%      10.23        perf-stat.overall.cpi
      0.10            -1.8%       0.10        perf-stat.overall.ipc
     49.76            -1.6       48.12        perf-stat.overall.node-store-miss-rate%
  13400506            +8.2%   14504566        perf-stat.overall.path-length
 3.094e+09            -2.1%   3.03e+09        perf-stat.ps.branch-instructions
     12087 ± 13%     -36.9%       7622 ±  7%  perf-stat.ps.context-switches
 4.054e+09            -1.3%  4.002e+09        perf-stat.ps.dTLB-loads
 4.508e+08            -5.1%  4.278e+08 ±  2%  perf-stat.ps.dTLB-stores
 1.516e+10            -1.6%  1.492e+10        perf-stat.ps.instructions
   3932404            +6.2%    4175831        perf-stat.ps.node-stores
 4.584e+12            -1.7%  4.506e+12        perf-stat.total.instructions
    364038 ±  6%     -40.3%     217265 ±  9%  interrupts.CAL:Function_call_interrupts
      5382 ± 33%     -63.4%       1970 ± 35%  interrupts.CPU44.CAL:Function_call_interrupts
      6325 ± 19%     -58.1%       2650 ± 37%  interrupts.CPU47.CAL:Function_call_interrupts
     11699 ± 19%     -60.6%       4610 ± 23%  interrupts.CPU48.CAL:Function_call_interrupts
     94.20 ± 22%     -45.8%      51.09 ± 46%  interrupts.CPU48.TLB:TLB_shootdowns
      9223 ± 24%     -52.5%       4383 ± 28%  interrupts.CPU49.CAL:Function_call_interrupts
      9507 ± 24%     -57.5%       4040 ± 27%  interrupts.CPU50.CAL:Function_call_interrupts
      4530 ± 18%     -33.9%       2993 ± 28%  interrupts.CPU62.CAL:Function_call_interrupts
     82.00 ± 21%     -41.9%      47.64 ± 38%  interrupts.CPU63.TLB:TLB_shootdowns
      4167 ± 16%     -45.4%       2276 ± 22%  interrupts.CPU64.CAL:Function_call_interrupts
    135.20 ± 31%     -58.4%      56.27 ± 51%  interrupts.CPU64.TLB:TLB_shootdowns
      4155 ± 17%     -42.5%       2387 ± 27%  interrupts.CPU65.CAL:Function_call_interrupts
     95.00 ± 48%     -53.8%      43.91 ± 42%  interrupts.CPU65.TLB:TLB_shootdowns
      4122 ± 20%     -39.4%       2497 ± 29%  interrupts.CPU66.CAL:Function_call_interrupts
      3954 ± 14%     -41.4%       2318 ± 28%  interrupts.CPU67.CAL:Function_call_interrupts
      3802 ± 17%     -41.9%       2209 ± 17%  interrupts.CPU70.CAL:Function_call_interrupts
      3787 ± 11%     -48.2%       1961 ± 29%  interrupts.CPU71.CAL:Function_call_interrupts
      3580 ± 14%     -45.1%       1964 ± 19%  interrupts.CPU72.CAL:Function_call_interrupts
      3711 ± 20%     -51.3%       1806 ± 25%  interrupts.CPU73.CAL:Function_call_interrupts
      3494 ± 21%     -40.6%       2076 ± 21%  interrupts.CPU76.CAL:Function_call_interrupts
      3416 ± 21%     -45.2%       1873 ± 26%  interrupts.CPU77.CAL:Function_call_interrupts
      3047 ± 24%     -38.0%       1890 ± 18%  interrupts.CPU78.CAL:Function_call_interrupts
      3102 ± 28%     -41.8%       1805 ± 16%  interrupts.CPU80.CAL:Function_call_interrupts
      2811 ± 23%     -36.5%       1785 ± 22%  interrupts.CPU83.CAL:Function_call_interrupts
      2617 ± 17%     -30.7%       1814 ± 30%  interrupts.CPU84.CAL:Function_call_interrupts
      3322 ± 25%     -38.1%       2055 ± 29%  interrupts.CPU87.CAL:Function_call_interrupts
      2941 ± 12%     -39.2%       1787 ± 27%  interrupts.CPU93.CAL:Function_call_interrupts
     72.56           -19.7       52.82        perf-profile.calltrace.cycles-pp.__mmap
     72.52           -19.7       52.78        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap
     72.48           -19.7       52.74        perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     72.49           -19.7       52.76        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     72.47           -19.7       52.74        perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
     71.74           -19.7       52.04        perf-profile.calltrace.cycles-pp.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
     71.63           -19.7       51.95        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
     71.52           -19.6       51.88        perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
     70.12           -19.2       50.92        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
      0.91 ±  2%      -0.2        0.70        perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff
      0.87 ±  2%      +0.1        0.95 ±  2%  perf-profile.calltrace.cycles-pp.__do_munmap.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
      0.00            +0.6        0.63 ±  2%  perf-profile.calltrace.cycles-pp.rwsem_spin_on_owner.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
     24.24 ±  3%     +19.4       43.62        perf-profile.calltrace.cycles-pp.osq_lock.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap
     24.72 ±  3%     +19.8       44.47        perf-profile.calltrace.cycles-pp.rwsem_optimistic_spin.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
     24.87 ±  3%     +19.8       44.62        perf-profile.calltrace.cycles-pp.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
     24.78 ±  3%     +19.8       44.54        perf-profile.calltrace.cycles-pp.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap.do_syscall_64
     25.94 ±  3%     +19.8       45.73        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__munmap
     25.97 ±  3%     +19.8       45.77        perf-profile.calltrace.cycles-pp.__munmap
     25.90 ±  3%     +19.8       45.70        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     25.88 ±  3%     +19.8       45.68        perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     25.87 ±  3%     +19.8       45.67        perf-profile.calltrace.cycles-pp.__vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe.__munmap
     72.57           -19.7       52.83        perf-profile.children.cycles-pp.__mmap
     72.48           -19.7       52.74        perf-profile.children.cycles-pp.ksys_mmap_pgoff
     72.48           -19.7       52.74        perf-profile.children.cycles-pp.vm_mmap_pgoff
      0.22 ±  5%      -0.1        0.14 ±  6%  perf-profile.children.cycles-pp.unmap_region
      0.08 ± 23%      -0.0        0.04 ± 61%  perf-profile.children.cycles-pp.__schedule
      0.06 ±  7%      -0.0        0.03 ± 75%  perf-profile.children.cycles-pp.perf_event_mmap
      0.12 ±  4%      -0.0        0.09 ±  5%  perf-profile.children.cycles-pp.up_write
      0.09 ±  7%      -0.0        0.06 ± 16%  perf-profile.children.cycles-pp.unmap_vmas
      0.10 ±  4%      -0.0        0.08 ±  3%  perf-profile.children.cycles-pp.up_read
      0.18 ±  2%      +0.0        0.20 ±  3%  perf-profile.children.cycles-pp.vm_area_dup
      0.18 ±  5%      +0.0        0.21 ±  2%  perf-profile.children.cycles-pp.vma_merge
      0.12 ±  4%      +0.0        0.14 ±  4%  perf-profile.children.cycles-pp.kmem_cache_free
      0.19 ±  6%      +0.0        0.23 ±  2%  perf-profile.children.cycles-pp.get_unmapped_area
      0.16 ±  6%      +0.0        0.20 ±  2%  perf-profile.children.cycles-pp.vm_unmapped_area
      0.17 ±  6%      +0.0        0.21 ±  2%  perf-profile.children.cycles-pp.arch_get_unmapped_area_topdown
      0.07 ± 10%      +0.1        0.14 ± 16%  perf-profile.children.cycles-pp.find_vma
      0.27 ±  4%      +0.1        0.35 ±  2%  perf-profile.children.cycles-pp.__vma_adjust
      0.35 ±  2%      +0.1        0.43 ±  3%  perf-profile.children.cycles-pp.__split_vma
      0.87 ±  2%      +0.1        0.95 ±  2%  perf-profile.children.cycles-pp.__do_munmap
      1.23            +0.1        1.33        perf-profile.children.cycles-pp.rwsem_spin_on_owner
     25.98 ±  3%     +19.8       45.78        perf-profile.children.cycles-pp.__munmap
     25.87 ±  3%     +19.8       45.68        perf-profile.children.cycles-pp.__vm_munmap
     25.88 ±  3%     +19.8       45.68        perf-profile.children.cycles-pp.__x64_sys_munmap
      0.53 ±  2%      -0.2        0.35 ±  3%  perf-profile.self.cycles-pp.rwsem_optimistic_spin
      0.08 ±  5%      -0.1        0.03 ± 75%  perf-profile.self.cycles-pp.do_mmap
      0.11 ±  6%      -0.0        0.09 ±  5%  perf-profile.self.cycles-pp.up_write
      0.19 ±  4%      -0.0        0.16 ±  5%  perf-profile.self.cycles-pp.down_write_killable
      0.05 ±  8%      +0.0        0.07 ±  8%  perf-profile.self.cycles-pp.downgrade_write
      0.11 ±  4%      +0.0        0.14 ±  4%  perf-profile.self.cycles-pp.__vma_adjust
      0.16 ±  6%      +0.0        0.20 ±  3%  perf-profile.self.cycles-pp.vm_unmapped_area
      0.05 ±  9%      +0.0        0.10 ± 13%  perf-profile.self.cycles-pp.find_vma
      1.21            +0.1        1.31        perf-profile.self.cycles-pp.rwsem_spin_on_owner


                                                                                
                           will-it-scale.per_thread_ops                         
                                                                                
  7400 +--------------------------------------------------------------------+   
       |                                                      +             |   
  7200 |.++.                  .+. +.+        .++.            : :.+ .+  +.   |   
       |    ++.+.++         ++   +   +.+.++.+    ++.+.++.++. : +  +  : : +  |   
  7000 |-+         + +     :                                +        ::     |   
       |            + + .+ :                                          +     |   
  6800 |-+             +  +                                                 |   
       |                                                          O         |   
  6600 |-+        O    O            O  O    O O  OO      OO    O            |   
       | OO  O   O  O    OO O        O   O     O    O  O    O O          O  |   
  6400 |-+     O     O       O O  O                   O               OO  O |   
       |                                  O                      O  O       |   
  6200 |-+                       O                                          |   
       |    O                                                               |   
  6000 +--------------------------------------------------------------------+   
                                                                                
                                                                                
[*] bisect-good sample
[O] bisect-bad  sample

***************************************************************************************************
lkp-csl-2sp9: 88 threads 2 sockets Intel(R) Xeon(R) Gold 6238M CPU @ 2.10GHz with 128G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
  gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-csl-2sp9/mmap1/will-it-scale/0x5003006

commit: 
  c28b1fc703 ("mm/gup: reorganize internal_get_user_pages_fast()")
  57efa1fe59 ("mm/gup: prevent gup_fast from racing with COW during fork")

c28b1fc70390df32 57efa1fe5957694fa541c9062de 
---------------- --------------------------- 
         %stddev     %change         %stddev
             \          |                \  
    247840            +3.7%     257132 ±  2%  will-it-scale.44.threads
      5632            +3.7%       5843 ±  2%  will-it-scale.per_thread_ops
    247840            +3.7%     257132 ±  2%  will-it-scale.workload
      0.10 ±  5%      +0.0        0.13 ±  8%  perf-profile.children.cycles-pp.find_vma
     14925 ± 19%     -48.2%       7724 ±  8%  softirqs.CPU87.SCHED
      9950 ±  3%     -36.1%       6355 ±  2%  vmstat.system.cs
   3312916 ±  4%     +13.9%    3774536 ±  9%  cpuidle.C1.time
   1675504 ±  5%     -36.6%    1061625        cpuidle.POLL.time
    987055 ±  5%     -41.8%     574757 ±  2%  cpuidle.POLL.usage
    165545 ±  3%     -12.2%     145358 ±  4%  meminfo.Active
    165235 ±  3%     -12.1%     145188 ±  4%  meminfo.Active(anon)
    180757 ±  3%     -11.7%     159538 ±  3%  meminfo.Shmem
   2877001 ± 11%     +16.2%    3342948 ± 10%  sched_debug.cfs_rq:/.min_vruntime.avg
   5545708 ± 11%      +9.8%    6086941 ±  8%  sched_debug.cfs_rq:/.min_vruntime.max
   2773178 ± 11%     +15.4%    3199941 ±  9%  sched_debug.cfs_rq:/.spread0.avg
    733740 ±  3%     -12.0%     646033 ±  5%  sched_debug.cpu.avg_idle.avg
     17167 ± 10%     -28.2%      12332 ±  7%  sched_debug.cpu.nr_switches.avg
     49180 ± 14%     -33.5%      32687 ± 22%  sched_debug.cpu.nr_switches.max
      9311 ± 18%     -36.2%       5943 ± 22%  sched_debug.cpu.nr_switches.stddev
     41257 ±  3%     -12.1%      36252 ±  4%  proc-vmstat.nr_active_anon
    339681            -1.6%     334294        proc-vmstat.nr_file_pages
     10395            -3.5%      10036        proc-vmstat.nr_mapped
     45130 ±  3%     -11.7%      39848 ±  3%  proc-vmstat.nr_shmem
     41257 ±  3%     -12.1%      36252 ±  4%  proc-vmstat.nr_zone_active_anon
    841530            -1.7%     826917        proc-vmstat.numa_local
     21515 ± 11%     -68.9%       6684 ± 70%  proc-vmstat.numa_pages_migrated
     60224 ±  3%     -11.1%      53513 ±  3%  proc-vmstat.pgactivate
    981265            -2.5%     956415        proc-vmstat.pgalloc_normal
    895893            -1.9%     878978        proc-vmstat.pgfree
     21515 ± 11%     -68.9%       6684 ± 70%  proc-vmstat.pgmigrate_success
      0.07 ±135%     -74.1%       0.02 ±  5%  perf-sched.sch_delay.max.ms.preempt_schedule_common._cond_resched.stop_one_cpu.__set_cpus_allowed_ptr.sched_setaffinity
     21.44 ±  5%     +80.9%      38.79 ±  3%  perf-sched.total_wait_and_delay.average.ms
     67273 ±  6%     -44.9%      37095 ±  5%  perf-sched.total_wait_and_delay.count.ms
     21.44 ±  5%     +80.9%      38.79 ±  3%  perf-sched.total_wait_time.average.ms
      0.08 ± 14%     +60.1%       0.13 ±  9%  perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
      0.09 ± 12%     +58.0%       0.15 ± 15%  perf-sched.wait_and_delay.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
    255.38 ± 14%     +22.1%     311.71 ± 17%  perf-sched.wait_and_delay.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
     31877 ± 10%     -54.2%      14606 ± 13%  perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
     27110 ±  7%     -47.3%      14280 ±  4%  perf-sched.wait_and_delay.count.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
    138.60 ± 13%     -21.4%     109.00 ± 15%  perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
      1.00 ±199%     -99.9%       0.00 ±200%  perf-sched.wait_time.avg.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap
      0.08 ± 14%     +60.9%       0.13 ±  9%  perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.__vm_munmap.__x64_sys_munmap
      0.09 ± 12%     +58.2%       0.15 ± 15%  perf-sched.wait_time.avg.ms.rwsem_down_write_slowpath.down_write_killable.vm_mmap_pgoff.ksys_mmap_pgoff
    255.38 ± 14%     +22.1%     311.71 ± 17%  perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.poll_schedule_timeout.constprop.0.do_sys_poll
      1.00 ±199%     -99.9%       0.00 ±200%  perf-sched.wait_time.max.ms.preempt_schedule_common._cond_resched.remove_vma.__do_munmap.__vm_munmap
      4.99           -36.1%       3.19 ± 36%  perf-sched.wait_time.max.ms.rcu_gp_kthread.kthread.ret_from_fork
      9869 ±  3%     -36.2%       6295 ±  2%  perf-stat.i.context-switches
      0.00 ±  7%      +0.0        0.00 ± 29%  perf-stat.i.dTLB-load-miss-rate%
     76953 ±  7%    +327.4%     328871 ± 29%  perf-stat.i.dTLB-load-misses
   4152320            -3.0%    4026365        perf-stat.i.iTLB-load-misses
   1665297            -2.2%    1628746        perf-stat.i.iTLB-loads
      8627            +3.5%       8933        perf-stat.i.instructions-per-iTLB-miss
      0.33 ±  3%     -11.0%       0.29 ±  6%  perf-stat.i.metric.K/sec
     87.42            +1.7       89.11        perf-stat.i.node-load-miss-rate%
   7507752            -9.2%    6814138        perf-stat.i.node-load-misses
   1078418 ±  2%     -22.9%     831563 ±  3%  perf-stat.i.node-loads
   3091445            -8.2%    2838247        perf-stat.i.node-store-misses
      0.00 ±  7%      +0.0        0.00 ± 29%  perf-stat.overall.dTLB-load-miss-rate%
      8599            +3.6%       8907        perf-stat.overall.instructions-per-iTLB-miss
     87.44            +1.7       89.13        perf-stat.overall.node-load-miss-rate%
  43415811            -3.3%   41994695 ±  2%  perf-stat.overall.path-length
      9895 ±  3%     -36.4%       6291 ±  2%  perf-stat.ps.context-switches
     76756 ±  7%    +327.0%     327716 ± 29%  perf-stat.ps.dTLB-load-misses
   4138410            -3.0%    4012712        perf-stat.ps.iTLB-load-misses
   1659653            -2.2%    1623167        perf-stat.ps.iTLB-loads
   7483002            -9.2%    6791226        perf-stat.ps.node-load-misses
   1074856 ±  2%     -22.9%     828780 ±  3%  perf-stat.ps.node-loads
   3081222            -8.2%    2828732        perf-stat.ps.node-store-misses
    335021 ±  2%     -27.9%     241715 ± 12%  interrupts.CAL:Function_call_interrupts
      3662 ± 31%     -61.3%       1417 ± 16%  interrupts.CPU10.CAL:Function_call_interrupts
      4671 ± 32%     -65.6%       1607 ± 30%  interrupts.CPU12.CAL:Function_call_interrupts
      4999 ± 34%     -68.1%       1592 ± 43%  interrupts.CPU14.CAL:Function_call_interrupts
    129.00 ± 30%     -46.8%      68.60 ± 34%  interrupts.CPU14.RES:Rescheduling_interrupts
      4531 ± 49%     -58.5%       1881 ± 39%  interrupts.CPU15.CAL:Function_call_interrupts
      4639 ± 28%     -37.6%       2893 ±  2%  interrupts.CPU18.NMI:Non-maskable_interrupts
      4639 ± 28%     -37.6%       2893 ±  2%  interrupts.CPU18.PMI:Performance_monitoring_interrupts
      6310 ± 49%     -68.5%       1988 ± 57%  interrupts.CPU21.CAL:Function_call_interrupts
    149.40 ± 49%     -49.3%      75.80 ± 42%  interrupts.CPU21.RES:Rescheduling_interrupts
      3592 ± 38%     -63.0%       1330 ± 14%  interrupts.CPU24.CAL:Function_call_interrupts
      5350 ± 21%     -30.5%       3720 ± 44%  interrupts.CPU24.NMI:Non-maskable_interrupts
      5350 ± 21%     -30.5%       3720 ± 44%  interrupts.CPU24.PMI:Performance_monitoring_interrupts
    139.00 ± 27%     -33.4%      92.60 ± 26%  interrupts.CPU24.RES:Rescheduling_interrupts
      3858 ± 42%     -53.7%       1785 ± 38%  interrupts.CPU26.CAL:Function_call_interrupts
      5964 ± 28%     -42.4%       3432 ± 55%  interrupts.CPU27.NMI:Non-maskable_interrupts
      5964 ± 28%     -42.4%       3432 ± 55%  interrupts.CPU27.PMI:Performance_monitoring_interrupts
      3429 ± 37%     -57.1%       1470 ± 44%  interrupts.CPU28.CAL:Function_call_interrupts
      3008 ± 35%     -37.6%       1877 ± 38%  interrupts.CPU29.CAL:Function_call_interrupts
      4684 ± 73%     -60.0%       1872 ± 34%  interrupts.CPU30.CAL:Function_call_interrupts
      4300 ± 46%     -54.7%       1949 ± 13%  interrupts.CPU43.CAL:Function_call_interrupts
     10255 ± 26%     -50.0%       5127 ± 29%  interrupts.CPU44.CAL:Function_call_interrupts
      5800 ± 20%     -28.3%       4158 ± 27%  interrupts.CPU52.CAL:Function_call_interrupts
      4802 ± 19%     -31.7%       3279 ± 18%  interrupts.CPU58.CAL:Function_call_interrupts
      4042 ± 32%     -65.6%       1391 ± 41%  interrupts.CPU6.CAL:Function_call_interrupts
    128.60 ± 31%     -52.9%      60.60 ± 38%  interrupts.CPU6.RES:Rescheduling_interrupts
      4065 ± 20%     -37.8%       2530 ±  6%  interrupts.CPU63.CAL:Function_call_interrupts
      4340 ± 24%     -36.2%       2771 ± 11%  interrupts.CPU64.CAL:Function_call_interrupts
      3983 ± 11%     -27.1%       2904 ± 19%  interrupts.CPU65.CAL:Function_call_interrupts
      3392 ± 25%     -55.2%       1518 ± 53%  interrupts.CPU7.CAL:Function_call_interrupts
    171.80 ± 67%     -62.5%      64.40 ± 32%  interrupts.CPU7.RES:Rescheduling_interrupts
      2942 ± 33%     -50.5%       1455 ± 25%  interrupts.CPU8.CAL:Function_call_interrupts
      7818           -27.3%       5681 ± 31%  interrupts.CPU85.NMI:Non-maskable_interrupts
      7818           -27.3%       5681 ± 31%  interrupts.CPU85.PMI:Performance_monitoring_interrupts
    320.80 ± 54%     -44.6%     177.80 ± 58%  interrupts.CPU87.TLB:TLB_shootdowns
      3212 ± 31%     -64.8%       1130 ± 36%  interrupts.CPU9.CAL:Function_call_interrupts





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure                   Open Source Technology Center
https://lists.01.org/hyperkitty/list/lkp@lists.01.org       Intel Corporation

Thanks,
Oliver Sang


[-- Attachment #2: job-script --]
[-- Type: text/plain, Size: 7928 bytes --]

#!/bin/sh

export_top_env()
{
	export suite='will-it-scale'
	export testcase='will-it-scale'
	export category='benchmark'
	export nr_task=48
	export job_origin='will-it-scale-part2.yaml'
	export queue_cmdline_keys=
	export queue='vip'
	export testbox='lkp-icl-2sp1'
	export tbox_group='lkp-icl-2sp1'
	export kconfig='x86_64-rhel-8.3'
	export submit_id='608271eb0b9a9366cf75aa8b'
	export job_file='/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210423-26319-1fupkzh-3.yaml'
	export id='2d333079611b73199587392d819fd36dc870581a'
	export queuer_version='/lkp/xsang/.src-20210423-103236'
	export model='Ice Lake'
	export nr_node=2
	export nr_cpu=96
	export memory='256G'
	export nr_hdd_partitions=1
	export hdd_partitions='/dev/disk/by-id/ata-ST9500530NS_9SP1KLAR-part1'
	export ssd_partitions='/dev/nvme0n1p1'
	export swap_partitions=
	export kernel_cmdline_hw='acpi_rsdp=0x665fd014'
	export rootfs_partition='/dev/disk/by-id/ata-INTEL_SSDSC2BB800G4_PHWL4204005K800RGN-part3'
	export commit='57efa1fe5957694fa541c9062de0a127f0b9acb0'
	export ucode='0xb000280'
	export need_kconfig_hw='CONFIG_IGB=y
CONFIG_IXGBE=y
CONFIG_SATA_AHCI'
	export enqueue_time='2021-04-23 15:06:19 +0800'
	export _id='608271ef0b9a9366cf75aa8e'
	export _rt='/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0'
	export user='lkp'
	export compiler='gcc-9'
	export LKP_SERVER='internal-lkp-server'
	export head_commit='59d492ff832e57456a83d5652009434a44874a3e'
	export base_commit='f40ddce88593482919761f74910f42f4b84c004b'
	export branch='linus/master'
	export rootfs='debian-10.4-x86_64-20200603.cgz'
	export monitor_sha='70d6d718'
	export result_root='/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/8'
	export scheduler_version='/lkp/lkp/.src-20210422-153727'
	export arch='x86_64'
	export max_uptime=2100
	export initrd='/osimage/debian/debian-10.4-x86_64-20200603.cgz'
	export bootloader_append='root=/dev/ram0
user=lkp
job=/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210423-26319-1fupkzh-3.yaml
ARCH=x86_64
kconfig=x86_64-rhel-8.3
branch=linus/master
commit=57efa1fe5957694fa541c9062de0a127f0b9acb0
BOOT_IMAGE=/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957
acpi_rsdp=0x665fd014
max_uptime=2100
RESULT_ROOT=/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/8
LKP_SERVER=internal-lkp-server
nokaslr
selinux=0
debug
apic=debug
sysrq_always_enabled
rcupdate.rcu_cpu_stall_timeout=100
net.ifnames=0
printk.devkmsg=on
panic=-1
softlockup_panic=1
nmi_watchdog=panic
oops=panic
load_ramdisk=2
prompt_ramdisk=0
drbd.minor_count=8
systemd.log_level=err
ignore_loglevel
console=tty0
earlyprintk=ttyS0,115200
console=ttyS0,115200
vga=normal
rw'
	export modules_initrd='/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/modules.cgz'
	export bm_initrd='/osimage/deps/debian-10.4-x86_64-20200603.cgz/run-ipconfig_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/lkp_20201211.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/rsync-rootfs_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/will-it-scale_20210401.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/will-it-scale-x86_64-a34a85c-1_20210401.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/mpstat_20200714.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/perf_20201126.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/perf-x86_64-d19cc4bfbff1-1_20210401.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/sar-x86_64-34c92ae-1_20200702.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/hw_20200715.cgz'
	export ucode_initrd='/osimage/ucode/intel-ucode-20201117.cgz'
	export lkp_initrd='/osimage/user/lkp/lkp-x86_64.cgz'
	export site='inn'
	export LKP_CGI_PORT=80
	export LKP_CIFS_PORT=139
	export last_kernel='5.10.0-00044-g57efa1fe5957'
	export good_samples='7084
6873
6971'
	export queue_at_least_once=1
	export kernel='/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957'
	export dequeue_time='2021-04-23 16:58:17 +0800'
	export job_initrd='/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210423-26319-1fupkzh-3.cgz'

	[ -n "$LKP_SRC" ] ||
	export LKP_SRC=/lkp/${user:-lkp}/src
}

run_job()
{
	echo $$ > $TMP/run-job.pid

	. $LKP_SRC/lib/http.sh
	. $LKP_SRC/lib/job.sh
	. $LKP_SRC/lib/env.sh

	export_top_env

	run_setup $LKP_SRC/setup/cpufreq_governor 'performance'

	run_monitor $LKP_SRC/monitors/wrapper kmsg
	run_monitor $LKP_SRC/monitors/no-stdout/wrapper boot-time
	run_monitor $LKP_SRC/monitors/wrapper uptime
	run_monitor $LKP_SRC/monitors/wrapper iostat
	run_monitor $LKP_SRC/monitors/wrapper heartbeat
	run_monitor $LKP_SRC/monitors/wrapper vmstat
	run_monitor $LKP_SRC/monitors/wrapper numa-numastat
	run_monitor $LKP_SRC/monitors/wrapper numa-vmstat
	run_monitor $LKP_SRC/monitors/wrapper numa-meminfo
	run_monitor $LKP_SRC/monitors/wrapper proc-vmstat
	run_monitor $LKP_SRC/monitors/wrapper proc-stat
	run_monitor $LKP_SRC/monitors/wrapper meminfo
	run_monitor $LKP_SRC/monitors/wrapper slabinfo
	run_monitor $LKP_SRC/monitors/wrapper interrupts
	run_monitor $LKP_SRC/monitors/wrapper lock_stat
	run_monitor lite_mode=1 $LKP_SRC/monitors/wrapper perf-sched
	run_monitor $LKP_SRC/monitors/wrapper softirqs
	run_monitor $LKP_SRC/monitors/one-shot/wrapper bdi_dev_mapping
	run_monitor $LKP_SRC/monitors/wrapper diskstats
	run_monitor $LKP_SRC/monitors/wrapper nfsstat
	run_monitor $LKP_SRC/monitors/wrapper cpuidle
	run_monitor $LKP_SRC/monitors/wrapper cpufreq-stats
	run_monitor $LKP_SRC/monitors/wrapper sched_debug
	run_monitor $LKP_SRC/monitors/wrapper perf-stat
	run_monitor $LKP_SRC/monitors/wrapper mpstat
	run_monitor $LKP_SRC/monitors/no-stdout/wrapper perf-profile
	run_monitor pmeter_server='lkp-nhm-dp2' pmeter_device='yokogawa-wt310' $LKP_SRC/monitors/wrapper pmeter
	run_monitor $LKP_SRC/monitors/wrapper oom-killer
	run_monitor $LKP_SRC/monitors/plain/watchdog

	run_test mode='thread' test='mmap1' $LKP_SRC/tests/wrapper will-it-scale
}

extract_stats()
{
	export stats_part_begin=
	export stats_part_end=

	env mode='thread' test='mmap1' $LKP_SRC/stats/wrapper will-it-scale
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper boot-time
	$LKP_SRC/stats/wrapper uptime
	$LKP_SRC/stats/wrapper iostat
	$LKP_SRC/stats/wrapper vmstat
	$LKP_SRC/stats/wrapper numa-numastat
	$LKP_SRC/stats/wrapper numa-vmstat
	$LKP_SRC/stats/wrapper numa-meminfo
	$LKP_SRC/stats/wrapper proc-vmstat
	$LKP_SRC/stats/wrapper meminfo
	$LKP_SRC/stats/wrapper slabinfo
	$LKP_SRC/stats/wrapper interrupts
	$LKP_SRC/stats/wrapper lock_stat
	env lite_mode=1 $LKP_SRC/stats/wrapper perf-sched
	$LKP_SRC/stats/wrapper softirqs
	$LKP_SRC/stats/wrapper diskstats
	$LKP_SRC/stats/wrapper nfsstat
	$LKP_SRC/stats/wrapper cpuidle
	$LKP_SRC/stats/wrapper sched_debug
	$LKP_SRC/stats/wrapper perf-stat
	$LKP_SRC/stats/wrapper mpstat
	$LKP_SRC/stats/wrapper perf-profile
	env pmeter_server='lkp-nhm-dp2' pmeter_device='yokogawa-wt310' $LKP_SRC/stats/wrapper pmeter

	$LKP_SRC/stats/wrapper time will-it-scale.time
	$LKP_SRC/stats/wrapper dmesg
	$LKP_SRC/stats/wrapper kmsg
	$LKP_SRC/stats/wrapper last_state
	$LKP_SRC/stats/wrapper stderr
	$LKP_SRC/stats/wrapper time
}

"$@"

[-- Attachment #3: job.yaml --]
[-- Type: text/plain, Size: 5114 bytes --]

---
suite: will-it-scale
testcase: will-it-scale
category: benchmark
nr_task: 50%
will-it-scale:
  mode: thread
  test: mmap1
job_origin: will-it-scale-part2.yaml
queue_cmdline_keys:
- branch
- commit
- queue_at_least_once
queue: bisect
testbox: lkp-icl-2sp1
tbox_group: lkp-icl-2sp1
kconfig: x86_64-rhel-8.3
submit_id: 6039018663f28a9f5549bb6e
job_file: "/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210226-40789-1bw58rx-2.yaml"
id: 568034105811dd2aa8af615c3d1cbd509191f301
queuer_version: "/lkp-src"
model: Ice Lake
nr_node: 2
nr_cpu: 96
memory: 256G
nr_hdd_partitions: 1
hdd_partitions: "/dev/disk/by-id/ata-ST9500530NS_9SP1KLAR-part1"
ssd_partitions: "/dev/nvme0n1p1"
swap_partitions: 
kernel_cmdline_hw: acpi_rsdp=0x665fd014
rootfs_partition: "/dev/disk/by-id/ata-INTEL_SSDSC2BB800G4_PHWL4204005K800RGN-part3"
kmsg: 
boot-time: 
uptime: 
iostat: 
heartbeat: 
vmstat: 
numa-numastat: 
numa-vmstat: 
numa-meminfo: 
proc-vmstat: 
proc-stat: 
meminfo: 
slabinfo: 
interrupts: 
lock_stat: 
perf-sched:
  lite_mode: 1
softirqs: 
bdi_dev_mapping: 
diskstats: 
nfsstat: 
cpuidle: 
cpufreq-stats: 
sched_debug: 
perf-stat: 
mpstat: 
perf-profile: 
cpufreq_governor: performance
commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0
ucode: '0xb000280'
need_kconfig_hw:
- CONFIG_IGB=y
- CONFIG_IXGBE=y
- CONFIG_SATA_AHCI
pmeter:
  pmeter_server: lkp-nhm-dp2
  pmeter_device: yokogawa-wt310
enqueue_time: 2021-02-26 22:11:18.355364388 +08:00
_id: 603906b363f28a9f5549bb70
_rt: "/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0"
user: lkp
compiler: gcc-9
LKP_SERVER: internal-lkp-server
head_commit: 59d492ff832e57456a83d5652009434a44874a3e
base_commit: f40ddce88593482919761f74910f42f4b84c004b
branch: linus/master
rootfs: debian-10.4-x86_64-20200603.cgz
monitor_sha: 70d6d718
result_root: "/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/0"
scheduler_version: "/lkp/lkp/.src-20210226-170207"
arch: x86_64
max_uptime: 2100
initrd: "/osimage/debian/debian-10.4-x86_64-20200603.cgz"
bootloader_append:
- root=/dev/ram0
- user=lkp
- job=/lkp/jobs/scheduled/lkp-icl-2sp1/will-it-scale-performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718-debian-10.4-x86_64-20200603.cgz-57efa1fe5957694fa541-20210226-40789-1bw58rx-2.yaml
- ARCH=x86_64
- kconfig=x86_64-rhel-8.3
- branch=linus/master
- commit=57efa1fe5957694fa541c9062de0a127f0b9acb0
- BOOT_IMAGE=/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957
- acpi_rsdp=0x665fd014
- max_uptime=2100
- RESULT_ROOT=/result/will-it-scale/performance-thread-50%-mmap1-ucode=0xb000280-monitor=70d6d718/lkp-icl-2sp1/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/0
- LKP_SERVER=internal-lkp-server
- nokaslr
- selinux=0
- debug
- apic=debug
- sysrq_always_enabled
- rcupdate.rcu_cpu_stall_timeout=100
- net.ifnames=0
- printk.devkmsg=on
- panic=-1
- softlockup_panic=1
- nmi_watchdog=panic
- oops=panic
- load_ramdisk=2
- prompt_ramdisk=0
- drbd.minor_count=8
- systemd.log_level=err
- ignore_loglevel
- console=tty0
- earlyprintk=ttyS0,115200
- console=ttyS0,115200
- vga=normal
- rw
modules_initrd: "/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/modules.cgz"
bm_initrd: "/osimage/deps/debian-10.4-x86_64-20200603.cgz/run-ipconfig_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/lkp_20201211.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/rsync-rootfs_20200608.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/will-it-scale_20210108.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/will-it-scale-x86_64-6b6f1f6-1_20210108.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/mpstat_20200714.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/perf_20201126.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/perf-x86_64-e71ba9452f0b-1_20210106.cgz,/osimage/pkg/debian-10.4-x86_64-20200603.cgz/sar-x86_64-34c92ae-1_20200702.cgz,/osimage/deps/debian-10.4-x86_64-20200603.cgz/hw_20200715.cgz"
ucode_initrd: "/osimage/ucode/intel-ucode-20201117.cgz"
lkp_initrd: "/osimage/user/lkp/lkp-x86_64.cgz"
site: inn
LKP_CGI_PORT: 80
LKP_CIFS_PORT: 139
oom-killer: 
watchdog: 
last_kernel: 5.11.0-07287-g933a73780a7a
repeat_to: 3
good_samples:
- 7084
- 6873
- 6971

#! queue options

#! user overrides
queue_at_least_once: 0

#! schedule options
kernel: "/pkg/linux/x86_64-rhel-8.3/gcc-9/57efa1fe5957694fa541c9062de0a127f0b9acb0/vmlinuz-5.10.0-00044-g57efa1fe5957"
dequeue_time: 2021-02-26 22:36:05.668722260 +08:00

#! /lkp/lkp/.src-20210226-170207/include/site/inn

#! runtime status
job_state: finished
loadavg: 40.45 29.66 13.27 1/716 10184
start_time: '1614350233'
end_time: '1614350535'
version: "/lkp/lkp/.src-20210226-170239:f6d2b143:03255feb8"

[-- Attachment #4: reproduce --]
[-- Type: text/plain, Size: 335 bytes --]


for cpu_dir in /sys/devices/system/cpu/cpu[0-9]*
do
	online_file="$cpu_dir"/online
	[ -f "$online_file" ] && [ "$(cat "$online_file")" -eq 0 ] && continue

	file="$cpu_dir"/cpufreq/scaling_governor
	[ -f "$file" ] && echo "performance" > "$file"
done

 "/lkp/benchmarks/python3/bin/python3" "./runtest.py" "mmap1" "295" "thread" "48"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-05-25  3:11 ` Linus Torvalds
@ 2021-06-04  7:04   ` Feng Tang
  2021-06-04  7:52     ` Feng Tang
  2021-06-04  8:37   ` [LKP] " Xing Zhengjun
  1 sibling, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-06-04  7:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: kernel test robot, Jason Gunthorpe, John Hubbard, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

Hi Linus,

Sorry for the late response.

On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote:
> >
> > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> 
> Hmm. This looks like one of those "random fluctuations" things.
> 
> It would be good to hear if other test-cases also bisect to the same
> thing, but this report already says:
> 
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+---------------------------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement                    |
> 
> which does kind of reinforce that "this benchmark gives unstable numbers".
> 
> The perf data doesn't even mention any of the GUP paths, and on the
> pure fork path the biggest impact would be:
> 
>  (a) maybe "struct mm_struct" changed in size or had a different cache layout

Yes, this seems to be the cause of the regression.

The test case is many thread are doing map/unmap at the same time,
so the process's rw_semaphore 'mmap_lock' is highly contended.

Before the patch (with 0day's kconfig), the mmap_lock is separated
into 2 cachelines, the 'count' is in one line, and the other members
sit in the next line, so it luckily avoid some cache bouncing. After
the patch, the 'mmap_lock' is pushed into one cacheline, which may
cause the regression.

Below is the pahole info:

- before the patch

	spinlock_t         page_table_lock;      /*   116     4 */
	struct rw_semaphore mmap_lock;           /*   120    40 */
	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
	struct list_head   mmlist;               /*   160    16 */
	long unsigned int  hiwater_rss;          /*   176     8 */

- after the patch

	spinlock_t         page_table_lock;      /*   124     4 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	struct rw_semaphore mmap_lock;           /*   128    40 */
	struct list_head   mmlist;               /*   168    16 */
	long unsigned int  hiwater_rss;          /*   184     8 */

perf c2c log can also confirm this.

Thanks,
Feng

>  (b) two added (nonatomic) increment operations in the fork path due
> to the seqcount
> 
> and I'm not seeing what would cause that 9% change. Obviously cache
> placement has done it before.
> 
> If somebody else sees something that I'm missing, please holler. But
> I'll ignore this as "noise" otherwise.
> 
>             Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-04  7:04   ` Feng Tang
@ 2021-06-04  7:52     ` Feng Tang
  2021-06-04 17:57       ` Linus Torvalds
  2021-06-04 17:58       ` John Hubbard
  0 siblings, 2 replies; 13+ messages in thread
From: Feng Tang @ 2021-06-04  7:52 UTC (permalink / raw)
  To: Linus Torvalds, Jason Gunthorpe
  Cc: kernel test robot, Jason Gunthorpe, John Hubbard, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> Hi Linus,
> 
> Sorry for the late response.
> 
> On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> > On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote:
> > >
> > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> > 
> > Hmm. This looks like one of those "random fluctuations" things.
> > 
> > It would be good to hear if other test-cases also bisect to the same
> > thing, but this report already says:
> > 
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement                    |
> > 
> > which does kind of reinforce that "this benchmark gives unstable numbers".
> > 
> > The perf data doesn't even mention any of the GUP paths, and on the
> > pure fork path the biggest impact would be:
> > 
> >  (a) maybe "struct mm_struct" changed in size or had a different cache layout
> 
> Yes, this seems to be the cause of the regression.
> 
> The test case is many thread are doing map/unmap at the same time,
> so the process's rw_semaphore 'mmap_lock' is highly contended.
> 
> Before the patch (with 0day's kconfig), the mmap_lock is separated
> into 2 cachelines, the 'count' is in one line, and the other members
> sit in the next line, so it luckily avoid some cache bouncing. After
> the patch, the 'mmap_lock' is pushed into one cacheline, which may
> cause the regression.
> 
> Below is the pahole info:
> 
> - before the patch
> 
> 	spinlock_t         page_table_lock;      /*   116     4 */
> 	struct rw_semaphore mmap_lock;           /*   120    40 */
> 	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> 	struct list_head   mmlist;               /*   160    16 */
> 	long unsigned int  hiwater_rss;          /*   176     8 */
> 
> - after the patch
> 
> 	spinlock_t         page_table_lock;      /*   124     4 */
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	struct rw_semaphore mmap_lock;           /*   128    40 */
> 	struct list_head   mmlist;               /*   168    16 */
> 	long unsigned int  hiwater_rss;          /*   184     8 */
> 
> perf c2c log can also confirm this.
 
We've tried some patch, which can restore the regerssion. As the
newly added member 'write_protect_seq' is 4 bytes long, and putting
it into an existing 4 bytes long hole can restore the regeression,
while not affecting most of other member's alignment. Please review
the following patch, thanks!

- Feng

From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct

Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
racing with COW during fork), on 64bits system, the hot member
rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
2 cachelines, that its member 'count' sits in one cacheline while
all other members in next cacheline, this naturally reduces some
cache bouncing, and with the commit, the 'mmap_lock' is pushed
into one cacheline, as shown in the pahole info:

 - before the commit

	spinlock_t         page_table_lock;      /*   116     4 */
	struct rw_semaphore mmap_lock;           /*   120    40 */
	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
	struct list_head   mmlist;               /*   160    16 */
	long unsigned int  hiwater_rss;          /*   176     8 */

 - after the commit

	spinlock_t         page_table_lock;      /*   124     4 */
	/* --- cacheline 2 boundary (128 bytes) --- */
	struct rw_semaphore mmap_lock;           /*   128    40 */
	struct list_head   mmlist;               /*   168    16 */
	long unsigned int  hiwater_rss;          /*   184     8 */

and it causes one 9.2% regression for 'mmap1' case of will-it-scale
benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
90%+ cpu cycles).

Though relayouting a structure could be a double-edged sword, as it
helps some case, but may hurt other cases. So one solution is the
newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
placing it into an existing 4 bytes hole in 'mm_struct' will not
affect most of other members's alignment, while restoring the
regression.

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/mm_types.h | 15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..5b55f88 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
 		 */
 		atomic_t has_pinned;
 
-		/**
-		 * @write_protect_seq: Locked when any thread is write
-		 * protecting pages mapped by this mm to enforce a later COW,
-		 * for instance during page table copying for fork().
-		 */
-		seqcount_t write_protect_seq;
-
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
@@ -480,7 +473,15 @@ struct mm_struct {
 		unsigned long stack_vm;	   /* VM_STACK */
 		unsigned long def_flags;
 
+		/**
+		 * @write_protect_seq: Locked when any thread is write
+		 * protecting pages mapped by this mm to enforce a later COW,
+		 * for instance during page table copying for fork().
+		 */
+		seqcount_t write_protect_seq;
+
 		spinlock_t arg_lock; /* protect the below fields */
+
 		unsigned long start_code, end_code, start_data, end_data;
 		unsigned long start_brk, brk, start_stack;
 		unsigned long arg_start, arg_end, env_start, env_end;
-- 
2.7.4




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [LKP] Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-05-25  3:11 ` Linus Torvalds
  2021-06-04  7:04   ` Feng Tang
@ 2021-06-04  8:37   ` Xing Zhengjun
  1 sibling, 0 replies; 13+ messages in thread
From: Xing Zhengjun @ 2021-06-04  8:37 UTC (permalink / raw)
  To: Linus Torvalds, kernel test robot
  Cc: Jason Gunthorpe, John Hubbard, Jan Kara, Peter Xu,
	Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot

Hi Linus,

On 5/25/2021 11:11 AM, Linus Torvalds wrote:
> On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@intel.com> wrote:
>> FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
>> commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> Hmm. This looks like one of those "random fluctuations" things.
>
> It would be good to hear if other test-cases also bisect to the same
> thing, but this report already says:
>
>> In addition to that, the commit also has significant impact on the following tests:
>>
>> +------------------+---------------------------------------------------------------------------------+
>> | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement                    |
> which does kind of reinforce that "this benchmark gives unstable numbers".
>
> The perf data doesn't even mention any of the GUP paths, and on the
> pure fork path the biggest impact would be:
>
>   (a) maybe "struct mm_struct" changed in size or had a different cache layout
I move "write_protect_seq" to the tail of the "struct mm_struct", the 
regression reduced to -3.6%. The regression should relate to the cache 
layout.
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode:
lkp-icl-2sp1/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/thread/mmap1/performance/0xb000280

commit:
   c28b1fc70390df32e29991eedd52bd86e7aba080
   57efa1fe5957694fa541c9062de0a127f0b9acb0
   f6a9c27882d51ff551e15522992d3725c342372d  (the test patch)

c28b1fc70390df32 57efa1fe5957694fa541c9062de f6a9c27882d51ff551e15522992
---------------- --------------------------- ---------------------------
          %stddev     %change         %stddev     %change %stddev
              \          |                \          | \
     341938            -9.0%     311218 ±  2%      -3.6% 329513        
will-it-scale.48.threads
       7123            -9.0%       6483 ±  2%      -3.6% 6864        
will-it-scale.per_thread_ops
     341938            -9.0%     311218 ±  2%      -3.6% 329513        
will-it-scale.workload

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 915f4f100383..34bb2a01806c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -447,13 +447,6 @@ struct mm_struct {
                  */
                 atomic_t has_pinned;

-               /**
-                * @write_protect_seq: Locked when any thread is write
-                * protecting pages mapped by this mm to enforce a later 
COW,
-                * for instance during page table copying for fork().
-                */
-               seqcount_t write_protect_seq;
-
  #ifdef CONFIG_MMU
                 atomic_long_t pgtables_bytes;   /* PTE page table pages */
  #endif
@@ -564,6 +557,12 @@ struct mm_struct {
  #ifdef CONFIG_IOMMU_SUPPORT
                 u32 pasid;
  #endif
+                /**
+                 * @write_protect_seq: Locked when any thread is write
+                 * protecting pages mapped by this mm to enforce a 
later COW,
+                 * for instance during page table copying for fork().
+                 */
+                seqcount_t write_protect_seq;
         } __randomize_layout;

         /*

>
>   (b) two added (nonatomic) increment operations in the fork path due
> to the seqcount
>
> and I'm not seeing what would cause that 9% change. Obviously cache
> placement has done it before.
>
> If somebody else sees something that I'm missing, please holler. But
> I'll ignore this as "noise" otherwise.
>
>              Linus
> _______________________________________________
> LKP mailing list -- lkp@lists.01.org
> To unsubscribe send an email to lkp-leave@lists.01.org

-- 
Zhengjun Xing


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-04  7:52     ` Feng Tang
@ 2021-06-04 17:57       ` Linus Torvalds
  2021-06-06 10:16         ` Feng Tang
  2021-06-04 17:58       ` John Hubbard
  1 sibling, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2021-06-04 17:57 UTC (permalink / raw)
  To: Feng Tang
  Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <feng.tang@intel.com> wrote:
>
> On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > >
> > > The perf data doesn't even mention any of the GUP paths, and on the
> > > pure fork path the biggest impact would be:
> > >
> > >  (a) maybe "struct mm_struct" changed in size or had a different cache layout
> >
> > Yes, this seems to be the cause of the regression.
> >
> > The test case is many thread are doing map/unmap at the same time,
> > so the process's rw_semaphore 'mmap_lock' is highly contended.
> >
> > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > into 2 cachelines, the 'count' is in one line, and the other members
> > sit in the next line, so it luckily avoid some cache bouncing. After
> > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > cause the regression.

Ok, thanks for following up on this.

> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!

The patch looks fine to me.

At the same time, I do wonder if maybe it would be worth exploring if
it's a good idea to perhaps move the 'mmap_sem' thing instead.

Or at least add a big comment. It's not clear to me exactly _which_
other fields are the ones that are so hot that the contention on
mmap_sem then causes even more cacheline bouncing.

For example, is it either

 (a) we *want* the mmap_sem to be in the first 128-byte region,
because then when we get the mmap_sem, the other fields in that same
cacheline are hot

OR

 (b) we do *not* want mmap_sem to be in the *second* 128-byte region,
because there is something *else* in that region that is touched
independently of mmap_sem that is very very hot and now you get even
more bouncing?

but I can't tell which one it is.

It would be great to have a comment in the code - and in the commit
message - about exactly which fields are the criticial ones. Because I
doubt it is 'write_protect_seq' itself that matters at all.

If it's "mmap_sem should be close to other commonly used fields",
maybe we should just move mmap_sem upwards in the structure?

              Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-04  7:52     ` Feng Tang
  2021-06-04 17:57       ` Linus Torvalds
@ 2021-06-04 17:58       ` John Hubbard
  2021-06-06  4:47         ` Feng Tang
  1 sibling, 1 reply; 13+ messages in thread
From: John Hubbard @ 2021-06-04 17:58 UTC (permalink / raw)
  To: Feng Tang, Linus Torvalds, Jason Gunthorpe
  Cc: kernel test robot, Jan Kara, Peter Xu, Andrea Arcangeli,
	Aneesh Kumar K.V, Christoph Hellwig, Hugh Dickins, Jann Horn,
	Kirill Shutemov, Kirill Tkhai, Leon Romanovsky, Michal Hocko,
	Oleg Nesterov, Andrew Morton, LKML, lkp, kernel test robot,
	Huang, Ying, zhengjun.xing

On 6/4/21 12:52 AM, Feng Tang wrote:
...
>>> The perf data doesn't even mention any of the GUP paths, and on the
>>> pure fork path the biggest impact would be:
>>>
>>>   (a) maybe "struct mm_struct" changed in size or had a different cache layout
>>
>> Yes, this seems to be the cause of the regression.
>>
>> The test case is many thread are doing map/unmap at the same time,
>> so the process's rw_semaphore 'mmap_lock' is highly contended.
>>
>> Before the patch (with 0day's kconfig), the mmap_lock is separated
>> into 2 cachelines, the 'count' is in one line, and the other members
>> sit in the next line, so it luckily avoid some cache bouncing. After

Wow! That's quite a fortunate layout to land on by accident. Almost
makes me wonder if mmap_lock should be designed to do that, but it's
probably even better to just keep working on having a less contended
mmap_lock.

I *suppose* it's worth trying to keep this fragile layout in place,
but it is a landmine for anyone who touches mm_struct. And the struct
is so large already that I'm not sure a comment warning would even
be noticed. Anyway...

>> the patch, the 'mmap_lock' is pushed into one cacheline, which may
>> cause the regression.
>>
>> Below is the pahole info:
>>
>> - before the patch
>>
>> 	spinlock_t         page_table_lock;      /*   116     4 */
>> 	struct rw_semaphore mmap_lock;           /*   120    40 */
>> 	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
>> 	struct list_head   mmlist;               /*   160    16 */
>> 	long unsigned int  hiwater_rss;          /*   176     8 */
>>
>> - after the patch
>>
>> 	spinlock_t         page_table_lock;      /*   124     4 */
>> 	/* --- cacheline 2 boundary (128 bytes) --- */
>> 	struct rw_semaphore mmap_lock;           /*   128    40 */
>> 	struct list_head   mmlist;               /*   168    16 */
>> 	long unsigned int  hiwater_rss;          /*   184     8 */
>>
>> perf c2c log can also confirm this.
>   
> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!
> 

So, this is a neat little solution, if we agree that it's worth "fixing".

I'm definitely on the fence, but leaning toward, "go for it", because
I like the "no cache effect" result of using up the hole.

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA

> - Feng
> 
>  From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@intel.com>
> Date: Fri, 4 Jun 2021 15:20:57 +0800
> Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
> 
> Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> racing with COW during fork), on 64bits system, the hot member
> rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> 2 cachelines, that its member 'count' sits in one cacheline while
> all other members in next cacheline, this naturally reduces some
> cache bouncing, and with the commit, the 'mmap_lock' is pushed
> into one cacheline, as shown in the pahole info:
> 
>   - before the commit
> 
> 	spinlock_t         page_table_lock;      /*   116     4 */
> 	struct rw_semaphore mmap_lock;           /*   120    40 */
> 	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> 	struct list_head   mmlist;               /*   160    16 */
> 	long unsigned int  hiwater_rss;          /*   176     8 */
> 
>   - after the commit
> 
> 	spinlock_t         page_table_lock;      /*   124     4 */
> 	/* --- cacheline 2 boundary (128 bytes) --- */
> 	struct rw_semaphore mmap_lock;           /*   128    40 */
> 	struct list_head   mmlist;               /*   168    16 */
> 	long unsigned int  hiwater_rss;          /*   184     8 */
> 
> and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> 90%+ cpu cycles).
> 
> Though relayouting a structure could be a double-edged sword, as it
> helps some case, but may hurt other cases. So one solution is the
> newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> placing it into an existing 4 bytes hole in 'mm_struct' will not
> affect most of other members's alignment, while restoring the
> regression.
> 
> [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>   include/linux/mm_types.h | 15 ++++++++-------
>   1 file changed, 8 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c..5b55f88 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -445,13 +445,6 @@ struct mm_struct {
>   		 */
>   		atomic_t has_pinned;
>   
> -		/**
> -		 * @write_protect_seq: Locked when any thread is write
> -		 * protecting pages mapped by this mm to enforce a later COW,
> -		 * for instance during page table copying for fork().
> -		 */
> -		seqcount_t write_protect_seq;
> -
>   #ifdef CONFIG_MMU
>   		atomic_long_t pgtables_bytes;	/* PTE page table pages */
>   #endif
> @@ -480,7 +473,15 @@ struct mm_struct {
>   		unsigned long stack_vm;	   /* VM_STACK */
>   		unsigned long def_flags;
>   
> +		/**
> +		 * @write_protect_seq: Locked when any thread is write
> +		 * protecting pages mapped by this mm to enforce a later COW,
> +		 * for instance during page table copying for fork().
> +		 */
> +		seqcount_t write_protect_seq;
> +
>   		spinlock_t arg_lock; /* protect the below fields */
> +
>   		unsigned long start_code, end_code, start_data, end_data;
>   		unsigned long start_brk, brk, start_stack;
>   		unsigned long arg_start, arg_end, env_start, env_end;
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-04 17:58       ` John Hubbard
@ 2021-06-06  4:47         ` Feng Tang
  0 siblings, 0 replies; 13+ messages in thread
From: Feng Tang @ 2021-06-06  4:47 UTC (permalink / raw)
  To: John Hubbard
  Cc: Linus Torvalds, Jason Gunthorpe, kernel test robot, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

On Fri, Jun 04, 2021 at 10:58:14AM -0700, John Hubbard wrote:
> On 6/4/21 12:52 AM, Feng Tang wrote:
> ...
> >>>The perf data doesn't even mention any of the GUP paths, and on the
> >>>pure fork path the biggest impact would be:
> >>>
> >>>  (a) maybe "struct mm_struct" changed in size or had a different cache layout
> >>
> >>Yes, this seems to be the cause of the regression.
> >>
> >>The test case is many thread are doing map/unmap at the same time,
> >>so the process's rw_semaphore 'mmap_lock' is highly contended.
> >>
> >>Before the patch (with 0day's kconfig), the mmap_lock is separated
> >>into 2 cachelines, the 'count' is in one line, and the other members
> >>sit in the next line, so it luckily avoid some cache bouncing. After
> 
> Wow! That's quite a fortunate layout to land on by accident. Almost
> makes me wonder if mmap_lock should be designed to do that, but it's
> probably even better to just keep working on having a less contended
> mmap_lock.

Yes, manipulating cache alignment is always tricky and fragile, as
data structure keeps being changed and it is affected by differrent 
kernel config options, also different workloads will see different
hot fields of it. 

Optimizing 'mmap_lock' is the better and ultimate solution. 

> I *suppose* it's worth trying to keep this fragile layout in place,
> but it is a landmine for anyone who touches mm_struct. And the struct
> is so large already that I'm not sure a comment warning would even
> be noticed. Anyway...

Linus also mentioned clear comment is needed. Will collect more info.


> >>the patch, the 'mmap_lock' is pushed into one cacheline, which may
> >>cause the regression.
> >>
> >>Below is the pahole info:
> >>
> >>- before the patch
> >>
> >>	spinlock_t         page_table_lock;      /*   116     4 */
> >>	struct rw_semaphore mmap_lock;           /*   120    40 */
> >>	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> >>	struct list_head   mmlist;               /*   160    16 */
> >>	long unsigned int  hiwater_rss;          /*   176     8 */
> >>
> >>- after the patch
> >>
> >>	spinlock_t         page_table_lock;      /*   124     4 */
> >>	/* --- cacheline 2 boundary (128 bytes) --- */
> >>	struct rw_semaphore mmap_lock;           /*   128    40 */
> >>	struct list_head   mmlist;               /*   168    16 */
> >>	long unsigned int  hiwater_rss;          /*   184     8 */
> >>
> >>perf c2c log can also confirm this.
> >We've tried some patch, which can restore the regerssion. As the
> >newly added member 'write_protect_seq' is 4 bytes long, and putting
> >it into an existing 4 bytes long hole can restore the regeression,
> >while not affecting most of other member's alignment. Please review
> >the following patch, thanks!
> >
> 
> So, this is a neat little solution, if we agree that it's worth "fixing".
> 
> I'm definitely on the fence, but leaning toward, "go for it", because
> I like the "no cache effect" result of using up the hole.
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Thanks for the reviewing!

- Feng


> thanks,
> -- 
> John Hubbard
> NVIDIA
> 
> >- Feng
> >
> > From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> >From: Feng Tang <feng.tang@intel.com>
> >Date: Fri, 4 Jun 2021 15:20:57 +0800
> >Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
> >
> >Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> >racing with COW during fork), on 64bits system, the hot member
> >rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> >2 cachelines, that its member 'count' sits in one cacheline while
> >all other members in next cacheline, this naturally reduces some
> >cache bouncing, and with the commit, the 'mmap_lock' is pushed
> >into one cacheline, as shown in the pahole info:
> >
> >  - before the commit
> >
> >	spinlock_t         page_table_lock;      /*   116     4 */
> >	struct rw_semaphore mmap_lock;           /*   120    40 */
> >	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> >	struct list_head   mmlist;               /*   160    16 */
> >	long unsigned int  hiwater_rss;          /*   176     8 */
> >
> >  - after the commit
> >
> >	spinlock_t         page_table_lock;      /*   124     4 */
> >	/* --- cacheline 2 boundary (128 bytes) --- */
> >	struct rw_semaphore mmap_lock;           /*   128    40 */
> >	struct list_head   mmlist;               /*   168    16 */
> >	long unsigned int  hiwater_rss;          /*   184     8 */
> >
> >and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> >benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> >90%+ cpu cycles).
> >
> >Though relayouting a structure could be a double-edged sword, as it
> >helps some case, but may hurt other cases. So one solution is the
> >newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> >placing it into an existing 4 bytes hole in 'mm_struct' will not
> >affect most of other members's alignment, while restoring the
> >regression.
> >
> >[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> >Reported-by: kernel test robot <oliver.sang@intel.com>
> >Signed-off-by: Feng Tang <feng.tang@intel.com>
> >---
> >  include/linux/mm_types.h | 15 ++++++++-------
> >  1 file changed, 8 insertions(+), 7 deletions(-)
> >
> >diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >index 5aacc1c..5b55f88 100644
> >--- a/include/linux/mm_types.h
> >+++ b/include/linux/mm_types.h
> >@@ -445,13 +445,6 @@ struct mm_struct {
> >  		 */
> >  		atomic_t has_pinned;
> >-		/**
> >-		 * @write_protect_seq: Locked when any thread is write
> >-		 * protecting pages mapped by this mm to enforce a later COW,
> >-		 * for instance during page table copying for fork().
> >-		 */
> >-		seqcount_t write_protect_seq;
> >-
> >  #ifdef CONFIG_MMU
> >  		atomic_long_t pgtables_bytes;	/* PTE page table pages */
> >  #endif
> >@@ -480,7 +473,15 @@ struct mm_struct {
> >  		unsigned long stack_vm;	   /* VM_STACK */
> >  		unsigned long def_flags;
> >+		/**
> >+		 * @write_protect_seq: Locked when any thread is write
> >+		 * protecting pages mapped by this mm to enforce a later COW,
> >+		 * for instance during page table copying for fork().
> >+		 */
> >+		seqcount_t write_protect_seq;
> >+
> >  		spinlock_t arg_lock; /* protect the below fields */
> >+
> >  		unsigned long start_code, end_code, start_data, end_data;
> >  		unsigned long start_brk, brk, start_stack;
> >  		unsigned long arg_start, arg_end, env_start, env_end;
> >

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-04 17:57       ` Linus Torvalds
@ 2021-06-06 10:16         ` Feng Tang
  2021-06-06 19:20           ` Linus Torvalds
  0 siblings, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-06-06 10:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

[-- Attachment #1: Type: text/plain, Size: 4401 bytes --]

On Fri, Jun 04, 2021 at 10:57:44AM -0700, Linus Torvalds wrote:
> On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > > >
> > > > The perf data doesn't even mention any of the GUP paths, and on the
> > > > pure fork path the biggest impact would be:
> > > >
> > > >  (a) maybe "struct mm_struct" changed in size or had a different cache layout
> > >
> > > Yes, this seems to be the cause of the regression.
> > >
> > > The test case is many thread are doing map/unmap at the same time,
> > > so the process's rw_semaphore 'mmap_lock' is highly contended.
> > >
> > > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > > into 2 cachelines, the 'count' is in one line, and the other members
> > > sit in the next line, so it luckily avoid some cache bouncing. After
> > > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > > cause the regression.
> 
> Ok, thanks for following up on this.
> 
> > We've tried some patch, which can restore the regerssion. As the
> > newly added member 'write_protect_seq' is 4 bytes long, and putting
> > it into an existing 4 bytes long hole can restore the regeression,
> > while not affecting most of other member's alignment. Please review
> > the following patch, thanks!
> 
> The patch looks fine to me.
> 
> At the same time, I do wonder if maybe it would be worth exploring if
> it's a good idea to perhaps move the 'mmap_sem' thing instead.
> 
> Or at least add a big comment. It's not clear to me exactly _which_
> other fields are the ones that are so hot that the contention on
> mmap_sem then causes even more cacheline bouncing.
>
> For example, is it either
> 
>  (a) we *want* the mmap_sem to be in the first 128-byte region,
> because then when we get the mmap_sem, the other fields in that same
> cacheline are hot
> 
> OR
> 
>  (b) we do *not* want mmap_sem to be in the *second* 128-byte region,
> because there is something *else* in that region that is touched
> independently of mmap_sem that is very very hot and now you get even
> more bouncing?
> 
> but I can't tell which one it is.

Yes, it's better to get more details of which fields are hottest, 
and following are some perf data details. Let me know if more info
is needed.

* perf-stat: we see more cache-misses

  32158577 ±  7%      +9.0%   35060321 ±  6%  perf-stat.ps.cache-misses
  69612918 ±  6%     +11.2%   77382336 ±  5%  perf-stat.ps.cache-references


* perf profile: the 'mmap_lock' are the hottest, though the ratio from
  map/unmap has some difference from 72:24 to 52:45, and this is the part
  that I don't understand
  
  - old kernel (without commit 57efa1fe59)

    96.60%     0.19%  [kernel.kallsyms]   [k] down_write_killable                           -      -            
72.46% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
24.14% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap

  - new kernel 

    96.60%     0.16%  [kernel.kallsyms]    [k] down_write_killable                           -      -            
51.85% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
44.74% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap


* perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
  data structure change

  - old kernel

    - first cacheline
	mmap_lock->count (75%)
	mm->mapcount (14%)

    - second cacheline
	mmap_lock->owner (97%)

  - new kernel

    mainly in the cacheline of 'mmap_lock'     

    mmap_lock->count (~2%)
    mmap_lock->owner (95%)
    
I also attached the reduced pah and perf-c2c log for further
check. (The absolute HITM events number can be ignored, as the
recording time for new/old kernel may be different)


> It would be great to have a comment in the code - and in the commit
> message - about exactly which fields are the criticial ones. Because I
> doubt it is 'write_protect_seq' itself that matters at all.
> 
> If it's "mmap_sem should be close to other commonly used fields",
> maybe we should just move mmap_sem upwards in the structure?

Ok, will add more comments if the patch is still fine with
the above updated info.

Thanks,
Feng

>               Linus

[-- Attachment #2: pah_new.log --]
[-- Type: text/plain, Size: 5610 bytes --]


struct rw_semaphore {
	atomic_long_t              count;                /*     0     0 */

	/* XXX 8 bytes hole, try to pack */

	atomic_long_t              owner;                /*     8     0 */

	/* XXX 8 bytes hole, try to pack */

	struct optimistic_spin_queue osq;                /*    16     0 */

	/* XXX 4 bytes hole, try to pack */

	raw_spinlock_t             wait_lock;            /*    20     0 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           wait_list;            /*    24     0 */

	/* size: 40, cachelines: 1, members: 5 */
	/* padding: 16 */
	/* last cacheline: 40 bytes */
};

struct mm_struct {
	struct {
		struct vm_area_struct * mmap;            /*     0     8 */
		struct rb_root     mm_rb;                /*     8     8 */
		u64                vmacache_seqnum;      /*    16     8 */
		long unsigned int  (*get_unmapped_area)(struct file *, long unsigned int, long unsigned int, long unsigned int, long unsigned int); /*    24     8 */
		long unsigned int  mmap_base;            /*    32     8 */
		long unsigned int  mmap_legacy_base;     /*    40     8 */
		long unsigned int  mmap_compat_base;     /*    48     8 */
		long unsigned int  mmap_compat_legacy_base; /*    56     8 */
		/* --- cacheline 1 boundary (64 bytes) --- */
		long unsigned int  task_size;            /*    64     8 */
		long unsigned int  highest_vm_end;       /*    72     8 */
		pgd_t *            pgd;                  /*    80     8 */
		atomic_t           membarrier_state;     /*    88     4 */
		atomic_t           mm_users;             /*    92     4 */
		atomic_t           mm_count;             /*    96     4 */
		atomic_t           has_pinned;           /*   100     4 */
		seqcount_t         write_protect_seq;    /*   104     4 */

		/* XXX 4 bytes hole, try to pack */

		atomic_long_t      pgtables_bytes;       /*   112     8 */
		int                map_count;            /*   120     4 */
		spinlock_t         page_table_lock;      /*   124     4 */
		/* --- cacheline 2 boundary (128 bytes) --- */
		struct rw_semaphore mmap_lock;           /*   128    40 */
		struct list_head   mmlist;               /*   168    16 */
		long unsigned int  hiwater_rss;          /*   184     8 */
		/* --- cacheline 3 boundary (192 bytes) --- */
		long unsigned int  hiwater_vm;           /*   192     8 */
		long unsigned int  total_vm;             /*   200     8 */
		long unsigned int  locked_vm;            /*   208     8 */
		atomic64_t         pinned_vm;            /*   216     8 */
		long unsigned int  data_vm;              /*   224     8 */
		long unsigned int  exec_vm;              /*   232     8 */
		long unsigned int  stack_vm;             /*   240     8 */
		long unsigned int  def_flags;            /*   248     8 */
		/* --- cacheline 4 boundary (256 bytes) --- */
		spinlock_t         arg_lock;             /*   256     4 */

		/* XXX 4 bytes hole, try to pack */

		long unsigned int  start_code;           /*   264     8 */
		long unsigned int  end_code;             /*   272     8 */
		long unsigned int  start_data;           /*   280     8 */
		long unsigned int  end_data;             /*   288     8 */
		long unsigned int  start_brk;            /*   296     8 */
		long unsigned int  brk;                  /*   304     8 */
		long unsigned int  start_stack;          /*   312     8 */
		/* --- cacheline 5 boundary (320 bytes) --- */
		long unsigned int  arg_start;            /*   320     8 */
		long unsigned int  arg_end;              /*   328     8 */
		long unsigned int  env_start;            /*   336     8 */
		long unsigned int  env_end;              /*   344     8 */
		long unsigned int  saved_auxv[46];       /*   352   368 */
		/* --- cacheline 11 boundary (704 bytes) was 16 bytes ago --- */
		struct mm_rss_stat rss_stat;             /*   720    32 */
		struct linux_binfmt * binfmt;            /*   752     8 */
		mm_context_t       context;              /*   760   128 */
		/* --- cacheline 13 boundary (832 bytes) was 56 bytes ago --- */
		long unsigned int  flags;                /*   888     8 */
		/* --- cacheline 14 boundary (896 bytes) --- */
		struct core_state * core_state;          /*   896     8 */
		spinlock_t         ioctx_lock;           /*   904     4 */

		/* XXX 4 bytes hole, try to pack */

		struct kioctx_table * ioctx_table;       /*   912     8 */
		struct task_struct * owner;              /*   920     8 */
		struct user_namespace * user_ns;         /*   928     8 */
		struct file *      exe_file;             /*   936     8 */
		struct mmu_notifier_subscriptions * notifier_subscriptions; /*   944     8 */
		long unsigned int  numa_next_scan;       /*   952     8 */
		/* --- cacheline 15 boundary (960 bytes) --- */
		long unsigned int  numa_scan_offset;     /*   960     8 */
		int                numa_scan_seq;        /*   968     4 */
		atomic_t           tlb_flush_pending;    /*   972     4 */
		bool               tlb_flush_batched;    /*   976     1 */

		/* XXX 7 bytes hole, try to pack */

		struct uprobes_state uprobes_state;      /*   984     8 */
		atomic_long_t      hugetlb_usage;        /*   992     8 */
		struct work_struct async_put_work;       /*  1000    32 */
		/* --- cacheline 16 boundary (1024 bytes) was 8 bytes ago --- */
		u32                pasid;                /*  1032     4 */
	};                                               /*     0  1040 */

	/* XXX last struct has 4 bytes of padding */

	long unsigned int          cpu_bitmap[];         /*  1040     0 */

	/* size: 1040, cachelines: 17, members: 2 */
	/* paddings: 1, sum paddings: 4 */
	/* last cacheline: 16 bytes */
};

[-- Attachment #3: pah_old.log --]
[-- Type: text/plain, Size: 5508 bytes --]


struct rw_semaphore {
	atomic_long_t              count;                /*     0     0 */

	/* XXX 8 bytes hole, try to pack */

	atomic_long_t              owner;                /*     8     0 */

	/* XXX 8 bytes hole, try to pack */

	struct optimistic_spin_queue osq;                /*    16     0 */

	/* XXX 4 bytes hole, try to pack */

	raw_spinlock_t             wait_lock;            /*    20     0 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head           wait_list;            /*    24     0 */

	/* size: 40, cachelines: 1, members: 5 */
	/* padding: 16 */
	/* last cacheline: 40 bytes */
};

struct mm_struct {
	struct {
		struct vm_area_struct * mmap;            /*     0     8 */
		struct rb_root     mm_rb;                /*     8     8 */
		u64                vmacache_seqnum;      /*    16     8 */
		long unsigned int  (*get_unmapped_area)(struct file *, long unsigned int, long unsigned int, long unsigned int, long unsigned int); /*    24     8 */
		long unsigned int  mmap_base;            /*    32     8 */
		long unsigned int  mmap_legacy_base;     /*    40     8 */
		long unsigned int  mmap_compat_base;     /*    48     8 */
		long unsigned int  mmap_compat_legacy_base; /*    56     8 */
		/* --- cacheline 1 boundary (64 bytes) --- */
		long unsigned int  task_size;            /*    64     8 */
		long unsigned int  highest_vm_end;       /*    72     8 */
		pgd_t *            pgd;                  /*    80     8 */
		atomic_t           membarrier_state;     /*    88     4 */
		atomic_t           mm_users;             /*    92     4 */
		atomic_t           mm_count;             /*    96     4 */
		atomic_t           has_pinned;           /*   100     4 */
		atomic_long_t      pgtables_bytes;       /*   104     8 */
		int                map_count;            /*   112     4 */
		spinlock_t         page_table_lock;      /*   116     4 */
		struct rw_semaphore mmap_lock;           /*   120    40 */
		/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
		struct list_head   mmlist;               /*   160    16 */
		long unsigned int  hiwater_rss;          /*   176     8 */
		long unsigned int  hiwater_vm;           /*   184     8 */
		/* --- cacheline 3 boundary (192 bytes) --- */
		long unsigned int  total_vm;             /*   192     8 */
		long unsigned int  locked_vm;            /*   200     8 */
		atomic64_t         pinned_vm;            /*   208     8 */
		long unsigned int  data_vm;              /*   216     8 */
		long unsigned int  exec_vm;              /*   224     8 */
		long unsigned int  stack_vm;             /*   232     8 */
		long unsigned int  def_flags;            /*   240     8 */
		spinlock_t         arg_lock;             /*   248     4 */

		/* XXX 4 bytes hole, try to pack */

		/* --- cacheline 4 boundary (256 bytes) --- */
		long unsigned int  start_code;           /*   256     8 */
		long unsigned int  end_code;             /*   264     8 */
		long unsigned int  start_data;           /*   272     8 */
		long unsigned int  end_data;             /*   280     8 */
		long unsigned int  start_brk;            /*   288     8 */
		long unsigned int  brk;                  /*   296     8 */
		long unsigned int  start_stack;          /*   304     8 */
		long unsigned int  arg_start;            /*   312     8 */
		/* --- cacheline 5 boundary (320 bytes) --- */
		long unsigned int  arg_end;              /*   320     8 */
		long unsigned int  env_start;            /*   328     8 */
		long unsigned int  env_end;              /*   336     8 */
		long unsigned int  saved_auxv[46];       /*   344   368 */
		/* --- cacheline 11 boundary (704 bytes) was 8 bytes ago --- */
		struct mm_rss_stat rss_stat;             /*   712    32 */
		struct linux_binfmt * binfmt;            /*   744     8 */
		mm_context_t       context;              /*   752   128 */
		/* --- cacheline 13 boundary (832 bytes) was 48 bytes ago --- */
		long unsigned int  flags;                /*   880     8 */
		struct core_state * core_state;          /*   888     8 */
		/* --- cacheline 14 boundary (896 bytes) --- */
		spinlock_t         ioctx_lock;           /*   896     4 */

		/* XXX 4 bytes hole, try to pack */

		struct kioctx_table * ioctx_table;       /*   904     8 */
		struct task_struct * owner;              /*   912     8 */
		struct user_namespace * user_ns;         /*   920     8 */
		struct file *      exe_file;             /*   928     8 */
		struct mmu_notifier_subscriptions * notifier_subscriptions; /*   936     8 */
		long unsigned int  numa_next_scan;       /*   944     8 */
		long unsigned int  numa_scan_offset;     /*   952     8 */
		/* --- cacheline 15 boundary (960 bytes) --- */
		int                numa_scan_seq;        /*   960     4 */
		atomic_t           tlb_flush_pending;    /*   964     4 */
		bool               tlb_flush_batched;    /*   968     1 */

		/* XXX 7 bytes hole, try to pack */

		struct uprobes_state uprobes_state;      /*   976     8 */
		atomic_long_t      hugetlb_usage;        /*   984     8 */
		struct work_struct async_put_work;       /*   992    32 */
		/* --- cacheline 16 boundary (1024 bytes) --- */
		u32                pasid;                /*  1024     4 */
	};                                               /*     0  1032 */

	/* XXX last struct has 4 bytes of padding */

	long unsigned int          cpu_bitmap[];         /*  1032     0 */

	/* size: 1032, cachelines: 17, members: 2 */
	/* paddings: 1, sum paddings: 4 */
	/* last cacheline: 8 bytes */
};

[-- Attachment #4: c2c_new.log --]
[-- Type: text/plain, Size: 37291 bytes --]

=================================================
            Trace Event Information              
=================================================
  Total records                     :     293248
  Locked Load/Store Operations      :       6171
  Load Operations                   :      52087
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :          0
  Loads - no mapping                :        467
  Load Fill Buffer Hit              :      14960
  Load L1D hit                      :      17949
  Load L2D hit                      :        377
  Load LLC hit                      :      11861
  Load Local HITM                   :       5926
  Load Remote HITM                  :       4183
  Load Remote HIT                   :          0
  Load Local DRAM                   :        580
  Load Remote DRAM                  :       5893
  Load MESI State Exclusive         :       5893
  Load MESI State Shared            :        580
  Load LLC Misses                   :      10656
  Load access blocked by data       :          0
  Load access blocked by address    :          0
  LLC Misses to Local DRAM          :        5.4%
  LLC Misses to Remote DRAM         :       55.3%
  LLC Misses to Remote cache (HIT)  :        0.0%
  LLC Misses to Remote cache (HITM) :       39.3%
  Store Operations                  :          0
  Store - uncacheable               :          0
  Store - no mapping                :          0
  Store L1D Hit                     :          0
  Store L1D Miss                    :          0
  No Page Map Rejects               :      50306
  Unable to parse data source       :     241161

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :       1441
  Load HITs on shared lines         :      32608
  Fill Buffer Hits on shared lines  :      10597
  L1D hits on shared lines          :       4777
  L2D hits on shared lines          :         22
  LLC hits on shared lines          :      11030
  Locked Access on shared lines     :       4481
  Blocked Access on shared lines    :          0
  Store HITs on shared lines        :          0
  Store L1D hits on shared lines    :          0
  Total Merged records              :      10109

=================================================
                 c2c details                     
=================================================
  Events                            : cpu/mem-loads,ldlat=30/P
                                    : cpu/mem-stores/P
  Cachelines sort on                : Total HITMs
  Cacheline data grouping           : offset,iaddr

=================================================
           Shared Data Cache Line Table          
=================================================
#
#        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  ---- Stores ----  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0  0xff110002a24ffc00     0   10811   28.14%     2845     1535     1310    10447    10447        0        0        0     2307     2819        0       412     1535         0     1310       292      1772
      1  0xff110002a24ffb80     0     423    2.41%      244      136      108      722      722        0        0        0      328        3        0        38      136         0      108         0       109
      2  0xff1100209ae58b00     1    1125    2.36%      239      128      111     1268     1268        0        0        0      754       94        0        67      128         0      111         0       114
      3  0xff1100209ae58580     1    1227    2.20%      222       99      123     1029     1029        0        0        0      582       46        0        55       99         0      123         0       124
      4  0xff110002a24ffc40     0    1386    1.69%      171      169        2     1205     1205        0        0        0      714        0        0       195      169         0        2         3       122
      5  0xff11001fffeaba00     0     161    0.84%       85       44       41      237      237        0        0        0       46       19        0        42       44         0       41         2        43
      6  0xff11001ffff2ba00     0     169    0.78%       79       32       47      252      252        0        0        0       48       26        0        37       32         0       47         8        54
      7  0xff11003fdcc6ba00     1     116    0.78%       79       39       40      205      205        0        0        0       30       15        0        31       39         0       40         2        48
      8  0xff110002a24ffbc0     0      64    0.77%       78       36       42      393      393        0        0        0      132      119        2        20       36         0       42         0        42
      9  0xff11003fdc6eba00     1     144    0.76%       77       48       29      195      195        0        0        0       31       19        0        29       48         0       29         4        35
     10  0xff11001fffe2ba00     0     156    0.74%       75       36       39      226      226        0        0        0       50       24        0        31       36         0       39         3        43
     11  0xff11003fdc7aba00     1     137    0.74%       75       38       37      204      204        0        0        0       30       20        0        34       38         0       37         3        42
     12  0xff11001fffc2ba00     0     128    0.73%       74       44       30      182      182        0        0        0       31       16        0        23       44         0       30         1        37
     13  0xff110020001aba00     0     135    0.72%       73       41       32      198      198        0        0        0       28       20        0        33       41         0       32         4        40
     14  0xff1100200012ba00     0     151    0.71%       72       36       36      221      221        0        0        0       35       21        0        45       36         0       36         6        42
     15  0xff1100200002ba00     0     122    0.69%       70       41       29      184      184        0        0        0       31       16        0        32       41         0       29         3        32
     16  0xff11003fdc92ba00     1     142    0.69%       70       32       38      204      204        0        0        0       40       17        0        29       32         0       38         3        45
     17  0xff11003fdc96ba00     1     138    0.69%       70       37       33      207      207        0        0        0       33       18        0        43       37         0       33         4        39
     18  0xff11003fdcd2ba00     1     109    0.69%       70       31       39      184      184        0        0        0       28       18        0        22       31         0       39         0        46
     19  0xff11001ffff6ba00     0     141    0.68%       69       43       26      193      193        0        0        0       38       23        0        29       43         0       26         4        30
     20  0xff11001ffffaba00     0     142    0.68%       69       36       33      204      204        0        0        0       35       23        0        38       36         0       33         2        37
     21  0xff11001fffe6ba00     0     133    0.67%       68       42       26      195      195        0        0        0       39       18        0        38       42         0       26         2        30
     22  0xff1100200016ba00     0     117    0.65%       66       42       24      185      185        0        0        0       37       10        0        40       42         0       24         5        27
     23  0xff1100209ae58b40     1       1    0.65%       66       66        0      210      210        0        0        0        1        0        0        40       66         0        0        18        85
     24  0xff11003fdc8aba00     1     126    0.65%       66       32       34      192      192        0        0        0       37       17        0        30       32         0       34         3        39
     25  0xff11001fffceba00     0     160    0.64%       65       30       35      222      222        0        0        0       40       26        0        48       30         0       35         3        40
     26  0xff11003fdcdeba00     1     121    0.64%       65       37       28      182      182        0        0        0       34       11        0        34       37         0       28         3        35
     27  0xff11001fffdaba00     0      92    0.63%       64       38       26      170      170        0        0        0       29       11        0        30       38         0       26         4        32
     28  0xff110020000aba00     0     159    0.63%       64       37       27      189      189        0        0        0       43       17        0        30       37         0       27         2        33
     29  0xff11003fdc82ba00     1     137    0.62%       63       30       33      200      200        0        0        0       43       20        0        33       30         0       33         5        36
     30  0xff1100200006ba00     0     127    0.61%       62       34       28      184      184        0        0        0       33       20        0        34       34         0       28         3        32
     31  0xff11003fdcbaba00     1     139    0.61%       62       35       27      180      180        0        0        0       39       19        0        26       35         0       27         3        31
     32  0xff11001fffc6ba00     0     130    0.60%       61       37       24      185      185        0        0        0       41       23        0        30       37         0       24         3        27
     33  0xff11001fffd2ba00     0     106    0.60%       61       29       32      167      167        0        0        0       26       17        0        26       29         0       32         1        36
     34  0xff11001fffd6ba00     0      85    0.60%       61       31       30      166      166        0        0        0       29        7        0        33       31         0       30         1        35
     35  0xff11001fffbeba00     0     118    0.59%       60       44       16      161      161        0        0        0       38       13        0        28       44         0       16         6        16
     36  0xff11001ffffeba00     0     144    0.59%       60       41       19      194      194        0        0        0       35       23        0        46       41         0       19         4        26
     37  0xff110020000eba00     0     133    0.59%       60       36       24      180      180        0        0        0       34       18        0        32       36         0       24         5        31
     38  0xff11003fdcc2ba00     1     121    0.59%       60       29       31      188      188        0        0        0       37       13        0        37       29         0       31         6        35
     39  0xff11003fdcbeba00     1     119    0.58%       59       32       27      171      171        0        0        0       30       15        0        33       32         0       27         3        31
     40  0xff11003fdcd6ba00     1     116    0.58%       59       29       30      174      174        0        0        0       20       22        0        39       29         0       30         1        33
     41  0xff11003fdceeba00     1      79    0.58%       59       35       24      151      151        0        0        0       28        4        0        28       35         0       24         0        32
     42  0xff11003fdcfaba00     1     105    0.58%       59       34       25      155      155        0        0        0       26       13        0        29       34         0       25         0        28
     43  0xff11001fffeeba00     0     113    0.57%       58       31       27      172      172        0        0        0       32       16        0        37       31         0       27         1        28
     44  0xff11003fdcb6ba00     1     128    0.56%       57       32       25      177      177        0        0        0       43       14        0        32       32         0       25         2        29
     45  0xff11003fdcfeba00     1      88    0.56%       57       33       24      144      144        0        0        0       29        8        0        16       33         0       24         4        30
     46  0xff11003fdc42ba00     1      86    0.55%       56       28       28      159      159        0        0        0       34       12        0        23       28         0       28         3        31
     47  0xff11003fdcaeba00     1     107    0.54%       55       22       33      154      154        0        0        0       24       11        0        24       22         0       33         1        39
     48  0xff11001fffcaba00     0     111    0.53%       54       32       22      156      156        0        0        0       35       11        0        28       32         0       22         1        27
     49  0xff11003fdcaaba00     1      87    0.52%       53       29       24      141      141        0        0        0       21        9        0        26       29         0       24         3        29
     50  0xff11003fdc86ba00     1      89    0.49%       50       30       20      147      147        0        0        0       27       11        0        31       30         0       20         2        26
     51  0xff11001fffdeba00     0      96    0.47%       48       26       22      144      144        0        0        0       25       10        0        28       26         0       22         7        26
     52  0xff11003fdc46ba00     1      98    0.47%       48       30       18      127      127        0        0        0       21        8        0        27       30         0       18         1        22
     53  0xff11003fdc6aba00     1      80    0.41%       41       20       21      121      121        0        0        0       24        8        0        22       20         0       21         1        25
     54  0xff11003fdcb2ba00     1      92    0.40%       40       20       20      137      137        0        0        0       29       12        0        27       20         0       20         2        27
     55  0xff110020d9ecc000     1      37    0.31%       31       15       16       68       68        0        0        0        9        2        0         8       15         0       16         0        18
     56  0xff1100406e91c000     1      57    0.27%       27       16       11       80       80        0        0        0       17        4        0        21       16         0       11         0        11
     57  0xff1100406f884000     1      59    0.27%       27       12       15       76       76        0        0        0       17        4        0        13       12         0       15         0        15
     58  0xff110001e3bd9bc0     0     125    0.26%       26       10       16      269      269        0        0        0       19        1        0       207       10         0       16         0        16
     59  0xff110001085fd700     0     196    0.25%       25       15       10      316      316        0        0        0        7       39        0       235       15         0       10         0        10
     60  0xff11000154dd8000     0      58    0.25%       25       12       13       87       87        0        0        0       24        5        0        20       12         0       13         0        13
     61  0xff1100406a59c000     1      60    0.25%       25       11       14       81       81        0        0        0       21        4        0        17       11         0       14         0        14
     62  0xff1100406d9f0000     1      40    0.25%       25       16        9       58       58        0        0        0        8        4        0        11       16         0        9         0        10
     63  0xff11000156a24000     0      51    0.24%       24       10       14       71       71        0        0        0       18        3        0        12       10         0       14         0        14
     64  0xff110001b4044000     0      66    0.24%       24       17        7       78       78        0        0        0       27        2        0        17       17         0        7         0         8
     65  0xff11000212360000     0      69    0.24%       24        9       15      104      104        0        0        0       40        4        0        21        9         0       15         0        15
     66  0xff1100209d074000     1      56    0.23%       23       11       12       67       67        0        0        0       19        5        0         7       11         0       12         0        13
     67  0xff11004070a10000     1      47    0.23%       23       11       12       74       74        0        0        0       21        1        0        17       11         0       12         0        12
     68  0xff11000109248000     0      50    0.22%       22       13        9       77       77        0        0        0       20        2        0        24       13         0        9         0         9
     69  0xff1100012d114000     0      57    0.22%       22       16        6       76       76        0        0        0       30        3        0        15       16         0        6         0         6
     70  0xff110020d9614000     1      31    0.22%       22       16        6       55       55        0        0        0       13        0        0        14       16         0        6         0         6
     71  0xff1100406a600000     1      55    0.22%       22       10       12       75       75        0        0        0       24        4        0        11       10         0       12         0        14
     72  0xffffffff835477c0     1       1    0.21%       21        9       12       33       33        0        0        0        0        0        0         0        9         0       12         0        12
     73  0xff11000270708000     0      48    0.21%       21        9       12       64       64        0        0        0       15        6        0         9        9         0       12         0        13
     74  0xff11000117070000     0      66    0.20%       20        8       12       95       95        0        0        0       39        5        0        19        8         0       12         0        12
     75  0xff1100012dd18000     0      57    0.20%       20       12        8       75       75        0        0        0       22        4        0        21       12         0        8         0         8
     76  0xff11000242dbc000     0      64    0.20%       20       12        8       86       86        0        0        0       31        2        0        23       12         0        8         0        10
     77  0xff1100027104c000     0      58    0.20%       20        9       11       82       82        0        0        0       24        4        0        21        9         0       11         0        13
     78  0xff1100406bca4000     1      43    0.20%       20        5       15       56       56        0        0        0        8        2        0        11        5         0       15         0        15
     79  0xff110001091c4000     0      60    0.19%       19       12        7       77       77        0        0        0       17        4        0        29       12         0        7         0         8
     80  0xff1100012bab8000     0      69    0.19%       19        7       12       95       95        0        0        0       33        9        0        22        7         0       12         0        12
     81  0xff11000155a60000     0      58    0.19%       19        6       13       84       84        0        0        0       27        6        0        19        6         0       13         0        13
     82  0xff110001b4244000     0      49    0.19%       19       13        6       66       66        0        0        0       23        7        0        11       13         0        6         0         6
     83  0xff110001e207c000     0      56    0.19%       19       14        5       68       68        0        0        0       22        3        0        19       14         0        5         0         5
     84  0xff110002a24fff00     0       1    0.19%       19       19        0      197      197        0        0        0        8        0        0       160       19         0        0         0        10
     85  0xff110020b736c000     1      56    0.19%       19       13        6       70       70        0        0        0       26        4        0        15       13         0        6         0         6
     86  0xff110020ba320000     1      60    0.19%       19       11        8       68       68        0        0        0       18        3        0        20       11         0        8         0         8
     87  0xff110040755c4000     1      58    0.19%       19       12        7       73       73        0        0        0       20       10        0        16       12         0        7         0         8
     88  0xff110001092a0000     0      37    0.18%       18       12        6       56       56        0        0        0       16        0        0        15       12         0        6         0         7
     89  0xff1100015696c000     0      58    0.18%       18       10        8       77       77        0        0        0       29        3        0        19       10         0        8         0         8
     90  0xff110001866f8000     0      57    0.18%       18        7       11       84       84        0        0        0       35        4        0        16        7         0       11         0        11
     91  0xff11004072f84000     1      40    0.18%       18       12        6       56       56        0        0        0       15        4        0        13       12         0        6         0         6
     92  0xff110001163d8000     0      46    0.17%       17        4       13       63       63        0        0        0       18        1        0        13        4         0       13         0        14
     93  0xff1100406bca4040     1       1    0.17%       17       17        0       34       34        0        0        0        1        0        0         5       17         0        0         0        11
     94  0xff11004070424000     1      41    0.17%       17        8        9       64       64        0        0        0       20        5        0        13        8         0        9         0         9
     95  0xff11004079f7c000     1      52    0.17%       17       11        6       52       52        0        0        0        8        4        0        17       11         0        6         0         6
     96  0xff11000117b9c000     0      55    0.16%       16        5       11       76       76        0        0        0       26        7        0        16        5         0       11         0        11
     97  0xff110002408b8000     0      78    0.16%       16        6       10       86       86        0        0        0       38        9        0        13        6         0       10         0        10
     98  0xff110002a1a88000     0      48    0.16%       16        7        9       64       64        0        0        0       15        5        0        19        7         0        9         0         9
     99  0xff11004072594000     1      53    0.16%       16        8        8       58       58        0        0        0       17        4        0        12        8         0        8         0         9
    100  0xff11004073a30000     1      46    0.16%       16        9        7       61       61        0        0        0       21        4        0        12        9         0        7         0         8
    101  0xff11000117b9c040     0       1    0.15%       15       15        0       31       31        0        0        0        0        0        0         6       15         0        0         0        10
    102  0xff1100012f064000     0      63    0.15%       15       10        5       90       90        0        0        0       49        7        0        13       10         0        5         0         6
    103  0xff110020b736c040     1       1    0.15%       15       15        0       27       27        0        0        0        0        0        0         1       15         0        0         0        11
    104  0xff1100406f161bc0     1      32    0.15%       15       11        4      131      131        0        0        0       10        0        1       101       11         0        4         0         4
    105  0xff11000156a24040     0       3    0.14%       14       13        1       28       28        0        0        0        1        0        0         2       13         0        1         0        11
    106  0xff11000300658000     0      55    0.14%       14        9        5       65       65        0        0        0       24        3        0        19        9         0        5         0         5
    107  0xff1100406b92c040     1       1    0.14%       14       14        0       31       31        0        0        0        0        0        0         7       14         0        0         0        10
    108  0xff1100406d770000     1      36    0.14%       14        5        9       43       43        0        0        0        7        1        0        12        5         0        9         0         9
    109  0xff110001092a0040     0       1    0.13%       13       13        0       31       31        0        0        0        0        0        0         8       13         0        0         0        10
    110  0xff110020d7e88000     1      48    0.13%       13        9        4       44       44        0        0        0        9        3        0        15        9         0        4         0         4
    111  0xff1100406b92c000     1      40    0.13%       13        8        5       42       42        0        0        0       12        3        0         7        8         0        5         0         7
    112  0xff1100407419c000     1      48    0.13%       13       11        2       60       60        0        0        0       24        7        0        13       11         0        2         0         3
    113  0xff110001091c4040     0       1    0.12%       12       12        0       25       25        0        0        0        0        0        0         4       12         0        0         0         9
    114  0xff11000300658040     0       1    0.12%       12       12        0       22       22        0        0        0        0        0        0         4       12         0        0         0         6
    115  0xff110020ba320040     1       1    0.12%       12       12        0       33       33        0        0        0        2        0        0         3       12         0        0         0        16
    116  0xff11004070424040     1       5    0.12%       12       12        0       32       32        0        0        0        0        1        0         3       12         0        0         0        16
    117  0xff1100407bca8000     1      41    0.12%       12        8        4       42       42        0        0        0        8        4        0        13        8         0        4         0         5
    118  0xff11000155a60040     0       1    0.11%       11       11        0       29       29        0        0        0        0        0        0         6       11         0        0         0        12
    119  0xff110001e3bd9700     0      42    0.11%       11        5        6       82       82        0        0        0       43        0        0        22        5         0        6         0         6

=================================================
      Shared Cache Line Distribution Pareto      
=================================================
#
#        ----- HITM -----  -- Store Refs --  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                             Shared                         
#   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol             Object        Source:Line  Node
# .....  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..............................  .................  .................  ....
#
  -------------------------------------------------------------
      0     1310     1535        0        0  0xff110002a24ffc00
  -------------------------------------------------------------
           1.37%    0.20%    0.00%    0.00%                 0x0     0       1  0xffffffff81157a52       854       620       429      737        49  [k] rwsem_optimistic_spin       [kernel.kallsyms]  atomic64_64.h:190   0  1
           0.69%    0.65%    0.00%    0.00%                 0x0     0       1  0xffffffff81157a37       520       194       327      587        48  [k] rwsem_optimistic_spin       [kernel.kallsyms]  atomic64_64.h:22    0  1
           0.00%    0.13%    0.00%    0.00%                 0x0     0       1  0xffffffff81157e89         0       762       203        5         5  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  atomic64_64.h:22    0  1
          54.43%   58.18%    0.00%    0.00%                 0x8     0       1  0xffffffff81157958       474       237       435     2611        50  [k] rwsem_spin_on_owner         [kernel.kallsyms]  atomic64_64.h:22    0  1
          40.76%   39.28%    0.00%    0.00%                 0x8     0       1  0xffffffff811578c7       408       185       324     1831        49  [k] rwsem_spin_on_owner         [kernel.kallsyms]  atomic64_64.h:22    0  1
           0.23%    0.20%    0.00%    0.00%                 0x8     0       1  0xffffffff811578fe       376       189       270      774        48  [k] rwsem_spin_on_owner         [kernel.kallsyms]  atomic64_64.h:22    0  1
           0.00%    0.07%    0.00%    0.00%                 0x8     0       1  0xffffffff81157c28         0       157       332      547        48  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  atomic64_64.h:22    0  1
           1.22%    0.65%    0.00%    0.00%                0x10     0       1  0xffffffff811587b5      1775       740      1033      592        49  [k] osq_lock                    [kernel.kallsyms]  atomic.h:208        0  1
           1.07%    0.33%    0.00%    0.00%                0x10     0       1  0xffffffff811588d3      1065       396       443      703        49  [k] osq_unlock                  [kernel.kallsyms]  atomic.h:196        0  1
           0.08%    0.07%    0.00%    0.00%                0x14     0       1  0xffffffff81cb8d19      1502       437       583       17        34  [k] _raw_spin_lock_irqsave      [kernel.kallsyms]  atomic.h:202        0  1
           0.00%    0.20%    0.00%    0.00%                0x18     0       1  0xffffffff811573d6         0       248       623       10        10  [k] rwsem_mark_wake             [kernel.kallsyms]  rwsem.c:414         0  1
           0.00%    0.07%    0.00%    0.00%                0x18     0       1  0xffffffff81157e67         0       789         0        1         1  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  rwsem.c:1245        0
           0.15%    0.00%    0.00%    0.00%                0x38     0       1  0xffffffff812f72d4       458         0       186      305        48  [k] unmap_region                [kernel.kallsyms]  mm.h:1945           0  1

  -------------------------------------------------------------
      1      108      136        0        0  0xff110002a24ffb80
  -------------------------------------------------------------
          57.41%   51.47%    0.00%    0.00%                0x10     0       1  0xffffffff812e24a4       386       157        62      212        44  [k] vmacache_find      [kernel.kallsyms]  vmacache.c:49   0  1
          42.59%   48.53%    0.00%    0.00%                0x18     0       1  0xffffffff812f73cc       392       164        86      181        44  [k] get_unmapped_area  [kernel.kallsyms]  mmap.c:2270     0  1

  -------------------------------------------------------------
      2      111      128        0        0  0xff1100209ae58b00
  -------------------------------------------------------------
          21.62%   25.78%    0.00%    0.00%                0x20     1       1  0xffffffff812f80f9       410       166       146      106        39  [k] __vma_adjust           [kernel.kallsyms]  mmap.c:536               0  1
          16.22%   13.28%    0.00%    0.00%                0x20     1       1  0xffffffff812f7537       401       170       277       87        38  [k] find_vma               [kernel.kallsyms]  mmap.c:2321              0  1
          49.55%   53.12%    0.00%    0.00%                0x28     1       1  0xffffffff812f92bb       406       168        84      204        47  [k] vm_unmapped_area       [kernel.kallsyms]  mmap.c:2065              0  1
          10.81%    7.81%    0.00%    0.00%                0x28     1       1  0xffffffff812f92bf       399       162        61       49        26  [k] vm_unmapped_area       [kernel.kallsyms]  mmap.c:2065              0  1
           1.80%    0.00%    0.00%    0.00%                0x28     1       1  0xffffffff812f7b99       424         0         0        4         2  [k] __vma_link_rb          [kernel.kallsyms]  mmap.c:434               0  1

  -------------------------------------------------------------
      3      123       99        0        0  0xff1100209ae58580
  -------------------------------------------------------------
          63.41%   40.40%    0.00%    0.00%                 0x0     1       1  0xffffffff812f906b       391       199        94      233        46  [k] vm_unmapped_area          [kernel.kallsyms]  mmap.c:2060              0  1
          12.20%   14.14%    0.00%    0.00%                 0x0     1       1  0xffffffff812f810d       387       159       159       62        29  [k] __vma_adjust              [kernel.kallsyms]  mmap.c:542               0  1
           1.63%   15.15%    0.00%    0.00%                 0x0     1       1  0xffffffff812f7dd1       398       174        72       28        20  [k] __vma_adjust              [kernel.kallsyms]  mmap.c:756               0  1
           8.94%    5.05%    0.00%    0.00%                 0x0     1       1  0xffffffff812f7544       384       163       236       41        23  [k] find_vma                  [kernel.kallsyms]  mmap.c:2317              0  1
           0.81%    0.00%    0.00%    0.00%                 0x0     1       1  0xffffffff812f79a3       365         0       147        6         5  [k] __vma_rb_erase            [kernel.kallsyms]  mmap.c:301               0  1
           0.00%    1.01%    0.00%    0.00%                 0x0     1       1  0xffffffff812f998a         0       160        87        9         9  [k] __do_munmap               [kernel.kallsyms]  mmap.c:301               0  1
           0.81%    0.00%    0.00%    0.00%                0x20     1       1  0xffffffff812f7b90       399         0       294       49        29  [k] __vma_link_rb             [kernel.kallsyms]  mmap.c:434               0  1
           0.00%    1.01%    0.00%    0.00%                0x28     1       1  0xffffffff812f80f9         0       184       193       42        23  [k] __vma_adjust              [kernel.kallsyms]  mmap.c:536               0  1
           8.94%    7.07%    0.00%    0.00%                0x30     1       1  0xffffffff812f7b99       378       167       126       34        16  [k] __vma_link_rb             [kernel.kallsyms]  mmap.c:434               0  1
           0.00%   13.13%    0.00%    0.00%                0x30     1       1  0xffffffff812f90a9         0       164       335       29        24  [k] vm_unmapped_area          [kernel.kallsyms]  mmap.c:2085              0  1
           3.25%    2.02%    0.00%    0.00%                0x30     1       1  0xffffffff812f9959       369       170       231       22        11  [k] __do_munmap               [kernel.kallsyms]  mmap.c:434               0  1
           0.00%    1.01%    0.00%    0.00%                0x30     1       1  0xffffffff812f796b         0       148       219        9         8  [k] __vma_rb_erase            [kernel.kallsyms]  mmap.c:434               0  1

[-- Attachment #5: c2c_old.log --]
[-- Type: text/plain, Size: 39473 bytes --]

=================================================
            Trace Event Information              
=================================================
  Total records                     :     419820
  Locked Load/Store Operations      :       8851
  Load Operations                   :      73936
  Loads - uncacheable               :          0
  Loads - IO                        :          0
  Loads - Miss                      :          0
  Loads - no mapping                :        650
  Load Fill Buffer Hit              :      19371
  Load L1D hit                      :      25362
  Load L2D hit                      :        621
  Load LLC hit                      :      16164
  Load Local HITM                   :       7915
  Load Remote HITM                  :       7527
  Load Remote HIT                   :          0
  Load Local DRAM                   :        795
  Load Remote DRAM                  :      10973
  Load MESI State Exclusive         :      10973
  Load MESI State Shared            :        795
  Load LLC Misses                   :      19295
  Load access blocked by data       :          0
  Load access blocked by address    :          0
  LLC Misses to Local DRAM          :        4.1%
  LLC Misses to Remote DRAM         :       56.9%
  LLC Misses to Remote cache (HIT)  :        0.0%
  LLC Misses to Remote cache (HITM) :       39.0%
  Store Operations                  :          0
  Store - uncacheable               :          0
  Store - no mapping                :          0
  Store L1D Hit                     :          0
  Store L1D Miss                    :          0
  No Page Map Rejects               :      71173
  Unable to parse data source       :     345884

=================================================
    Global Shared Cache Line Event Information   
=================================================
  Total Shared Cache Lines          :       1755
  Load HITs on shared lines         :      48124
  Fill Buffer Hits on shared lines  :      13949
  L1D hits on shared lines          :       7439
  L2D hits on shared lines          :         83
  LLC hits on shared lines          :      15246
  Locked Access on shared lines     :       6623
  Blocked Access on shared lines    :          0
  Store HITs on shared lines        :          0
  Store L1D hits on shared lines    :          0
  Total Merged records              :      15442

=================================================
                 c2c details                     
=================================================
  Events                            : cpu/mem-loads,ldlat=30/P
                                    : cpu/mem-stores/P
  Cachelines sort on                : Total HITMs
  Cacheline data grouping           : offset,iaddr

=================================================
           Shared Data Cache Line Table          
=================================================
#
#        ----------- Cacheline ----------      Tot  ------- Load Hitm -------    Total    Total    Total  ---- Stores ----  ----- Core Load Hit -----  - LLC Load Hit --  - RMT Load Hit --  --- Load Dram ----
# Index             Address  Node  PA cnt     Hitm    Total  LclHitm  RmtHitm  records    Loads   Stores    L1Hit   L1Miss       FB       L1       L2    LclHit  LclHitm    RmtHit  RmtHitm       Lcl       Rmt
# .....  ..................  ....  ......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  .......  ........  .......  ........  .......  ........  ........
#
      0  0xff11004077f52280     1    6828   25.29%     3905     1988     1917    10654    10654        0        0        0     1804     1048        2       757     1988         0     1917       158      2980
      1  0xff11004077f52240     1    4806   18.11%     2796      909     1887     9484     9484        0        0        0     1020     2800        0       465      909         0     1887       223      2180
      2  0xff11004077f52200     1     459    1.28%      197      114       83      748      748        0        0        0      377        1        1        89      114         0       83         0        83
      3  0xff11004077f522c0     1    1929    1.13%      175      171        4     1704     1704        0        0        0     1117        0        1       209      171         0        4        13       189
      4  0xff110002d1012400     0     857    1.02%      158       78       80      817      817        0        0        0      413       96        0        67       78         0       80         3        80
      5  0xff1100407dadc980     1     724    0.93%      144       58       86      792      792        0        0        0      465       46        0        49       58         0       86         0        88
      6  0xff1100407dadcd40     1    1036    0.87%      134       60       74     1157     1157        0        0        0      828       59        0        62       60         0       74         0        74
      7  0xff11003fdcdaba00     1     200    0.66%      102       55       47      278      278        0        0        0       47       21        0        51       55         0       47         4        53
      8  0xff110002d10123c0     0     158    0.63%       97       46       51      577      577        0        0        0      147       18        0       256       46         0       51         0        59
      9  0xff11003fdca6ba00     1     147    0.58%       89       45       44      258      258        0        0        0       44       15        0        49       45         0       44         3        58
     10  0xff110020001aba00     0     174    0.56%       86       46       40      253      253        0        0        0       31       28        0        57       46         0       40         4        47
     11  0xff11001ffffaba00     0     155    0.54%       84       37       47      236      236        0        0        0       40       19        0        35       37         0       47         4        54
     12  0xff11001fffdaba00     0     152    0.54%       83       40       43      237      237        0        0        0       34       21        0        40       40         0       43         3        56
     13  0xff11003fdca2ba00     1     169    0.54%       83       52       31      237      237        0        0        0       48       22        0        45       52         0       31         3        36
     14  0xff11003fdcdeba00     1     158    0.54%       83       40       43      232      232        0        0        0       38       21        0        33       40         0       43         5        52
     15  0xff11001fffe2ba00     0     136    0.53%       82       43       39      210      210        0        0        0       32       14        0        25       43         0       39         9        48
     16  0xff11001fffe6ba00     0     145    0.52%       81       43       38      228      228        0        0        0       47       12        0        40       43         0       38         4        44
     17  0xff11003fdcb2ba00     1     133    0.52%       81       36       45      242      242        0        0        0       43       23        0        36       36         0       45         1        58
     18  0xff11003fdcd6ba00     1     153    0.52%       80       40       40      237      237        0        0        0       42       15        0        43       40         0       40         6        51
     19  0xff11001fffeaba00     0     138    0.51%       78       42       36      220      220        0        0        0       31       18        0        42       42         0       36         5        46
     20  0xff11003fdcc2ba00     1     142    0.51%       78       40       38      217      217        0        0        0       40       18        0        31       40         0       38         3        47
     21  0xff11001fffdeba00     0     143    0.50%       77       42       35      218      218        0        0        0       31       27        0        39       42         0       35         2        42
     22  0xff11001fffeeba00     0     154    0.50%       77       34       43      226      226        0        0        0       37       19        0        38       34         0       43         4        51
     23  0xff11004077f52580     1       1    0.50%       77       77        0      228      228        0        0        0       23        0        0        61       77         0        0         1        66
     24  0xff11001ffff2ba00     0     138    0.49%       76       41       35      223      223        0        0        0       37       17        0        40       41         0       35         5        48
     25  0xff11003fdcc6ba00     1     155    0.49%       76       44       32      224      224        0        0        0       44       23        0        37       44         0       32         4        40
     26  0xff11003fdccaba00     1     123    0.49%       76       48       28      195      195        0        0        0       31       20        0        29       48         0       28         2        37
     27  0xff11001fffd6ba00     0     155    0.48%       74       41       33      232      232        0        0        0       36       26        0        51       41         0       33         8        37
     28  0xff1100200002ba00     0     174    0.48%       74       34       40      240      240        0        0        0       42       21        0        49       34         0       40         0        54
     29  0xff11003fdcbaba00     1     158    0.48%       74       33       41      235      235        0        0        0       40       17        0        50       33         0       41         3        51
     30  0xff11003fdcd2ba00     1     167    0.48%       74       37       37      242      242        0        0        0       57       17        0        45       37         0       37         5        44
     31  0xff11001ffff6ba00     0     146    0.47%       73       33       40      215      215        0        0        0       35       26        0        32       33         0       40         4        45
     32  0xff11003fdcb6ba00     1     144    0.47%       73       32       41      217      217        0        0        0       46       13        0        33       32         0       41         4        48
     33  0xff110020000aba00     0     146    0.47%       72       29       43      220      220        0        0        0       33       22        0        39       29         0       43         6        48
     34  0xff11003fdceaba00     1     111    0.46%       71       36       35      189      189        0        0        0       18       18        0        38       36         0       35         3        41
     35  0xff11001fffc2ba00     0     138    0.45%       70       33       37      217      217        0        0        0       38       21        0        37       33         0       37         3        48
     36  0xff11001ffffeba00     0     152    0.45%       70       39       31      204      204        0        0        0       29       23        0        39       39         0       31         7        36
     37  0xff1100200016ba00     0     154    0.45%       70       36       34      241      241        0        0        0       47       29        0        39       36         0       34         6        50
     38  0xff11001fffceba00     0     129    0.44%       68       28       40      219      219        0        0        0       36       23        0        41       28         0       40         5        46
     39  0xff11003fdcf2ba00     1     118    0.44%       68       32       36      206      206        0        0        0       33       18        0        41       32         0       36         1        45
     40  0xff1100200012ba00     0     126    0.43%       67       37       30      203      203        0        0        0       27       25        0        37       37         0       30         8        39
     41  0xff11003fdcceba00     1     142    0.43%       67       28       39      223      223        0        0        0       42       23        0        41       28         0       39         2        48
     42  0xff11003fdcfaba00     1     119    0.43%       67       31       36      200      200        0        0        0       36       14        0        34       31         0       36         3        46
     43  0xff110020000eba00     0     106    0.42%       65       36       29      174      174        0        0        0       25       16        0        23       36         0       29         2        43
     44  0xff11003fdcaaba00     1     144    0.42%       65       32       33      216      216        0        0        0       46       23        0        36       32         0       33         2        44
     45  0xff110020001eba00     0     136    0.41%       64       24       40      223      223        0        0        0       44       17        0        39       24         0       40         4        55
     46  0xff11003fdceeba00     1     115    0.41%       63       32       31      200      200        0        0        0       37       12        0        46       32         0       31         4        38
     47  0xff11003fdcaeba00     1     134    0.40%       61       32       29      196      196        0        0        0       37       24        0        27       32         0       29         5        42
     48  0xff11003fdce6ba00     1     126    0.40%       61       36       25      184      184        0        0        0       24       19        0        49       36         0       25         3        28
     49  0xff11003fdcfeba00     1     107    0.39%       60       36       24      183      183        0        0        0       38        9        0        37       36         0       24         3        36
     50  0xff11003fdce2ba00     1     126    0.38%       58       33       25      187      187        0        0        0       38       13        0        37       33         0       25         8        33
     51  0xff11001fffd2ba00     0     108    0.36%       56       31       25      154      154        0        0        0       26       16        0        23       31         0       25         0        33
     52  0xff11004077f52540     1       1    0.36%       56       56        0      155      155        0        0        0        6        0        0        40       56         0        0         2        51
     53  0xff11001fffcaba00     0     100    0.36%       55       31       24      172      172        0        0        0       27       17        0        35       31         0       24         3        35
     54  0xff11003fdcf6ba00     1     105    0.35%       54       29       25      163      163        0        0        0       23       20        0        25       29         0       25         5        36
     55  0xff1100407dadcd80     1     183    0.34%       53       43       10      345      345        0        0        0      156       17        0        86       43         0       10         5        28
     56  0xff11001fffc6ba00     0     129    0.34%       52       20       32      201      201        0        0        0       48       18        0        36       20         0       32         3        44
     57  0xff1100200006ba00     0     152    0.34%       52       21       31      201      201        0        0        0       49       22        0        38       21         0       31         4        36
     58  0xff11004077f524c0     1     397    0.34%       52       52        0      626      626        0        0        0      445        0        0        63       52         0        0         1        65
     59  0xff11003fdcbeba00     1      82    0.32%       50       22       28      141      141        0        0        0       24       13        0        20       22         0       28         3        31
     60  0xff110020db958000     1      85    0.32%       49       18       31      129      129        0        0        0       25       13        0        11       18         0       31         0        31
     61  0xff110001b550c000     0     102    0.26%       40       15       25      131      131        0        0        0       23       20        0        23       15         0       25         0        25
     62  0xff11000185270000     0      96    0.24%       37       16       21      123      123        0        0        0       30       11        0        24       16         0       21         0        21
     63  0xff110040745f0000     1     100    0.24%       37       16       21      132      132        0        0        0       31       15        0        27       16         0       21         0        22
     64  0xff11000116e68000     0     102    0.23%       36       18       18      131      131        0        0        0       36       20        0        21       18         0       18         0        18
     65  0xff110002405e4000     0     100    0.23%       36       13       23      131      131        0        0        0       36       18        0        17       13         0       23         1        23
     66  0xff1100209ebfc000     1     117    0.23%       36       13       23      146      146        0        0        0       38       16        0        33       13         0       23         0        23
     67  0xff11004075200000     1      83    0.23%       36       15       21      132      132        0        0        0       34       18        0        23       15         0       21         0        21
     68  0xff1100209c38c000     1      98    0.23%       35       17       18      139      139        0        0        0       37       20        0        29       17         0       18         0        18
     69  0xff11003fdc5eba00     1      69    0.23%       35       20       15      104      104        0        0        0       21       14        0        10       20         0       15         2        22
     70  0xff1100407546c000     1     101    0.23%       35       17       18      147      147        0        0        0       54       19        0        20       17         0       18         0        19
     71  0xff1100012d434000     0     115    0.22%       34       17       17      147      147        0        0        0       43       22        0        29       17         0       17         0        19
     72  0xff110002704b4000     0      81    0.22%       34       17       17      116      116        0        0        0       29       14        0        21       17         0       17         0        18
     73  0xff110002d0b18000     0      90    0.22%       34       17       17      114      114        0        0        0       34       13        0        14       17         0       17         0        19
     74  0xff1100406c458000     1     105    0.22%       34       16       18      127      127        0        0        0       34       23        0        18       16         0       18         0        18
     75  0xff1100407eeb8000     1      97    0.22%       34       16       18      133      133        0        0        0       46       19        0        16       16         0       18         0        18
     76  0xff110020b69bc000     1     105    0.21%       33       16       17      134      134        0        0        0       45       16        0        22       16         0       17         0        18
     77  0xffffffff835477c0     1       1    0.21%       32       13       19       51       51        0        0        0        0        0        0         0       13         0       19         0        19
     78  0xff110001b2a1ba00     0     215    0.21%       32       19       13      330      330        0        0        0        4       37        0       244       19         0       13         0        13
     79  0xff110002720d8000     0      98    0.21%       32       15       17      129      129        0        0        0       33       20        0        27       15         0       17         0        17
     80  0xff1100209cf60000     1      92    0.21%       32       15       17      108      108        0        0        0       30       11        0        18       15         0       17         0        17
     81  0xff1100011628c000     0      95    0.20%       31       17       14      112      112        0        0        0       30       12        0        25       17         0       14         0        14
     82  0xff110020b5cac000     1      93    0.20%       31       14       17      133      133        0        0        0       49       13        0        22       14         0       17         0        18
     83  0xff11000156b04000     0      90    0.19%       30       15       15      100      100        0        0        0       27       13        0        15       15         0       15         0        15
     84  0xff110001570cc000     0     104    0.19%       30       18       12      106      106        0        0        0       24       19        0        21       18         0       12         0        12
     85  0xff1100407ed74000     1      70    0.19%       30       14       16       89       89        0        0        0       23        7        0        13       14         0       16         0        16
     86  0xff1100209d8fc000     1      87    0.19%       29        9       20      121      121        0        0        0       39       16        0        16        9         0       20         0        21
     87  0xff1100407db84000     1      78    0.19%       29       14       15      103      103        0        0        0       32        8        0        19       14         0       15         0        15
     88  0xff11000116638000     0      84    0.18%       28       15       13      116      116        0        0        0       32       17        0        26       15         0       13         0        13
     89  0xff110001b2fb4000     0      59    0.18%       28       15       13       79       79        0        0        0       15        7        0        16       15         0       13         0        13
     90  0xff1100010927c000     0      87    0.17%       27       13       14      134      134        0        0        0       60       15        0        17       13         0       14         0        15
     91  0xff11000185274000     0      88    0.17%       27       13       14      112      112        0        0        0       24       20        0        26       13         0       14         0        15
     92  0xff1100209a4f4000     1      91    0.17%       27       13       14      120      120        0        0        0       36       14        0        28       13         0       14         0        15
     93  0xff110001b343c000     0      96    0.17%       26        9       17      119      119        0        0        0       29       27        0        19        9         0       17         0        18
     94  0xff1100021189c000     0      87    0.17%       26       12       14      102      102        0        0        0       22       19        0        21       12         0       14         0        14
     95  0xff110001e3704000     0      78    0.16%       25       17        8       80       80        0        0        0       17       15        0        13       17         0        8         0        10
     96  0xff110002703b0000     0      95    0.16%       25       12       13      129      129        0        0        0       41       25        0        23       12         0       13         0        15
     97  0xff110020b5ef0000     1      91    0.16%       25       10       15      111      111        0        0        0       29       16        0        26       10         0       15         0        15
     98  0xff110020dd11c000     1      90    0.16%       24       12       12      106      106        0        0        0       32       15        0        22       12         0       12         1        12
     99  0xff11000116b38000     0      69    0.15%       23        9       14      103      103        0        0        0       30       20        0        16        9         0       14         0        14
    100  0xff11000211070000     0      85    0.15%       23        9       14      100      100        0        0        0       26       18        0        18        9         0       14         0        15
    101  0xff1100407303c000     1      90    0.15%       23        8       15      119      119        0        0        0       40       18        0        20        8         0       15         0        18
    102  0xff1100407db80000     1      84    0.15%       23       10       13      109      109        0        0        0       44        8        1        20       10         0       13         0        13
    103  0xff1100012df30000     0     102    0.14%       22       13        9      108      108        0        0        0       29       25        0        22       13         0        9         0        10
    104  0xff110020b69b8000     1      84    0.14%       21        9       12      117      117        0        0        0       43       19        0        21        9         0       12         0        13
    105  0xff110001e1ff0000     0      87    0.13%       20        9       11      110      110        0        0        0       42       21        0        14        9         0       11         0        13
    106  0xff1100407348c000     1      75    0.13%       20        9       11       98       98        0        0        0       27       19        0        20        9         0       11         0        12
    107  0xff11004075204000     1      81    0.13%       20        9       11      104      104        0        0        0       30       21        0        21        9         0       11         0        12
    108  0xff110020b7eb8000     1      64    0.12%       19        7       12       93       93        0        0        0       45        5        0        12        7         0       12         0        12
    109  0xff110001e47a0000     0      81    0.12%       18        5       13       93       93        0        0        0       24       21        0        16        5         0       13         0        14
    110  0xff1100010923c000     0      91    0.11%       17        9        8       92       92        0        0        0       30       17        0        18        9         0        8         0        10
    111  0xff110001b550c040     0       3    0.10%       16       16        0       40       40        0        0        0        4        0        0         2       16         0        0         0        18
    112  0xff110002d10124c0     0     428    0.10%       16       14        2      646      646        0        0        0      357        3        0       236       14         0        2         0        34
    113  0xff11003fdc6aba00     1      35    0.10%       16        9        7       47       47        0        0        0       10        8        0         3        9         0        7         0        10

=================================================
      Shared Cache Line Distribution Pareto      
=================================================
#
#        ----- HITM -----  -- Store Refs --  --------- Data address ---------                      ---------- cycles ----------    Total       cpu                                             Shared                        
#   Num  RmtHitm  LclHitm   L1 Hit  L1 Miss              Offset  Node  PA cnt        Code address  rmt hitm  lcl hitm      load  records       cnt                          Symbol             Object       Source:Line  Node
# .....  .......  .......  .......  .......  ..................  ....  ......  ..................  ........  ........  ........  .......  ........  ..............................  .................  ................  ....
#
  -------------------------------------------------------------
      0     1917     1988        0        0  0xff11004077f52280
  -------------------------------------------------------------
          53.31%   49.04%    0.00%    0.00%                 0x0     1       1  0xffffffff81157918       497       277       465     3558        56  [k] rwsem_spin_on_owner         [kernel.kallsyms]  atomic64_64.h:22   0  1
          30.57%   35.87%    0.00%    0.00%                 0x0     1       1  0xffffffff81157887       413       228       392     2686        55  [k] rwsem_spin_on_owner         [kernel.kallsyms]  atomic64_64.h:22   0  1
          13.72%   12.02%    0.00%    0.00%                 0x0     1       1  0xffffffff81157be8       483       187       328      836        52  [k] rwsem_down_write_slowpath   [kernel.kallsyms]  atomic64_64.h:22   0  1
           0.37%    0.55%    0.00%    0.00%                 0x0     1       1  0xffffffff811578be       411       213       309     1156        52  [k] rwsem_spin_on_owner         [kernel.kallsyms]  atomic64_64.h:22   0  1
           0.21%    0.60%    0.00%    0.00%                 0x0     1       1  0xffffffff81157a71       557       255       442       71        36  [k] rwsem_optimistic_spin       [kernel.kallsyms]  atomic64_64.h:22   0  1
           0.05%    0.15%    0.00%    0.00%                 0x0     1       1  0xffffffff811577e8       379       445       247      194        52  [k] downgrade_write             [kernel.kallsyms]  atomic64_64.h:22   0  1
           0.05%    0.05%    0.00%    0.00%                 0x0     1       1  0xffffffff81157760       386       172       300        5         4  [k] up_read                     [kernel.kallsyms]  atomic64_64.h:22   0  1
           1.10%    0.25%    0.00%    0.00%                 0x8     1       1  0xffffffff81158775      1059       580       939      617        54  [k] osq_lock                    [kernel.kallsyms]  atomic.h:208       0  1
           0.26%    0.40%    0.00%    0.00%                 0x8     1       1  0xffffffff81158893      1153       408       561      774        55  [k] osq_unlock                  [kernel.kallsyms]  atomic.h:196       0  1
           0.05%    0.00%    0.00%    0.00%                 0xc     1       1  0xffffffff81cb8d19       560         0       548       34        46  [k] _raw_spin_lock_irqsave      [kernel.kallsyms]  atomic.h:202       0  1
           0.05%    0.15%    0.00%    0.00%                0x10     1       1  0xffffffff81157396       796       234       436       42        30  [k] rwsem_mark_wake             [kernel.kallsyms]  rwsem.c:414        0  1
           0.16%    0.05%    0.00%    0.00%                0x10     1       1  0xffffffff811576ba       543       209       554       26        20  [k] rwsem_wake.isra.0           [kernel.kallsyms]  list.h:282         0  1
           0.10%    0.86%    0.00%    0.00%                0x38     1       1  0xffffffff812f96c2       408       166       146      158        46  [k] __do_munmap                 [kernel.kallsyms]  mm.h:1951          0  1

  -------------------------------------------------------------
      1     1887      909        0        0  0xff11004077f52240
  -------------------------------------------------------------
           0.58%    2.42%    0.00%    0.00%                 0x8     1       1  0xffffffff812f8eea       449       214       266       83        33  [k] vm_unmapped_area           [kernel.kallsyms]  mmap.c:2048         0  1
           0.00%    1.87%    0.00%    0.00%                0x10     1       1  0xffffffff812eaec3         0       172       130      124        48  [k] free_pgd_range             [kernel.kallsyms]  pgtable.h:106       0  1
           0.32%    0.22%    0.00%    0.00%                0x10     1       1  0xffffffff812ed581       410       320        86      156        46  [k] unmap_page_range           [kernel.kallsyms]  pgtable.h:106       0  1
           0.48%    0.44%    0.00%    0.00%                0x18     1       1  0xffffffff81125dab       407       332        96       24        13  [k] finish_task_switch         [kernel.kallsyms]  atomic.h:29         0  1
           0.00%    0.22%    0.00%    0.00%                0x20     1       1  0xffffffff81cb2ed4         0       749       388       11        15  [k] __schedule                 [kernel.kallsyms]  atomic.h:95         0  1
           0.00%    0.11%    0.00%    0.00%                0x20     1       1  0xffffffff81125cad         0      1506       641       11        14  [k] finish_task_switch         [kernel.kallsyms]  atomic.h:123        0  1
           7.53%    5.39%    0.00%    0.00%                0x30     1       1  0xffffffff812fb7d5       619       216       667      466        52  [k] do_mmap                    [kernel.kallsyms]  mmap.c:1445         0  1
           4.72%    2.53%    0.00%    0.00%                0x30     1       1  0xffffffff812f9533       459       172       184      250        50  [k] __do_munmap                [kernel.kallsyms]  mmap.c:2853         0  1
           2.17%    2.86%    0.00%    0.00%                0x30     1       1  0xffffffff812f800f       405       169       220      233        52  [k] __vma_adjust               [kernel.kallsyms]  mmap.c:719          0  1
           0.05%    1.21%    0.00%    0.00%                0x30     1       1  0xffffffff812f7f3d       421       165       302       41        42  [k] __vma_adjust               [kernel.kallsyms]  mmap.c:958          0  1
           0.11%    0.22%    0.00%    0.00%                0x30     1       1  0xffffffff812f964c       487       156       335      123        55  [k] __do_munmap                [kernel.kallsyms]  mmap.c:2697         0  1
          69.79%   75.91%    0.00%    0.00%                0x38     1       1  0xffffffff811579f7       436       177       342     3585        54  [k] rwsem_optimistic_spin      [kernel.kallsyms]  atomic64_64.h:22    0  1
          13.78%    5.28%    0.00%    0.00%                0x38     1       1  0xffffffff81157a12      1240       483       484     1354        55  [k] rwsem_optimistic_spin      [kernel.kallsyms]  atomic64_64.h:190   0  1
           0.16%    0.44%    0.00%    0.00%                0x38     1       1  0xffffffff81157794      1128       695       539      545        54  [k] up_write                   [kernel.kallsyms]  atomic64_64.h:172   0  1
           0.00%    0.55%    0.00%    0.00%                0x38     1       1  0xffffffff81157a7d         0       221       314      552        52  [k] rwsem_optimistic_spin      [kernel.kallsyms]  atomic64_64.h:22    0  1
           0.11%    0.11%    0.00%    0.00%                0x38     1       1  0xffffffff811577e3      1386       415       381      355        54  [k] downgrade_write            [kernel.kallsyms]  atomic64_64.h:172   0  1
           0.11%    0.00%    0.00%    0.00%                0x38     1       1  0xffffffff81cb6168      1250         0       502      930        55  [k] down_write_killable        [kernel.kallsyms]  atomic64_64.h:190   0  1
           0.00%    0.11%    0.00%    0.00%                0x38     1       1  0xffffffff81157c9c         0       146         0        1         1  [k] rwsem_down_write_slowpath  [kernel.kallsyms]  atomic64_64.h:22    0
           0.00%    0.11%    0.00%    0.00%                0x38     1       1  0xffffffff81157ce6         0       230      2218        2         2  [k] rwsem_down_write_slowpath  [kernel.kallsyms]  atomic64_64.h:22    0
           0.05%    0.00%    0.00%    0.00%                0x38     1       1  0xffffffff81157e2c       394         0         0        2         1  [k] rwsem_down_write_slowpath  [kernel.kallsyms]  atomic64_64.h:22    0
           0.05%    0.00%    0.00%    0.00%                0x38     1       1  0xffffffff81157e49       501         0        82        5         4  [k] rwsem_down_write_slowpath  [kernel.kallsyms]  atomic64_64.h:22    0  1

  -------------------------------------------------------------
      2       83      114        0        0  0xff11004077f52200
  -------------------------------------------------------------
          49.40%   41.23%    0.00%    0.00%                0x10     1       1  0xffffffff812e2464       391       166       141      173        46  [k] vmacache_find      [kernel.kallsyms]  vmacache.c:49   0  1
          50.60%   58.77%    0.00%    0.00%                0x18     1       1  0xffffffff812f728c       394       165        68      203        48  [k] get_unmapped_area  [kernel.kallsyms]  mmap.c:2270     0  1

  -------------------------------------------------------------
      3        4      171        0        0  0xff11004077f522c0
  -------------------------------------------------------------
           0.00%    5.85%    0.00%    0.00%                 0x0     1       1  0xffffffff812f96b0         0       170       247       90        39  [k] __do_munmap              [kernel.kallsyms]  mm.h:1951      0  1
           0.00%    4.68%    0.00%    0.00%                 0x0     1       1  0xffffffff812fb228         0       176       236       47        54  [k] mmap_region              [kernel.kallsyms]  mmap.c:3384    0  1
          25.00%    1.17%    0.00%    0.00%                 0x0     1       1  0xffffffff811f12f0       453       726       273       34        27  [k] __acct_update_integrals  [kernel.kallsyms]  tsacct.c:140   0  1
           0.00%    1.17%    0.00%    0.00%                 0x0     1       1  0xffffffff812fa195         0       200       210      361        53  [k] may_expand_vm            [kernel.kallsyms]  mmap.c:3359    0  1
           0.00%   22.22%    0.00%    0.00%                 0x8     1       1  0xffffffff812f959a         0       166       178      177        51  [k] __do_munmap              [kernel.kallsyms]  mmap.c:2889    0  1
           0.00%    0.58%    0.00%    0.00%                0x18     1       1  0xffffffff812fa1cf         0       177       208      306        52  [k] may_expand_vm            [kernel.kallsyms]  mmap.c:3363    0  1
          75.00%   64.33%    0.00%    0.00%                0x30     1       1  0xffffffff812fb836       371       234       237      329        51  [k] do_mmap                  [kernel.kallsyms]  mmap.c:1472    0  1

  -------------------------------------------------------------
      4       80       78        0        0  0xff110002d1012400
  -------------------------------------------------------------
           0.00%    2.56%    0.00%    0.00%                 0x0     0       1  0xffffffff812f8f1e         0       232       235       74        34  [k] vm_unmapped_area          [kernel.kallsyms]  mmap.c:2060              0  1
          22.50%    6.41%    0.00%    0.00%                0x10     0       1  0xffffffff812f7406       382       188       177       75        35  [k] find_vma                  [kernel.kallsyms]  mmap.c:2323              0  1
           1.25%    0.00%    0.00%    0.00%                0x10     0       1  0xffffffff812fb12f        71         0       101        6         4  [k] mmap_region               [kernel.kallsyms]  mmap.c:536               0  1
          76.25%   91.03%    0.00%    0.00%                0x20     0       1  0xffffffff812f917b       393       172        69      239        51  [k] vm_unmapped_area          [kernel.kallsyms]  mmap.c:2065              0  1

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-06 10:16         ` Feng Tang
@ 2021-06-06 19:20           ` Linus Torvalds
  2021-06-06 22:13             ` Waiman Long
  2021-06-07  6:05             ` Feng Tang
  0 siblings, 2 replies; 13+ messages in thread
From: Linus Torvalds @ 2021-06-06 19:20 UTC (permalink / raw)
  To: Feng Tang, Waiman Long
  Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

[ Adding Waiman Long to the participants, because this seems to be a
very specific cacheline alignment behavior of rwsems, maybe Waiman has
some comments ]

On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote:
>
> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
>   data structure change
>
>   - old kernel
>
>     - first cacheline
>         mmap_lock->count (75%)
>         mm->mapcount (14%)
>
>     - second cacheline
>         mmap_lock->owner (97%)
>
>   - new kernel
>
>     mainly in the cacheline of 'mmap_lock'
>
>     mmap_lock->count (~2%)
>     mmap_lock->owner (95%)

Oooh.

It looks like pretty much all the contention is on mmap_lock, and the
difference is that the old kernel just _happened_ to split the
mmap_lock rwsem at *exactly* the right place.

The rw_semaphore structure looks like this:

        struct rw_semaphore {
                atomic_long_t count;
                atomic_long_t owner;
                struct optimistic_spin_queue osq; /* spinner MCS lock */
                ...

and before the addition of the 'write_protect_seq' field, the mmap_sem
was at offset 120 in 'struct mm_struct'.

Which meant that count and owner were in two different cachelines, and
then when you have contention and spend time in
rwsem_down_write_slowpath(), this is probably *exactly* the kind of
layout you want.

Because first the rwsem_write_trylock() will do a cmpxchg on the first
cacheline (for the optimistic fast-path), and then in the case of
contention, rwsem_down_write_slowpath() will just access the second
cacheline.

Which is probably just optimal for a load that spends a lot of time
contended - new waiters touch that first cacheline, and then they
queue themselves up on the second cacheline. Waiman, does that sound
believable?

Anyway, I'm certainly ok with the patch that just moves
'write_protect_seq' down, it might be worth commenting about how this
is about some very special cache layout of the mmap_sem part of the
structure.

That said, this means that it all is very subtle dependent on a lot of
kernel config options, and I'm not sure how relevant the exact kernel
options are that the test robot has been using. But even if this is
just a "kernel test robot reports", I think it's an interesting case
and worth a comment for when this happens next time...

                Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-06 19:20           ` Linus Torvalds
@ 2021-06-06 22:13             ` Waiman Long
  2021-06-07  6:05             ` Feng Tang
  1 sibling, 0 replies; 13+ messages in thread
From: Waiman Long @ 2021-06-06 22:13 UTC (permalink / raw)
  To: Linus Torvalds, Feng Tang
  Cc: Jason Gunthorpe, kernel test robot, John Hubbard, Jan Kara,
	Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V, Christoph Hellwig,
	Hugh Dickins, Jann Horn, Kirill Shutemov, Kirill Tkhai,
	Leon Romanovsky, Michal Hocko, Oleg Nesterov, Andrew Morton,
	LKML, lkp, kernel test robot, Huang, Ying, zhengjun.xing

On 6/6/21 3:20 PM, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
>
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote:
>> * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
>>    data structure change
>>
>>    - old kernel
>>
>>      - first cacheline
>>          mmap_lock->count (75%)
>>          mm->mapcount (14%)
>>
>>      - second cacheline
>>          mmap_lock->owner (97%)
>>
>>    - new kernel
>>
>>      mainly in the cacheline of 'mmap_lock'
>>
>>      mmap_lock->count (~2%)
>>      mmap_lock->owner (95%)
> Oooh.
>
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
>
> The rw_semaphore structure looks like this:
>
>          struct rw_semaphore {
>                  atomic_long_t count;
>                  atomic_long_t owner;
>                  struct optimistic_spin_queue osq; /* spinner MCS lock */
>                  ...
>
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
>
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
>
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
>
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?

Yes, I think so.

The count field is accessed when a task tries to acquire the rwsem or 
when a owner releases the lock. If the trylock fails, the writer will go 
into the slowpath doing optimistic spinning on the owner field. As a 
result, a lot of reads to owner are issued relative to the read/write of 
count. Normally, there should only be one spinner that has the OSQ lock 
spinning on owner and the 9% performance degradation seems a bit high to 
me. In the rare case that the head waiter in the wait queue sets the 
handoff flag, the waiter may also spin on owner causing a bit more 
contention on the owner cacheline. I will do further investigation on 
this possibility when I have time.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-06 19:20           ` Linus Torvalds
  2021-06-06 22:13             ` Waiman Long
@ 2021-06-07  6:05             ` Feng Tang
  2021-06-08  0:03               ` Linus Torvalds
  1 sibling, 1 reply; 13+ messages in thread
From: Feng Tang @ 2021-06-07  6:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Waiman Long, Jason Gunthorpe, kernel test robot, John Hubbard,
	Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V,
	Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov,
	Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov,
	Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying,
	zhengjun.xing

On Sun, Jun 06, 2021 at 12:20:46PM -0700, Linus Torvalds wrote:
> [ Adding Waiman Long to the participants, because this seems to be a
> very specific cacheline alignment behavior of rwsems, maybe Waiman has
> some comments ]
> 
> On Sun, Jun 6, 2021 at 3:16 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > * perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
> >   data structure change
> >
> >   - old kernel
> >
> >     - first cacheline
> >         mmap_lock->count (75%)
> >         mm->mapcount (14%)
> >
> >     - second cacheline
> >         mmap_lock->owner (97%)
> >
> >   - new kernel
> >
> >     mainly in the cacheline of 'mmap_lock'
> >
> >     mmap_lock->count (~2%)
> >     mmap_lock->owner (95%)
> 
> Oooh.
> 
> It looks like pretty much all the contention is on mmap_lock, and the
> difference is that the old kernel just _happened_ to split the
> mmap_lock rwsem at *exactly* the right place.
> 
> The rw_semaphore structure looks like this:
> 
>         struct rw_semaphore {
>                 atomic_long_t count;
>                 atomic_long_t owner;
>                 struct optimistic_spin_queue osq; /* spinner MCS lock */
>                 ...
> 
> and before the addition of the 'write_protect_seq' field, the mmap_sem
> was at offset 120 in 'struct mm_struct'.
> 
> Which meant that count and owner were in two different cachelines, and
> then when you have contention and spend time in
> rwsem_down_write_slowpath(), this is probably *exactly* the kind of
> layout you want.
> 
> Because first the rwsem_write_trylock() will do a cmpxchg on the first
> cacheline (for the optimistic fast-path), and then in the case of
> contention, rwsem_down_write_slowpath() will just access the second
> cacheline.
> 
> Which is probably just optimal for a load that spends a lot of time
> contended - new waiters touch that first cacheline, and then they
> queue themselves up on the second cacheline. Waiman, does that sound
> believable?
> 
> Anyway, I'm certainly ok with the patch that just moves
> 'write_protect_seq' down, it might be worth commenting about how this
> is about some very special cache layout of the mmap_sem part of the
> structure.
> 
> That said, this means that it all is very subtle dependent on a lot of
> kernel config options, and I'm not sure how relevant the exact kernel
> options are that the test robot has been using. But even if this is
> just a "kernel test robot reports", I think it's an interesting case
> and worth a comment for when this happens next time...

There are 3 kernel config options before 'mmap_lock' (inside 'mm_struct'):

  CONFIG_MMU
  CONFIG_MEMBARRIER
  CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

0day's default kernel config is similar to RHEL-8.3, which has all
these three enabled. IIUC, the first 2 options are 'y' for many common
configs, while 'CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES' is only available
on x86 system.

Please review the updated patch, thanks

- Feng


From cbdbe70fb9e5bab2988d645c6f0f614d51b2e386 Mon Sep 17 00:00:00 2001
From: Feng Tang <feng.tang@intel.com>
Date: Fri, 4 Jun 2021 15:20:57 +0800
Subject: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct

0day robot reports a 9.2% regression for will-it-scale mmap1 test
case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
from racing with COW during fork").

Further debug shows the regression is due to that commit changes
the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
thus some cache alignmeent changes.

From the perf data, the contention for 'mmap_lock' is very severe
and takes around 95% cpu cycles, and it is a rw_semaphore

        struct rw_semaphore {
                atomic_long_t count;	/* 8 bytes */
                atomic_long_t owner;	/* 8 bytes */
                struct optimistic_spin_queue osq; /* spinner MCS lock */
                ...

Before commit 57efa1fe5957 adds the 'write_protect_seq', it
happens to have a very optimal cache alignment layout, as
Linus explained:

 "and before the addition of the 'write_protect_seq' field, the
  mmap_sem was at offset 120 in 'struct mm_struct'.

  Which meant that count and owner were in two different cachelines,
  and then when you have contention and spend time in
  rwsem_down_write_slowpath(), this is probably *exactly* the kind
  of layout you want.

  Because first the rwsem_write_trylock() will do a cmpxchg on the
  first cacheline (for the optimistic fast-path), and then in the
  case of contention, rwsem_down_write_slowpath() will just access
  the second cacheline.

  Which is probably just optimal for a load that spends a lot of
  time contended - new waiters touch that first cacheline, and then
  they queue themselves up on the second cacheline."

After the commit, the rw_semaphore is at offset 128, which means
the 'count' and 'owner' fields are now in the same cacheline,
and causes more cache bouncing.

Currently there are "#ifdef CONFIG_XXX" before 'mmap_lock' which
will affect its offset:

  CONFIG_MMU
  CONFIG_MEMBARRIER
  CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES

The layout above is on 64 bits system with 0day's default kernel
config (similar to RHEL-8.3's config), which all these 3 options
are 'y'. And the layout can ary with different kernel configs.

Relayouting a structure is usually a double-edged sword, as sometimes
it can helps one case, but hurt other cases. For this case, one
solution is, as the newly added 'write_protect_seq' is a 4 bytes long
seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
existing 4 bytes hole in 'mm_struct' will not change other fields'
alignment, while restoring the regression. 

[1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
 include/linux/mm_types.h | 27 ++++++++++++++++++++-------
 1 file changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5aacc1c..cba6022 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -445,13 +445,6 @@ struct mm_struct {
 		 */
 		atomic_t has_pinned;
 
-		/**
-		 * @write_protect_seq: Locked when any thread is write
-		 * protecting pages mapped by this mm to enforce a later COW,
-		 * for instance during page table copying for fork().
-		 */
-		seqcount_t write_protect_seq;
-
 #ifdef CONFIG_MMU
 		atomic_long_t pgtables_bytes;	/* PTE page table pages */
 #endif
@@ -460,6 +453,18 @@ struct mm_struct {
 		spinlock_t page_table_lock; /* Protects page tables and some
 					     * counters
 					     */
+		/*
+		 * With some kernel config, the current mmap_lock's offset
+		 * inside 'mm_struct' is at 0x120, which is very optimal, as
+		 * its two hot fields 'count' and 'owner' sit in 2 different
+		 * cachelines,  and when mmap_lock is highly contended, both
+		 * of the 2 fields will be accessed frequently, current layout
+		 * will help to reduce cache bouncing.
+		 *
+		 * So please be careful with adding new fields before
+		 * mmap_lock, which can easily push the 2 fields into one
+		 * cacheline.
+		 */
 		struct rw_semaphore mmap_lock;
 
 		struct list_head mmlist; /* List of maybe swapped mm's.	These
@@ -480,7 +485,15 @@ struct mm_struct {
 		unsigned long stack_vm;	   /* VM_STACK */
 		unsigned long def_flags;
 
+		/**
+		 * @write_protect_seq: Locked when any thread is write
+		 * protecting pages mapped by this mm to enforce a later COW,
+		 * for instance during page table copying for fork().
+		 */
+		seqcount_t write_protect_seq;
+
 		spinlock_t arg_lock; /* protect the below fields */
+
 		unsigned long start_code, end_code, start_data, end_data;
 		unsigned long start_brk, brk, start_stack;
 		unsigned long arg_start, arg_end, env_start, env_end;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
  2021-06-07  6:05             ` Feng Tang
@ 2021-06-08  0:03               ` Linus Torvalds
  0 siblings, 0 replies; 13+ messages in thread
From: Linus Torvalds @ 2021-06-08  0:03 UTC (permalink / raw)
  To: Feng Tang
  Cc: Waiman Long, Jason Gunthorpe, kernel test robot, John Hubbard,
	Jan Kara, Peter Xu, Andrea Arcangeli, Aneesh Kumar K.V,
	Christoph Hellwig, Hugh Dickins, Jann Horn, Kirill Shutemov,
	Kirill Tkhai, Leon Romanovsky, Michal Hocko, Oleg Nesterov,
	Andrew Morton, LKML, lkp, kernel test robot, Huang, Ying,
	zhengjun.xing

On Sun, Jun 6, 2021 at 11:06 PM Feng Tang <feng.tang@intel.com> wrote:
>
> Please review the updated patch, thanks

Looks good to me. Thanks,

               Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-06-08  0:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25  3:16 [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression kernel test robot
2021-05-25  3:11 ` Linus Torvalds
2021-06-04  7:04   ` Feng Tang
2021-06-04  7:52     ` Feng Tang
2021-06-04 17:57       ` Linus Torvalds
2021-06-06 10:16         ` Feng Tang
2021-06-06 19:20           ` Linus Torvalds
2021-06-06 22:13             ` Waiman Long
2021-06-07  6:05             ` Feng Tang
2021-06-08  0:03               ` Linus Torvalds
2021-06-04 17:58       ` John Hubbard
2021-06-06  4:47         ` Feng Tang
2021-06-04  8:37   ` [LKP] " Xing Zhengjun

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).