On Fri, Jun 04, 2021 at 10:57:44AM -0700, Linus Torvalds wrote:
> On Fri, Jun 4, 2021 at 12:52 AM Feng Tang <feng.tang@intel.com> wrote:
> >
> > On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> > > >
> > > > The perf data doesn't even mention any of the GUP paths, and on the
> > > > pure fork path the biggest impact would be:
> > > >
> > > >  (a) maybe "struct mm_struct" changed in size or had a different cache layout
> > >
> > > Yes, this seems to be the cause of the regression.
> > >
> > > The test case is many thread are doing map/unmap at the same time,
> > > so the process's rw_semaphore 'mmap_lock' is highly contended.
> > >
> > > Before the patch (with 0day's kconfig), the mmap_lock is separated
> > > into 2 cachelines, the 'count' is in one line, and the other members
> > > sit in the next line, so it luckily avoid some cache bouncing. After
> > > the patch, the 'mmap_lock' is pushed into one cacheline, which may
> > > cause the regression.
> 
> Ok, thanks for following up on this.
> 
> > We've tried some patch, which can restore the regerssion. As the
> > newly added member 'write_protect_seq' is 4 bytes long, and putting
> > it into an existing 4 bytes long hole can restore the regeression,
> > while not affecting most of other member's alignment. Please review
> > the following patch, thanks!
> 
> The patch looks fine to me.
> 
> At the same time, I do wonder if maybe it would be worth exploring if
> it's a good idea to perhaps move the 'mmap_sem' thing instead.
> 
> Or at least add a big comment. It's not clear to me exactly _which_
> other fields are the ones that are so hot that the contention on
> mmap_sem then causes even more cacheline bouncing.
>
> For example, is it either
> 
>  (a) we *want* the mmap_sem to be in the first 128-byte region,
> because then when we get the mmap_sem, the other fields in that same
> cacheline are hot
> 
> OR
> 
>  (b) we do *not* want mmap_sem to be in the *second* 128-byte region,
> because there is something *else* in that region that is touched
> independently of mmap_sem that is very very hot and now you get even
> more bouncing?
> 
> but I can't tell which one it is.

Yes, it's better to get more details of which fields are hottest, 
and following are some perf data details. Let me know if more info
is needed.

* perf-stat: we see more cache-misses

  32158577 ±  7%      +9.0%   35060321 ±  6%  perf-stat.ps.cache-misses
  69612918 ±  6%     +11.2%   77382336 ±  5%  perf-stat.ps.cache-references


* perf profile: the 'mmap_lock' are the hottest, though the ratio from
  map/unmap has some difference from 72:24 to 52:45, and this is the part
  that I don't understand
  
  - old kernel (without commit 57efa1fe59)

    96.60%     0.19%  [kernel.kallsyms]   [k] down_write_killable                           -      -            
72.46% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
24.14% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap

  - new kernel 

    96.60%     0.16%  [kernel.kallsyms]    [k] down_write_killable                           -      -            
51.85% down_write_killable;vm_mmap_pgoff;ksys_mmap_pgoff;do_syscall_64;entry_SYSCALL_64_after_hwframe;__mmap
44.74% down_write_killable;__vm_munmap;__x64_sys_munmap;do_syscall_64;entry_SYSCALL_64_after_hwframe;__munmap


* perf-c2c: The hotspots(HITM) for 2 kernels are different due to the
  data structure change

  - old kernel

    - first cacheline
	mmap_lock->count (75%)
	mm->mapcount (14%)

    - second cacheline
	mmap_lock->owner (97%)

  - new kernel

    mainly in the cacheline of 'mmap_lock'     

    mmap_lock->count (~2%)
    mmap_lock->owner (95%)
    
I also attached the reduced pah and perf-c2c log for further
check. (The absolute HITM events number can be ignored, as the
recording time for new/old kernel may be different)


> It would be great to have a comment in the code - and in the commit
> message - about exactly which fields are the criticial ones. Because I
> doubt it is 'write_protect_seq' itself that matters at all.
> 
> If it's "mmap_sem should be close to other commonly used fields",
> maybe we should just move mmap_sem upwards in the structure?

Ok, will add more comments if the patch is still fine with
the above updated info.

Thanks,
Feng

>               Linus