[RFC PATCH 0/8] Drop mmap_sem during unmapping large map

* [RFC PATCH 0/8] Drop mmap_sem during unmapping large map
@ 2018-03-20 21:31 ` Yang Shi
  0 siblings, 0 replies; 44+ messages in thread
From: Yang Shi @ 2018-03-20 21:31 UTC (permalink / raw)
  To: akpm; +Cc: yang.shi, linux-mm, linux-kernel

Background:
Recently, when we ran some vm scalability tests on machines with large memory,
we ran into a couple of mmap_sem scalability issues when unmapping large
memory space, please refer to https://lkml.org/lkml/2017/12/14/733 and
https://lkml.org/lkml/2018/2/20/576.

Then akpm suggested to unmap large mapping section by section and drop mmap_sem
at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784). So, this
series of patches are aimed to solve the mmap_sem issue by adopting akpm's
suggestion.

Approach:
A couple of approaches were explored.
#1. Unmap large map by section in vm_munmap(). It works, but just sys_munmap()
can benefit from this change.

#2. Do unmapping in deeper place of the call chain, i.e. zap_pmd_range().
    In this way, I don't have to define a magic size for unmapping. But, there
    are two major issues:
      * mmap_sem may be acquired by down_write() or down_read() in all the
        possible call paths. So, the call path has to be checked to determine
        to use which variants, either _write or _read. It increases the
        complexity significantly.
      * The below race condition might be introduced:
          CPU A                         CPU B 
       ----------                     ---------- 
       do_munmap
     zap_pmd_range 
       up_write                         do_munmap
                                        down_write 
                                        ...... 
                                        remove_vma_list 
                                        up_write 
      down_write 
     access vmas  <-- use-after-free bug

        And, unmapping by section requires splitting vma, so the code has to
        deal with partial unmapped vma, it also increase the complexity
        significantly. 

#3. Do it in do_munmap(). I can keep splitting vma/unmap region/free pagetables
    /free vmas sequence atomic for every section. And, not only sys_munmap()
    can benefit, but also mremap and sysv shm. The only problem is it may not
    want to drop mmap_sem from some call paths. So, an extra parameter, called
    "atomic", is introduced to do_munmap(). The caller can pass "true" or "false"
    to tell do_munmap() if dropping mmap_sem is expected or not. "True" means not
    drop, "false" means drop. Since all callers to do_munmap() acquire mmap_sem
    by _write, so I just need deal with one variant. And, when re-acquiring
    mmap_sem, just use down_write() for now since dealing with the return value
    of down_write_killable() sounds unnecessary.

    Other than these, a magic section size has to be defined explicitly, now
    HPAGE_PUD_SIZE is used. According to my test, HPAGE_PUD_SIZE sounds good
    enough. This is also why down_write() is used for re-acquiring mmap_sem
    instead of down_write_killable(). Smaller size looks have to much overhead.

Regression and performance data:
Test is run on a machine with 32 cores of E5-2680 @ 2.70GHz and 384GB memory

Full LTP test is done, no regression issue.

Measurement of SyS_munmap() execution time:
  size        pristine        patched        delta
  80GB       5008377 us      4905841 us       -2%
  160GB      9129243 us      9145306 us       +0.18%
  320GB      17915310 us     17990174 us      +0.42%

Throughput of page faults (#/s) with vm-scalability:
                    pristine         patched         delta
mmap-pread-seq       554894           563517         +1.6%
mmap-pread-seq-mt    581232           580772         -0.079%
mmap-xread-seq-mt    99182            105400         +6.3%

Throughput of page faults (#/s) with the below stress-ng test:
stress-ng --mmap 0 --mmap-bytes 80G --mmap-file --metrics --perf
--timeout 600s
        pristine         patched          delta
         100165           108396          +8.2%

There are 8 patches in this series.
1/8:
  Introduce “atomic” parameter and define do_munmap_range(), modify
  do_munmap() to call do_munmap() to unmap memory by section
2/8 - 6/8:
  modify do_munmap() call sites in mm/mmap.c, mm/mremap.c,
  fs/proc/vmcore.c, ipc/shm.c and mm/nommu.c to adopt "atomic" parameter
7/8 - 8/8:
  modify the do_munmap() call sites in arch/x86 to adopt "atomic" parameter

Yang Shi (8):
      mm: mmap: unmap large mapping by section
      mm: mmap: pass atomic parameter to do_munmap() call sites
      mm: mremap: pass atomic parameter to do_munmap()
      mm: nommu: add atomic parameter to do_munmap()
      ipc: shm: pass atomic parameter to do_munmap()
      fs: proc/vmcore: pass atomic parameter to do_munmap()
      x86: mpx: pass atomic parameter to do_munmap()
      x86: vma: pass atomic parameter to do_munmap()

 arch/x86/entry/vdso/vma.c |  2 +-
 arch/x86/mm/mpx.c         |  2 +-
 fs/proc/vmcore.c          |  4 ++--
 include/linux/mm.h        |  2 +-
 ipc/shm.c                 |  9 ++++++---
 mm/mmap.c                 | 48 ++++++++++++++++++++++++++++++++++++++++++------
 mm/mremap.c               | 10 ++++++----
 mm/nommu.c                |  5 +++--
 8 files changed, 62 insertions(+), 20 deletions(-)

^ permalink raw reply	[flat|nested] 44+ messages in thread