[RFC PATCH 00/24] Fine grained MM locking

* [RFC PATCH 00/24] Fine grained MM locking
@ 2020-02-24 20:30 Michel Lespinasse
  2020-02-24 20:30 ` [RFC PATCH 01/24] MM locking API: initial implementation as rwsem wrappers Michel Lespinasse
                   ` (24 more replies)
  0 siblings, 25 replies; 28+ messages in thread
From: Michel Lespinasse @ 2020-02-24 20:30 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Laurent Dufour, Vlastimil Babka,
	Matthew Wilcox, Liam R . Howlett, Jerome Glisse, Davidlohr Bueso,
	David Rientjes
  Cc: linux-mm, Michel Lespinasse

Hi,

This is the first version of my work towards fine grained MM locking.
This is still early work - I am happy with my page fault changes,
but want to expand on the mmap/munmap side of things before I send the
next version. I have previously shared this with some of the copied folks
(for those who received that, there are no additional changes in this
public resend). Please expect a v2 within a few weeks, with further
changes for fine grained range locking in the mmap and munmap paths.

This work originated in discussions at LSF/MM 2019; it is intended to
address the latency issues that are caused by false conflicts between
threads working on separate parts of their address space.
The priorities are to keep things as simple as possible,
and to allow for progressive conversion of the code base to
finer grained MM locks.

The general approach is to replace the mmap_sem rwsem with a range lock.
Initially all lock/unlock sites are automatically converted to lock the
entire address space through a new API. Then, the API is extended to
support range locking. Locking sites can then be progressively converted
to use range locking, while leaving unconverted sites working
with no code changes.

When using a range lock (as opposed to a coarse lock), the following
rules apply:
- Some structures (notably the vma rbtree and associated statistics)
  are per-mm. They need to be locked separately using a new mm_vma_lock.
  The entire point of this patch set is to reduce false sharing latencies,
  so the mm_vma_lock must be held only for short times. We expect to do
  O(log N) operations holding the lock (for example, walking or updating
  the vma rbtree) but no O(N) operations (such as iterating on all vmas
  within a range or all mapped pages within a range).
- Code holding the mm_vma_lock should only update vma attributes for the
  range it has a write lock for. However, range locks only protects the
  vma's attributes, not the vmas themselves - vmas can still be split or
  merged with their neighbors if they have compatible attributes.
- Code holding a range lock but not the mm_vma_lock must be prepared for
  the vmas at both ends of the locked range to be merged with their
  neighbors outside of the locked range. The easiest way to do that is
  to copy the vma of record into a pseudo-vma before releasing the
  mm_vma_lock (this is a bit kludgy and I would prefer to copy only the
  necessary VMA attributes, but using a pseudo-vma makes it easier to
  maintain this patchset out of mainline for the moment).

Call sites that take a range lock usualy immediately take the
mm_vma_lock next - it would probably be more efficient to collapse
mm_vma_lock with the mutex that protects the range lock
structures. This isn't done yet as I tried to simplify the initial
implementation.

In the future I would also like to remove the various workarounds we have
been doing to limit mmap_sem hold times (i.e. FAULT_FLAG_ALLOW_RETRY,
vm_populate and munmap downgrading to a read lock, ...) which shouldn't
be necessary if the locking was only effective on the memory ranges
affected by each operation.

The included changes apply on top of upstream kernel v5.5.
Please apply with git am -p0 - I'm not sure why my git format-patch
setup requires that.

Commits 1 to 6 implement a range locking API:
- 1 implements coarse locking as wrappers around rwsem;
- 2 converts most mmap_sem locking sites to use the new coarse locking API
  (using coccinelle to automate the conversion);
- 3 converts remaining mmap_sem locking sites which were missed by coccinelle;
- 4 extends the API to support range locking. The initial implementation
  still uses coarse locking (ignoring the range); but it validates that the
  callers use matching ranges in lock and unlock calls;
- 5 prepares callers to allow for sleeping during unlock;
- 6 actually implements the range locking functions.

Commits 7 to 12 allow the x86 fault handler to specify a range
that may be released while handling the fault:
- 7 adds a range field to struct mm_fault;
- 8 makes handle_mm_fault() populate that field;
- 9 and 10 honor it when dropping mmap_sem during fault handling;
- 11 is a cleanup to the x86 fault handler to prepare for 12;
- 12 changes the x86 fault handler to use an explicit lock range.

Commits 13 to 15 prepare for operating on a pseudo-vma during faults:
- 13 adds a prepare_vma_fault which may update the vma of record
  (specifically, allocate an anon_vma) before creating the pseudo-vma;
- 14 disables swap vma readahead as its implementation keeps stats in the vma;
- 15 changes the x86 fault handler to use pseudo-vmas when handling anon vmas.

Commits 16 and 17 implement range locking in x86 anonymous vma faults:
- Commit 16 adds the vma locking API to be used to manipulate vmas when
  holding a fine grained ranged lock;
- Commit 17 converts the x86 fault handler to use a pmd sized range lock
  when operating on anon vmas.

Commits 18 to 20 extend the above to also work on filemap based files:
- Commit 18 makes sure we release the correct range when dropping mmap_sem
  during filemap file access;
- Commit 19 tags vm_operations that support range locking;
- Commit 20 makes the x86 fault handler use fine grained ranges when
  faulting the supported files.

Commits 21 to 24 implement range locking for the most basic mmap() case:
- 21 adds a locked argument to do_mmap();
- 22 makes do_mmap acquire the mmap_sem if locked is false;
- 23 converts soem easy call sites to pass locked=false;
- 24 changes do_mmap to acquire a fine grained lock in the easiest case
  (anonymous mapping, known address, no prior existing mapping).

Michel Lespinasse (24):
  MM locking API: initial implementation as rwsem wrappers
  MM locking API: use coccinelle to convert mmap_sem rwsem call sites
  MM locking API: manual conversion of mmap_sem call sites missed by
    coccinelle
  MM locking API: add range arguments
  MM locking API: allow for sleeping during unlock
  MM locking API: implement fine grained range locks
  mm/memory: add range field to struct vm_fault
  mm/memory: allow specifying MM lock range to handle_mm_fault()
  do_swap_page: use the vmf->range field when dropping mmap_sem
  handle_userfault: use the vmf->range field when dropping mmap_sem
  x86 fault handler: merge bad_area() functions
  x86 fault handler: use an explicit MM lock range
  mm/memory: add prepare_mm_fault() function
  mm/swap_state: disable swap vma readahead
  x86 fault handler: use a pseudo-vma when operating on anonymous vmas.
  MM locking API: add vma locking API
  x86 fault handler: implement range locking
  shared file mappings: use the vmf->range field when dropping mmap_sem
  mm: add field to annotate vm_operations that support range locking
  x86 fault handler: extend range locking to supported file vmas
  do_mmap: add locked argument
  do_mmap: implement locked argument
  do_mmap: use locked=false in vm_mmap_pgoff() and aio_setup_ring()
  do_mmap: implement easiest cases of fine grained locking

 arch/alpha/kernel/traps.c                     |   4 +-
 arch/alpha/mm/fault.c                         |  10 +-
 arch/arc/kernel/process.c                     |   4 +-
 arch/arc/kernel/troubleshoot.c                |   4 +-
 arch/arc/mm/fault.c                           |   4 +-
 arch/arm/kernel/process.c                     |   4 +-
 arch/arm/kernel/swp_emulate.c                 |   4 +-
 arch/arm/lib/uaccess_with_memcpy.c            |  16 +-
 arch/arm/mm/fault.c                           |   6 +-
 arch/arm64/kernel/traps.c                     |   4 +-
 arch/arm64/kernel/vdso.c                      |   8 +-
 arch/arm64/mm/fault.c                         |   8 +-
 arch/csky/kernel/vdso.c                       |   4 +-
 arch/csky/mm/fault.c                          |   8 +-
 arch/hexagon/kernel/vdso.c                    |   4 +-
 arch/hexagon/mm/vm_fault.c                    |   8 +-
 arch/ia64/kernel/perfmon.c                    |   8 +-
 arch/ia64/mm/fault.c                          |   8 +-
 arch/ia64/mm/init.c                           |  12 +-
 arch/m68k/kernel/sys_m68k.c                   |  14 +-
 arch/m68k/mm/fault.c                          |   8 +-
 arch/microblaze/mm/fault.c                    |  12 +-
 arch/mips/kernel/traps.c                      |   4 +-
 arch/mips/kernel/vdso.c                       |   4 +-
 arch/mips/mm/fault.c                          |  10 +-
 arch/nds32/kernel/vdso.c                      |   6 +-
 arch/nds32/mm/fault.c                         |  12 +-
 arch/nios2/mm/fault.c                         |  12 +-
 arch/nios2/mm/init.c                          |   4 +-
 arch/openrisc/mm/fault.c                      |  10 +-
 arch/parisc/kernel/traps.c                    |   6 +-
 arch/parisc/mm/fault.c                        |   8 +-
 arch/powerpc/kernel/vdso.c                    |   6 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |   4 +-
 arch/powerpc/kvm/book3s_hv.c                  |   6 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c            |  12 +-
 arch/powerpc/kvm/e500_mmu_host.c              |   4 +-
 arch/powerpc/mm/book3s64/iommu_api.c          |   4 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |  12 +-
 arch/powerpc/mm/copro_fault.c                 |   4 +-
 arch/powerpc/mm/fault.c                       |  12 +-
 arch/powerpc/oprofile/cell/spu_task_sync.c    |   6 +-
 arch/powerpc/platforms/cell/spufs/file.c      |   4 +-
 arch/riscv/kernel/vdso.c                      |   4 +-
 arch/riscv/mm/fault.c                         |  10 +-
 arch/s390/kernel/vdso.c                       |   4 +-
 arch/s390/kvm/gaccess.c                       |   4 +-
 arch/s390/kvm/kvm-s390.c                      |  24 +-
 arch/s390/kvm/priv.c                          |  32 +-
 arch/s390/mm/fault.c                          |   6 +-
 arch/s390/mm/gmap.c                           |  40 +-
 arch/s390/pci/pci_mmio.c                      |   4 +-
 arch/sh/kernel/sys_sh.c                       |   6 +-
 arch/sh/kernel/vsyscall/vsyscall.c            |   4 +-
 arch/sh/mm/fault.c                            |  14 +-
 arch/sparc/mm/fault_32.c                      |  18 +-
 arch/sparc/mm/fault_64.c                      |  12 +-
 arch/sparc/vdso/vma.c                         |   4 +-
 arch/um/include/asm/mmu_context.h             |   6 +-
 arch/um/kernel/tlb.c                          |   2 +-
 arch/um/kernel/trap.c                         |   6 +-
 arch/unicore32/mm/fault.c                     |   6 +-
 arch/x86/entry/vdso/vma.c                     |  10 +-
 arch/x86/kernel/tboot.c                       |   2 +-
 arch/x86/kernel/vm86_32.c                     |   4 +-
 arch/x86/kvm/mmu/paging_tmpl.h                |   8 +-
 arch/x86/mm/debug_pagetables.c                |   8 +-
 arch/x86/mm/fault.c                           | 110 ++-
 arch/x86/mm/mpx.c                             |  15 +-
 arch/x86/um/vdso/vma.c                        |   4 +-
 arch/xtensa/mm/fault.c                        |  10 +-
 drivers/android/binder_alloc.c                |  10 +-
 drivers/firmware/efi/efi.c                    |   2 +-
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  10 +-
 drivers/gpu/drm/amd/amdkfd/kfd_events.c       |   4 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c      |   4 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |   8 +-
 drivers/gpu/drm/nouveau/nouveau_svm.c         |  20 +-
 drivers/gpu/drm/radeon/radeon_cs.c            |   4 +-
 drivers/gpu/drm/radeon/radeon_gem.c           |   6 +-
 drivers/gpu/drm/ttm/ttm_bo_vm.c               |   4 +-
 drivers/infiniband/core/umem.c                |   6 +-
 drivers/infiniband/core/umem_odp.c            |  10 +-
 drivers/infiniband/core/uverbs_main.c         |   4 +-
 drivers/infiniband/hw/mlx4/mr.c               |   4 +-
 drivers/infiniband/hw/qib/qib_user_pages.c    |   6 +-
 drivers/infiniband/hw/usnic/usnic_uiom.c      |   4 +-
 drivers/infiniband/sw/siw/siw_mem.c           |   4 +-
 drivers/iommu/amd_iommu_v2.c                  |   4 +-
 drivers/iommu/intel-svm.c                     |   4 +-
 drivers/media/v4l2-core/videobuf-core.c       |   4 +-
 drivers/media/v4l2-core/videobuf-dma-contig.c |   4 +-
 drivers/media/v4l2-core/videobuf-dma-sg.c     |   4 +-
 drivers/misc/cxl/cxllib.c                     |   4 +-
 drivers/misc/cxl/fault.c                      |   4 +-
 drivers/misc/sgi-gru/grufault.c               |  16 +-
 drivers/misc/sgi-gru/grufile.c                |   4 +-
 drivers/oprofile/buffer_sync.c                |  10 +-
 drivers/staging/kpc2000/kpc_dma/fileops.c     |   4 +-
 drivers/tee/optee/call.c                      |   4 +-
 drivers/vfio/vfio_iommu_type1.c               |  12 +-
 drivers/xen/gntdev.c                          |   4 +-
 drivers/xen/privcmd.c                         |  14 +-
 fs/aio.c                                      |  16 +-
 fs/coredump.c                                 |   4 +-
 fs/exec.c                                     |  16 +-
 fs/ext4/file.c                                |   1 +
 fs/io_uring.c                                 |   4 +-
 fs/proc/base.c                                |  18 +-
 fs/proc/task_mmu.c                            |  28 +-
 fs/proc/task_nommu.c                          |  18 +-
 fs/userfaultfd.c                              |  28 +-
 include/linux/hugetlb.h                       |   5 +-
 include/linux/mm.h                            |  56 +-
 include/linux/mm_lock.h                       | 285 ++++++++
 include/linux/mm_types.h                      |  22 +
 include/linux/mm_types_task.h                 |  21 +
 include/linux/mmu_notifier.h                  |   5 +-
 include/linux/pagemap.h                       |   7 +-
 include/linux/sched.h                         |   2 +
 init/init_task.c                              |   1 +
 ipc/shm.c                                     |  11 +-
 kernel/acct.c                                 |   4 +-
 kernel/bpf/stackmap.c                         |  32 +-
 kernel/events/core.c                          |   4 +-
 kernel/events/uprobes.c                       |  16 +-
 kernel/exit.c                                 |   8 +-
 kernel/fork.c                                 |  17 +-
 kernel/futex.c                                |   4 +-
 kernel/sched/fair.c                           |   4 +-
 kernel/sys.c                                  |  18 +-
 kernel/trace/trace_output.c                   |   4 +-
 mm/Kconfig                                    |  25 +
 mm/Makefile                                   |   2 +
 mm/filemap.c                                  |  10 +-
 mm/frame_vector.c                             |   4 +-
 mm/gup.c                                      |  20 +-
 mm/hugetlb.c                                  |  13 +-
 mm/init-mm.c                                  |   3 +-
 mm/internal.h                                 |   2 +-
 mm/khugepaged.c                               |  37 +-
 mm/ksm.c                                      |  34 +-
 mm/madvise.c                                  |  18 +-
 mm/memcontrol.c                               |   8 +-
 mm/memory.c                                   |  55 +-
 mm/mempolicy.c                                |  22 +-
 mm/migrate.c                                  |   8 +-
 mm/mincore.c                                  |   4 +-
 mm/mlock.c                                    |  16 +-
 mm/mm_lock_range.c                            | 691 ++++++++++++++++++
 mm/mm_lock_rwsem_checked.c                    | 134 ++++
 mm/mmap.c                                     | 170 +++--
 mm/mmu_notifier.c                             |   4 +-
 mm/mprotect.c                                 |  12 +-
 mm/mremap.c                                   |   6 +-
 mm/msync.c                                    |   8 +-
 mm/nommu.c                                    |  36 +-
 mm/oom_kill.c                                 |   4 +-
 mm/process_vm_access.c                        |   4 +-
 mm/shmem.c                                    |   1 +
 mm/swap_state.c                               |   6 +
 mm/swapfile.c                                 |   4 +-
 mm/userfaultfd.c                              |  14 +-
 mm/util.c                                     |  14 +-
 net/ipv4/tcp.c                                |   4 +-
 net/xdp/xdp_umem.c                            |   4 +-
 virt/kvm/arm/mmu.c                            |  14 +-
 virt/kvm/async_pf.c                           |   4 +-
 virt/kvm/kvm_main.c                           |   8 +-
 170 files changed, 2183 insertions(+), 798 deletions(-)
 create mode 100644 include/linux/mm_lock.h
 create mode 100644 mm/mm_lock_range.c
 create mode 100644 mm/mm_lock_rwsem_checked.c

-- 
2.25.0.341.g760bfbb309-goog

^ permalink raw reply	[flat|nested] 28+ messages in thread