[00/94] Introducing the Maple Tree
mbox series

Message ID 20210428153542.2814175-1-Liam.Howlett@Oracle.com
Headers show
Series
  • Introducing the Maple Tree
Related show

Message

Liam Howlett April 28, 2021, 3:35 p.m. UTC
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on next-20210419

Link:
https://github.com/oracle/linux-uek/releases/tag/howlett%2Fmaple%2F20210419

Performance on a 144 core x86:

While still using the mmap_sem, the performance seems fairly similar on
real-world workloads, while there are variations in micro-benchmarks.

Increase in performance in the following micro-benchmarks in Hmean:
- wis brk1-threads: Disregard, this is useless.
- wis malloc1-threads: Increase of 15% to 500%
- wis page_fault1-threads: Increase of up to 15%

Decrease in performance in the following micro-benchmarks in Hmean:
- wis brk1-processes: Decrease of 45% due to RCU required allocations
- wis signal1-threads: -3% to -20%

Mixed:
- wis malloc1-processes: +9% to -20%
- wis pthread_mutex1-threads: +9% to -16%
- wis page_fault3-threads: +7% to -21%


kernbench:
Amean     user-2        880.66 (   0.00%)      885.49 *  -0.55%*
Amean     syst-2        143.88 (   0.00%)      152.64 *  -6.09%*
Amean     elsp-2        518.11 (   0.00%)      524.97 *  -1.32%*
Amean     user-4        903.74 (   0.00%)      908.41 *  -0.52%*
Amean     syst-4        149.61 (   0.00%)      158.42 *  -5.89%*
Amean     elsp-4        270.00 (   0.00%)      275.83 *  -2.16%*
Amean     user-8        955.25 (   0.00%)      961.88 *  -0.69%*
Amean     syst-8        158.76 (   0.00%)      169.04 *  -6.48%*
Amean     elsp-8        146.95 (   0.00%)      148.79 *  -1.25%*
Amean     user-16      1033.15 (   0.00%)     1037.42 *  -0.41%*
Amean     syst-16       168.01 (   0.00%)      177.40 *  -5.59%*
Amean     elsp-16        84.63 (   0.00%)       84.91 *  -0.33%*
Amean     user-32      1135.48 (   0.00%)     1130.56 *   0.43%*
Amean     syst-32       181.35 (   0.00%)      190.54 *  -5.06%*
Amean     elsp-32        50.53 (   0.00%)       51.88 *  -2.67%*
Amean     user-64      1191.24 (   0.00%)     1205.41 *  -1.19%*
Amean     syst-64       189.13 (   0.00%)      200.28 *  -5.89%*
Amean     elsp-64        31.75 (   0.00%)       31.95 *  -0.63%*
Amean     user-128     1608.94 (   0.00%)     1613.01 *  -0.25%*
Amean     syst-128      239.32 (   0.00%)      253.50 *  -5.93%*
Amean     elsp-128       25.34 (   0.00%)       26.02 *  -2.69%*

gitcheckout:
Amean     User           0.00 (   0.00%)        0.00 *   0.00%*
Amean     System         8.33 (   0.00%)        7.86 *   5.59%*
Amean     Elapsed       21.85 (   0.00%)       21.34 *   2.29%*
Amean     CPU           98.80 (   0.00%)       99.00 *  -0.20%*


Patch organization:
Patches 1-20 introduce a new VMA API: lookup_vma().  This looks up the VMA at a
given address and returns either NULL or the VMA.  In looking at the VMA code,
find_vma() is often used to look up a vma and then check the limits (or
incorrectly assume that find_vma() does not continue searching upwards).
Initially, this is just a wrapper for find_vma_intersection() but becomes more
efficient once the maple tree is used.

Patches 21-25 update the radix tree test suite to support what is necessary for
the maple tree.

Patch 26 is the maple tree and test code: 36,152 lines are the
test_maple_tree.c file alone.

Patches 27-33 are the removal of the rbtree from the mm_struct.

Patches 34-44 are the removal of the vmacache from the kernel.

Patches 45-51 are optimizations for using the maple tree directly.

Patches 52-91 are the removal of the linked list from the mm_struct.

Patches 92-94 are to do with cleaning up the locking around the splitting of
VMAs using the maple tree state.


Liam R. Howlett (94):
  mm: Add vma_lookup()
  drm/i915/selftests: Use vma_lookup() in __igt_mmap()
  arch/arc/kernel/troubleshoot: use vma_lookup() instead of
    find_vma_intersection()
  arch/arm64/kvm: Use vma_lookup() instead of find_vma_intersection()
  arch/powerpc/kvm/book3s_hv_uvmem: Use vma_lookup() instead of
    find_vma_intersection()
  arch/powerpc/kvm/book3s: Use vma_lookup() in
    kvmppc_hv_setup_htab_rma()
  arch/mips/kernel/traps: Use vma_lookup() instead of
    find_vma_intersection()
  x86/sgx: Use vma_lookup() in sgx_encl_find()
  virt/kvm: Use vma_lookup() instead of find_vma_intersection()
  vfio: Use vma_lookup() instead of find_vma_intersection()
  net/ipv5/tcp: Use vma_lookup() in tcp_zerocopy_receive()
  drm/amdgpu: Use vma_lookup() in amdgpu_ttm_tt_get_user_pages()
  media: videobuf2: Use vma_lookup() in get_vaddr_frames()
  misc/sgi-gru/grufault: Use vma_lookup() in gru_find_vma()
  kernel/events/uprobes: Use vma_lookup() in find_active_uprobe()
  lib/test_hmm: Use vma_lookup() in dmirror_migrate()
  mm/ksm: Use vma_lookup() in find_mergeable_vma()
  mm/migrate: Use vma_lookup() in do_pages_stat_array()
  mm/mremap: Use vma_lookup() in vma_to_resize()
  mm/memory.c: Use vma_lookup() instead of find_vma_intersection()
  radix tree test suite: Enhancements for Maple Tree
  radix tree test suite: Add support for fallthrough attribute
  radix tree test suite: Add support for kmem_cache_free_bulk
  radix tree test suite: Add keme_cache_alloc_bulk() support
  radix tree test suite: Add __must_be_array() support
  Maple Tree: Add new data structure
  mm: Start tracking VMAs with maple tree
  mm/mmap: Introduce unlock_range() for code cleanup
  mm/mmap: Change find_vma() to use the maple tree
  mm/mmap: Change find_vma_prev() to use maple tree
  mm/mmap: Change unmapped_area and unmapped_area_topdown to use maple
    tree
  kernel/fork: Convert dup_mmap to use maple tree
  mm: Remove rb tree.
  arch/m68k/kernel/sys_m68k: Use vma_lookup() in sys_cacheflush()
  xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/gup: Add mm_populate_vma() for use when the vma is known
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm/mmap: Change vm_brk_flags() to use mm_populate_vma()
  mm: Change find_vma_intersection() to maple tree and make find_vma()
    to inline.
  mm/mmap: Change mmap_region() to use maple tree state
  mm/mmap: Drop munmap_vma_range()
  mm: Remove vmacache
  mm/mmap: Change __do_munmap() to avoid unnecessary lookups.
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Add do_mas_munmap() and wraper for __do_munmap()
  mmap: Use find_vma_intersection in do_mmap() for overlap
  mmap: Remove __do_munmap() in favour of do_mas_munmap()
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  mmap: make remove_vma_list() inline
  mm: Introduce vma_next() and vma_prev()
  arch/arm64: Remove mmap linked list from vdso.
  arch/parisc: Remove mmap linked list from kernel/cache
  arch/powerpc: Remove mmap linked list from mm/book3s32/tlb
  arch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot
  arch/s390: Use maple tree iterators instead of linked list.
  arch/x86: Use maple tree iterators for vdso/vma
  arch/xtensa: Use maple tree iterators for unmapped area
  drivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()
  drivers/tee/optee: Use maple tree iterators for __check_mem_type()
  fs/binfmt_elf: Use maple tree iterators for fill_files_note()
  fs/coredump: Use maple tree iterators in place of linked list
  fs/exec: Use vma_next() instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  fs/userfaultfd: Stop using vma linked list.
  ipc/shm: Stop using the vma linked list
  kernel/acct: Use maple tree iterators instead of linked list
  kernel/events/core: Use maple tree iterators instead of linked list
  kernel/events/uprobes: Use maple tree iterators instead of linked list
  kernel/sched/fair: Use maple tree iterators instead of linked list
  kernel/sys: Use maple tree iterators instead of linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/huge_memory: Use vma_next() instead of vma linked list
  mm/khugepaged: Use maple tree iterators instead of vma linked list
  mm/ksm: Use maple tree iterators instead of vma linked list
  mm/madvise: Use vma_next instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use maple tree iterators instead of vma linked list
  mm/mlock: Use maple tree iterators instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_next() instead of vma linked list
  mm/msync: Use vma_next() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_next() instead of vma linked list
  mm/swapfile: Use maple tree iterator instead of vma linked list
  mm/util: Remove __vma_link_list() and __vma_unlink_list()
  arch/um/kernel/tlb: Stop using linked list
  bpf: Remove VMA linked list
  mm: Remove vma linked list.
  mm: Return a bool from anon_vma_interval_tree_verify()
  mm/mmap: Add mas_split_vma() and use it for munmap()
  mm: Move mas locking outside of munmap() path.

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple-tree.rst         |    36 +
 MAINTAINERS                                   |    12 +
 arch/arc/kernel/troubleshoot.c                |     8 +-
 arch/arm64/kernel/vdso.c                      |     5 +-
 arch/arm64/kvm/mmu.c                          |     2 +-
 arch/m68k/kernel/sys_m68k.c                   |     4 +-
 arch/mips/kernel/traps.c                      |     4 +-
 arch/parisc/kernel/cache.c                    |    15 +-
 arch/powerpc/kvm/book3s_hv.c                  |     4 +-
 arch/powerpc/kvm/book3s_hv_uvmem.c            |     2 +-
 arch/powerpc/mm/book3s32/tlb.c                |     5 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    15 +-
 arch/s390/configs/debug_defconfig             |     1 -
 arch/s390/mm/gmap.c                           |     8 +-
 arch/um/kernel/tlb.c                          |    16 +-
 arch/x86/entry/vdso/vma.c                     |    12 +-
 arch/x86/kernel/cpu/sgx/encl.h                |     4 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |     4 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |     4 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c    |     2 +-
 drivers/media/common/videobuf2/frame_vector.c |     2 +-
 drivers/misc/cxl/fault.c                      |     6 +-
 drivers/misc/sgi-gru/grufault.c               |     4 +-
 drivers/tee/optee/call.c                      |    15 +-
 drivers/vfio/vfio_iommu_type1.c               |     2 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     5 +-
 fs/coredump.c                                 |    13 +-
 fs/exec.c                                     |     9 +-
 fs/proc/base.c                                |     7 +-
 fs/proc/task_mmu.c                            |    45 +-
 fs/proc/task_nommu.c                          |    55 +-
 fs/userfaultfd.c                              |    43 +-
 include/linux/maple_tree.h                    |   449 +
 include/linux/mm.h                            |    62 +-
 include/linux/mm_types.h                      |    33 +-
 include/linux/mm_types_task.h                 |     5 -
 include/linux/sched.h                         |     1 -
 include/linux/sched/mm.h                      |     3 +
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   227 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    13 +-
 kernel/acct.c                                 |     8 +-
 kernel/bpf/task_iter.c                        |     6 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     7 +-
 kernel/events/uprobes.c                       |    29 +-
 kernel/fork.c                                 |    50 +-
 kernel/sched/fair.c                           |    14 +-
 kernel/sys.c                                  |     6 +-
 lib/Kconfig.debug                             |    15 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6393 +++
 lib/test_hmm.c                                |     5 +-
 lib/test_maple_tree.c                         | 36152 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/debug.c                                    |    12 +-
 mm/gup.c                                      |    27 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    51 +-
 mm/interval_tree.c                            |     6 +-
 mm/khugepaged.c                               |    13 +-
 mm/ksm.c                                      |    32 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    45 +-
 mm/mempolicy.c                                |    44 +-
 mm/migrate.c                                  |     4 +-
 mm/mlock.c                                    |    26 +-
 mm/mmap.c                                     |  2271 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    17 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   135 +-
 mm/oom_kill.c                                 |     5 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     9 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 net/ipv4/tcp.c                                |     4 +-
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |    13 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |    78 +-
 tools/testing/radix-tree/linux/kernel.h       |    10 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     2 +
 tools/testing/radix-tree/maple.c              |    59 +
 tools/testing/radix-tree/test.h               |     1 +
 .../radix-tree/trace/events/maple_tree.h      |     8 +
 virt/kvm/kvm_main.c                           |     2 +-
 100 files changed, 45347 insertions(+), 1699 deletions(-)
 create mode 100644 Documentation/core-api/maple-tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

Comments

Liam Howlett May 4, 2021, 2:05 p.m. UTC | #1
* Hillf Danton <hdanton@sina.com> [210504 00:29]:
> On Wed, 28 Apr 2021 15:36:24  Liam R. Howlett wrote:
> > ---
> >  mm/khugepaged.c | 11 +++++++++--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> Good work. And a nit below.
> > 
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 33cf91529f0b..4983a25c5a90 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2063,6 +2063,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned =
> > int pages,
> >  	struct mm_struct *mm;
> >  	struct vm_area_struct *vma;
> >  	int progress = 0;
> > +	MA_STATE(mas, NULL, 0, 0);
> > 
> >  	VM_BUG_ON(!pages);
> >  	lockdep_assert_held(&khugepaged_mm_lock);
> > @@ -2079,18 +2080,22 @@ static unsigned int khugepaged_scan_mm_slot(unsigne=
> > d int pages,
> >  	khugepaged_collapse_pte_mapped_thps(mm_slot);
> > 
> >  	mm = mm_slot->mm;
> > +	mas.tree = &mm->mm_mt;
> >  	/*
> >  	 * Don't wait for semaphore (to avoid long wait times).  Just move to
> >  	 * the next mm on the list.
> >  	 */
> >  	vma = NULL;
> > +	mas_set(&mas, khugepaged_scan.address);
> >  	if (unlikely(!mmap_read_trylock(mm)))
> >  		goto breakouterloop_mmap_lock;
> > +
> > +	rcu_read_lock();
> >  	if (likely(!khugepaged_test_exit(mm)))
> > -		vma = find_vma(mm, khugepaged_scan.address);
> > +		vma = mas_find(&mas, ULONG_MAX);
> > 
> >  	progress++;
> > -	for (; vma; vma = vma->vm_next) {
> > +	mas_for_each(&mas, vma, ULONG_MAX) {
> >  		unsigned long hstart, hend;
> > 
> >  		cond_resched();
> 
> Given the added rcu_read_lock(), take another look at this scheduling
> entry?
> 

Yes, thank you for pointing this out.  It is certainly an issue.

> > @@ -2129,6 +2134,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned =
> > int pages,
> >  				pgoff_t pgoff = linear_page_index(vma,
> >  						khugepaged_scan.address);
> > 
> > +				rcu_read_unlock();
> >  				mmap_read_unlock(mm);
> >  				ret = 1;
> >  				khugepaged_scan_file(mm, file, pgoff, hpage);
> > @@ -2149,6 +2155,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned =
> > int pages,
> >  		}
> >  	}
> >  breakouterloop:
> > +	rcu_read_unlock();
> >  	mmap_read_unlock(mm); /* exit_mmap will destroy ptes after this */
> >  breakouterloop_mmap_lock:
> > 
> > --
> > 2.30.2