linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] Prepare for maple tree
@ 2022-05-04  0:26 Liam Howlett
  2022-05-04  0:26 ` [PATCH 1/1] mips: rename mt_init to mips_mt_init Liam Howlett
                   ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  0:26 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

mips uses mt_init internally.  Move this out of the way for the maple
tree.

Liam R. Howlett (1):
  mips: rename mt_init to mips_mt_init

 arch/mips/kernel/mips-mt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

-- 
2.35.1

^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH 1/1] mips: rename mt_init to mips_mt_init
  2022-05-04  0:26 [PATCH 0/1] Prepare for maple tree Liam Howlett
@ 2022-05-04  0:26 ` Liam Howlett
  2022-05-12  9:54   ` David Hildenbrand
  2022-05-04  1:12 ` [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator Liam Howlett
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
  2 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  0:26 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Move mt_init out of the way for the maple tree.  Use mips_mt prefix to
match the rest of the functions in the file.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 arch/mips/kernel/mips-mt.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/mips/kernel/mips-mt.c b/arch/mips/kernel/mips-mt.c
index d5f7362e8c24..dc023a979803 100644
--- a/arch/mips/kernel/mips-mt.c
+++ b/arch/mips/kernel/mips-mt.c
@@ -230,7 +230,7 @@ void mips_mt_set_cpuoptions(void)
 
 struct class *mt_class;
 
-static int __init mt_init(void)
+static int __init mips_mt_init(void)
 {
 	struct class *mtc;
 
@@ -243,4 +243,4 @@ static int __init mt_init(void)
 	return 0;
 }
 
-subsys_initcall(mt_init);
+subsys_initcall(mips_mt_init);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator
  2022-05-04  0:26 [PATCH 0/1] Prepare for maple tree Liam Howlett
  2022-05-04  0:26 ` [PATCH 1/1] mips: rename mt_init to mips_mt_init Liam Howlett
@ 2022-05-04  1:12 ` Liam Howlett
  2022-05-10 10:44   ` SeongJae Park
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
  2 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:12 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton, damon, SeongJae Park

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

This rather specialised walk can use the VMA iterator.  If this proves to
be too slow, we can write a custom routine to find the two largest gaps,
but it will be somewhat complicated, so let's see if we need it first.

Update the kunit test case to use the maple tree.  This also fixes an
issue with the kunit testcase not adding the last VMA to the list.

Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests)
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
---
 mm/damon/vaddr-test.h | 37 +++++++++++-------------------
 mm/damon/vaddr.c      | 53 ++++++++++++++++++++++---------------------
 2 files changed, 40 insertions(+), 50 deletions(-)

diff --git a/mm/damon/vaddr-test.h b/mm/damon/vaddr-test.h
index 5431da4fe9d4..dbf2b8759607 100644
--- a/mm/damon/vaddr-test.h
+++ b/mm/damon/vaddr-test.h
@@ -13,34 +13,21 @@
 #define _DAMON_VADDR_TEST_H
 
 #include <kunit/test.h>
+#include "../../mm/internal.h"
 
-static void __link_vmas(struct vm_area_struct *vmas, ssize_t nr_vmas)
+static void __link_vmas(struct maple_tree *mt, struct vm_area_struct *vmas,
+			ssize_t nr_vmas)
 {
-	int i, j;
-	unsigned long largest_gap, gap;
+	int i;
+	MA_STATE(mas, mt, 0, 0);
 
 	if (!nr_vmas)
 		return;
 
-	for (i = 0; i < nr_vmas - 1; i++) {
-		vmas[i].vm_next = &vmas[i + 1];
-
-		vmas[i].vm_rb.rb_left = NULL;
-		vmas[i].vm_rb.rb_right = &vmas[i + 1].vm_rb;
-
-		largest_gap = 0;
-		for (j = i; j < nr_vmas; j++) {
-			if (j == 0)
-				continue;
-			gap = vmas[j].vm_start - vmas[j - 1].vm_end;
-			if (gap > largest_gap)
-				largest_gap = gap;
-		}
-		vmas[i].rb_subtree_gap = largest_gap;
-	}
-	vmas[i].vm_next = NULL;
-	vmas[i].vm_rb.rb_right = NULL;
-	vmas[i].rb_subtree_gap = 0;
+	mas_lock(&mas);
+	for (i = 0; i < nr_vmas; i++)
+		vma_mas_store(&vmas[i], &mas);
+	mas_unlock(&mas);
 }
 
 /*
@@ -72,6 +59,7 @@ static void __link_vmas(struct vm_area_struct *vmas, ssize_t nr_vmas)
  */
 static void damon_test_three_regions_in_vmas(struct kunit *test)
 {
+	static struct mm_struct mm;
 	struct damon_addr_range regions[3] = {0,};
 	/* 10-20-25, 200-210-220, 300-305, 307-330 */
 	struct vm_area_struct vmas[] = {
@@ -83,9 +71,10 @@ static void damon_test_three_regions_in_vmas(struct kunit *test)
 		(struct vm_area_struct) {.vm_start = 307, .vm_end = 330},
 	};
 
-	__link_vmas(vmas, 6);
+	mt_init_flags(&mm.mm_mt, MM_MT_FLAGS);
+	__link_vmas(&mm.mm_mt, vmas, ARRAY_SIZE(vmas));
 
-	__damon_va_three_regions(&vmas[0], regions);
+	__damon_va_three_regions(&mm, regions);
 
 	KUNIT_EXPECT_EQ(test, 10ul, regions[0].start);
 	KUNIT_EXPECT_EQ(test, 25ul, regions[0].end);
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index b2ec0aa1ff45..9a7c52982c35 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -113,37 +113,38 @@ static unsigned long sz_range(struct damon_addr_range *r)
  *
  * Returns 0 if success, or negative error code otherwise.
  */
-static int __damon_va_three_regions(struct vm_area_struct *vma,
+static int __damon_va_three_regions(struct mm_struct *mm,
 				       struct damon_addr_range regions[3])
 {
-	struct damon_addr_range gap = {0}, first_gap = {0}, second_gap = {0};
-	struct vm_area_struct *last_vma = NULL;
-	unsigned long start = 0;
-	struct rb_root rbroot;
-
-	/* Find two biggest gaps so that first_gap > second_gap > others */
-	for (; vma; vma = vma->vm_next) {
-		if (!last_vma) {
-			start = vma->vm_start;
-			goto next;
-		}
+	struct damon_addr_range first_gap = {0}, second_gap = {0};
+	VMA_ITERATOR(vmi, mm, 0);
+	struct vm_area_struct *vma, *prev = NULL;
+	unsigned long start;
 
-		if (vma->rb_subtree_gap <= sz_range(&second_gap)) {
-			rbroot.rb_node = &vma->vm_rb;
-			vma = rb_entry(rb_last(&rbroot),
-					struct vm_area_struct, vm_rb);
+	/*
+	 * Find the two biggest gaps so that first_gap > second_gap > others.
+	 * If this is too slow, it can be optimised to examine the maple
+	 * tree gaps.
+	 */
+	for_each_vma(vmi, vma) {
+		unsigned long gap;
+
+		if (!prev) {
+			start = vma->vm_start;
 			goto next;
 		}
-
-		gap.start = last_vma->vm_end;
-		gap.end = vma->vm_start;
-		if (sz_range(&gap) > sz_range(&second_gap)) {
-			swap(gap, second_gap);
-			if (sz_range(&second_gap) > sz_range(&first_gap))
-				swap(second_gap, first_gap);
+		gap = vma->vm_start - prev->vm_end;
+
+		if (gap > sz_range(&first_gap)) {
+			second_gap = first_gap;
+			first_gap.start = prev->vm_end;
+			first_gap.end = vma->vm_start;
+		} else if (gap > sz_range(&second_gap)) {
+			second_gap.start = prev->vm_end;
+			second_gap.end = vma->vm_start;
 		}
 next:
-		last_vma = vma;
+		prev = vma;
 	}
 
 	if (!sz_range(&second_gap) || !sz_range(&first_gap))
@@ -159,7 +160,7 @@ static int __damon_va_three_regions(struct vm_area_struct *vma,
 	regions[1].start = ALIGN(first_gap.end, DAMON_MIN_REGION);
 	regions[1].end = ALIGN(second_gap.start, DAMON_MIN_REGION);
 	regions[2].start = ALIGN(second_gap.end, DAMON_MIN_REGION);
-	regions[2].end = ALIGN(last_vma->vm_end, DAMON_MIN_REGION);
+	regions[2].end = ALIGN(prev->vm_end, DAMON_MIN_REGION);
 
 	return 0;
 }
@@ -180,7 +181,7 @@ static int damon_va_three_regions(struct damon_target *t,
 		return -EINVAL;
 
 	mmap_read_lock(mm);
-	rc = __damon_va_three_regions(mm->mmap, regions);
+	rc = __damon_va_three_regions(mm, regions);
 	mmap_read_unlock(mm);
 
 	mmput(mm);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 16/69] proc: remove VMA rbtree use from nommu
  2022-05-04  0:26 [PATCH 0/1] Prepare for maple tree Liam Howlett
  2022-05-04  0:26 ` [PATCH 1/1] mips: rename mt_init to mips_mt_init Liam Howlett
  2022-05-04  1:12 ` [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator Liam Howlett
@ 2022-05-04  1:13 ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 17/69] mm: remove rb tree Liam Howlett
                     ` (52 more replies)
  2 siblings, 53 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

These users of the rbtree should probably have been walks of the linked
list, but convert them to use walks of the maple tree.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/proc/task_nommu.c | 45 +++++++++++++++++++++-----------------------
 1 file changed, 21 insertions(+), 24 deletions(-)

diff --git a/fs/proc/task_nommu.c b/fs/proc/task_nommu.c
index a6d21fc0033c..2fd06f52b6a4 100644
--- a/fs/proc/task_nommu.c
+++ b/fs/proc/task_nommu.c
@@ -20,15 +20,13 @@
  */
 void task_mem(struct seq_file *m, struct mm_struct *mm)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 	struct vm_region *region;
-	struct rb_node *p;
 	unsigned long bytes = 0, sbytes = 0, slack = 0, size;
-        
-	mmap_read_lock(mm);
-	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
-		vma = rb_entry(p, struct vm_area_struct, vm_rb);
 
+	mmap_read_lock(mm);
+	for_each_vma(vmi, vma) {
 		bytes += kobjsize(vma);
 
 		region = vma->vm_region;
@@ -82,15 +80,13 @@ void task_mem(struct seq_file *m, struct mm_struct *mm)
 
 unsigned long task_vsize(struct mm_struct *mm)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
-	struct rb_node *p;
 	unsigned long vsize = 0;
 
 	mmap_read_lock(mm);
-	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
-		vma = rb_entry(p, struct vm_area_struct, vm_rb);
+	for_each_vma(vmi, vma)
 		vsize += vma->vm_end - vma->vm_start;
-	}
 	mmap_read_unlock(mm);
 	return vsize;
 }
@@ -99,14 +95,13 @@ unsigned long task_statm(struct mm_struct *mm,
 			 unsigned long *shared, unsigned long *text,
 			 unsigned long *data, unsigned long *resident)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 	struct vm_region *region;
-	struct rb_node *p;
 	unsigned long size = kobjsize(mm);
 
 	mmap_read_lock(mm);
-	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p)) {
-		vma = rb_entry(p, struct vm_area_struct, vm_rb);
+	for_each_vma(vmi, vma) {
 		size += kobjsize(vma);
 		region = vma->vm_region;
 		if (region) {
@@ -190,17 +185,19 @@ static int nommu_vma_show(struct seq_file *m, struct vm_area_struct *vma)
  */
 static int show_map(struct seq_file *m, void *_p)
 {
-	struct rb_node *p = _p;
-
-	return nommu_vma_show(m, rb_entry(p, struct vm_area_struct, vm_rb));
+	return nommu_vma_show(m, _p);
 }
 
 static void *m_start(struct seq_file *m, loff_t *pos)
 {
 	struct proc_maps_private *priv = m->private;
 	struct mm_struct *mm;
-	struct rb_node *p;
-	loff_t n = *pos;
+	struct vm_area_struct *vma;
+	unsigned long addr = *pos;
+
+	/* See m_next(). Zero at the start or after lseek. */
+	if (addr == -1UL)
+		return NULL;
 
 	/* pin the task and mm whilst we play with them */
 	priv->task = get_proc_task(priv->inode);
@@ -216,10 +213,10 @@ static void *m_start(struct seq_file *m, loff_t *pos)
 		return ERR_PTR(-EINTR);
 	}
 
-	/* start from the Nth VMA */
-	for (p = rb_first(&mm->mm_rb); p; p = rb_next(p))
-		if (n-- == 0)
-			return p;
+	/* start the next element from addr */
+	vma = find_vma(mm, addr);
+	if (vma)
+		return vma;
 
 	mmap_read_unlock(mm);
 	mmput(mm);
@@ -242,10 +239,10 @@ static void m_stop(struct seq_file *m, void *_vml)
 
 static void *m_next(struct seq_file *m, void *_p, loff_t *pos)
 {
-	struct rb_node *p = _p;
+	struct vm_area_struct *vma = _p;
 
-	(*pos)++;
-	return p ? rb_next(p) : NULL;
+	*pos = vma->vm_end;
+	return find_vma(vma->vm_mm, vma->vm_end);
 }
 
 static const struct seq_operations proc_pid_maps_ops = {
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 17/69] mm: remove rb tree.
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 19/69] xen: use vma_lookup() in privcmd_ioctl_mmap() Liam Howlett
                     ` (51 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Remove the RB tree and start using the maple tree for vm_area_struct
tracking.

Drop validate_mm() calls in expand_upwards() and expand_downwards() as the
lock is not held.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 arch/x86/kernel/tboot.c    |   1 -
 drivers/firmware/efi/efi.c |   1 -
 include/linux/mm.h         |   2 -
 include/linux/mm_types.h   |  14 -
 kernel/fork.c              |   8 -
 mm/init-mm.c               |   2 -
 mm/mmap.c                  | 509 ++++++++-----------------------------
 mm/nommu.c                 |  87 ++-----
 mm/util.c                  |  10 +-
 9 files changed, 144 insertions(+), 490 deletions(-)

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 859e8d2ea070..a8e3130890ea 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -97,7 +97,6 @@ void __init tboot_probe(void)
 
 static pgd_t *tboot_pg_dir;
 static struct mm_struct tboot_mm = {
-	.mm_rb          = RB_ROOT,
 	.mm_mt          = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, tboot_mm.mmap_lock),
 	.pgd            = swapper_pg_dir,
 	.mm_users       = ATOMIC_INIT(2),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 92a765d8d3b6..f18c256bbf89 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -54,7 +54,6 @@ static unsigned long __initdata mem_reserve = EFI_INVALID_TABLE_ADDR;
 static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR;
 
 struct mm_struct efi_mm = {
-	.mm_rb			= RB_ROOT,
 	.mm_mt			= MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, efi_mm.mmap_lock),
 	.mm_users		= ATOMIC_INIT(2),
 	.mm_count		= ATOMIC_INIT(1),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index c259f15c58ac..d11673080c33 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2660,8 +2660,6 @@ extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
 extern int split_vma(struct mm_struct *, struct vm_area_struct *,
 	unsigned long addr, int new_below);
 extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
-extern void __vma_link_rb(struct mm_struct *, struct vm_area_struct *,
-	struct rb_node **, struct rb_node *);
 extern void unlink_file_vma(struct vm_area_struct *);
 extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
 	unsigned long addr, unsigned long len, pgoff_t pgoff,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index e3c7855fc622..50c53f370bf6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -400,19 +400,6 @@ struct vm_area_struct {
 
 	/* linked list of VM areas per task, sorted by address */
 	struct vm_area_struct *vm_next, *vm_prev;
-
-	struct rb_node vm_rb;
-
-	/*
-	 * Largest free memory gap in bytes to the left of this VMA.
-	 * Either between this VMA and vma->vm_prev, or between one of the
-	 * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
-	 * get_unmapped_area find a free area of the right size.
-	 */
-	unsigned long rb_subtree_gap;
-
-	/* Second cache line starts here. */
-
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
 
 	/*
@@ -478,7 +465,6 @@ struct mm_struct {
 	struct {
 		struct vm_area_struct *mmap;		/* list of VMAs */
 		struct maple_tree mm_mt;
-		struct rb_root mm_rb;
 		u64 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 		unsigned long (*get_unmapped_area) (struct file *filp,
diff --git a/kernel/fork.c b/kernel/fork.c
index 79af6e908539..60783abc21c8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -581,7 +581,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 					struct mm_struct *oldmm)
 {
 	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
-	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge = 0;
 	MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
@@ -608,8 +607,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	mm->exec_vm = oldmm->exec_vm;
 	mm->stack_vm = oldmm->stack_vm;
 
-	rb_link = &mm->mm_rb.rb_node;
-	rb_parent = NULL;
 	pprev = &mm->mmap;
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
@@ -703,10 +700,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		tmp->vm_prev = prev;
 		prev = tmp;
 
-		__vma_link_rb(mm, tmp, rb_link, rb_parent);
-		rb_link = &tmp->vm_rb.rb_right;
-		rb_parent = &tmp->vm_rb;
-
 		/* Link the vma into the MT */
 		mas.index = tmp->vm_start;
 		mas.last = tmp->vm_end - 1;
@@ -1124,7 +1117,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
 	mm->mmap = NULL;
-	mm->mm_rb = RB_ROOT;
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	mm->vmacache_seqnum = 0;
diff --git a/mm/init-mm.c b/mm/init-mm.c
index b912b0f2eced..c9327abb771c 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/mm_types.h>
-#include <linux/rbtree.h>
 #include <linux/maple_tree.h>
 #include <linux/rwsem.h>
 #include <linux/spinlock.h>
@@ -29,7 +28,6 @@
  * and size this cpu_bitmask to NR_CPUS.
  */
 struct mm_struct init_mm = {
-	.mm_rb		= RB_ROOT,
 	.mm_mt		= MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, init_mm.mmap_lock),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
diff --git a/mm/mmap.c b/mm/mmap.c
index ecdedf5191c0..44f9f4b5411e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -39,7 +39,6 @@
 #include <linux/audit.h>
 #include <linux/khugepaged.h>
 #include <linux/uprobes.h>
-#include <linux/rbtree_augmented.h>
 #include <linux/notifier.h>
 #include <linux/memory.h>
 #include <linux/printk.h>
@@ -294,93 +293,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	return origbrk;
 }
 
-static inline unsigned long vma_compute_gap(struct vm_area_struct *vma)
-{
-	unsigned long gap, prev_end;
-
-	/*
-	 * Note: in the rare case of a VM_GROWSDOWN above a VM_GROWSUP, we
-	 * allow two stack_guard_gaps between them here, and when choosing
-	 * an unmapped area; whereas when expanding we only require one.
-	 * That's a little inconsistent, but keeps the code here simpler.
-	 */
-	gap = vm_start_gap(vma);
-	if (vma->vm_prev) {
-		prev_end = vm_end_gap(vma->vm_prev);
-		if (gap > prev_end)
-			gap -= prev_end;
-		else
-			gap = 0;
-	}
-	return gap;
-}
-
-#ifdef CONFIG_DEBUG_VM_RB
-static unsigned long vma_compute_subtree_gap(struct vm_area_struct *vma)
-{
-	unsigned long max = vma_compute_gap(vma), subtree_gap;
-	if (vma->vm_rb.rb_left) {
-		subtree_gap = rb_entry(vma->vm_rb.rb_left,
-				struct vm_area_struct, vm_rb)->rb_subtree_gap;
-		if (subtree_gap > max)
-			max = subtree_gap;
-	}
-	if (vma->vm_rb.rb_right) {
-		subtree_gap = rb_entry(vma->vm_rb.rb_right,
-				struct vm_area_struct, vm_rb)->rb_subtree_gap;
-		if (subtree_gap > max)
-			max = subtree_gap;
-	}
-	return max;
-}
-
-static int browse_rb(struct mm_struct *mm)
-{
-	struct rb_root *root = &mm->mm_rb;
-	int i = 0, j, bug = 0;
-	struct rb_node *nd, *pn = NULL;
-	unsigned long prev = 0, pend = 0;
-
-	for (nd = rb_first(root); nd; nd = rb_next(nd)) {
-		struct vm_area_struct *vma;
-		vma = rb_entry(nd, struct vm_area_struct, vm_rb);
-		if (vma->vm_start < prev) {
-			pr_emerg("vm_start %lx < prev %lx\n",
-				  vma->vm_start, prev);
-			bug = 1;
-		}
-		if (vma->vm_start < pend) {
-			pr_emerg("vm_start %lx < pend %lx\n",
-				  vma->vm_start, pend);
-			bug = 1;
-		}
-		if (vma->vm_start > vma->vm_end) {
-			pr_emerg("vm_start %lx > vm_end %lx\n",
-				  vma->vm_start, vma->vm_end);
-			bug = 1;
-		}
-		spin_lock(&mm->page_table_lock);
-		if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) {
-			pr_emerg("free gap %lx, correct %lx\n",
-			       vma->rb_subtree_gap,
-			       vma_compute_subtree_gap(vma));
-			bug = 1;
-		}
-		spin_unlock(&mm->page_table_lock);
-		i++;
-		pn = nd;
-		prev = vma->vm_start;
-		pend = vma->vm_end;
-	}
-	j = 0;
-	for (nd = pn; nd; nd = rb_prev(nd))
-		j++;
-	if (i != j) {
-		pr_emerg("backwards %d, forwards %d\n", j, i);
-		bug = 1;
-	}
-	return bug ? -1 : i;
-}
 #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
 extern void mt_validate(struct maple_tree *mt);
 extern void mt_dump(const struct maple_tree *mt);
@@ -406,19 +318,25 @@ static void validate_mm_mt(struct mm_struct *mm)
 		    (vma->vm_end - 1 != mas.last)) {
 			pr_emerg("issue in %s\n", current->comm);
 			dump_stack();
-#ifdef CONFIG_DEBUG_VM
 			dump_vma(vma_mt);
-			pr_emerg("and next in rb\n");
+			pr_emerg("and vm_next\n");
 			dump_vma(vma->vm_next);
-#endif
 			pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
 				 mas.index, mas.last);
 			pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
 				 vma_mt->vm_start, vma_mt->vm_end);
-			pr_emerg("rb vma: %px %lu - %lu\n", vma,
+			if (vma->vm_prev) {
+				pr_emerg("ll prev: %px %lu - %lu\n",
+					 vma->vm_prev, vma->vm_prev->vm_start,
+					 vma->vm_prev->vm_end);
+			}
+			pr_emerg("ll vma: %px %lu - %lu\n", vma,
 				 vma->vm_start, vma->vm_end);
-			pr_emerg("rb->next = %px %lu - %lu\n", vma->vm_next,
-					vma->vm_next->vm_start, vma->vm_next->vm_end);
+			if (vma->vm_next) {
+				pr_emerg("ll next: %px %lu - %lu\n",
+					 vma->vm_next, vma->vm_next->vm_start,
+					 vma->vm_next->vm_end);
+			}
 
 			mt_dump(mas.tree);
 			if (vma_mt->vm_end != mas.last + 1) {
@@ -442,21 +360,6 @@ static void validate_mm_mt(struct mm_struct *mm)
 	VM_BUG_ON(vma);
 	mt_validate(&mm->mm_mt);
 }
-#else
-#define validate_mm_mt(root) do { } while (0)
-#endif
-static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore)
-{
-	struct rb_node *nd;
-
-	for (nd = rb_first(root); nd; nd = rb_next(nd)) {
-		struct vm_area_struct *vma;
-		vma = rb_entry(nd, struct vm_area_struct, vm_rb);
-		VM_BUG_ON_VMA(vma != ignore &&
-			vma->rb_subtree_gap != vma_compute_subtree_gap(vma),
-			vma);
-	}
-}
 
 static void validate_mm(struct mm_struct *mm)
 {
@@ -465,7 +368,10 @@ static void validate_mm(struct mm_struct *mm)
 	unsigned long highest_address = 0;
 	struct vm_area_struct *vma = mm->mmap;
 
+	validate_mm_mt(mm);
+
 	while (vma) {
+#ifdef CONFIG_DEBUG_VM_RB
 		struct anon_vma *anon_vma = vma->anon_vma;
 		struct anon_vma_chain *avc;
 
@@ -475,6 +381,7 @@ static void validate_mm(struct mm_struct *mm)
 				anon_vma_interval_tree_verify(avc);
 			anon_vma_unlock_read(anon_vma);
 		}
+#endif
 
 		highest_address = vm_end_gap(vma);
 		vma = vma->vm_next;
@@ -489,80 +396,13 @@ static void validate_mm(struct mm_struct *mm)
 			  mm->highest_vm_end, highest_address);
 		bug = 1;
 	}
-	i = browse_rb(mm);
-	if (i != mm->map_count) {
-		if (i != -1)
-			pr_emerg("map_count %d rb %d\n", mm->map_count, i);
-		bug = 1;
-	}
 	VM_BUG_ON_MM(bug, mm);
 }
-#else
-#define validate_mm_rb(root, ignore) do { } while (0)
+
+#else /* !CONFIG_DEBUG_VM_MAPLE_TREE */
 #define validate_mm_mt(root) do { } while (0)
 #define validate_mm(mm) do { } while (0)
-#endif
-
-RB_DECLARE_CALLBACKS_MAX(static, vma_gap_callbacks,
-			 struct vm_area_struct, vm_rb,
-			 unsigned long, rb_subtree_gap, vma_compute_gap)
-
-/*
- * Update augmented rbtree rb_subtree_gap values after vma->vm_start or
- * vma->vm_prev->vm_end values changed, without modifying the vma's position
- * in the rbtree.
- */
-static void vma_gap_update(struct vm_area_struct *vma)
-{
-	/*
-	 * As it turns out, RB_DECLARE_CALLBACKS_MAX() already created
-	 * a callback function that does exactly what we want.
-	 */
-	vma_gap_callbacks_propagate(&vma->vm_rb, NULL);
-}
-
-static inline void vma_rb_insert(struct vm_area_struct *vma,
-				 struct rb_root *root)
-{
-	/* All rb_subtree_gap values must be consistent prior to insertion */
-	validate_mm_rb(root, NULL);
-
-	rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
-}
-
-static void __vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
-{
-	/*
-	 * Note rb_erase_augmented is a fairly large inline function,
-	 * so make sure we instantiate it only once with our desired
-	 * augmented rbtree callbacks.
-	 */
-	rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
-}
-
-static __always_inline void vma_rb_erase_ignore(struct vm_area_struct *vma,
-						struct rb_root *root,
-						struct vm_area_struct *ignore)
-{
-	/*
-	 * All rb_subtree_gap values must be consistent prior to erase,
-	 * with the possible exception of
-	 *
-	 * a. the "next" vma being erased if next->vm_start was reduced in
-	 *    __vma_adjust() -> __vma_unlink()
-	 * b. the vma being erased in detach_vmas_to_be_unmapped() ->
-	 *    vma_rb_erase()
-	 */
-	validate_mm_rb(root, ignore);
-
-	__vma_rb_erase(vma, root);
-}
-
-static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
-					 struct rb_root *root)
-{
-	vma_rb_erase_ignore(vma, root, vma);
-}
+#endif /* CONFIG_DEBUG_VM_MAPLE_TREE */
 
 /*
  * vma has some anon_vma assigned, and is already inserted on that
@@ -596,39 +436,26 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-static int find_vma_links(struct mm_struct *mm, unsigned long addr,
-		unsigned long end, struct vm_area_struct **pprev,
-		struct rb_node ***rb_link, struct rb_node **rb_parent)
+/*
+ * range_has_overlap() - Check the @start - @end range for overlapping VMAs and
+ * sets up a pointer to the previous VMA
+ * @mm: the mm struct
+ * @start: the start address of the range
+ * @end: the end address of the range
+ * @pprev: the pointer to the pointer of the previous VMA
+ *
+ * Returns: True if there is an overlapping VMA, false otherwise
+ */
+static inline
+bool range_has_overlap(struct mm_struct *mm, unsigned long start,
+		       unsigned long end, struct vm_area_struct **pprev)
 {
-	struct rb_node **__rb_link, *__rb_parent, *rb_prev;
-
-	mmap_assert_locked(mm);
-	__rb_link = &mm->mm_rb.rb_node;
-	rb_prev = __rb_parent = NULL;
-
-	while (*__rb_link) {
-		struct vm_area_struct *vma_tmp;
+	struct vm_area_struct *existing;
 
-		__rb_parent = *__rb_link;
-		vma_tmp = rb_entry(__rb_parent, struct vm_area_struct, vm_rb);
-
-		if (vma_tmp->vm_end > addr) {
-			/* Fail if an existing vma overlaps the area */
-			if (vma_tmp->vm_start < end)
-				return -ENOMEM;
-			__rb_link = &__rb_parent->rb_left;
-		} else {
-			rb_prev = __rb_parent;
-			__rb_link = &__rb_parent->rb_right;
-		}
-	}
-
-	*pprev = NULL;
-	if (rb_prev)
-		*pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
-	*rb_link = __rb_link;
-	*rb_parent = __rb_parent;
-	return 0;
+	MA_STATE(mas, &mm->mm_mt, start, start);
+	existing = mas_find(&mas, end - 1);
+	*pprev = mas_prev(&mas, 0);
+	return existing ? true : false;
 }
 
 /*
@@ -655,8 +482,6 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
  * @start: The start of the range.
  * @len: The length of the range.
  * @pprev: pointer to the pointer that will be set to previous vm_area_struct
- * @rb_link: the rb_node
- * @rb_parent: the parent rb_node
  *
  * Find all the vm_area_struct that overlap from @start to
  * @end and munmap them.  Set @pprev to the previous vm_area_struct.
@@ -665,14 +490,11 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
  */
 static inline int
 munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
-		 struct vm_area_struct **pprev, struct rb_node ***link,
-		 struct rb_node **parent, struct list_head *uf)
+		 struct vm_area_struct **pprev, struct list_head *uf)
 {
-
-	while (find_vma_links(mm, start, start + len, pprev, link, parent))
+	while (range_has_overlap(mm, start, start + len, pprev))
 		if (do_munmap(mm, start, len, uf))
 			return -ENOMEM;
-
 	return 0;
 }
 
@@ -693,30 +515,6 @@ static unsigned long count_vma_pages_range(struct mm_struct *mm,
 	return nr_pages;
 }
 
-void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
-		struct rb_node **rb_link, struct rb_node *rb_parent)
-{
-	/* Update tracking information for the gap following the new vma. */
-	if (vma->vm_next)
-		vma_gap_update(vma->vm_next);
-	else
-		mm->highest_vm_end = vm_end_gap(vma);
-
-	/*
-	 * vma->vm_prev wasn't known when we followed the rbtree to find the
-	 * correct insertion point for that vma. As a result, we could not
-	 * update the vma vm_rb parents rb_subtree_gap values on the way down.
-	 * So, we first insert the vma with a zero rb_subtree_gap value
-	 * (to be consistent with what we did on the way down), and then
-	 * immediately update the gap to the correct value. Finally we
-	 * rebalance the rbtree after all augmented values have been set.
-	 */
-	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
-	vma->rb_subtree_gap = 0;
-	vma_gap_update(vma);
-	vma_rb_insert(vma, &mm->mm_rb);
-}
-
 static void __vma_link_file(struct vm_area_struct *vma)
 {
 	struct file *file;
@@ -784,18 +582,8 @@ static inline void vma_mas_szero(struct ma_state *mas, unsigned long start,
 	mas_store_prealloc(mas, NULL);
 }
 
-static void
-__vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
-	struct vm_area_struct *prev, struct rb_node **rb_link,
-	struct rb_node *rb_parent)
-{
-	__vma_link_list(mm, vma, prev);
-	__vma_link_rb(mm, vma, rb_link, rb_parent);
-}
-
 static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct vm_area_struct *prev, struct rb_node **rb_link,
-			struct rb_node *rb_parent)
+			struct vm_area_struct *prev)
 {
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct address_space *mapping = NULL;
@@ -809,7 +597,7 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	vma_mas_store(vma, &mas);
-	__vma_link(mm, vma, prev, rb_link, rb_parent);
+	__vma_link_list(mm, vma, prev);
 	__vma_link_file(vma);
 
 	if (mapping)
@@ -822,34 +610,20 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 /*
  * Helper for vma_adjust() in the split_vma insert case: insert a vma into the
- * mm's list and rbtree.  It has already been inserted into the interval tree.
+ * mm's list and the mm tree.  It has already been inserted into the interval tree.
  */
 static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas,
 			       struct vm_area_struct *vma)
 {
 	struct vm_area_struct *prev;
-	struct rb_node **rb_link, *rb_parent;
-
-	if (find_vma_links(mm, vma->vm_start, vma->vm_end,
-			   &prev, &rb_link, &rb_parent))
-		BUG();
 
+	mas_set(mas, vma->vm_start);
+	prev = mas_prev(mas, 0);
 	vma_mas_store(vma, mas);
 	__vma_link_list(mm, vma, prev);
-	__vma_link_rb(mm, vma, rb_link, rb_parent);
 	mm->map_count++;
 }
 
-static __always_inline void __vma_unlink(struct mm_struct *mm,
-						struct vm_area_struct *vma,
-						struct vm_area_struct *ignore)
-{
-	vma_rb_erase_ignore(vma, &mm->mm_rb, ignore);
-	__vma_unlink_list(mm, vma);
-	/* Kill the cache */
-	vmacache_invalidate(mm);
-}
-
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -862,20 +636,18 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	struct vm_area_struct *expand)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
+	struct vm_area_struct *next_next, *next = find_vma(mm, vma->vm_end);
+	struct vm_area_struct *orig_vma = vma;
 	struct address_space *mapping = NULL;
 	struct rb_root_cached *root = NULL;
 	struct anon_vma *anon_vma = NULL;
 	struct file *file = vma->vm_file;
-	bool start_changed = false, end_changed = false;
+	bool vma_changed = false;
 	long adjust_next = 0;
 	int remove_next = 0;
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct vm_area_struct *exporter = NULL, *importer = NULL;
 
-	validate_mm(mm);
-	validate_mm_mt(mm);
-
 	if (next && !insert) {
 		if (end >= next->vm_end) {
 			/*
@@ -905,8 +677,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 				 * remove_next == 1 is case 1 or 7.
 				 */
 				remove_next = 1 + (end > next->vm_end);
+				next_next = find_vma(mm, next->vm_end);
 				VM_WARN_ON(remove_next == 2 &&
-					   end != next->vm_next->vm_end);
+					   end != next_next->vm_end);
 				/* trim end to next, for case 6 first pass */
 				end = next->vm_end;
 			}
@@ -1005,21 +778,21 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (start != vma->vm_start) {
-		unsigned long old_start = vma->vm_start;
+		if (vma->vm_start < start)
+			vma_mas_szero(&mas, vma->vm_start, start);
+		vma_changed = true;
 		vma->vm_start = start;
-		if (old_start < start)
-			vma_mas_szero(&mas, old_start, start);
-		start_changed = true;
 	}
 	if (end != vma->vm_end) {
-		unsigned long old_end = vma->vm_end;
+		if (vma->vm_end > end)
+			vma_mas_szero(&mas, end, vma->vm_end);
+		vma_changed = true;
 		vma->vm_end = end;
-		if (old_end > end)
-			vma_mas_szero(&mas, end, old_end);
-		end_changed = true;
+		if (!next)
+			mm->highest_vm_end = vm_end_gap(vma);
 	}
 
-	if (end_changed || start_changed)
+	if (vma_changed)
 		vma_mas_store(vma, &mas);
 
 	vma->vm_pgoff = pgoff;
@@ -1037,25 +810,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
-		/*
-		 * vma_merge has merged next into vma, and needs
-		 * us to remove next before dropping the locks.
-		 * Since we have expanded over this vma, the maple tree will
-		 * have overwritten by storing the value
-		 */
-		if (remove_next != 3)
-			__vma_unlink(mm, next, next);
-		else
-			/*
-			 * vma is not before next if they've been
-			 * swapped.
-			 *
-			 * pre-swap() next->vm_start was reduced so
-			 * tell validate_mm_rb to ignore pre-swap()
-			 * "next" (which is stored in post-swap()
-			 * "vma").
-			 */
-			__vma_unlink(mm, next, vma);
+		__vma_unlink_list(mm, next);
+		/* Kill the cache */
+		vmacache_invalidate(mm);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 	} else if (insert) {
@@ -1065,15 +822,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		 * (it may either follow vma or precede it).
 		 */
 		__insert_vm_struct(mm, &mas, insert);
-	} else {
-		if (start_changed)
-			vma_gap_update(vma);
-		if (end_changed) {
-			if (!next)
-				mm->highest_vm_end = vm_end_gap(vma);
-			else if (!adjust_next)
-				vma_gap_update(next);
-		}
 	}
 
 	if (anon_vma) {
@@ -1100,7 +848,9 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
+		BUG_ON(vma->vm_end < next->vm_end);
 		vm_area_free(next);
+
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -1113,7 +863,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "next->vm_prev->vm_end" changed and the
 			 * "vma->vm_next" gap must be updated.
 			 */
-			next = vma->vm_next;
+			next = next_next;
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -1128,13 +878,11 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			next = vma;
 		}
 		if (remove_next == 2) {
+			mas_reset(&mas);
 			remove_next = 1;
 			end = next->vm_end;
 			goto again;
-		}
-		else if (next)
-			vma_gap_update(next);
-		else {
+		} else if (!next) {
 			/*
 			 * If remove_next == 2 we obviously can't
 			 * reach this path.
@@ -1161,8 +909,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		uprobe_mmap(insert);
 
 	validate_mm(mm);
-	validate_mm_mt(mm);
-
 	return 0;
 }
 
@@ -1315,7 +1061,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	struct vm_area_struct *area, *next;
 	int err;
 
-	validate_mm_mt(mm);
 	/*
 	 * We later require that vma->vm_flags == vm_flags,
 	 * so this tests vma->vm_flags & VM_SPECIAL, too.
@@ -1391,7 +1136,6 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 		khugepaged_enter_vma_merge(area, vm_flags);
 		return area;
 	}
-	validate_mm_mt(mm);
 
 	return NULL;
 }
@@ -1561,6 +1305,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
 	vm_flags_t vm_flags;
 	int pkey = 0;
 
+	validate_mm(mm);
 	*populate = 0;
 
 	if (!len)
@@ -1868,10 +1613,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev, *merge;
 	int error;
-	struct rb_node **rb_link, *rb_parent;
 	unsigned long charged = 0;
 
-	validate_mm_mt(mm);
 	/* Check against address space limit. */
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
 		unsigned long nr_pages;
@@ -1887,8 +1630,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			return -ENOMEM;
 	}
 
-	/* Clear old maps, set up prev, rb_link, rb_parent, and uf */
-	if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
+	/* Clear old maps, set up prev and uf */
+	if (munmap_vma_range(mm, addr, len, &prev, uf))
 		return -ENOMEM;
 	/*
 	 * Private writable mapping: check memory availability
@@ -1986,7 +1729,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			goto free_vma;
 	}
 
-	if (vma_link(mm, vma, prev, rb_link, rb_parent)) {
+	if (vma_link(mm, vma, prev)) {
 		error = -ENOMEM;
 		if (file)
 			goto unmap_and_free_vma;
@@ -2026,7 +1769,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
-	validate_mm_mt(mm);
 	return addr;
 
 unmap_and_free_vma:
@@ -2043,7 +1785,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
-	validate_mm_mt(mm);
 	return error;
 }
 
@@ -2379,7 +2120,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	int error = 0;
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
-	validate_mm_mt(mm);
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
@@ -2431,15 +2171,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 			error = acct_stack_growth(vma, size, grow);
 			if (!error) {
 				/*
-				 * vma_gap_update() doesn't support concurrent
-				 * updates, but we only hold a shared mmap_lock
-				 * lock here, so we need to protect against
-				 * concurrent vma expansions.
-				 * anon_vma_lock_write() doesn't help here, as
-				 * we don't guarantee that all growable vmas
-				 * in a mm share the same root anon vma.
-				 * So, we reuse mm->page_table_lock to guard
-				 * against concurrent vma expansions.
+				 * We only hold a shared mmap_lock lock here, so
+				 * we need to protect against concurrent vma
+				 * expansions.  anon_vma_lock_write() doesn't
+				 * help here, as we don't guarantee that all
+				 * growable vmas in a mm share the same root
+				 * anon vma.  So, we reuse mm->page_table_lock
+				 * to guard against concurrent vma expansions.
 				 */
 				spin_lock(&mm->page_table_lock);
 				if (vma->vm_flags & VM_LOCKED)
@@ -2450,9 +2188,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				/* Overwrite old entry in mtree. */
 				vma_mas_store(vma, &mas);
 				anon_vma_interval_tree_post_update_vma(vma);
-				if (vma->vm_next)
-					vma_gap_update(vma->vm_next);
-				else
+				if (!vma->vm_next)
 					mm->highest_vm_end = vm_end_gap(vma);
 				spin_unlock(&mm->page_table_lock);
 
@@ -2462,8 +2198,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	}
 	anon_vma_unlock_write(vma->anon_vma);
 	khugepaged_enter_vma_merge(vma, vma->vm_flags);
-	validate_mm(mm);
-	validate_mm_mt(mm);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -2471,15 +2205,13 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 /*
  * vma is the first one with address < vma->vm_start.  Have to extend vma.
  */
-int expand_downwards(struct vm_area_struct *vma,
-				   unsigned long address)
+int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *prev;
 	int error = 0;
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
-	validate_mm(mm);
 	address &= PAGE_MASK;
 	if (address < mmap_min_addr)
 		return -EPERM;
@@ -2521,15 +2253,13 @@ int expand_downwards(struct vm_area_struct *vma,
 			error = acct_stack_growth(vma, size, grow);
 			if (!error) {
 				/*
-				 * vma_gap_update() doesn't support concurrent
-				 * updates, but we only hold a shared mmap_lock
-				 * lock here, so we need to protect against
-				 * concurrent vma expansions.
-				 * anon_vma_lock_write() doesn't help here, as
-				 * we don't guarantee that all growable vmas
-				 * in a mm share the same root anon vma.
-				 * So, we reuse mm->page_table_lock to guard
-				 * against concurrent vma expansions.
+				 * We only hold a shared mmap_lock lock here, so
+				 * we need to protect against concurrent vma
+				 * expansions.  anon_vma_lock_write() doesn't
+				 * help here, as we don't guarantee that all
+				 * growable vmas in a mm share the same root
+				 * anon vma.  So, we reuse mm->page_table_lock
+				 * to guard against concurrent vma expansions.
 				 */
 				spin_lock(&mm->page_table_lock);
 				if (vma->vm_flags & VM_LOCKED)
@@ -2541,7 +2271,6 @@ int expand_downwards(struct vm_area_struct *vma,
 				/* Overwrite old entry in mtree. */
 				vma_mas_store(vma, &mas);
 				anon_vma_interval_tree_post_update_vma(vma);
-				vma_gap_update(vma);
 				spin_unlock(&mm->page_table_lock);
 
 				perf_event_mmap(vma);
@@ -2550,7 +2279,6 @@ int expand_downwards(struct vm_area_struct *vma,
 	}
 	anon_vma_unlock_write(vma->anon_vma);
 	khugepaged_enter_vma_merge(vma, vma->vm_flags);
-	validate_mm(mm);
 	return error;
 }
 
@@ -2682,10 +2410,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas,
 
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
-	mas_set_range(mas, vma->vm_start, end - 1);
-	mas_store_prealloc(mas, NULL);
+	vma_mas_szero(mas, vma->vm_start, end);
 	do {
-		vma_rb_erase(vma, &mm->mm_rb);
 		if (vma->vm_flags & VM_LOCKED)
 			mm->locked_vm -= vma_pages(vma);
 		mm->map_count--;
@@ -2693,10 +2419,9 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas,
 		vma = vma->vm_next;
 	} while (vma && vma->vm_start < end);
 	*insertion_point = vma;
-	if (vma) {
+	if (vma)
 		vma->vm_prev = prev;
-		vma_gap_update(vma);
-	} else
+	else
 		mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
 	tail_vma->vm_next = NULL;
 
@@ -2815,11 +2540,7 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	if (len == 0)
 		return -EINVAL;
 
-	/*
-	 * arch_unmap() might do unmaps itself.  It must be called
-	 * and finish any rbtree manipulation before this code
-	 * runs and also starts to manipulate the rbtree.
-	 */
+	 /* arch_unmap() might do unmaps itself.  */
 	arch_unmap(mm, start, end);
 
 	/* Find the first overlapping VMA where start < vma->vm_end */
@@ -2830,6 +2551,11 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	if (mas_preallocate(&mas, vma, GFP_KERNEL))
 		return -ENOMEM;
 	prev = vma->vm_prev;
+	/* we have start < vma->vm_end  */
+
+	/* if it doesn't overlap, we have nothing.. */
+	if (vma->vm_start >= end)
+		return 0;
 
 	/*
 	 * If we need to split any vma, do it now to save pain later.
@@ -2890,6 +2616,8 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	/* Fix up all other VM information */
 	remove_vma_list(mm, vma);
 
+
+	validate_mm(mm);
 	return downgrade ? 1 : 0;
 
 map_count_exceeded:
@@ -3028,11 +2756,11 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
  *  anonymous maps.  eventually we may be able to do some
  *  brk-specific accounting here.
  */
-static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long flags, struct list_head *uf)
+static int do_brk_flags(unsigned long addr, unsigned long len,
+			unsigned long flags, struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma, *prev;
-	struct rb_node **rb_link, *rb_parent;
 	pgoff_t pgoff = addr >> PAGE_SHIFT;
 	int error;
 	unsigned long mapped_addr;
@@ -3051,8 +2779,8 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 	if (error)
 		return error;
 
-	/* Clear old maps, set up prev, rb_link, rb_parent, and uf */
-	if (munmap_vma_range(mm, addr, len, &prev, &rb_link, &rb_parent, uf))
+	/* Clear old maps, set up prev and uf */
+	if (munmap_vma_range(mm, addr, len, &prev, uf))
 		return -ENOMEM;
 
 	/* Check against address space limits *after* clearing old maps... */
@@ -3086,7 +2814,7 @@ static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
 	vma->vm_page_prot = vm_get_page_prot(flags);
-	if(vma_link(mm, vma, prev, rb_link, rb_parent))
+	if(vma_link(mm, vma, prev))
 		goto no_vma_link;
 
 out:
@@ -3203,26 +2931,10 @@ void exit_mmap(struct mm_struct *mm)
 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	struct vm_area_struct *prev;
-	struct rb_node **rb_link, *rb_parent;
-	unsigned long start = vma->vm_start;
-	struct vm_area_struct *overlap = NULL;
 
-	if (find_vma_links(mm, vma->vm_start, vma->vm_end,
-			   &prev, &rb_link, &rb_parent))
+	if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev))
 		return -ENOMEM;
 
-	overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1);
-	if (overlap) {
-
-		pr_err("Found vma ending at %lu\n", start - 1);
-		pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap,
-				overlap->vm_start, overlap->vm_end - 1);
-#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
-		mt_dump(&mm->mm_mt);
-#endif
-		BUG();
-	}
-
 	if ((vma->vm_flags & VM_ACCOUNT) &&
 	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
 		return -ENOMEM;
@@ -3244,7 +2956,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
 	}
 
-	if (vma_link(mm, vma, prev, rb_link, rb_parent))
+	if (vma_link(mm, vma, prev))
 		return -ENOMEM;
 
 	return 0;
@@ -3262,9 +2974,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	unsigned long vma_start = vma->vm_start;
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma, *prev;
-	struct rb_node **rb_link, *rb_parent;
 	bool faulted_in_anon_vma = true;
-	unsigned long index = addr;
 
 	validate_mm_mt(mm);
 	/*
@@ -3276,10 +2986,9 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		faulted_in_anon_vma = false;
 	}
 
-	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
+	if (range_has_overlap(mm, addr, addr + len, &prev))
 		return NULL;	/* should never get here */
-	if (mt_find(&mm->mm_mt, &index, addr+len - 1))
-		BUG();
+
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
 			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
 			    vma->vm_userfaultfd_ctx, anon_vma_name(vma));
@@ -3320,12 +3029,16 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
-		vma_link(mm, new_vma, prev, rb_link, rb_parent);
+		if (vma_link(mm, new_vma, prev))
+			goto out_vma_link;
 		*need_rmap_locks = false;
 	}
 	validate_mm_mt(mm);
 	return new_vma;
 
+out_vma_link:
+	if (new_vma->vm_ops && new_vma->vm_ops->close)
+		new_vma->vm_ops->close(new_vma);
 out_free_mempol:
 	mpol_put(vma_policy(new_vma));
 out_free_vma:
diff --git a/mm/nommu.c b/mm/nommu.c
index 9d7afc2d959e..81408d20024f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -553,9 +553,9 @@ static void put_nommu_region(struct vm_region *region)
  */
 static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	struct vm_area_struct *pvma, *prev;
 	struct address_space *mapping;
-	struct rb_node **p, *parent, *rb_prev;
+	struct vm_area_struct *prev;
+	MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end);
 
 	BUG_ON(!vma->vm_region);
 
@@ -573,42 +573,10 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 		i_mmap_unlock_write(mapping);
 	}
 
+	prev = mas_prev(&mas, 0);
+	mas_reset(&mas);
 	/* add the VMA to the tree */
-	parent = rb_prev = NULL;
-	p = &mm->mm_rb.rb_node;
-	while (*p) {
-		parent = *p;
-		pvma = rb_entry(parent, struct vm_area_struct, vm_rb);
-
-		/* sort by: start addr, end addr, VMA struct addr in that order
-		 * (the latter is necessary as we may get identical VMAs) */
-		if (vma->vm_start < pvma->vm_start)
-			p = &(*p)->rb_left;
-		else if (vma->vm_start > pvma->vm_start) {
-			rb_prev = parent;
-			p = &(*p)->rb_right;
-		} else if (vma->vm_end < pvma->vm_end)
-			p = &(*p)->rb_left;
-		else if (vma->vm_end > pvma->vm_end) {
-			rb_prev = parent;
-			p = &(*p)->rb_right;
-		} else if (vma < pvma)
-			p = &(*p)->rb_left;
-		else if (vma > pvma) {
-			rb_prev = parent;
-			p = &(*p)->rb_right;
-		} else
-			BUG();
-	}
-
-	rb_link_node(&vma->vm_rb, parent, p);
-	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
-
-	/* add VMA to the VMA list also */
-	prev = NULL;
-	if (rb_prev)
-		prev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
-
+	vma_mas_store(vma, &mas);
 	__vma_link_list(mm, vma, prev);
 }
 
@@ -621,6 +589,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 	struct address_space *mapping;
 	struct mm_struct *mm = vma->vm_mm;
 	struct task_struct *curr = current;
+	MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);
 
 	mm->map_count--;
 	for (i = 0; i < VMACACHE_SIZE; i++) {
@@ -643,8 +612,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 	}
 
 	/* remove from the MM's tree and list */
-	rb_erase(&vma->vm_rb, &mm->mm_rb);
-
+	vma_mas_remove(vma, &mas);
 	__vma_unlink_list(mm, vma);
 }
 
@@ -668,24 +636,19 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 {
 	struct vm_area_struct *vma;
+	MA_STATE(mas, &mm->mm_mt, addr, addr);
 
 	/* check the cache first */
 	vma = vmacache_find(mm, addr);
 	if (likely(vma))
 		return vma;
 
-	/* trawl the list (there may be multiple mappings in which addr
-	 * resides) */
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
-		if (vma->vm_start > addr)
-			return NULL;
-		if (vma->vm_end > addr) {
-			vmacache_update(addr, vma);
-			return vma;
-		}
-	}
+	vma = mas_walk(&mas);
 
-	return NULL;
+	if (vma)
+		vmacache_update(addr, vma);
+
+	return vma;
 }
 EXPORT_SYMBOL(find_vma);
 
@@ -717,26 +680,23 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
 {
 	struct vm_area_struct *vma;
 	unsigned long end = addr + len;
+	MA_STATE(mas, &mm->mm_mt, addr, addr);
 
 	/* check the cache first */
 	vma = vmacache_find_exact(mm, addr, end);
 	if (vma)
 		return vma;
 
-	/* trawl the list (there may be multiple mappings in which addr
-	 * resides) */
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
-		if (vma->vm_start < addr)
-			continue;
-		if (vma->vm_start > addr)
-			return NULL;
-		if (vma->vm_end == end) {
-			vmacache_update(addr, vma);
-			return vma;
-		}
-	}
+	vma = mas_walk(&mas);
+	if (!vma)
+		return NULL;
+	if (vma->vm_start != addr)
+		return NULL;
+	if (vma->vm_end != end)
+		return NULL;
 
-	return NULL;
+	vmacache_update(addr, vma);
+	return vma;
 }
 
 /*
@@ -1533,6 +1493,7 @@ void exit_mmap(struct mm_struct *mm)
 		delete_vma(mm, vma);
 		cond_resched();
 	}
+	__mt_destroy(&mm->mm_mt);
 }
 
 int vm_brk(unsigned long addr, unsigned long len)
diff --git a/mm/util.c b/mm/util.c
index 3492a9e81aa3..3e97807c353b 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -287,6 +287,8 @@ void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
 	vma->vm_next = next;
 	if (next)
 		next->vm_prev = vma;
+	else
+		mm->highest_vm_end = vm_end_gap(vma);
 }
 
 void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
@@ -299,8 +301,14 @@ void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
 		prev->vm_next = next;
 	else
 		mm->mmap = next;
-	if (next)
+	if (next) {
 		next->vm_prev = prev;
+	} else {
+		if (prev)
+			mm->highest_vm_end = vm_end_gap(prev);
+		else
+			mm->highest_vm_end = 0;
+	}
 }
 
 /* Check if the vma is being used as a stack by this task */
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 18/69] mmap: change zeroing of maple tree in __vma_adjust()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (2 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 20/69] mm: optimize find_exact_vma() to use vma_lookup() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 23/69] mm: use maple tree operations for find_vma_intersection() Liam Howlett
                     ` (48 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Only write to the maple tree if we are not inserting or the insert isn't
going to overwrite the area to clear.  This avoids spanning writes and
node coealescing when unnecessary.

The change requires a custom search for the linked list addition to find
the correct VMA for the prev link.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 44f9f4b5411e..6f1d72172ef6 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -613,11 +613,11 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
  * mm's list and the mm tree.  It has already been inserted into the interval tree.
  */
 static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas,
-			       struct vm_area_struct *vma)
+		struct vm_area_struct *vma, unsigned long location)
 {
 	struct vm_area_struct *prev;
 
-	mas_set(mas, vma->vm_start);
+	mas_set(mas, location);
 	prev = mas_prev(mas, 0);
 	vma_mas_store(vma, mas);
 	__vma_link_list(mm, vma, prev);
@@ -647,6 +647,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	int remove_next = 0;
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct vm_area_struct *exporter = NULL, *importer = NULL;
+	unsigned long ll_prev = vma->vm_start; /* linked list prev. */
 
 	if (next && !insert) {
 		if (end >= next->vm_end) {
@@ -778,15 +779,27 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (start != vma->vm_start) {
-		if (vma->vm_start < start)
+		if ((vma->vm_start < start) &&
+		    (!insert || (insert->vm_end != start))) {
 			vma_mas_szero(&mas, vma->vm_start, start);
-		vma_changed = true;
+			VM_WARN_ON(insert && insert->vm_start > vma->vm_start);
+		} else {
+			vma_changed = true;
+		}
 		vma->vm_start = start;
 	}
 	if (end != vma->vm_end) {
-		if (vma->vm_end > end)
-			vma_mas_szero(&mas, end, vma->vm_end);
-		vma_changed = true;
+		if (vma->vm_end > end) {
+			if (!insert || (insert->vm_start != end)) {
+				vma_mas_szero(&mas, end, vma->vm_end);
+				VM_WARN_ON(insert &&
+					   insert->vm_end < vma->vm_end);
+			} else if (insert->vm_start == end) {
+				ll_prev = vma->vm_end;
+			}
+		} else {
+			vma_changed = true;
+		}
 		vma->vm_end = end;
 		if (!next)
 			mm->highest_vm_end = vm_end_gap(vma);
@@ -821,7 +834,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		 * us to insert it before dropping the locks
 		 * (it may either follow vma or precede it).
 		 */
-		__insert_vm_struct(mm, &mas, insert);
+		__insert_vm_struct(mm, &mas, insert, ll_prev);
 	}
 
 	if (anon_vma) {
@@ -908,6 +921,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	if (insert && file)
 		uprobe_mmap(insert);
 
+	mas_destroy(&mas);
 	validate_mm(mm);
 	return 0;
 }
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 19/69] xen: use vma_lookup() in privcmd_ioctl_mmap()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 17/69] mm: remove rb tree Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 20/69] mm: optimize find_exact_vma() to use vma_lookup() Liam Howlett
                     ` (50 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

vma_lookup() walks the VMA tree for a specific value, find_vma() will
search the tree after walking to a specific value.  It is more efficient
to only walk to the requested value since privcmd_ioctl_mmap() will exit
the loop if vm_start != msg->va.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/xen/privcmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index 3369734108af..ad17166b0ef6 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -282,7 +282,7 @@ static long privcmd_ioctl_mmap(struct file *file, void __user *udata)
 						     struct page, lru);
 		struct privcmd_mmap_entry *msg = page_address(page);
 
-		vma = find_vma(mm, msg->va);
+		vma = vma_lookup(mm, msg->va);
 		rc = -EINVAL;
 
 		if (!vma || (msg->va != vma->vm_start) || vma->vm_private_data)
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 20/69] mm: optimize find_exact_vma() to use vma_lookup()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 17/69] mm: remove rb tree Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 19/69] xen: use vma_lookup() in privcmd_ioctl_mmap() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 18/69] mmap: change zeroing of maple tree in __vma_adjust() Liam Howlett
                     ` (49 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Use vma_lookup() to walk the tree to the start value requested.  If the
vma at the start does not match, then the answer is NULL and there is no
need to look at the next vma the way that find_vma() would.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d11673080c33..0fdb19d1b48b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2852,7 +2852,7 @@ static inline unsigned long vma_pages(struct vm_area_struct *vma)
 static inline struct vm_area_struct *find_exact_vma(struct mm_struct *mm,
 				unsigned long vm_start, unsigned long vm_end)
 {
-	struct vm_area_struct *vma = find_vma(mm, vm_start);
+	struct vm_area_struct *vma = vma_lookup(mm, vm_start);
 
 	if (vma && (vma->vm_start != vm_start || vma->vm_end != vm_end))
 		vma = NULL;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 21/69] mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (4 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 23/69] mm: use maple tree operations for find_vma_intersection() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 22/69] mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap() Liam Howlett
                     ` (46 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

vma_lookup() will walk the vma tree once and not continue to look for the
next vma.  Since the exact vma is checked below, this is a more optimal
way of searching.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ac53ad2c9bb1..03fda93ade3e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1435,7 +1435,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 {
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
-	struct vm_area_struct *vma = find_vma(mm, haddr);
+	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
 	pmd_t *pmd;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 23/69] mm: use maple tree operations for find_vma_intersection()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (3 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 18/69] mmap: change zeroing of maple tree in __vma_adjust() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 21/69] mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup() Liam Howlett
                     ` (47 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Move find_vma_intersection() to mmap.c and change implementation to maple
tree.

When searching for a vma within a range, it is easier to use the maple
tree interface.

Exported find_vma_intersection() for kvm module.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h | 22 ++++------------------
 mm/mmap.c          | 29 +++++++++++++++++++++++++++++
 mm/nommu.c         | 11 +++++++++++
 3 files changed, 44 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0fdb19d1b48b..6db9face6f84 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2780,26 +2780,12 @@ extern struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long add
 extern struct vm_area_struct * find_vma_prev(struct mm_struct * mm, unsigned long addr,
 					     struct vm_area_struct **pprev);
 
-/**
- * find_vma_intersection() - Look up the first VMA which intersects the interval
- * @mm: The process address space.
- * @start_addr: The inclusive start user address.
- * @end_addr: The exclusive end user address.
- *
- * Returns: The first VMA within the provided range, %NULL otherwise.  Assumes
- * start_addr < end_addr.
+/*
+ * Look up the first VMA which intersects the interval [start_addr, end_addr)
+ * NULL if none.  Assume start_addr < end_addr.
  */
-static inline
 struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
-					     unsigned long start_addr,
-					     unsigned long end_addr)
-{
-	struct vm_area_struct *vma = find_vma(mm, start_addr);
-
-	if (vma && end_addr <= vma->vm_start)
-		vma = NULL;
-	return vma;
-}
+			unsigned long start_addr, unsigned long end_addr);
 
 /**
  * vma_lookup() - Find a VMA at a specific address
diff --git a/mm/mmap.c b/mm/mmap.c
index ec4ce76f02dc..5f948f353376 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2069,6 +2069,35 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
 
 EXPORT_SYMBOL(get_unmapped_area);
 
+/**
+ * find_vma_intersection() - Look up the first VMA which intersects the interval
+ * @mm: The process address space.
+ * @start_addr: The inclusive start user address.
+ * @end_addr: The exclusive end user address.
+ *
+ * Returns: The first VMA within the provided range, %NULL otherwise.  Assumes
+ * start_addr < end_addr.
+ */
+struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
+					     unsigned long start_addr,
+					     unsigned long end_addr)
+{
+	struct vm_area_struct *vma;
+	unsigned long index = start_addr;
+
+	mmap_assert_locked(mm);
+	/* Check the cache first. */
+	vma = vmacache_find(mm, start_addr);
+	if (likely(vma))
+		return vma;
+
+	vma = mt_find(&mm->mm_mt, &index, end_addr - 1);
+	if (vma)
+		vmacache_update(start_addr, vma);
+	return vma;
+}
+EXPORT_SYMBOL(find_vma_intersection);
+
 /**
  * find_vma() - Find the VMA for a given address, or the next vma.
  * @mm: The mm_struct to check
diff --git a/mm/nommu.c b/mm/nommu.c
index 81408d20024f..2870edfad8ed 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -629,6 +629,17 @@ static void delete_vma(struct mm_struct *mm, struct vm_area_struct *vma)
 	vm_area_free(vma);
 }
 
+struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
+					     unsigned long start_addr,
+					     unsigned long end_addr)
+{
+	unsigned long index = start_addr;
+
+	mmap_assert_locked(mm);
+	return mt_find(&mm->mm_mt, &index, end_addr - 1);
+}
+EXPORT_SYMBOL(find_vma_intersection);
+
 /*
  * look up the first VMA in which addr resides, NULL if none
  * - should be called with mm->mmap_lock at least held readlocked
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 22/69] mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (5 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 21/69] mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 24/69] mm/mmap: use advanced maple tree API for mmap_region() Liam Howlett
                     ` (45 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Avoid allocating a new VMA when it a vma modification can occur.  When a
brk() can expand or contract a VMA, then the single store operation will
only modify one index of the maple tree instead of causing a node to split
or coalesce.  This avoids unnecessary allocations/frees of maple tree
nodes and VMAs.

Move some limit & flag verifications out of the do_brk_flags() function to
use only relevant checks in the code path of bkr() and vm_brk_flags().

Set the vma to check if it can expand in vm_brk_flags() if extra criteria
are met.

Drop userfaultfd from do_brk_flags() path and only use it in
vm_brk_flags() path since that is the only place a munmap will happen.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 286 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 228 insertions(+), 58 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 6f1d72172ef6..ec4ce76f02dc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -194,17 +194,40 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 	return next;
 }
 
-static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long flags,
-		struct list_head *uf);
+/*
+ * check_brk_limits() - Use platform specific check of range & verify mlock
+ * limits.
+ * @addr: The address to check
+ * @len: The size of increase.
+ *
+ * Return: 0 on success.
+ */
+static int check_brk_limits(unsigned long addr, unsigned long len)
+{
+	unsigned long mapped_addr;
+
+	mapped_addr = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED);
+	if (IS_ERR_VALUE(mapped_addr))
+		return mapped_addr;
+
+	return mlock_future_check(current->mm, current->mm->def_flags, len);
+}
+static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
+			 unsigned long newbrk, unsigned long oldbrk,
+			 struct list_head *uf);
+static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *brkvma,
+			unsigned long addr, unsigned long request,
+			unsigned long flags);
 SYSCALL_DEFINE1(brk, unsigned long, brk)
 {
 	unsigned long newbrk, oldbrk, origbrk;
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *next;
+	struct vm_area_struct *brkvma, *next = NULL;
 	unsigned long min_brk;
 	bool populate;
 	bool downgraded = false;
 	LIST_HEAD(uf);
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
@@ -246,35 +269,52 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 
 	/*
 	 * Always allow shrinking brk.
-	 * __do_munmap() may downgrade mmap_lock to read.
+	 * do_brk_munmap() may downgrade mmap_lock to read.
 	 */
 	if (brk <= mm->brk) {
 		int ret;
 
+		/* Search one past newbrk */
+		mas_set(&mas, newbrk);
+		brkvma = mas_find(&mas, oldbrk);
+		BUG_ON(brkvma == NULL);
+		if (brkvma->vm_start >= oldbrk)
+			goto out; /* mapping intersects with an existing non-brk vma. */
 		/*
-		 * mm->brk must to be protected by write mmap_lock so update it
-		 * before downgrading mmap_lock. When __do_munmap() fails,
-		 * mm->brk will be restored from origbrk.
+		 * mm->brk must be protected by write mmap_lock.
+		 * do_brk_munmap() may downgrade the lock,  so update it
+		 * before calling do_brk_munmap().
 		 */
 		mm->brk = brk;
-		ret = __do_munmap(mm, newbrk, oldbrk-newbrk, &uf, true);
-		if (ret < 0) {
-			mm->brk = origbrk;
-			goto out;
-		} else if (ret == 1) {
+		mas.last = oldbrk - 1;
+		ret = do_brk_munmap(&mas, brkvma, newbrk, oldbrk, &uf);
+		if (ret == 1)  {
 			downgraded = true;
-		}
-		goto success;
+			goto success;
+		} else if (!ret)
+			goto success;
+
+		mm->brk = origbrk;
+		goto out;
 	}
 
-	/* Check against existing mmap mappings. */
-	next = find_vma(mm, oldbrk);
+	if (check_brk_limits(oldbrk, newbrk - oldbrk))
+		goto out;
+
+	/*
+	 * Only check if the next VMA is within the stack_guard_gap of the
+	 * expansion area
+	 */
+	mas_set(&mas, oldbrk);
+	next = mas_find(&mas, newbrk - 1 + PAGE_SIZE + stack_guard_gap);
 	if (next && newbrk + PAGE_SIZE > vm_start_gap(next))
 		goto out;
 
+	brkvma = mas_prev(&mas, mm->start_brk);
 	/* Ok, looks good - let it rip. */
-	if (do_brk_flags(oldbrk, newbrk-oldbrk, 0, &uf) < 0)
+	if (do_brk_flags(&mas, brkvma, oldbrk, newbrk - oldbrk, 0) < 0)
 		goto out;
+
 	mm->brk = brk;
 
 success:
@@ -2766,38 +2806,113 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 }
 
 /*
- *  this is really a simplified "do_mmap".  it only handles
- *  anonymous maps.  eventually we may be able to do some
- *  brk-specific accounting here.
+ * brk_munmap() - Unmap a parital vma.
+ * @mas: The maple tree state.
+ * @vma: The vma to be modified
+ * @newbrk: the start of the address to unmap
+ * @oldbrk: The end of the address to unmap
+ * @uf: The userfaultfd list_head
+ *
+ * Returns: 1 on success.
+ * unmaps a partial VMA mapping.  Does not handle alignment, downgrades lock if
+ * possible.
  */
-static int do_brk_flags(unsigned long addr, unsigned long len,
-			unsigned long flags, struct list_head *uf)
+static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
+			 unsigned long newbrk, unsigned long oldbrk,
+			 struct list_head *uf)
 {
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma, *prev;
-	pgoff_t pgoff = addr >> PAGE_SHIFT;
-	int error;
-	unsigned long mapped_addr;
-	validate_mm_mt(mm);
+	struct mm_struct *mm = vma->vm_mm;
+	struct vm_area_struct unmap;
+	unsigned long unmap_pages;
+	int ret = 1;
+
+	arch_unmap(mm, newbrk, oldbrk);
+
+	if (likely((vma->vm_end < oldbrk) ||
+		   ((vma->vm_start == newbrk) && (vma->vm_end == oldbrk)))) {
+		/* remove entire mapping(s) */
+		mas_set(mas, newbrk);
+		if (vma->vm_start != newbrk)
+			mas_reset(mas); /* cause a re-walk for the first overlap. */
+		ret = __do_munmap(mm, newbrk, oldbrk - newbrk, uf, true);
+		goto munmap_full_vma;
+	}
+
+	vma_init(&unmap, mm);
+	unmap.vm_start = newbrk;
+	unmap.vm_end = oldbrk;
+	ret = userfaultfd_unmap_prep(&unmap, newbrk, oldbrk, uf);
+	if (ret)
+		return ret;
+	ret = 1;
 
-	/* Until we need other flags, refuse anything except VM_EXEC. */
-	if ((flags & (~VM_EXEC)) != 0)
-		return -EINVAL;
-	flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
+	/* Change the oldbrk of vma to the newbrk of the munmap area */
+	vma_adjust_trans_huge(vma, vma->vm_start, newbrk, 0);
+	if (mas_preallocate(mas, vma, GFP_KERNEL))
+		return -ENOMEM;
 
-	mapped_addr = get_unmapped_area(NULL, addr, len, 0, MAP_FIXED);
-	if (IS_ERR_VALUE(mapped_addr))
-		return mapped_addr;
+	if (vma->anon_vma) {
+		anon_vma_lock_write(vma->anon_vma);
+		anon_vma_interval_tree_pre_update_vma(vma);
+	}
 
-	error = mlock_future_check(mm, mm->def_flags, len);
-	if (error)
-		return error;
+	vma->vm_end = newbrk;
+	vma_init(&unmap, mm);
+	unmap.vm_start = newbrk;
+	unmap.vm_end = oldbrk;
+	if (vma->anon_vma)
+		vma_set_anonymous(&unmap);
 
-	/* Clear old maps, set up prev and uf */
-	if (munmap_vma_range(mm, addr, len, &prev, uf))
-		return -ENOMEM;
+	vma_mas_remove(&unmap, mas);
+
+	vmacache_invalidate(vma->vm_mm);
+	if (vma->anon_vma) {
+		anon_vma_interval_tree_post_update_vma(vma);
+		anon_vma_unlock_write(vma->anon_vma);
+	}
+
+	unmap_pages = vma_pages(&unmap);
+	if (vma->vm_flags & VM_LOCKED)
+		mm->locked_vm -= unmap_pages;
+
+	mmap_write_downgrade(mm);
+	unmap_region(mm, &unmap, vma, newbrk, oldbrk);
+	/* Statistics */
+	vm_stat_account(mm, vma->vm_flags, -unmap_pages);
+	if (vma->vm_flags & VM_ACCOUNT)
+		vm_unacct_memory(unmap_pages);
+
+munmap_full_vma:
+	validate_mm_mt(mm);
+	return ret;
+}
+
+/*
+ * do_brk_flags() - Increase the brk vma if the flags match.
+ * @mas: The maple tree state.
+ * @addr: The start address
+ * @len: The length of the increase
+ * @vma: The vma,
+ * @flags: The VMA Flags
+ *
+ * Extend the brk VMA from addr to addr + len.  If the VMA is NULL or the flags
+ * do not match then create a new anonymous VMA.  Eventually we may be able to
+ * do some brk-specific accounting here.
+ */
+static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
+			unsigned long addr, unsigned long len,
+			unsigned long flags)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *prev = NULL;
+	validate_mm_mt(mm);
 
-	/* Check against address space limits *after* clearing old maps... */
+
+	/*
+	 * Check against address space limits by the changed size
+	 * Note: This happens *after* clearing old mappings in some code paths.
+	 */
+	flags |= VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags;
 	if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
@@ -2807,30 +2922,56 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
 	if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
 		return -ENOMEM;
 
-	/* Can we just expand an old private anonymous mapping? */
-	vma = vma_merge(mm, prev, addr, addr + len, flags,
-			NULL, NULL, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
-	if (vma)
-		goto out;
-
 	/*
-	 * create a vma struct for an anonymous mapping
+	 * Expand the existing vma if possible; Note that singular lists do not
+	 * occur after forking, so the expand will only happen on new VMAs.
 	 */
-	vma = vm_area_alloc(mm);
-	if (!vma) {
-		vm_unacct_memory(len >> PAGE_SHIFT);
-		return -ENOMEM;
+	if (vma &&
+	    (!vma->anon_vma || list_is_singular(&vma->anon_vma_chain)) &&
+	    ((vma->vm_flags & ~VM_SOFTDIRTY) == flags)) {
+		mas->index = vma->vm_start;
+		mas->last = addr + len - 1;
+		vma_adjust_trans_huge(vma, addr, addr + len, 0);
+		if (vma->anon_vma) {
+			anon_vma_lock_write(vma->anon_vma);
+			anon_vma_interval_tree_pre_update_vma(vma);
+		}
+		vma->vm_end = addr + len;
+		vma->vm_flags |= VM_SOFTDIRTY;
+		if (mas_store_gfp(mas, vma, GFP_KERNEL))
+			return -ENOMEM;
+
+		if (vma->anon_vma) {
+			anon_vma_interval_tree_post_update_vma(vma);
+			anon_vma_unlock_write(vma->anon_vma);
+		}
+		khugepaged_enter_vma_merge(vma, flags);
+		goto out;
 	}
+	prev = vma;
+
+	/* create a vma struct for an anonymous mapping */
+	vma = vm_area_alloc(mm);
+	if (!vma)
+		goto vma_alloc_fail;
 
 	vma_set_anonymous(vma);
 	vma->vm_start = addr;
 	vma->vm_end = addr + len;
-	vma->vm_pgoff = pgoff;
+	vma->vm_pgoff = addr >> PAGE_SHIFT;
 	vma->vm_flags = flags;
 	vma->vm_page_prot = vm_get_page_prot(flags);
-	if(vma_link(mm, vma, prev))
-		goto no_vma_link;
+	mas_set_range(mas, vma->vm_start, addr + len - 1);
+	if ( mas_store_gfp(mas, vma, GFP_KERNEL))
+		goto mas_store_fail;
 
+	mm->map_count++;
+
+	if (!prev)
+		prev = mas_prev(mas, 0);
+
+	__vma_link_list(mm, vma, prev);
+	mm->map_count++;
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
@@ -2841,18 +2982,22 @@ static int do_brk_flags(unsigned long addr, unsigned long len,
 	validate_mm_mt(mm);
 	return 0;
 
-no_vma_link:
+mas_store_fail:
 	vm_area_free(vma);
+vma_alloc_fail:
+	vm_unacct_memory(len >> PAGE_SHIFT);
 	return -ENOMEM;
 }
 
 int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
 	unsigned long len;
 	int ret;
 	bool populate;
 	LIST_HEAD(uf);
+	MA_STATE(mas, &mm->mm_mt, addr, addr);
 
 	len = PAGE_ALIGN(request);
 	if (len < request)
@@ -2863,13 +3008,38 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	ret = do_brk_flags(addr, len, flags, &uf);
+	/* Until we need other flags, refuse anything except VM_EXEC. */
+	if ((flags & (~VM_EXEC)) != 0)
+		return -EINVAL;
+
+	ret = check_brk_limits(addr, len);
+	if (ret)
+		goto limits_failed;
+
+	if (find_vma_intersection(mm, addr, addr + len))
+		ret = do_munmap(mm, addr, len, &uf);
+
+	if (ret)
+		goto munmap_failed;
+
+	vma = mas_prev(&mas, 0);
+	if (!vma || vma->vm_end != addr || vma_policy(vma) ||
+	    !can_vma_merge_after(vma, flags, NULL, NULL,
+				 addr >> PAGE_SHIFT,NULL_VM_UFFD_CTX, NULL))
+		vma = NULL;
+
+	ret = do_brk_flags(&mas, vma, addr, len, flags);
 	populate = ((mm->def_flags & VM_LOCKED) != 0);
 	mmap_write_unlock(mm);
 	userfaultfd_unmap_complete(mm, &uf);
 	if (populate && !ret)
 		mm_populate(addr, len);
 	return ret;
+
+munmap_failed:
+limits_failed:
+	mmap_write_unlock(mm);
+	return ret;
 }
 EXPORT_SYMBOL(vm_brk_flags);
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 24/69] mm/mmap: use advanced maple tree API for mmap_region()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (6 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 22/69] mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 25/69] mm: remove vmacache Liam Howlett
                     ` (44 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Changing mmap_region() to use the maple tree state and the advanced maple
tree interface allows for a lot less tree walking.

This change removes the last caller of munmap_vma_range(), so drop this
unused function.

Add vma_expand() to expand a VMA if possible by doing the necessary
hugepage check, uprobe_munmap of files, dcache flush, modifications then
undoing the detaches, etc.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 245 +++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 198 insertions(+), 47 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 5f948f353376..baf608975f99 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -516,28 +516,6 @@ static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
 	return vma->vm_next;
 }
 
-/*
- * munmap_vma_range() - munmap VMAs that overlap a range.
- * @mm: The mm struct
- * @start: The start of the range.
- * @len: The length of the range.
- * @pprev: pointer to the pointer that will be set to previous vm_area_struct
- *
- * Find all the vm_area_struct that overlap from @start to
- * @end and munmap them.  Set @pprev to the previous vm_area_struct.
- *
- * Returns: -ENOMEM on munmap failure or 0 on success.
- */
-static inline int
-munmap_vma_range(struct mm_struct *mm, unsigned long start, unsigned long len,
-		 struct vm_area_struct **pprev, struct list_head *uf)
-{
-	while (range_has_overlap(mm, start, start + len, pprev))
-		if (do_munmap(mm, start, len, uf))
-			return -ENOMEM;
-	return 0;
-}
-
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
 		unsigned long addr, unsigned long end)
 {
@@ -664,6 +642,127 @@ static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas,
 	mm->map_count++;
 }
 
+/*
+ * vma_expand - Expand an existing VMA
+ *
+ * @mas: The maple state
+ * @vma: The vma to expand
+ * @start: The start of the vma
+ * @end: The exclusive end of the vma
+ * @pgoff: The page offset of vma
+ * @next: The current of next vma.
+ *
+ * Expand @vma to @start and @end.  Can expand off the start and end.  Will
+ * expand over @next if it's different from @vma and @end == @next->vm_end.
+ * Checking if the @vma can expand and merge with @next needs to be handled by
+ * the caller.
+ *
+ * Returns: 0 on success
+ */
+inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
+		      unsigned long start, unsigned long end, pgoff_t pgoff,
+		      struct vm_area_struct *next)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct address_space *mapping = NULL;
+	struct rb_root_cached *root = NULL;
+	struct anon_vma *anon_vma = vma->anon_vma;
+	struct file *file = vma->vm_file;
+	bool remove_next = false;
+	bool anon_cloned = false;
+
+	if (next && (vma != next) && (end == next->vm_end)) {
+		remove_next = true;
+		if (next->anon_vma && !vma->anon_vma) {
+			int error;
+
+			vma->anon_vma = next->anon_vma;
+			error = anon_vma_clone(vma, next);
+			if (error)
+				return error;
+			anon_cloned = true;
+		}
+	}
+
+	/* Not merging but overwriting any part of next is not handled. */
+	VM_BUG_ON(!remove_next && next != vma && end > next->vm_start);
+	/* Only handles expanding */
+	VM_BUG_ON(vma->vm_start < start || vma->vm_end > end);
+
+	if (mas_preallocate(mas, vma, GFP_KERNEL))
+		goto nomem;
+
+	vma_adjust_trans_huge(vma, start, end, 0);
+
+	if (file) {
+		mapping = file->f_mapping;
+		root = &mapping->i_mmap;
+		uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+		i_mmap_lock_write(mapping);
+		flush_dcache_mmap_lock(mapping);
+		vma_interval_tree_remove(vma, root);
+	} else if (anon_vma) {
+		anon_vma_lock_write(anon_vma);
+		anon_vma_interval_tree_pre_update_vma(vma);
+	}
+
+	vma->vm_start = start;
+	vma->vm_end = end;
+	vma->vm_pgoff = pgoff;
+	/* Note: mas must be pointing to the expanding VMA */
+	vma_mas_store(vma, mas);
+
+	if (file) {
+		vma_interval_tree_insert(vma, root);
+		flush_dcache_mmap_unlock(mapping);
+	}
+
+	/* Expanding over the next vma */
+	if (remove_next) {
+		/* Remove from mm linked list - also updates highest_vm_end */
+		__vma_unlink_list(mm, next);
+
+		/* Kill the cache */
+		vmacache_invalidate(mm);
+
+		if (file)
+			__remove_shared_vm_struct(next, file, mapping);
+
+	} else if (!next) {
+		mm->highest_vm_end = vm_end_gap(vma);
+	}
+
+	if (anon_vma) {
+		anon_vma_interval_tree_post_update_vma(vma);
+		anon_vma_unlock_write(anon_vma);
+	}
+
+	if (file) {
+		i_mmap_unlock_write(mapping);
+		uprobe_mmap(vma);
+	}
+
+	if (remove_next) {
+		if (file) {
+			uprobe_munmap(next, next->vm_start, next->vm_end);
+			fput(file);
+		}
+		if (next->anon_vma)
+			anon_vma_merge(vma, next);
+		mm->map_count--;
+		mpol_put(vma_policy(next));
+		vm_area_free(next);
+	}
+
+	validate_mm(mm);
+	return 0;
+
+nomem:
+	if (anon_cloned)
+		unlink_anon_vmas(vma);
+	return -ENOMEM;
+}
+
 /*
  * We cannot adjust vm_start, vm_end, vm_pgoff fields of a vma that
  * is already present in an i_mmap tree without adjusting the tree.
@@ -1665,9 +1764,15 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		struct list_head *uf)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma, *prev, *merge;
-	int error;
+	struct vm_area_struct *vma = NULL;
+	struct vm_area_struct *prev, *next;
+	pgoff_t pglen = len >> PAGE_SHIFT;
 	unsigned long charged = 0;
+	unsigned long end = addr + len;
+	unsigned long merge_start = addr, merge_end = end;
+	pgoff_t vm_pgoff;
+	int error;
+	MA_STATE(mas, &mm->mm_mt, addr, end - 1);
 
 	/* Check against address space limit. */
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
@@ -1677,16 +1782,17 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 * MAP_FIXED may remove pages of mappings that intersects with
 		 * requested mapping. Account for the pages it would unmap.
 		 */
-		nr_pages = count_vma_pages_range(mm, addr, addr + len);
+		nr_pages = count_vma_pages_range(mm, addr, end);
 
 		if (!may_expand_vm(mm, vm_flags,
 					(len >> PAGE_SHIFT) - nr_pages))
 			return -ENOMEM;
 	}
 
-	/* Clear old maps, set up prev and uf */
-	if (munmap_vma_range(mm, addr, len, &prev, uf))
+	/* Unmap any existing mapping in the area */
+	if (do_munmap(mm, addr, len, uf))
 		return -ENOMEM;
+
 	/*
 	 * Private writable mapping: check memory availability
 	 */
@@ -1697,14 +1803,43 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		vm_flags |= VM_ACCOUNT;
 	}
 
-	/*
-	 * Can we just expand an old mapping?
-	 */
-	vma = vma_merge(mm, prev, addr, addr + len, vm_flags,
-			NULL, file, pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
-	if (vma)
-		goto out;
+	next = mas_next(&mas, ULONG_MAX);
+	prev = mas_prev(&mas, 0);
+	if (vm_flags & VM_SPECIAL)
+		goto cannot_expand;
+
+	/* Attempt to expand an old mapping */
+	/* Check next */
+	if (next && next->vm_start == end && !vma_policy(next) &&
+	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
+				 NULL_VM_UFFD_CTX, NULL)) {
+		merge_end = next->vm_end;
+		vma = next;
+		vm_pgoff = next->vm_pgoff - pglen;
+	}
+
+	/* Check prev */
+	if (prev && prev->vm_end == addr && !vma_policy(prev) &&
+	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
+				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
+		   can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
+				       NULL_VM_UFFD_CTX , NULL))) {
+		merge_start = prev->vm_start;
+		vma = prev;
+		vm_pgoff = prev->vm_pgoff;
+	}
+
+
+	/* Actually expand, if possible */
+	if (vma &&
+	    !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
+		khugepaged_enter_vma_merge(vma, vm_flags);
+		goto expanded;
+	}
 
+	mas.index = addr;
+	mas.last = end - 1;
+cannot_expand:
 	/*
 	 * Determine the object being mapped and call the appropriate
 	 * specific mapper. the address has already been validated, but
@@ -1717,7 +1852,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	vma->vm_start = addr;
-	vma->vm_end = addr + len;
+	vma->vm_end = end;
 	vma->vm_flags = vm_flags;
 	vma->vm_page_prot = vm_get_page_prot(vm_flags);
 	vma->vm_pgoff = pgoff;
@@ -1738,28 +1873,30 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		 *
 		 * Answer: Yes, several device drivers can do it in their
 		 *         f_op->mmap method. -DaveM
-		 * Bug: If addr is changed, prev, rb_link, rb_parent should
-		 *      be updated for vma_link()
 		 */
 		WARN_ON_ONCE(addr != vma->vm_start);
 
 		addr = vma->vm_start;
+		mas_reset(&mas);
 
 		/* If vm_flags changed after call_mmap(), we should try merge vma again
 		 * as we may succeed this time.
 		 */
 		if (unlikely(vm_flags != vma->vm_flags && prev)) {
-			merge = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
+			next = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
 				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
-			if (merge) {
+			if (next) {
 				/* ->mmap() can change vma->vm_file and fput the original file. So
 				 * fput the vma->vm_file here or we would add an extra fput for file
 				 * and cause general protection fault ultimately.
 				 */
 				fput(vma->vm_file);
 				vm_area_free(vma);
-				vma = merge;
-				/* Update vm_flags to pick up the change. */
+				vma = prev;
+				/* Update vm_flags and possible addr to pick up the change. We don't
+				 * warn here if addr changed as the vma is not linked by vma_link().
+				 */
+				addr = vma->vm_start;
 				vm_flags = vma->vm_flags;
 				goto unmap_writable;
 			}
@@ -1783,7 +1920,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			goto free_vma;
 	}
 
-	if (vma_link(mm, vma, prev)) {
+	if (mas_preallocate(&mas, vma, GFP_KERNEL)) {
 		error = -ENOMEM;
 		if (file)
 			goto unmap_and_free_vma;
@@ -1791,12 +1928,28 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 			goto free_vma;
 	}
 
+	if (vma->vm_file)
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+
+	vma_mas_store(vma, &mas);
+	__vma_link_list(mm, vma, prev);
+	mm->map_count++;
+	if (vma->vm_file) {
+		if (vma->vm_flags & VM_SHARED)
+			mapping_allow_writable(vma->vm_file->f_mapping);
+
+		flush_dcache_mmap_lock(vma->vm_file->f_mapping);
+		vma_interval_tree_insert(vma, &vma->vm_file->f_mapping->i_mmap);
+		flush_dcache_mmap_unlock(vma->vm_file->f_mapping);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	}
+
 	/* Once vma denies write, undo our temporary denial count */
 unmap_writable:
 	if (file && vm_flags & VM_SHARED)
 		mapping_unmap_writable(file->f_mapping);
 	file = vma->vm_file;
-out:
+expanded:
 	perf_event_mmap(vma);
 
 	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
@@ -1823,6 +1976,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
+	validate_mm(mm);
 	return addr;
 
 unmap_and_free_vma:
@@ -1839,6 +1993,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
+	validate_mm(mm);
 	return error;
 }
 
@@ -2636,10 +2791,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	prev = vma->vm_prev;
 	/* we have start < vma->vm_end  */
 
-	/* if it doesn't overlap, we have nothing.. */
-	if (vma->vm_start >= end)
-		return 0;
-
 	/*
 	 * If we need to split any vma, do it now to save pain later.
 	 *
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 25/69] mm: remove vmacache
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (7 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 24/69] mm/mmap: use advanced maple tree API for mmap_region() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 27/69] mm/mmap: move mmap_region() below do_munmap() Liam Howlett
                     ` (43 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

By using the maple tree and the maple tree state, the vmacache is no
longer beneficial and is complicating the VMA code.  Remove the vmacache
to reduce the work in keeping it up to date and code complexity.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 fs/exec.c                     |   3 -
 fs/proc/task_mmu.c            |   1 -
 include/linux/mm_types.h      |   1 -
 include/linux/mm_types_task.h |  12 ----
 include/linux/sched.h         |   1 -
 include/linux/vm_event_item.h |   4 --
 include/linux/vmacache.h      |  28 --------
 include/linux/vmstat.h        |   6 --
 kernel/debug/debug_core.c     |  12 ----
 kernel/fork.c                 |   5 --
 lib/Kconfig.debug             |   8 ---
 mm/Makefile                   |   2 +-
 mm/debug.c                    |   4 +-
 mm/mmap.c                     |  32 +---------
 mm/nommu.c                    |  37 ++---------
 mm/vmacache.c                 | 117 ----------------------------------
 mm/vmstat.c                   |   4 --
 17 files changed, 9 insertions(+), 268 deletions(-)
 delete mode 100644 include/linux/vmacache.h
 delete mode 100644 mm/vmacache.c

diff --git a/fs/exec.c b/fs/exec.c
index e3e55d5e0be1..14e7278a1ab8 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -28,7 +28,6 @@
 #include <linux/file.h>
 #include <linux/fdtable.h>
 #include <linux/mm.h>
-#include <linux/vmacache.h>
 #include <linux/stat.h>
 #include <linux/fcntl.h>
 #include <linux/swap.h>
@@ -1023,8 +1022,6 @@ static int exec_mmap(struct mm_struct *mm)
 	activate_mm(active_mm, mm);
 	if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
 		local_irq_enable();
-	tsk->mm->vmacache_seqnum = 0;
-	vmacache_flush(tsk);
 	task_unlock(tsk);
 	if (old_mm) {
 		mmap_read_unlock(old_mm);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index a843d13e2e1a..b940b969b000 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1,6 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/pagewalk.h>
-#include <linux/vmacache.h>
 #include <linux/mm_inline.h>
 #include <linux/hugetlb.h>
 #include <linux/huge_mm.h>
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 50c53f370bf6..b844119387a3 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -465,7 +465,6 @@ struct mm_struct {
 	struct {
 		struct vm_area_struct *mmap;		/* list of VMAs */
 		struct maple_tree mm_mt;
-		u64 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
 		unsigned long (*get_unmapped_area) (struct file *filp,
 				unsigned long addr, unsigned long len,
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index c1bc6731125c..0bb4b6da9993 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -24,18 +24,6 @@
 		IS_ENABLED(CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK))
 #define ALLOC_SPLIT_PTLOCKS	(SPINLOCK_SIZE > BITS_PER_LONG/8)
 
-/*
- * The per task VMA cache array:
- */
-#define VMACACHE_BITS 2
-#define VMACACHE_SIZE (1U << VMACACHE_BITS)
-#define VMACACHE_MASK (VMACACHE_SIZE - 1)
-
-struct vmacache {
-	u64 seqnum;
-	struct vm_area_struct *vmas[VMACACHE_SIZE];
-};
-
 /*
  * When updating this, please also update struct resident_page_types[] in
  * kernel/fork.c
diff --git a/include/linux/sched.h b/include/linux/sched.h
index a8911b1f35aa..c58392abc663 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -861,7 +861,6 @@ struct task_struct {
 	struct mm_struct		*active_mm;
 
 	/* Per-thread vma caching: */
-	struct vmacache			vmacache;
 
 #ifdef SPLIT_RSS_COUNTING
 	struct task_rss_stat		rss_stat;
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index e83967e4c20e..5e80138ce624 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -122,10 +122,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		NR_TLB_LOCAL_FLUSH_ALL,
 		NR_TLB_LOCAL_FLUSH_ONE,
 #endif /* CONFIG_DEBUG_TLBFLUSH */
-#ifdef CONFIG_DEBUG_VM_VMACACHE
-		VMACACHE_FIND_CALLS,
-		VMACACHE_FIND_HITS,
-#endif
 #ifdef CONFIG_SWAP
 		SWAP_RA,
 		SWAP_RA_HIT,
diff --git a/include/linux/vmacache.h b/include/linux/vmacache.h
deleted file mode 100644
index 6fce268a4588..000000000000
--- a/include/linux/vmacache.h
+++ /dev/null
@@ -1,28 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __LINUX_VMACACHE_H
-#define __LINUX_VMACACHE_H
-
-#include <linux/sched.h>
-#include <linux/mm.h>
-
-static inline void vmacache_flush(struct task_struct *tsk)
-{
-	memset(tsk->vmacache.vmas, 0, sizeof(tsk->vmacache.vmas));
-}
-
-extern void vmacache_update(unsigned long addr, struct vm_area_struct *newvma);
-extern struct vm_area_struct *vmacache_find(struct mm_struct *mm,
-						    unsigned long addr);
-
-#ifndef CONFIG_MMU
-extern struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
-						  unsigned long start,
-						  unsigned long end);
-#endif
-
-static inline void vmacache_invalidate(struct mm_struct *mm)
-{
-	mm->vmacache_seqnum++;
-}
-
-#endif /* __LINUX_VMACACHE_H */
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index bfe38869498d..19cf5b6892ce 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -125,12 +125,6 @@ static inline void vm_events_fold_cpu(int cpu)
 #define count_vm_tlb_events(x, y) do { (void)(y); } while (0)
 #endif
 
-#ifdef CONFIG_DEBUG_VM_VMACACHE
-#define count_vm_vmacache_event(x) count_vm_event(x)
-#else
-#define count_vm_vmacache_event(x) do {} while (0)
-#endif
-
 #define __count_zid_vm_events(item, zid, delta) \
 	__count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
 
diff --git a/kernel/debug/debug_core.c b/kernel/debug/debug_core.c
index da06a5553835..c4e6f5159bed 100644
--- a/kernel/debug/debug_core.c
+++ b/kernel/debug/debug_core.c
@@ -50,7 +50,6 @@
 #include <linux/pid.h>
 #include <linux/smp.h>
 #include <linux/mm.h>
-#include <linux/vmacache.h>
 #include <linux/rcupdate.h>
 #include <linux/irq.h>
 
@@ -282,17 +281,6 @@ static void kgdb_flush_swbreak_addr(unsigned long addr)
 	if (!CACHE_FLUSH_IS_SAFE)
 		return;
 
-	if (current->mm) {
-		int i;
-
-		for (i = 0; i < VMACACHE_SIZE; i++) {
-			if (!current->vmacache.vmas[i])
-				continue;
-			flush_cache_range(current->vmacache.vmas[i],
-					  addr, addr + BREAK_INSTR_SIZE);
-		}
-	}
-
 	/* Force flush instruction cache if it was outside the mm */
 	flush_icache_range(addr, addr + BREAK_INSTR_SIZE);
 }
diff --git a/kernel/fork.c b/kernel/fork.c
index 60783abc21c8..4af22dd65fc6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -43,7 +43,6 @@
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
-#include <linux/vmacache.h>
 #include <linux/nsproxy.h>
 #include <linux/capability.h>
 #include <linux/cpu.h>
@@ -1119,7 +1118,6 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	mm->mmap = NULL;
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
-	mm->vmacache_seqnum = 0;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
 	seqcount_init(&mm->write_protect_seq);
@@ -1575,9 +1573,6 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
 	if (!oldmm)
 		return 0;
 
-	/* initialize the new vmacache entries */
-	vmacache_flush(tsk);
-
 	if (clone_flags & CLONE_VM) {
 		mmget(oldmm);
 		mm = oldmm;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3d9366075153..8a0567046e9e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -837,14 +837,6 @@ config DEBUG_VM
 
 	  If unsure, say N.
 
-config DEBUG_VM_VMACACHE
-	bool "Debug VMA caching"
-	depends on DEBUG_VM
-	help
-	  Enable this to turn on VMA caching debug information. Doing so
-	  can cause significant overhead, so only enable it in non-production
-	  environments.
-
 config DEBUG_VM_MAPLE_TREE
 	bool "Debug VM maple trees"
 	depends on DEBUG_VM
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..298c9991ab75 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y			:= filemap.o mempool.o oom_kill.o fadvise.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   util.o mmzone.o vmstat.o backing-dev.o \
 			   mm_init.o percpu.o slab_common.o \
-			   compaction.o vmacache.o \
+			   compaction.o \
 			   interval_tree.o list_lru.o workingset.o \
 			   debug.o gup.o mmap_lock.o $(mmu-y)
 
diff --git a/mm/debug.c b/mm/debug.c
index bef329bf28f0..2d625ca0e326 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -155,7 +155,7 @@ EXPORT_SYMBOL(dump_vma);
 
 void dump_mm(const struct mm_struct *mm)
 {
-	pr_emerg("mm %px mmap %px seqnum %llu task_size %lu\n"
+	pr_emerg("mm %px mmap %px task_size %lu\n"
 #ifdef CONFIG_MMU
 		"get_unmapped_area %px\n"
 #endif
@@ -183,7 +183,7 @@ void dump_mm(const struct mm_struct *mm)
 		"tlb_flush_pending %d\n"
 		"def_flags: %#lx(%pGv)\n",
 
-		mm, mm->mmap, (long long) mm->vmacache_seqnum, mm->task_size,
+		mm, mm->mmap, mm->task_size,
 #ifdef CONFIG_MMU
 		mm->get_unmapped_area,
 #endif
diff --git a/mm/mmap.c b/mm/mmap.c
index baf608975f99..9b4192130814 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -14,7 +14,6 @@
 #include <linux/backing-dev.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
-#include <linux/vmacache.h>
 #include <linux/shm.h>
 #include <linux/mman.h>
 #include <linux/pagemap.h>
@@ -722,9 +721,6 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
 		/* Remove from mm linked list - also updates highest_vm_end */
 		__vma_unlink_list(mm, next);
 
-		/* Kill the cache */
-		vmacache_invalidate(mm);
-
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 
@@ -963,8 +959,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 
 	if (remove_next) {
 		__vma_unlink_list(mm, next);
-		/* Kill the cache */
-		vmacache_invalidate(mm);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 	} else if (insert) {
@@ -2237,19 +2231,10 @@ struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
 					     unsigned long start_addr,
 					     unsigned long end_addr)
 {
-	struct vm_area_struct *vma;
 	unsigned long index = start_addr;
 
 	mmap_assert_locked(mm);
-	/* Check the cache first. */
-	vma = vmacache_find(mm, start_addr);
-	if (likely(vma))
-		return vma;
-
-	vma = mt_find(&mm->mm_mt, &index, end_addr - 1);
-	if (vma)
-		vmacache_update(start_addr, vma);
-	return vma;
+	return mt_find(&mm->mm_mt, &index, end_addr - 1);
 }
 EXPORT_SYMBOL(find_vma_intersection);
 
@@ -2263,19 +2248,10 @@ EXPORT_SYMBOL(find_vma_intersection);
  */
 struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 {
-	struct vm_area_struct *vma;
 	unsigned long index = addr;
 
 	mmap_assert_locked(mm);
-	/* Check the cache first. */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
-
-	vma = mt_find(&mm->mm_mt, &index, ULONG_MAX);
-	if (vma)
-		vmacache_update(addr, vma);
-	return vma;
+	return mt_find(&mm->mm_mt, &index, ULONG_MAX);
 }
 EXPORT_SYMBOL(find_vma);
 
@@ -2663,9 +2639,6 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas,
 		mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
 	tail_vma->vm_next = NULL;
 
-	/* Kill the cache */
-	vmacache_invalidate(mm);
-
 	/*
 	 * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
 	 * VM_GROWSUP VMA. Such VMAs can change their size under
@@ -3045,7 +3018,6 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 
 	vma_mas_remove(&unmap, mas);
 
-	vmacache_invalidate(vma->vm_mm);
 	if (vma->anon_vma) {
 		anon_vma_interval_tree_post_update_vma(vma);
 		anon_vma_unlock_write(vma->anon_vma);
diff --git a/mm/nommu.c b/mm/nommu.c
index 2870edfad8ed..1c9b4e8c4d5c 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -19,7 +19,6 @@
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/sched/mm.h>
-#include <linux/vmacache.h>
 #include <linux/mman.h>
 #include <linux/swap.h>
 #include <linux/file.h>
@@ -585,23 +584,12 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
  */
 static void delete_vma_from_mm(struct vm_area_struct *vma)
 {
-	int i;
-	struct address_space *mapping;
-	struct mm_struct *mm = vma->vm_mm;
-	struct task_struct *curr = current;
 	MA_STATE(mas, &vma->vm_mm->mm_mt, 0, 0);
 
-	mm->map_count--;
-	for (i = 0; i < VMACACHE_SIZE; i++) {
-		/* if the vma is cached, invalidate the entire cache */
-		if (curr->vmacache.vmas[i] == vma) {
-			vmacache_invalidate(mm);
-			break;
-		}
-	}
-
+	vma->vm_mm->map_count--;
 	/* remove the VMA from the mapping */
 	if (vma->vm_file) {
+		struct address_space *mapping;
 		mapping = vma->vm_file->f_mapping;
 
 		i_mmap_lock_write(mapping);
@@ -613,7 +601,7 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 
 	/* remove from the MM's tree and list */
 	vma_mas_remove(vma, &mas);
-	__vma_unlink_list(mm, vma);
+	__vma_unlink_list(vma->vm_mm, vma);
 }
 
 /*
@@ -646,20 +634,9 @@ EXPORT_SYMBOL(find_vma_intersection);
  */
 struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 {
-	struct vm_area_struct *vma;
 	MA_STATE(mas, &mm->mm_mt, addr, addr);
 
-	/* check the cache first */
-	vma = vmacache_find(mm, addr);
-	if (likely(vma))
-		return vma;
-
-	vma = mas_walk(&mas);
-
-	if (vma)
-		vmacache_update(addr, vma);
-
-	return vma;
+	return mas_walk(&mas);
 }
 EXPORT_SYMBOL(find_vma);
 
@@ -693,11 +670,6 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
 	unsigned long end = addr + len;
 	MA_STATE(mas, &mm->mm_mt, addr, addr);
 
-	/* check the cache first */
-	vma = vmacache_find_exact(mm, addr, end);
-	if (vma)
-		return vma;
-
 	vma = mas_walk(&mas);
 	if (!vma)
 		return NULL;
@@ -706,7 +678,6 @@ static struct vm_area_struct *find_vma_exact(struct mm_struct *mm,
 	if (vma->vm_end != end)
 		return NULL;
 
-	vmacache_update(addr, vma);
 	return vma;
 }
 
diff --git a/mm/vmacache.c b/mm/vmacache.c
deleted file mode 100644
index 01a6e6688ec1..000000000000
--- a/mm/vmacache.c
+++ /dev/null
@@ -1,117 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/*
- * Copyright (C) 2014 Davidlohr Bueso.
- */
-#include <linux/sched/signal.h>
-#include <linux/sched/task.h>
-#include <linux/mm.h>
-#include <linux/vmacache.h>
-
-/*
- * Hash based on the pmd of addr if configured with MMU, which provides a good
- * hit rate for workloads with spatial locality.  Otherwise, use pages.
- */
-#ifdef CONFIG_MMU
-#define VMACACHE_SHIFT	PMD_SHIFT
-#else
-#define VMACACHE_SHIFT	PAGE_SHIFT
-#endif
-#define VMACACHE_HASH(addr) ((addr >> VMACACHE_SHIFT) & VMACACHE_MASK)
-
-/*
- * This task may be accessing a foreign mm via (for example)
- * get_user_pages()->find_vma().  The vmacache is task-local and this
- * task's vmacache pertains to a different mm (ie, its own).  There is
- * nothing we can do here.
- *
- * Also handle the case where a kernel thread has adopted this mm via
- * kthread_use_mm(). That kernel thread's vmacache is not applicable to this mm.
- */
-static inline bool vmacache_valid_mm(struct mm_struct *mm)
-{
-	return current->mm == mm && !(current->flags & PF_KTHREAD);
-}
-
-void vmacache_update(unsigned long addr, struct vm_area_struct *newvma)
-{
-	if (vmacache_valid_mm(newvma->vm_mm))
-		current->vmacache.vmas[VMACACHE_HASH(addr)] = newvma;
-}
-
-static bool vmacache_valid(struct mm_struct *mm)
-{
-	struct task_struct *curr;
-
-	if (!vmacache_valid_mm(mm))
-		return false;
-
-	curr = current;
-	if (mm->vmacache_seqnum != curr->vmacache.seqnum) {
-		/*
-		 * First attempt will always be invalid, initialize
-		 * the new cache for this task here.
-		 */
-		curr->vmacache.seqnum = mm->vmacache_seqnum;
-		vmacache_flush(curr);
-		return false;
-	}
-	return true;
-}
-
-struct vm_area_struct *vmacache_find(struct mm_struct *mm, unsigned long addr)
-{
-	int idx = VMACACHE_HASH(addr);
-	int i;
-
-	count_vm_vmacache_event(VMACACHE_FIND_CALLS);
-
-	if (!vmacache_valid(mm))
-		return NULL;
-
-	for (i = 0; i < VMACACHE_SIZE; i++) {
-		struct vm_area_struct *vma = current->vmacache.vmas[idx];
-
-		if (vma) {
-#ifdef CONFIG_DEBUG_VM_VMACACHE
-			if (WARN_ON_ONCE(vma->vm_mm != mm))
-				break;
-#endif
-			if (vma->vm_start <= addr && vma->vm_end > addr) {
-				count_vm_vmacache_event(VMACACHE_FIND_HITS);
-				return vma;
-			}
-		}
-		if (++idx == VMACACHE_SIZE)
-			idx = 0;
-	}
-
-	return NULL;
-}
-
-#ifndef CONFIG_MMU
-struct vm_area_struct *vmacache_find_exact(struct mm_struct *mm,
-					   unsigned long start,
-					   unsigned long end)
-{
-	int idx = VMACACHE_HASH(start);
-	int i;
-
-	count_vm_vmacache_event(VMACACHE_FIND_CALLS);
-
-	if (!vmacache_valid(mm))
-		return NULL;
-
-	for (i = 0; i < VMACACHE_SIZE; i++) {
-		struct vm_area_struct *vma = current->vmacache.vmas[idx];
-
-		if (vma && vma->vm_start == start && vma->vm_end == end) {
-			count_vm_vmacache_event(VMACACHE_FIND_HITS);
-			return vma;
-		}
-		if (++idx == VMACACHE_SIZE)
-			idx = 0;
-	}
-
-	return NULL;
-}
-#endif
diff --git a/mm/vmstat.c b/mm/vmstat.c
index b94a2e4723ff..4e76537aadcf 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1382,10 +1382,6 @@ const char * const vmstat_text[] = {
 	"nr_tlb_local_flush_one",
 #endif /* CONFIG_DEBUG_TLBFLUSH */
 
-#ifdef CONFIG_DEBUG_VM_VMACACHE
-	"vmacache_find_calls",
-	"vmacache_find_hits",
-#endif
 #ifdef CONFIG_SWAP
 	"swap_ra",
 	"swap_ra_hit",
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 26/69] mm: convert vma_lookup() to use mtree_load()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (9 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 27/69] mm/mmap: move mmap_region() below do_munmap() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states Liam Howlett
                     ` (41 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Unlike the rbtree, the Maple Tree will return a NULL if there's nothing at
a particular address.

Since the previous commit dropped the vmacache, it is now possible to
consult the tree directly.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6db9face6f84..f6d633f04a64 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2797,12 +2797,7 @@ struct vm_area_struct *find_vma_intersection(struct mm_struct *mm,
 static inline
 struct vm_area_struct *vma_lookup(struct mm_struct *mm, unsigned long addr)
 {
-	struct vm_area_struct *vma = find_vma(mm, addr);
-
-	if (vma && addr < vma->vm_start)
-		vma = NULL;
-
-	return vma;
+	return mtree_load(&mm->mm_mt, addr);
 }
 
 static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 27/69] mm/mmap: move mmap_region() below do_munmap()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (8 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 25/69] mm: remove vmacache Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 26/69] mm: convert vma_lookup() to use mtree_load() Liam Howlett
                     ` (42 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Relocation of code for the next commit.  There should be no changes here.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 476 +++++++++++++++++++++++++++---------------------------
 1 file changed, 238 insertions(+), 238 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 9b4192130814..d49dca8fecd5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1753,244 +1753,6 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
-unsigned long mmap_region(struct file *file, unsigned long addr,
-		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
-		struct list_head *uf)
-{
-	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = NULL;
-	struct vm_area_struct *prev, *next;
-	pgoff_t pglen = len >> PAGE_SHIFT;
-	unsigned long charged = 0;
-	unsigned long end = addr + len;
-	unsigned long merge_start = addr, merge_end = end;
-	pgoff_t vm_pgoff;
-	int error;
-	MA_STATE(mas, &mm->mm_mt, addr, end - 1);
-
-	/* Check against address space limit. */
-	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
-		unsigned long nr_pages;
-
-		/*
-		 * MAP_FIXED may remove pages of mappings that intersects with
-		 * requested mapping. Account for the pages it would unmap.
-		 */
-		nr_pages = count_vma_pages_range(mm, addr, end);
-
-		if (!may_expand_vm(mm, vm_flags,
-					(len >> PAGE_SHIFT) - nr_pages))
-			return -ENOMEM;
-	}
-
-	/* Unmap any existing mapping in the area */
-	if (do_munmap(mm, addr, len, uf))
-		return -ENOMEM;
-
-	/*
-	 * Private writable mapping: check memory availability
-	 */
-	if (accountable_mapping(file, vm_flags)) {
-		charged = len >> PAGE_SHIFT;
-		if (security_vm_enough_memory_mm(mm, charged))
-			return -ENOMEM;
-		vm_flags |= VM_ACCOUNT;
-	}
-
-	next = mas_next(&mas, ULONG_MAX);
-	prev = mas_prev(&mas, 0);
-	if (vm_flags & VM_SPECIAL)
-		goto cannot_expand;
-
-	/* Attempt to expand an old mapping */
-	/* Check next */
-	if (next && next->vm_start == end && !vma_policy(next) &&
-	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
-				 NULL_VM_UFFD_CTX, NULL)) {
-		merge_end = next->vm_end;
-		vma = next;
-		vm_pgoff = next->vm_pgoff - pglen;
-	}
-
-	/* Check prev */
-	if (prev && prev->vm_end == addr && !vma_policy(prev) &&
-	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
-				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
-		   can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
-				       NULL_VM_UFFD_CTX , NULL))) {
-		merge_start = prev->vm_start;
-		vma = prev;
-		vm_pgoff = prev->vm_pgoff;
-	}
-
-
-	/* Actually expand, if possible */
-	if (vma &&
-	    !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
-		khugepaged_enter_vma_merge(vma, vm_flags);
-		goto expanded;
-	}
-
-	mas.index = addr;
-	mas.last = end - 1;
-cannot_expand:
-	/*
-	 * Determine the object being mapped and call the appropriate
-	 * specific mapper. the address has already been validated, but
-	 * not unmapped, but the maps are removed from the list.
-	 */
-	vma = vm_area_alloc(mm);
-	if (!vma) {
-		error = -ENOMEM;
-		goto unacct_error;
-	}
-
-	vma->vm_start = addr;
-	vma->vm_end = end;
-	vma->vm_flags = vm_flags;
-	vma->vm_page_prot = vm_get_page_prot(vm_flags);
-	vma->vm_pgoff = pgoff;
-
-	if (file) {
-		if (vm_flags & VM_SHARED) {
-			error = mapping_map_writable(file->f_mapping);
-			if (error)
-				goto free_vma;
-		}
-
-		vma->vm_file = get_file(file);
-		error = call_mmap(file, vma);
-		if (error)
-			goto unmap_and_free_vma;
-
-		/* Can addr have changed??
-		 *
-		 * Answer: Yes, several device drivers can do it in their
-		 *         f_op->mmap method. -DaveM
-		 */
-		WARN_ON_ONCE(addr != vma->vm_start);
-
-		addr = vma->vm_start;
-		mas_reset(&mas);
-
-		/* If vm_flags changed after call_mmap(), we should try merge vma again
-		 * as we may succeed this time.
-		 */
-		if (unlikely(vm_flags != vma->vm_flags && prev)) {
-			next = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
-				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
-			if (next) {
-				/* ->mmap() can change vma->vm_file and fput the original file. So
-				 * fput the vma->vm_file here or we would add an extra fput for file
-				 * and cause general protection fault ultimately.
-				 */
-				fput(vma->vm_file);
-				vm_area_free(vma);
-				vma = prev;
-				/* Update vm_flags and possible addr to pick up the change. We don't
-				 * warn here if addr changed as the vma is not linked by vma_link().
-				 */
-				addr = vma->vm_start;
-				vm_flags = vma->vm_flags;
-				goto unmap_writable;
-			}
-		}
-
-		vm_flags = vma->vm_flags;
-	} else if (vm_flags & VM_SHARED) {
-		error = shmem_zero_setup(vma);
-		if (error)
-			goto free_vma;
-	} else {
-		vma_set_anonymous(vma);
-	}
-
-	/* Allow architectures to sanity-check the vm_flags */
-	if (!arch_validate_flags(vma->vm_flags)) {
-		error = -EINVAL;
-		if (file)
-			goto unmap_and_free_vma;
-		else
-			goto free_vma;
-	}
-
-	if (mas_preallocate(&mas, vma, GFP_KERNEL)) {
-		error = -ENOMEM;
-		if (file)
-			goto unmap_and_free_vma;
-		else
-			goto free_vma;
-	}
-
-	if (vma->vm_file)
-		i_mmap_lock_write(vma->vm_file->f_mapping);
-
-	vma_mas_store(vma, &mas);
-	__vma_link_list(mm, vma, prev);
-	mm->map_count++;
-	if (vma->vm_file) {
-		if (vma->vm_flags & VM_SHARED)
-			mapping_allow_writable(vma->vm_file->f_mapping);
-
-		flush_dcache_mmap_lock(vma->vm_file->f_mapping);
-		vma_interval_tree_insert(vma, &vma->vm_file->f_mapping->i_mmap);
-		flush_dcache_mmap_unlock(vma->vm_file->f_mapping);
-		i_mmap_unlock_write(vma->vm_file->f_mapping);
-	}
-
-	/* Once vma denies write, undo our temporary denial count */
-unmap_writable:
-	if (file && vm_flags & VM_SHARED)
-		mapping_unmap_writable(file->f_mapping);
-	file = vma->vm_file;
-expanded:
-	perf_event_mmap(vma);
-
-	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
-	if (vm_flags & VM_LOCKED) {
-		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
-					is_vm_hugetlb_page(vma) ||
-					vma == get_gate_vma(current->mm))
-			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-		else
-			mm->locked_vm += (len >> PAGE_SHIFT);
-	}
-
-	if (file)
-		uprobe_mmap(vma);
-
-	/*
-	 * New (or expanded) vma always get soft dirty status.
-	 * Otherwise user-space soft-dirty page tracker won't
-	 * be able to distinguish situation when vma area unmapped,
-	 * then new mapped in-place (which must be aimed as
-	 * a completely new data area).
-	 */
-	vma->vm_flags |= VM_SOFTDIRTY;
-
-	vma_set_page_prot(vma);
-
-	validate_mm(mm);
-	return addr;
-
-unmap_and_free_vma:
-	fput(vma->vm_file);
-	vma->vm_file = NULL;
-
-	/* Undo any partial mapping done by a device driver. */
-	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
-	charged = 0;
-	if (vm_flags & VM_SHARED)
-		mapping_unmap_writable(file->f_mapping);
-free_vma:
-	vm_area_free(vma);
-unacct_error:
-	if (charged)
-		vm_unacct_memory(charged);
-	validate_mm(mm);
-	return error;
-}
-
 /* unmapped_area() Find an area between the low_limit and the high_limit with
  * the correct alignment and offset, all from @info. Note: current->mm is used
  * for the search.
@@ -2840,6 +2602,244 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	return __do_munmap(mm, start, len, uf, false);
 }
 
+unsigned long mmap_region(struct file *file, unsigned long addr,
+		unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
+		struct list_head *uf)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL;
+	struct vm_area_struct *prev, *next;
+	pgoff_t pglen = len >> PAGE_SHIFT;
+	unsigned long charged = 0;
+	unsigned long end = addr + len;
+	unsigned long merge_start = addr, merge_end = end;
+	pgoff_t vm_pgoff;
+	int error;
+	MA_STATE(mas, &mm->mm_mt, addr, end - 1);
+
+	/* Check against address space limit. */
+	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
+		unsigned long nr_pages;
+
+		/*
+		 * MAP_FIXED may remove pages of mappings that intersects with
+		 * requested mapping. Account for the pages it would unmap.
+		 */
+		nr_pages = count_vma_pages_range(mm, addr, end);
+
+		if (!may_expand_vm(mm, vm_flags,
+					(len >> PAGE_SHIFT) - nr_pages))
+			return -ENOMEM;
+	}
+
+	/* Unmap any existing mapping in the area */
+	if (do_munmap(mm, addr, len, uf))
+		return -ENOMEM;
+
+	/*
+	 * Private writable mapping: check memory availability
+	 */
+	if (accountable_mapping(file, vm_flags)) {
+		charged = len >> PAGE_SHIFT;
+		if (security_vm_enough_memory_mm(mm, charged))
+			return -ENOMEM;
+		vm_flags |= VM_ACCOUNT;
+	}
+
+	next = mas_next(&mas, ULONG_MAX);
+	prev = mas_prev(&mas, 0);
+	if (vm_flags & VM_SPECIAL)
+		goto cannot_expand;
+
+	/* Attempt to expand an old mapping */
+	/* Check next */
+	if (next && next->vm_start == end && !vma_policy(next) &&
+	    can_vma_merge_before(next, vm_flags, NULL, file, pgoff+pglen,
+				 NULL_VM_UFFD_CTX, NULL)) {
+		merge_end = next->vm_end;
+		vma = next;
+		vm_pgoff = next->vm_pgoff - pglen;
+	}
+
+	/* Check prev */
+	if (prev && prev->vm_end == addr && !vma_policy(prev) &&
+	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
+				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
+		   can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
+				       NULL_VM_UFFD_CTX , NULL))) {
+		merge_start = prev->vm_start;
+		vma = prev;
+		vm_pgoff = prev->vm_pgoff;
+	}
+
+
+	/* Actually expand, if possible */
+	if (vma &&
+	    !vma_expand(&mas, vma, merge_start, merge_end, vm_pgoff, next)) {
+		khugepaged_enter_vma_merge(vma, vm_flags);
+		goto expanded;
+	}
+
+	mas.index = addr;
+	mas.last = end - 1;
+cannot_expand:
+	/*
+	 * Determine the object being mapped and call the appropriate
+	 * specific mapper. the address has already been validated, but
+	 * not unmapped, but the maps are removed from the list.
+	 */
+	vma = vm_area_alloc(mm);
+	if (!vma) {
+		error = -ENOMEM;
+		goto unacct_error;
+	}
+
+	vma->vm_start = addr;
+	vma->vm_end = end;
+	vma->vm_flags = vm_flags;
+	vma->vm_page_prot = vm_get_page_prot(vm_flags);
+	vma->vm_pgoff = pgoff;
+
+	if (file) {
+		if (vm_flags & VM_SHARED) {
+			error = mapping_map_writable(file->f_mapping);
+			if (error)
+				goto free_vma;
+		}
+
+		vma->vm_file = get_file(file);
+		error = call_mmap(file, vma);
+		if (error)
+			goto unmap_and_free_vma;
+
+		/* Can addr have changed??
+		 *
+		 * Answer: Yes, several device drivers can do it in their
+		 *         f_op->mmap method. -DaveM
+		 */
+		WARN_ON_ONCE(addr != vma->vm_start);
+
+		addr = vma->vm_start;
+		mas_reset(&mas);
+
+		/* If vm_flags changed after call_mmap(), we should try merge vma again
+		 * as we may succeed this time.
+		 */
+		if (unlikely(vm_flags != vma->vm_flags && prev)) {
+			next = vma_merge(mm, prev, vma->vm_start, vma->vm_end, vma->vm_flags,
+				NULL, vma->vm_file, vma->vm_pgoff, NULL, NULL_VM_UFFD_CTX, NULL);
+			if (next) {
+				/* ->mmap() can change vma->vm_file and fput the original file. So
+				 * fput the vma->vm_file here or we would add an extra fput for file
+				 * and cause general protection fault ultimately.
+				 */
+				fput(vma->vm_file);
+				vm_area_free(vma);
+				vma = prev;
+				/* Update vm_flags and possible addr to pick up the change. We don't
+				 * warn here if addr changed as the vma is not linked by vma_link().
+				 */
+				addr = vma->vm_start;
+				vm_flags = vma->vm_flags;
+				goto unmap_writable;
+			}
+		}
+
+		vm_flags = vma->vm_flags;
+	} else if (vm_flags & VM_SHARED) {
+		error = shmem_zero_setup(vma);
+		if (error)
+			goto free_vma;
+	} else {
+		vma_set_anonymous(vma);
+	}
+
+	/* Allow architectures to sanity-check the vm_flags */
+	if (!arch_validate_flags(vma->vm_flags)) {
+		error = -EINVAL;
+		if (file)
+			goto unmap_and_free_vma;
+		else
+			goto free_vma;
+	}
+
+	if (mas_preallocate(&mas, vma, GFP_KERNEL)) {
+		error = -ENOMEM;
+		if (file)
+			goto unmap_and_free_vma;
+		else
+			goto free_vma;
+	}
+
+	if (vma->vm_file)
+		i_mmap_lock_write(vma->vm_file->f_mapping);
+
+	vma_mas_store(vma, &mas);
+	__vma_link_list(mm, vma, prev);
+	mm->map_count++;
+	if (vma->vm_file) {
+		if (vma->vm_flags & VM_SHARED)
+			mapping_allow_writable(vma->vm_file->f_mapping);
+
+		flush_dcache_mmap_lock(vma->vm_file->f_mapping);
+		vma_interval_tree_insert(vma, &vma->vm_file->f_mapping->i_mmap);
+		flush_dcache_mmap_unlock(vma->vm_file->f_mapping);
+		i_mmap_unlock_write(vma->vm_file->f_mapping);
+	}
+
+	/* Once vma denies write, undo our temporary denial count */
+unmap_writable:
+	if (file && vm_flags & VM_SHARED)
+		mapping_unmap_writable(file->f_mapping);
+	file = vma->vm_file;
+expanded:
+	perf_event_mmap(vma);
+
+	vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
+	if (vm_flags & VM_LOCKED) {
+		if ((vm_flags & VM_SPECIAL) || vma_is_dax(vma) ||
+					is_vm_hugetlb_page(vma) ||
+					vma == get_gate_vma(current->mm))
+			vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
+		else
+			mm->locked_vm += (len >> PAGE_SHIFT);
+	}
+
+	if (file)
+		uprobe_mmap(vma);
+
+	/*
+	 * New (or expanded) vma always get soft dirty status.
+	 * Otherwise user-space soft-dirty page tracker won't
+	 * be able to distinguish situation when vma area unmapped,
+	 * then new mapped in-place (which must be aimed as
+	 * a completely new data area).
+	 */
+	vma->vm_flags |= VM_SOFTDIRTY;
+
+	vma_set_page_prot(vma);
+
+	validate_mm(mm);
+	return addr;
+
+unmap_and_free_vma:
+	fput(vma->vm_file);
+	vma->vm_file = NULL;
+
+	/* Undo any partial mapping done by a device driver. */
+	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+	charged = 0;
+	if (vm_flags & VM_SHARED)
+		mapping_unmap_writable(file->f_mapping);
+free_vma:
+	vm_area_free(vma);
+unacct_error:
+	if (charged)
+		vm_unacct_memory(charged);
+	validate_mm(mm);
+	return error;
+}
+
 static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
 {
 	int ret;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (10 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 26/69] mm: convert vma_lookup() to use mtree_load() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-06-06 12:09     ` Qian Cai
  2022-05-04  1:13   ` [PATCH v9 29/69] mm/mmap: change do_brk_munmap() to use do_mas_align_munmap() Liam Howlett
                     ` (40 subsequent siblings)
  52 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
do_mas_align_munmap().

do_munmap() is a wrapper to create a maple state for any callers that have
not been converted to the maple tree.

do_mas_munmap() takes a maple state to mumap a range.  This is just a
small function which checks for error conditions and aligns the end of the
range.

do_mas_align_munmap() uses the aligned range to mumap a range.
do_mas_align_munmap() starts with the first VMA in the range, then finds
the last VMA in the range.  Both start and end are split if necessary.
Then the VMAs are removed from the linked list and the mm mlock count is
updated at the same time.  Followed by a single tree operation of
overwriting the area in with a NULL.  Finally, the detached list is
unmapped and freed.

By reorganizing the munmap calls as outlined, it is now possible to avoid
extra work of aligning pre-aligned callers which are known to be safe,
avoid extra VMA lookups or tree walks for modifications.

detach_vmas_to_be_unmapped() is no longer used, so drop this code.

vm_brk_flags() can just call the do_mas_munmap() as it checks for
intersecting VMAs directly.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h |   5 +-
 mm/mmap.c          | 231 ++++++++++++++++++++++++++++-----------------
 mm/mremap.c        |  17 ++--
 3 files changed, 158 insertions(+), 95 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f6d633f04a64..0cc2cb692a78 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2712,8 +2712,9 @@ extern unsigned long mmap_region(struct file *file, unsigned long addr,
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
 	unsigned long len, unsigned long prot, unsigned long flags,
 	unsigned long pgoff, unsigned long *populate, struct list_head *uf);
-extern int __do_munmap(struct mm_struct *, unsigned long, size_t,
-		       struct list_head *uf, bool downgrade);
+extern int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
+			 unsigned long start, size_t len, struct list_head *uf,
+			 bool downgrade);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t,
 		     struct list_head *uf);
 extern int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior);
diff --git a/mm/mmap.c b/mm/mmap.c
index d49dca8fecd5..dd21f0a3f236 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2372,47 +2372,6 @@ static void unmap_region(struct mm_struct *mm,
 	tlb_finish_mmu(&tlb);
 }
 
-/*
- * Create a list of vma's touched by the unmap, removing them from the mm's
- * vma list as we go..
- */
-static bool
-detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas,
-	struct vm_area_struct *vma, struct vm_area_struct *prev,
-	unsigned long end)
-{
-	struct vm_area_struct **insertion_point;
-	struct vm_area_struct *tail_vma = NULL;
-
-	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
-	vma->vm_prev = NULL;
-	vma_mas_szero(mas, vma->vm_start, end);
-	do {
-		if (vma->vm_flags & VM_LOCKED)
-			mm->locked_vm -= vma_pages(vma);
-		mm->map_count--;
-		tail_vma = vma;
-		vma = vma->vm_next;
-	} while (vma && vma->vm_start < end);
-	*insertion_point = vma;
-	if (vma)
-		vma->vm_prev = prev;
-	else
-		mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
-	tail_vma->vm_next = NULL;
-
-	/*
-	 * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
-	 * VM_GROWSUP VMA. Such VMAs can change their size under
-	 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
-	 */
-	if (vma && (vma->vm_flags & VM_GROWSDOWN))
-		return false;
-	if (prev && (prev->vm_flags & VM_GROWSUP))
-		return false;
-	return true;
-}
-
 /*
  * __split_vma() bypasses sysctl_max_map_count checking.  We use this where it
  * has already been checked or doesn't make sense to fail.
@@ -2492,40 +2451,51 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-/* Munmap is split into 2 main parts -- this part which finds
- * what needs doing, and the areas themselves, which do the
- * work.  This now handles partial unmappings.
- * Jeremy Fitzhardinge <jeremy@goop.org>
- */
-int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
-		struct list_head *uf, bool downgrade)
+static inline int
+unlock_range(struct vm_area_struct *start, struct vm_area_struct **tail,
+	     unsigned long limit)
 {
-	unsigned long end;
-	struct vm_area_struct *vma, *prev, *last;
-	int error = -ENOMEM;
-	MA_STATE(mas, &mm->mm_mt, 0, 0);
+	struct mm_struct *mm = start->vm_mm;
+	struct vm_area_struct *tmp = start;
+	int count = 0;
 
-	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
-		return -EINVAL;
+	while (tmp && tmp->vm_start < limit) {
+		*tail = tmp;
+		count++;
+		if (tmp->vm_flags & VM_LOCKED)
+			mm->locked_vm -= vma_pages(tmp);
 
-	len = PAGE_ALIGN(len);
-	end = start + len;
-	if (len == 0)
-		return -EINVAL;
+		tmp = tmp->vm_next;
+	}
 
-	 /* arch_unmap() might do unmaps itself.  */
-	arch_unmap(mm, start, end);
+	return count;
+}
 
-	/* Find the first overlapping VMA where start < vma->vm_end */
-	vma = find_vma_intersection(mm, start, end);
-	if (!vma)
-		return 0;
+/*
+ * do_mas_align_munmap() - munmap the aligned region from @start to @end.
+ * @mas: The maple_state, ideally set up to alter the correct tree location.
+ * @vma: The starting vm_area_struct
+ * @mm: The mm_struct
+ * @start: The aligned start address to munmap.
+ * @end: The aligned end address to munmap.
+ * @uf: The userfaultfd list_head
+ * @downgrade: Set to true to attempt a write downgrade of the mmap_sem
+ *
+ * If @downgrade is true, check return code for potential release of the lock.
+ */
+static int
+do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
+		    struct mm_struct *mm, unsigned long start,
+		    unsigned long end, struct list_head *uf, bool downgrade)
+{
+	struct vm_area_struct *prev, *last;
+	int error = -ENOMEM;
+	/* we have start < vma->vm_end  */
 
-	if (mas_preallocate(&mas, vma, GFP_KERNEL))
+	if (mas_preallocate(mas, vma, GFP_KERNEL))
 		return -ENOMEM;
-	prev = vma->vm_prev;
-	/* we have start < vma->vm_end  */
 
+	mas->last = end - 1;
 	/*
 	 * If we need to split any vma, do it now to save pain later.
 	 *
@@ -2546,17 +2516,31 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		error = __split_vma(mm, vma, start, 0);
 		if (error)
 			goto split_failed;
+
 		prev = vma;
+		vma = __vma_next(mm, prev);
+		mas->index = start;
+		mas_reset(mas);
+	} else {
+		prev = vma->vm_prev;
 	}
 
+	if (vma->vm_end >= end)
+		last = vma;
+	else
+		last = find_vma_intersection(mm, end - 1, end);
+
 	/* Does it split the last one? */
-	last = find_vma(mm, end);
-	if (last && end > last->vm_start) {
+	if (last && end < last->vm_end) {
 		error = __split_vma(mm, last, end, 1);
+
 		if (error)
 			goto split_failed;
+
+		if (vma == last)
+			vma = __vma_next(mm, prev);
+		mas_reset(mas);
 	}
-	vma = __vma_next(mm, prev);
 
 	if (unlikely(uf)) {
 		/*
@@ -2569,16 +2553,46 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		 * failure that it's not worth optimizing it for.
 		 */
 		error = userfaultfd_unmap_prep(vma, start, end, uf);
+
 		if (error)
 			goto userfaultfd_error;
 	}
 
-	/* Detach vmas from rbtree */
-	if (!detach_vmas_to_be_unmapped(mm, &mas, vma, prev, end))
-		downgrade = false;
+	/*
+	 * unlock any mlock()ed ranges before detaching vmas, count the number
+	 * of VMAs to be dropped, and return the tail entry of the affected
+	 * area.
+	 */
+	mm->map_count -= unlock_range(vma, &last, end);
+	/* Drop removed area from the tree */
+	mas_store_prealloc(mas, NULL);
 
-	if (downgrade)
-		mmap_write_downgrade(mm);
+	/* Detach vmas from the MM linked list */
+	vma->vm_prev = NULL;
+	if (prev)
+		prev->vm_next = last->vm_next;
+	else
+		mm->mmap = last->vm_next;
+
+	if (last->vm_next) {
+		last->vm_next->vm_prev = prev;
+		last->vm_next = NULL;
+	} else
+		mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
+
+	/*
+	 * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
+	 * VM_GROWSUP VMA. Such VMAs can change their size under
+	 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
+	 */
+	if (downgrade) {
+		if (last && (last->vm_flags & VM_GROWSDOWN))
+			downgrade = false;
+		else if (prev && (prev->vm_flags & VM_GROWSUP))
+			downgrade = false;
+		else
+			mmap_write_downgrade(mm);
+	}
 
 	unmap_region(mm, vma, prev, start, end);
 
@@ -2592,14 +2606,63 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 map_count_exceeded:
 split_failed:
 userfaultfd_error:
-	mas_destroy(&mas);
+	mas_destroy(mas);
 	return error;
 }
 
+/*
+ * do_mas_munmap() - munmap a given range.
+ * @mas: The maple state
+ * @mm: The mm_struct
+ * @start: The start address to munmap
+ * @len: The length of the range to munmap
+ * @uf: The userfaultfd list_head
+ * @downgrade: set to true if the user wants to attempt to write_downgrade the
+ * mmap_sem
+ *
+ * This function takes a @mas that is either pointing to the previous VMA or set
+ * to MA_START and sets it up to remove the mapping(s).  The @len will be
+ * aligned and any arch_unmap work will be preformed.
+ *
+ * Returns: -EINVAL on failure, 1 on success and unlock, 0 otherwise.
+ */
+int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
+		  unsigned long start, size_t len, struct list_head *uf,
+		  bool downgrade)
+{
+	unsigned long end;
+	struct vm_area_struct *vma;
+
+	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
+		return -EINVAL;
+
+	end = start + PAGE_ALIGN(len);
+	if (end == start)
+		return -EINVAL;
+
+	 /* arch_unmap() might do unmaps itself.  */
+	arch_unmap(mm, start, end);
+
+	/* Find the first overlapping VMA */
+	vma = mas_find(mas, end - 1);
+	if (!vma)
+		return 0;
+
+	return do_mas_align_munmap(mas, vma, mm, start, end, uf, downgrade);
+}
+
+/* do_munmap() - Wrapper function for non-maple tree aware do_munmap() calls.
+ * @mm: The mm_struct
+ * @start: The start address to munmap
+ * @len: The length to be munmapped.
+ * @uf: The userfaultfd list_head
+ */
 int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	      struct list_head *uf)
 {
-	return __do_munmap(mm, start, len, uf, false);
+	MA_STATE(mas, &mm->mm_mt, start, start);
+
+	return do_mas_munmap(&mas, mm, start, len, uf, false);
 }
 
 unsigned long mmap_region(struct file *file, unsigned long addr,
@@ -2633,7 +2696,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	}
 
 	/* Unmap any existing mapping in the area */
-	if (do_munmap(mm, addr, len, uf))
+	if (do_mas_munmap(&mas, mm, addr, len, uf, false))
 		return -ENOMEM;
 
 	/*
@@ -2845,11 +2908,12 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
 	int ret;
 	struct mm_struct *mm = current->mm;
 	LIST_HEAD(uf);
+	MA_STATE(mas, &mm->mm_mt, start, start);
 
 	if (mmap_write_lock_killable(mm))
 		return -EINTR;
 
-	ret = __do_munmap(mm, start, len, &uf, downgrade);
+	ret = do_mas_munmap(&mas, mm, start, len, &uf, downgrade);
 	/*
 	 * Returning 1 indicates mmap_lock is downgraded.
 	 * But 1 is not legal return value of vm_munmap() and munmap(), reset
@@ -2984,10 +3048,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	if (likely((vma->vm_end < oldbrk) ||
 		   ((vma->vm_start == newbrk) && (vma->vm_end == oldbrk)))) {
 		/* remove entire mapping(s) */
-		mas_set(mas, newbrk);
-		if (vma->vm_start != newbrk)
-			mas_reset(mas); /* cause a re-walk for the first overlap. */
-		ret = __do_munmap(mm, newbrk, oldbrk - newbrk, uf, true);
+		ret = do_mas_munmap(mas, mm, newbrk, oldbrk-newbrk, uf, true);
 		goto munmap_full_vma;
 	}
 
@@ -3168,9 +3229,7 @@ int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
 	if (ret)
 		goto limits_failed;
 
-	if (find_vma_intersection(mm, addr, addr + len))
-		ret = do_munmap(mm, addr, len, &uf);
-
+	ret = do_mas_munmap(&mas, mm, addr, len, &uf, 0);
 	if (ret)
 		goto munmap_failed;
 
diff --git a/mm/mremap.c b/mm/mremap.c
index 98f50e633009..4495f69eccbe 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -975,20 +975,23 @@ SYSCALL_DEFINE5(mremap, unsigned long, addr, unsigned long, old_len,
 	/*
 	 * Always allow a shrinking remap: that just unmaps
 	 * the unnecessary pages..
-	 * __do_munmap does all the needed commit accounting, and
+	 * do_mas_munmap does all the needed commit accounting, and
 	 * downgrades mmap_lock to read if so directed.
 	 */
 	if (old_len >= new_len) {
 		int retval;
+		MA_STATE(mas, &mm->mm_mt, addr + new_len, addr + new_len);
 
-		retval = __do_munmap(mm, addr+new_len, old_len - new_len,
-				  &uf_unmap, true);
-		if (retval < 0 && old_len != new_len) {
-			ret = retval;
-			goto out;
+		retval = do_mas_munmap(&mas, mm, addr + new_len,
+				       old_len - new_len, &uf_unmap, true);
 		/* Returning 1 indicates mmap_lock is downgraded to read. */
-		} else if (retval == 1)
+		if (retval == 1) {
 			downgraded = true;
+		} else if (retval < 0 && old_len != new_len) {
+			ret = retval;
+			goto out;
+		}
+
 		ret = addr;
 		goto out;
 	}
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 29/69] mm/mmap: change do_brk_munmap() to use do_mas_align_munmap()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (11 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 30/69] arm64: remove mmap linked list from vdso Liam Howlett
                     ` (39 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

do_brk_munmap() has already aligned the address and has a maple tree state
to be used.  Use the new do_mas_align_munmap() to avoid unnecessary
alignment and error checks.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index dd21f0a3f236..c3609e4e6f12 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3041,14 +3041,15 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct unmap;
 	unsigned long unmap_pages;
-	int ret = 1;
+	int ret;
 
 	arch_unmap(mm, newbrk, oldbrk);
 
 	if (likely((vma->vm_end < oldbrk) ||
 		   ((vma->vm_start == newbrk) && (vma->vm_end == oldbrk)))) {
 		/* remove entire mapping(s) */
-		ret = do_mas_munmap(mas, mm, newbrk, oldbrk-newbrk, uf, true);
+		ret = do_mas_align_munmap(mas, vma, mm, newbrk, oldbrk, uf,
+					  true);
 		goto munmap_full_vma;
 	}
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 31/69] arm64: Change elfcore for_each_mte_vma() to use VMA iterator
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (13 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 30/69] arm64: remove mmap linked list from vdso Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 33/69] powerpc: remove mmap linked list walks Liam Howlett
                     ` (37 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Rework for_each_mte_vma() to use a VMA iterator instead of an explicit
linked-list.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/20220218023650.672072-1-Liam.Howlett@oracle.com
Signed-off-by: Will Deacon <will@kernel.org>
---
 arch/arm64/kernel/elfcore.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/kernel/elfcore.c b/arch/arm64/kernel/elfcore.c
index 2b3f3d0544b9..e66ab0f09256 100644
--- a/arch/arm64/kernel/elfcore.c
+++ b/arch/arm64/kernel/elfcore.c
@@ -8,9 +8,9 @@
 #include <asm/cpufeature.h>
 #include <asm/mte.h>
 
-#define for_each_mte_vma(tsk, vma)					\
+#define for_each_mte_vma(vmi, vma)					\
 	if (system_supports_mte())					\
-		for (vma = tsk->mm->mmap; vma; vma = vma->vm_next)	\
+		for_each_vma(vmi, vma)					\
 			if (vma->vm_flags & VM_MTE)
 
 static unsigned long mte_vma_tag_dump_size(struct vm_area_struct *vma)
@@ -81,8 +81,9 @@ Elf_Half elf_core_extra_phdrs(void)
 {
 	struct vm_area_struct *vma;
 	int vma_count = 0;
+	VMA_ITERATOR(vmi, current->mm, 0);
 
-	for_each_mte_vma(current, vma)
+	for_each_mte_vma(vmi, vma)
 		vma_count++;
 
 	return vma_count;
@@ -91,8 +92,9 @@ Elf_Half elf_core_extra_phdrs(void)
 int elf_core_write_extra_phdrs(struct coredump_params *cprm, loff_t offset)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, current->mm, 0);
 
-	for_each_mte_vma(current, vma) {
+	for_each_mte_vma(vmi, vma) {
 		struct elf_phdr phdr;
 
 		phdr.p_type = PT_ARM_MEMTAG_MTE;
@@ -116,8 +118,9 @@ size_t elf_core_extra_data_size(void)
 {
 	struct vm_area_struct *vma;
 	size_t data_size = 0;
+	VMA_ITERATOR(vmi, current->mm, 0);
 
-	for_each_mte_vma(current, vma)
+	for_each_mte_vma(vmi, vma)
 		data_size += mte_vma_tag_dump_size(vma);
 
 	return data_size;
@@ -126,8 +129,9 @@ size_t elf_core_extra_data_size(void)
 int elf_core_write_extra_data(struct coredump_params *cprm)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, current->mm, 0);
 
-	for_each_mte_vma(current, vma) {
+	for_each_mte_vma(vmi, vma) {
 		if (vma->vm_flags & VM_DONTDUMP)
 			continue;
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 30/69] arm64: remove mmap linked list from vdso
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (12 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 29/69] mm/mmap: change do_brk_munmap() to use do_mas_align_munmap() Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 31/69] arm64: Change elfcore for_each_mte_vma() to use VMA iterator Liam Howlett
                     ` (38 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/arm64/kernel/vdso.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index a61fc4f989b3..a8388af62b99 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -136,10 +136,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 {
 	struct mm_struct *mm = task->mm;
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	mmap_read_lock(mm);
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, vdso_info[VDSO_ABI_AA64].dm))
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 32/69] parisc: remove mmap linked list from cache handling
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (15 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 33/69] powerpc: remove mmap linked list walks Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 35/69] x86: remove vma linked list walks Liam Howlett
                     ` (35 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/parisc/kernel/cache.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/parisc/kernel/cache.c b/arch/parisc/kernel/cache.c
index 23348199f3f8..ab7c789541bf 100644
--- a/arch/parisc/kernel/cache.c
+++ b/arch/parisc/kernel/cache.c
@@ -536,9 +536,11 @@ static inline unsigned long mm_total_size(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	unsigned long usize = 0;
+	VMA_ITERATOR(vmi, mm, 0);
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for_each_vma(vmi, vma)
 		usize += vma->vm_end - vma->vm_start;
+
 	return usize;
 }
 
@@ -578,6 +580,7 @@ static void flush_cache_pages(struct vm_area_struct *vma, struct mm_struct *mm,
 void flush_cache_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	/* Flushing the whole cache on each cpu takes forever on
 	   rp3440, etc.  So, avoid it if the mm isn't too big.  */
@@ -589,7 +592,7 @@ void flush_cache_mm(struct mm_struct *mm)
 		return;
 	}
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for_each_vma(vmi, vma)
 		flush_cache_pages(vma, mm, vma->vm_start, vma->vm_end);
 }
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 33/69] powerpc: remove mmap linked list walks
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (14 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 31/69] arm64: Change elfcore for_each_mte_vma() to use VMA iterator Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 32/69] parisc: remove mmap linked list from cache handling Liam Howlett
                     ` (36 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/powerpc/kernel/vdso.c              |  6 +++---
 arch/powerpc/mm/book3s32/tlb.c          | 11 ++++++-----
 arch/powerpc/mm/book3s64/subpage_prot.c | 13 ++-----------
 3 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 717f2c9a7573..f70db911e061 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -114,18 +114,18 @@ struct vdso_data *arch_get_vdso_data(void *vvar_page)
 int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 {
 	struct mm_struct *mm = task->mm;
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 
 	mmap_read_lock(mm);
-
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, &vvar_spec))
 			zap_page_range(vma, vma->vm_start, size);
 	}
-
 	mmap_read_unlock(mm);
+
 	return 0;
 }
 
diff --git a/arch/powerpc/mm/book3s32/tlb.c b/arch/powerpc/mm/book3s32/tlb.c
index 19f0ef950d77..9ad6b56bfec9 100644
--- a/arch/powerpc/mm/book3s32/tlb.c
+++ b/arch/powerpc/mm/book3s32/tlb.c
@@ -81,14 +81,15 @@ EXPORT_SYMBOL(hash__flush_range);
 void hash__flush_tlb_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *mp;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	/*
-	 * It is safe to go down the mm's list of vmas when called
-	 * from dup_mmap, holding mmap_lock.  It would also be safe from
-	 * unmap_region or exit_mmap, but not from vmtruncate on SMP -
-	 * but it seems dup_mmap is the only SMP case which gets here.
+	 * It is safe to iterate the vmas when called from dup_mmap,
+	 * holding mmap_lock.  It would also be safe from unmap_region
+	 * or exit_mmap, but not from vmtruncate on SMP - but it seems
+	 * dup_mmap is the only SMP case which gets here.
 	 */
-	for (mp = mm->mmap; mp != NULL; mp = mp->vm_next)
+	for_each_vma(vmi, mp)
 		hash__flush_range(mp->vm_mm, mp->vm_start, mp->vm_end);
 }
 EXPORT_SYMBOL(hash__flush_tlb_mm);
diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 60c6ea16a972..d73b3b4176e8 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -149,24 +149,15 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 				    unsigned long len)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, addr);
 
 	/*
 	 * We don't try too hard, we just mark all the vma in that range
 	 * VM_NOHUGEPAGE and split them.
 	 */
-	vma = find_vma(mm, addr);
-	/*
-	 * If the range is in unmapped range, just return
-	 */
-	if (vma && ((addr + len) <= vma->vm_start))
-		return;
-
-	while (vma) {
-		if (vma->vm_start >= (addr + len))
-			break;
+	for_each_vma_range(vmi, vma, addr + len) {
 		vma->vm_flags |= VM_NOHUGEPAGE;
 		walk_page_vma(vma, &subpage_walk_ops, NULL);
-		vma = vma->vm_next;
 	}
 }
 #else
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 34/69] s390: remove vma linked list walks
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (18 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 36/69] xtensa: " Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 38/69] optee: remove vma linked list walk Liam Howlett
                     ` (32 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/s390/kernel/vdso.c | 3 ++-
 arch/s390/mm/gmap.c     | 6 ++++--
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 99694260cac9..66f7e7c63632 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -68,10 +68,11 @@ static struct page *find_timens_vvar_page(struct vm_area_struct *vma)
 int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 {
 	struct mm_struct *mm = task->mm;
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 
 	mmap_read_lock(mm);
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (!vma_is_special_mapping(vma, &vvar_mapping))
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index af03cacf34ec..8a639487f840 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2508,8 +2508,9 @@ static const struct mm_walk_ops thp_split_walk_ops = {
 static inline void thp_split_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
-	for (vma = mm->mmap; vma != NULL; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		vma->vm_flags &= ~VM_HUGEPAGE;
 		vma->vm_flags |= VM_NOHUGEPAGE;
 		walk_page_vma(vma, &thp_split_walk_ops, NULL);
@@ -2577,8 +2578,9 @@ int gmap_mark_unmergeable(void)
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	int ret;
+	VMA_ITERATOR(vmi, mm, 0);
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
 				  MADV_UNMERGEABLE, &vma->vm_flags);
 		if (ret)
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 36/69] xtensa: remove vma linked list walks
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (17 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 35/69] x86: remove vma linked list walks Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 34/69] s390: " Liam Howlett
                     ` (33 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.  Since VMA can no longer be NULL in the
loop, then deal with out-of-memory outside the loop.  This means a
slightly longer run time in the failure case (-ENOMEM) - it will run to
the end of the VMAs before erroring instead of in the middle of the loop.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 arch/xtensa/kernel/syscall.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/arch/xtensa/kernel/syscall.c b/arch/xtensa/kernel/syscall.c
index 201356faa7e6..b3c2450d6f23 100644
--- a/arch/xtensa/kernel/syscall.c
+++ b/arch/xtensa/kernel/syscall.c
@@ -58,6 +58,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 		unsigned long len, unsigned long pgoff, unsigned long flags)
 {
 	struct vm_area_struct *vmm;
+	struct vma_iterator vmi;
 
 	if (flags & MAP_FIXED) {
 		/* We do not accept a shared mapping if it would violate
@@ -79,15 +80,20 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
 	else
 		addr = PAGE_ALIGN(addr);
 
-	for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
-		/* At this point:  (!vmm || addr < vmm->vm_end). */
-		if (TASK_SIZE - len < addr)
-			return -ENOMEM;
-		if (!vmm || addr + len <= vm_start_gap(vmm))
-			return addr;
+	vma_iter_init(&vmi, current->mm, addr);
+	for_each_vma(vmi, vmm) {
+		/* At this point:  (addr < vmm->vm_end). */
+		if (addr + len <= vm_start_gap(vmm))
+			break;
+
 		addr = vmm->vm_end;
 		if (flags & MAP_SHARED)
 			addr = COLOUR_ALIGN(addr, pgoff);
 	}
+
+	if (TASK_SIZE - len < addr)
+		return -ENOMEM;
+
+	return addr;
 }
 #endif
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 35/69] x86: remove vma linked list walks
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (16 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 32/69] parisc: remove mmap linked list from cache handling Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 36/69] xtensa: " Liam Howlett
                     ` (34 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 arch/x86/entry/vdso/vma.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c
index 235a5794296a..3883da001c62 100644
--- a/arch/x86/entry/vdso/vma.c
+++ b/arch/x86/entry/vdso/vma.c
@@ -127,17 +127,17 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 {
 	struct mm_struct *mm = task->mm;
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	mmap_read_lock(mm);
-
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, &vvar_mapping))
 			zap_page_range(vma, vma->vm_start, size);
 	}
-
 	mmap_read_unlock(mm);
+
 	return 0;
 }
 #else
@@ -354,6 +354,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	mmap_write_lock(mm);
 	/*
@@ -363,7 +364,7 @@ int map_vdso_once(const struct vdso_image *image, unsigned long addr)
 	 * We could search vma near context.vdso, but it's a slowpath,
 	 * so let's explicitly check all VMAs to be completely sure.
 	 */
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		if (vma_is_special_mapping(vma, &vdso_mapping) ||
 				vma_is_special_mapping(vma, &vvar_mapping)) {
 			mmap_write_unlock(mm);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 39/69] um: remove vma linked list walk
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (20 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 38/69] optee: remove vma linked list walk Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 37/69] cxl: " Liam Howlett
                     ` (30 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 arch/um/kernel/tlb.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/arch/um/kernel/tlb.c b/arch/um/kernel/tlb.c
index bc38f79ca3a3..ad449173a1a1 100644
--- a/arch/um/kernel/tlb.c
+++ b/arch/um/kernel/tlb.c
@@ -584,21 +584,19 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
 
 void flush_tlb_mm(struct mm_struct *mm)
 {
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
-	while (vma != NULL) {
+	for_each_vma(vmi, vma)
 		fix_range(mm, vma->vm_start, vma->vm_end, 0);
-		vma = vma->vm_next;
-	}
 }
 
 void force_flush_all(void)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
-	while (vma != NULL) {
+	for_each_vma(vmi, vma)
 		fix_range(mm, vma->vm_start, vma->vm_end, 1);
-		vma = vma->vm_next;
-	}
 }
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 37/69] cxl: remove vma linked list walk
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (21 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 39/69] um: " Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 40/69] coredump: " Liam Howlett
                     ` (29 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.  This requires a little restructuring of the
surrounding code to hoist the mm to the caller.  That turns
cxl_prefault_one() into a trivial function, so call cxl_fault_segment()
directly.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/misc/cxl/fault.c | 45 ++++++++++++++--------------------------
 1 file changed, 15 insertions(+), 30 deletions(-)

diff --git a/drivers/misc/cxl/fault.c b/drivers/misc/cxl/fault.c
index 60c829113299..2c64f55cf01f 100644
--- a/drivers/misc/cxl/fault.c
+++ b/drivers/misc/cxl/fault.c
@@ -280,22 +280,6 @@ void cxl_handle_fault(struct work_struct *fault_work)
 		mmput(mm);
 }
 
-static void cxl_prefault_one(struct cxl_context *ctx, u64 ea)
-{
-	struct mm_struct *mm;
-
-	mm = get_mem_context(ctx);
-	if (mm == NULL) {
-		pr_devel("cxl_prefault_one unable to get mm %i\n",
-			 pid_nr(ctx->pid));
-		return;
-	}
-
-	cxl_fault_segment(ctx, mm, ea);
-
-	mmput(mm);
-}
-
 static u64 next_segment(u64 ea, u64 vsid)
 {
 	if (vsid & SLB_VSID_B_1T)
@@ -306,23 +290,16 @@ static u64 next_segment(u64 ea, u64 vsid)
 	return ea + 1;
 }
 
-static void cxl_prefault_vma(struct cxl_context *ctx)
+static void cxl_prefault_vma(struct cxl_context *ctx, struct mm_struct *mm)
 {
 	u64 ea, last_esid = 0;
 	struct copro_slb slb;
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 	int rc;
-	struct mm_struct *mm;
-
-	mm = get_mem_context(ctx);
-	if (mm == NULL) {
-		pr_devel("cxl_prefault_vm unable to get mm %i\n",
-			 pid_nr(ctx->pid));
-		return;
-	}
 
 	mmap_read_lock(mm);
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		for (ea = vma->vm_start; ea < vma->vm_end;
 				ea = next_segment(ea, slb.vsid)) {
 			rc = copro_calculate_slb(mm, ea, &slb);
@@ -337,20 +314,28 @@ static void cxl_prefault_vma(struct cxl_context *ctx)
 		}
 	}
 	mmap_read_unlock(mm);
-
-	mmput(mm);
 }
 
 void cxl_prefault(struct cxl_context *ctx, u64 wed)
 {
+	struct mm_struct *mm = get_mem_context(ctx);
+
+	if (mm == NULL) {
+		pr_devel("cxl_prefault unable to get mm %i\n",
+			 pid_nr(ctx->pid));
+		return;
+	}
+
 	switch (ctx->afu->prefault_mode) {
 	case CXL_PREFAULT_WED:
-		cxl_prefault_one(ctx, wed);
+		cxl_fault_segment(ctx, mm, wed);
 		break;
 	case CXL_PREFAULT_ALL:
-		cxl_prefault_vma(ctx);
+		cxl_prefault_vma(ctx, mm);
 		break;
 	default:
 		break;
 	}
+
+	mmput(mm);
 }
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 38/69] optee: remove vma linked list walk
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (19 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 34/69] s390: " Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 39/69] um: " Liam Howlett
                     ` (31 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the VMA iterator instead.  Change the calling convention of
__check_mem_type() to pass in the mm instead of the first vma in the
range.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/tee/optee/call.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/tee/optee/call.c b/drivers/tee/optee/call.c
index bd49ec934060..d8115dcae6e7 100644
--- a/drivers/tee/optee/call.c
+++ b/drivers/tee/optee/call.c
@@ -342,15 +342,18 @@ static bool is_normal_memory(pgprot_t p)
 #endif
 }
 
-static int __check_mem_type(struct vm_area_struct *vma, unsigned long end)
+static int __check_mem_type(struct mm_struct *mm, unsigned long start,
+				unsigned long end)
 {
-	while (vma && is_normal_memory(vma->vm_page_prot)) {
-		if (vma->vm_end >= end)
-			return 0;
-		vma = vma->vm_next;
+	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, start);
+
+	for_each_vma_range(vmi, vma, end) {
+		if (!is_normal_memory(vma->vm_page_prot))
+			return -EINVAL;
 	}
 
-	return -EINVAL;
+	return 0;
 }
 
 int optee_check_mem_type(unsigned long start, size_t num_pages)
@@ -366,8 +369,7 @@ int optee_check_mem_type(unsigned long start, size_t num_pages)
 		return 0;
 
 	mmap_read_lock(mm);
-	rc = __check_mem_type(find_vma(mm, start),
-			      start + num_pages * PAGE_SIZE);
+	rc = __check_mem_type(mm, start, start + num_pages * PAGE_SIZE);
 	mmap_read_unlock(mm);
 
 	return rc;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 41/69] exec: use VMA iterator instead of linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (23 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 40/69] coredump: " Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 42/69] fs/proc/base: use maple tree iterators in place " Liam Howlett
                     ` (27 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove a use of the vm_next list by doing the initial lookup with the VMA
iterator and then using it to find the next entry.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 fs/exec.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 14e7278a1ab8..b5e3bfd52b53 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -686,6 +686,8 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	unsigned long length = old_end - old_start;
 	unsigned long new_start = old_start - shift;
 	unsigned long new_end = old_end - shift;
+	VMA_ITERATOR(vmi, mm, new_start);
+	struct vm_area_struct *next;
 	struct mmu_gather tlb;
 
 	BUG_ON(new_start > new_end);
@@ -694,7 +696,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	 * ensure there are no vmas between where we want to go
 	 * and where we are
 	 */
-	if (vma != find_vma(mm, new_start))
+	if (vma != vma_next(&vmi))
 		return -EFAULT;
 
 	/*
@@ -713,12 +715,13 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
+	next = vma_next(&vmi);
 	if (new_end > old_start) {
 		/*
 		 * when the old and new regions overlap clear from new_end.
 		 */
 		free_pgd_range(&tlb, new_end, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
+			next ? next->vm_start : USER_PGTABLES_CEILING);
 	} else {
 		/*
 		 * otherwise, clean from old_start; this is done to not touch
@@ -727,7 +730,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 		 * for the others its just a little faster.
 		 */
 		free_pgd_range(&tlb, old_start, old_end, new_end,
-			vma->vm_next ? vma->vm_next->vm_start : USER_PGTABLES_CEILING);
+			next ? next->vm_start : USER_PGTABLES_CEILING);
 	}
 	tlb_finish_mmu(&tlb);
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 42/69] fs/proc/base: use maple tree iterators in place of linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (24 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 41/69] exec: use VMA iterator instead of linked list Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 43/69] fs/proc/task_mmu: stop using linked list and highest_vm_end Liam Howlett
                     ` (26 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 fs/proc/base.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 8dfa36a99c74..617816168748 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2322,6 +2322,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	GENRADIX(struct map_files_info) fa;
 	struct map_files_info *p;
 	int ret;
+	MA_STATE(mas, NULL, 0, 0);
 
 	genradix_init(&fa);
 
@@ -2349,6 +2350,7 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	}
 
 	nr_files = 0;
+	mas.tree = &mm->mm_mt;
 
 	/*
 	 * We need two passes here:
@@ -2360,7 +2362,8 @@ proc_map_files_readdir(struct file *file, struct dir_context *ctx)
 	 * routine might require mmap_lock taken in might_fault().
 	 */
 
-	for (vma = mm->mmap, pos = 2; vma; vma = vma->vm_next) {
+	pos = 2;
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		if (!vma->vm_file)
 			continue;
 		if (++pos <= ctx->pos)
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 40/69] coredump: remove vma linked list walk
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (22 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 37/69] cxl: " Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 41/69] exec: use VMA iterator instead of linked list Liam Howlett
                     ` (28 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the Maple Tree iterator instead.  This is too complicated for the VMA
iterator to handle, so let's open-code it for now.  If this turns out to
be a common pattern, we can migrate it to common code.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 fs/coredump.c | 34 ++++++++++++----------------------
 1 file changed, 12 insertions(+), 22 deletions(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index ebc43f960b64..3a0022c1ca36 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -1072,30 +1072,20 @@ static unsigned long vma_dump_size(struct vm_area_struct *vma,
 	return vma->vm_end - vma->vm_start;
 }
 
-static struct vm_area_struct *first_vma(struct task_struct *tsk,
-					struct vm_area_struct *gate_vma)
-{
-	struct vm_area_struct *ret = tsk->mm->mmap;
-
-	if (ret)
-		return ret;
-	return gate_vma;
-}
-
 /*
  * Helper function for iterating across a vma list.  It ensures that the caller
  * will visit `gate_vma' prior to terminating the search.
  */
-static struct vm_area_struct *next_vma(struct vm_area_struct *this_vma,
+static struct vm_area_struct *coredump_next_vma(struct ma_state *mas,
+				       struct vm_area_struct *vma,
 				       struct vm_area_struct *gate_vma)
 {
-	struct vm_area_struct *ret;
-
-	ret = this_vma->vm_next;
-	if (ret)
-		return ret;
-	if (this_vma == gate_vma)
+	if (gate_vma && (vma == gate_vma))
 		return NULL;
+
+	vma = mas_next(mas, ULONG_MAX);
+	if (vma)
+		return vma;
 	return gate_vma;
 }
 
@@ -1119,9 +1109,10 @@ static void free_vma_snapshot(struct coredump_params *cprm)
  */
 static bool dump_vma_snapshot(struct coredump_params *cprm)
 {
-	struct vm_area_struct *vma, *gate_vma;
+	struct vm_area_struct *gate_vma, *vma = NULL;
 	struct mm_struct *mm = current->mm;
-	int i;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
+	int i = 0;
 
 	/*
 	 * Once the stack expansion code is fixed to not change VMA bounds
@@ -1141,8 +1132,7 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
 		return false;
 	}
 
-	for (i = 0, vma = first_vma(current, gate_vma); vma != NULL;
-			vma = next_vma(vma, gate_vma), i++) {
+	while ((vma = coredump_next_vma(&mas, vma, gate_vma)) != NULL) {
 		struct core_vma_metadata *m = cprm->vma_meta + i;
 
 		m->start = vma->vm_start;
@@ -1150,10 +1140,10 @@ static bool dump_vma_snapshot(struct coredump_params *cprm)
 		m->flags = vma->vm_flags;
 		m->dump_size = vma_dump_size(vma, cprm->mm_flags);
 		m->pgoff = vma->vm_pgoff;
-
 		m->file = vma->vm_file;
 		if (m->file)
 			get_file(m->file);
+		i++;
 	}
 
 	mmap_write_unlock(mm);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 43/69] fs/proc/task_mmu: stop using linked list and highest_vm_end
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (25 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 42/69] fs/proc/base: use maple tree iterators in place " Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 45/69] ipc/shm: use VMA iterator instead of linked list Liam Howlett
                     ` (25 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove references to mm_struct linked list and highest_vm_end for when
they are removed

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 fs/proc/internal.h |  2 +-
 fs/proc/task_mmu.c | 73 ++++++++++++++++++++++++++--------------------
 2 files changed, 42 insertions(+), 33 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 06a80f78433d..f03000764ce5 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -285,7 +285,7 @@ struct proc_maps_private {
 	struct task_struct *task;
 	struct mm_struct *mm;
 #ifdef CONFIG_MMU
-	struct vm_area_struct *tail_vma;
+	struct vma_iterator iter;
 #endif
 #ifdef CONFIG_NUMA
 	struct mempolicy *task_mempolicy;
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index b940b969b000..b59b053e91d4 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -123,12 +123,26 @@ static void release_task_mempolicy(struct proc_maps_private *priv)
 }
 #endif
 
+static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv,
+						loff_t *ppos)
+{
+	struct vm_area_struct *vma = vma_next(&priv->iter);
+
+	if (vma) {
+		*ppos = vma->vm_start;
+	} else {
+		*ppos = -2UL;
+		vma = get_gate_vma(priv->mm);
+	}
+
+	return vma;
+}
+
 static void *m_start(struct seq_file *m, loff_t *ppos)
 {
 	struct proc_maps_private *priv = m->private;
 	unsigned long last_addr = *ppos;
 	struct mm_struct *mm;
-	struct vm_area_struct *vma;
 
 	/* See m_next(). Zero at the start or after lseek. */
 	if (last_addr == -1UL)
@@ -152,31 +166,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos)
 		return ERR_PTR(-EINTR);
 	}
 
+	vma_iter_init(&priv->iter, mm, last_addr);
 	hold_task_mempolicy(priv);
-	priv->tail_vma = get_gate_vma(mm);
-
-	vma = find_vma(mm, last_addr);
-	if (vma)
-		return vma;
+	if (last_addr == -2UL)
+		return get_gate_vma(mm);
 
-	return priv->tail_vma;
+	return proc_get_vma(priv, ppos);
 }
 
 static void *m_next(struct seq_file *m, void *v, loff_t *ppos)
 {
-	struct proc_maps_private *priv = m->private;
-	struct vm_area_struct *next, *vma = v;
-
-	if (vma == priv->tail_vma)
-		next = NULL;
-	else if (vma->vm_next)
-		next = vma->vm_next;
-	else
-		next = priv->tail_vma;
-
-	*ppos = next ? next->vm_start : -1UL;
-
-	return next;
+	if (*ppos == -2UL) {
+		*ppos = -1UL;
+		return NULL;
+	}
+	return proc_get_vma(m->private, ppos);
 }
 
 static void m_stop(struct seq_file *m, void *v)
@@ -872,16 +876,16 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
 {
 	struct proc_maps_private *priv = m->private;
 	struct mem_size_stats mss;
-	struct mm_struct *mm;
+	struct mm_struct *mm = priv->mm;
 	struct vm_area_struct *vma;
-	unsigned long last_vma_end = 0;
+	unsigned long vma_start = 0, last_vma_end = 0;
 	int ret = 0;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	priv->task = get_proc_task(priv->inode);
 	if (!priv->task)
 		return -ESRCH;
 
-	mm = priv->mm;
 	if (!mm || !mmget_not_zero(mm)) {
 		ret = -ESRCH;
 		goto out_put_task;
@@ -894,8 +898,13 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
 		goto out_put_mm;
 
 	hold_task_mempolicy(priv);
+	vma = mas_find(&mas, 0);
+
+	if (unlikely(!vma))
+		goto empty_set;
 
-	for (vma = priv->mm->mmap; vma;) {
+	vma_start = vma->vm_start;
+	do {
 		smap_gather_stats(vma, &mss, 0);
 		last_vma_end = vma->vm_end;
 
@@ -904,6 +913,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
 		 * access it for write request.
 		 */
 		if (mmap_lock_is_contended(mm)) {
+			mas_pause(&mas);
 			mmap_read_unlock(mm);
 			ret = mmap_read_lock_killable(mm);
 			if (ret) {
@@ -947,7 +957,7 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
 			 *    contains last_vma_end.
 			 *    Iterate VMA' from last_vma_end.
 			 */
-			vma = find_vma(mm, last_vma_end - 1);
+			vma = mas_find(&mas, ULONG_MAX);
 			/* Case 3 above */
 			if (!vma)
 				break;
@@ -961,11 +971,10 @@ static int show_smaps_rollup(struct seq_file *m, void *v)
 				smap_gather_stats(vma, &mss, last_vma_end);
 		}
 		/* Case 2 above */
-		vma = vma->vm_next;
-	}
+	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
 
-	show_vma_header_prefix(m, priv->mm->mmap->vm_start,
-			       last_vma_end, 0, 0, 0, 0);
+empty_set:
+	show_vma_header_prefix(m, vma_start, last_vma_end, 0, 0, 0, 0);
 	seq_pad(m, ' ');
 	seq_puts(m, "[rollup]\n");
 
@@ -1258,6 +1267,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		return -ESRCH;
 	mm = get_task_mm(task);
 	if (mm) {
+		MA_STATE(mas, &mm->mm_mt, 0, 0);
 		struct mmu_notifier_range range;
 		struct clear_refs_private cp = {
 			.type = type,
@@ -1277,7 +1287,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 		}
 
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
-			for (vma = mm->mmap; vma; vma = vma->vm_next) {
+			mas_for_each(&mas, vma, ULONG_MAX) {
 				if (!(vma->vm_flags & VM_SOFTDIRTY))
 					continue;
 				vma->vm_flags &= ~VM_SOFTDIRTY;
@@ -1289,8 +1299,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 						0, NULL, mm, 0, -1UL);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-		walk_page_range(mm, 0, mm->highest_vm_end, &clear_refs_walk_ops,
-				&cp);
+		walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			mmu_notifier_invalidate_range_end(&range);
 			flush_tlb_mm(mm);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 45/69] ipc/shm: use VMA iterator instead of linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (26 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 43/69] fs/proc/task_mmu: stop using linked list and highest_vm_end Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:13   ` [PATCH v9 44/69] userfaultfd: use maple tree iterator to iterate VMAs Liam Howlett
                     ` (24 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

The VMA iterator is faster than the linked llist, and it can be walked
even when VMAs are being removed from the address space, so there's no
need to keep track of 'next'.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 ipc/shm.c | 21 ++++++++++-----------
 1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/ipc/shm.c b/ipc/shm.c
index b3048ebd5c31..7d86f058fb86 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1721,7 +1721,7 @@ long ksys_shmdt(char __user *shmaddr)
 #ifdef CONFIG_MMU
 	loff_t size = 0;
 	struct file *file;
-	struct vm_area_struct *next;
+	VMA_ITERATOR(vmi, mm, addr);
 #endif
 
 	if (addr & ~PAGE_MASK)
@@ -1751,12 +1751,9 @@ long ksys_shmdt(char __user *shmaddr)
 	 * match the usual checks anyway. So assume all vma's are
 	 * above the starting address given.
 	 */
-	vma = find_vma(mm, addr);
 
 #ifdef CONFIG_MMU
-	while (vma) {
-		next = vma->vm_next;
-
+	for_each_vma(vmi, vma) {
 		/*
 		 * Check if the starting address would match, i.e. it's
 		 * a fragment created by mprotect() and/or munmap(), or it
@@ -1774,6 +1771,7 @@ long ksys_shmdt(char __user *shmaddr)
 			file = vma->vm_file;
 			size = i_size_read(file_inode(vma->vm_file));
 			do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
+			mas_pause(&vmi.mas);
 			/*
 			 * We discovered the size of the shm segment, so
 			 * break out of here and fall through to the next
@@ -1781,10 +1779,9 @@ long ksys_shmdt(char __user *shmaddr)
 			 * searching for matching vma's.
 			 */
 			retval = 0;
-			vma = next;
+			vma = vma_next(&vmi);
 			break;
 		}
-		vma = next;
 	}
 
 	/*
@@ -1794,17 +1791,19 @@ long ksys_shmdt(char __user *shmaddr)
 	 */
 	size = PAGE_ALIGN(size);
 	while (vma && (loff_t)(vma->vm_end - addr) <= size) {
-		next = vma->vm_next;
-
 		/* finding a matching vma now does not alter retval */
 		if ((vma->vm_ops == &shm_vm_ops) &&
 		    ((vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) &&
-		    (vma->vm_file == file))
+		    (vma->vm_file == file)) {
 			do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start, NULL);
-		vma = next;
+			mas_pause(&vmi.mas);
+		}
+
+		vma = vma_next(&vmi);
 	}
 
 #else	/* CONFIG_MMU */
+	vma = vma_lookup(mm, addr);
 	/* under NOMMU conditions, the exact address to be destroyed must be
 	 * given
 	 */
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 44/69] userfaultfd: use maple tree iterator to iterate VMAs
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (27 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 45/69] ipc/shm: use VMA iterator instead of linked list Liam Howlett
@ 2022-05-04  1:13   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 47/69] perf: use VMA iterator Liam Howlett
                     ` (23 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:13 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Don't use the mm_struct linked list or the vma->vm_next in prep for
removal.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 fs/userfaultfd.c              | 55 +++++++++++++++++++++++------------
 include/linux/userfaultfd_k.h |  7 ++---
 mm/mmap.c                     |  7 ++---
 3 files changed, 42 insertions(+), 27 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index aa0c47cb0d16..af29e5885ed2 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -610,14 +610,16 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 	if (release_new_ctx) {
 		struct vm_area_struct *vma;
 		struct mm_struct *mm = release_new_ctx->mm;
+		VMA_ITERATOR(vmi, mm, 0);
 
 		/* the various vma->vm_userfaultfd_ctx still points to it */
 		mmap_write_lock(mm);
-		for (vma = mm->mmap; vma; vma = vma->vm_next)
+		for_each_vma(vmi, vma) {
 			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
 				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 				vma->vm_flags &= ~__VM_UFFD_FLAGS;
 			}
+		}
 		mmap_write_unlock(mm);
 
 		userfaultfd_ctx_put(release_new_ctx);
@@ -798,11 +800,13 @@ static bool has_unmap_ctx(struct userfaultfd_ctx *ctx, struct list_head *unmaps,
 	return false;
 }
 
-int userfaultfd_unmap_prep(struct vm_area_struct *vma,
-			   unsigned long start, unsigned long end,
-			   struct list_head *unmaps)
+int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
+			   unsigned long end, struct list_head *unmaps)
 {
-	for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
+	VMA_ITERATOR(vmi, mm, start);
+	struct vm_area_struct *vma;
+
+	for_each_vma_range(vmi, vma, end) {
 		struct userfaultfd_unmap_ctx *unmap_ctx;
 		struct userfaultfd_ctx *ctx = vma->vm_userfaultfd_ctx.ctx;
 
@@ -852,6 +856,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	/* len == 0 means wake all */
 	struct userfaultfd_wake_range range = { .len = 0, };
 	unsigned long new_flags;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	WRITE_ONCE(ctx->released, true);
 
@@ -868,7 +873,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 	 */
 	mmap_write_lock(mm);
 	prev = NULL;
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		cond_resched();
 		BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
 		       !!(vma->vm_flags & __VM_UFFD_FLAGS));
@@ -1285,6 +1290,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	bool found;
 	bool basic_ioctls;
 	unsigned long start, end, vma_end;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	user_uffdio_register = (struct uffdio_register __user *) arg;
 
@@ -1327,7 +1333,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	mmap_write_lock(mm);
-	vma = find_vma_prev(mm, start, &prev);
+	mas_set(&mas, start);
+	vma = mas_find(&mas, ULONG_MAX);
 	if (!vma)
 		goto out_unlock;
 
@@ -1352,7 +1359,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	 */
 	found = false;
 	basic_ioctls = false;
-	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+	for (cur = vma; cur; cur = mas_next(&mas, end - 1)) {
 		cond_resched();
 
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
@@ -1412,8 +1419,10 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	}
 	BUG_ON(!found);
 
-	if (vma->vm_start < start)
-		prev = vma;
+	mas_set(&mas, start);
+	prev = mas_prev(&mas, 0);
+	if (prev != vma)
+		mas_next(&mas, ULONG_MAX);
 
 	ret = 0;
 	do {
@@ -1443,6 +1452,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 				 ((struct vm_userfaultfd_ctx){ ctx }),
 				 anon_vma_name(vma));
 		if (prev) {
+			/* vma_merge() invalidated the mas */
+			mas_pause(&mas);
 			vma = prev;
 			goto next;
 		}
@@ -1450,11 +1461,15 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 			ret = split_vma(mm, vma, start, 1);
 			if (ret)
 				break;
+			/* split_vma() invalidated the mas */
+			mas_pause(&mas);
 		}
 		if (vma->vm_end > end) {
 			ret = split_vma(mm, vma, end, 0);
 			if (ret)
 				break;
+			/* split_vma() invalidated the mas */
+			mas_pause(&mas);
 		}
 	next:
 		/*
@@ -1471,8 +1486,8 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 	skip:
 		prev = vma;
 		start = vma->vm_end;
-		vma = vma->vm_next;
-	} while (vma && vma->vm_start < end);
+		vma = mas_next(&mas, end - 1);
+	} while (vma);
 out_unlock:
 	mmap_write_unlock(mm);
 	mmput(mm);
@@ -1516,6 +1531,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	bool found;
 	unsigned long start, end, vma_end;
 	const void __user *buf = (void __user *)arg;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	ret = -EFAULT;
 	if (copy_from_user(&uffdio_unregister, buf, sizeof(uffdio_unregister)))
@@ -1534,7 +1550,8 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		goto out;
 
 	mmap_write_lock(mm);
-	vma = find_vma_prev(mm, start, &prev);
+	mas_set(&mas, start);
+	vma = mas_find(&mas, ULONG_MAX);
 	if (!vma)
 		goto out_unlock;
 
@@ -1559,7 +1576,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	 */
 	found = false;
 	ret = -EINVAL;
-	for (cur = vma; cur && cur->vm_start < end; cur = cur->vm_next) {
+	for (cur = vma; cur; cur = mas_next(&mas, end - 1)) {
 		cond_resched();
 
 		BUG_ON(!!cur->vm_userfaultfd_ctx.ctx ^
@@ -1579,8 +1596,10 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	}
 	BUG_ON(!found);
 
-	if (vma->vm_start < start)
-		prev = vma;
+	mas_set(&mas, start);
+	prev = mas_prev(&mas, 0);
+	if (prev != vma)
+		mas_next(&mas, ULONG_MAX);
 
 	ret = 0;
 	do {
@@ -1645,8 +1664,8 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 	skip:
 		prev = vma;
 		start = vma->vm_end;
-		vma = vma->vm_next;
-	} while (vma && vma->vm_start < end);
+		vma = mas_next(&mas, end - 1);
+	} while (vma);
 out_unlock:
 	mmap_write_unlock(mm);
 	mmput(mm);
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index 33cea484d1ad..e0b2ec2c20f2 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -139,9 +139,8 @@ extern bool userfaultfd_remove(struct vm_area_struct *vma,
 			       unsigned long start,
 			       unsigned long end);
 
-extern int userfaultfd_unmap_prep(struct vm_area_struct *vma,
-				  unsigned long start, unsigned long end,
-				  struct list_head *uf);
+extern int userfaultfd_unmap_prep(struct mm_struct *mm, unsigned long start,
+				  unsigned long end, struct list_head *uf);
 extern void userfaultfd_unmap_complete(struct mm_struct *mm,
 				       struct list_head *uf);
 
@@ -222,7 +221,7 @@ static inline bool userfaultfd_remove(struct vm_area_struct *vma,
 	return true;
 }
 
-static inline int userfaultfd_unmap_prep(struct vm_area_struct *vma,
+static inline int userfaultfd_unmap_prep(struct mm_struct *mm,
 					 unsigned long start, unsigned long end,
 					 struct list_head *uf)
 {
diff --git a/mm/mmap.c b/mm/mmap.c
index c3609e4e6f12..572a2a474b49 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2552,7 +2552,7 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 		 * split, despite we could. This is unlikely enough
 		 * failure that it's not worth optimizing it for.
 		 */
-		error = userfaultfd_unmap_prep(vma, start, end, uf);
+		error = userfaultfd_unmap_prep(mm, start, end, uf);
 
 		if (error)
 			goto userfaultfd_error;
@@ -3053,10 +3053,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 		goto munmap_full_vma;
 	}
 
-	vma_init(&unmap, mm);
-	unmap.vm_start = newbrk;
-	unmap.vm_end = oldbrk;
-	ret = userfaultfd_unmap_prep(&unmap, newbrk, oldbrk, uf);
+	ret = userfaultfd_unmap_prep(mm, newbrk, oldbrk, uf);
 	if (ret)
 		return ret;
 	ret = 1;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 46/69] acct: use VMA iterator instead of linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (29 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 47/69] perf: use VMA iterator Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 48/69] sched: use maple tree iterator to walk VMAs Liam Howlett
                     ` (21 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The VMA iterator is faster than the linked list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/acct.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/acct.c b/kernel/acct.c
index 3df53cf1dcd5..2e7bf8d41f04 100644
--- a/kernel/acct.c
+++ b/kernel/acct.c
@@ -535,15 +535,14 @@ void acct_collect(long exitcode, int group_dead)
 	unsigned long vsize = 0;
 
 	if (group_dead && current->mm) {
+		struct mm_struct *mm = current->mm;
+		VMA_ITERATOR(vmi, mm, 0);
 		struct vm_area_struct *vma;
 
-		mmap_read_lock(current->mm);
-		vma = current->mm->mmap;
-		while (vma) {
+		mmap_read_lock(mm);
+		for_each_vma(vmi, vma)
 			vsize += vma->vm_end - vma->vm_start;
-			vma = vma->vm_next;
-		}
-		mmap_read_unlock(current->mm);
+		mmap_read_unlock(mm);
 	}
 
 	spin_lock_irq(&current->sighand->siglock);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 47/69] perf: use VMA iterator
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (28 preceding siblings ...)
  2022-05-04  1:13   ` [PATCH v9 44/69] userfaultfd: use maple tree iterator to iterate VMAs Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 46/69] acct: use VMA iterator instead of linked list Liam Howlett
                     ` (22 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The VMA iterator is faster than the linked list and removing the linked
list will shrink the vm_area_struct.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/events/core.c    | 3 ++-
 kernel/events/uprobes.c | 9 ++++++---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 7858bafffa9d..e2da3045d274 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10211,8 +10211,9 @@ static void perf_addr_filter_apply(struct perf_addr_filter *filter,
 				   struct perf_addr_filter_range *fr)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		if (!vma->vm_file)
 			continue;
 
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 6418083901d4..84b5a7cdfe81 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -349,9 +349,10 @@ static bool valid_ref_ctr_vma(struct uprobe *uprobe,
 static struct vm_area_struct *
 find_ref_ctr_vma(struct uprobe *uprobe, struct mm_struct *mm)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *tmp;
 
-	for (tmp = mm->mmap; tmp; tmp = tmp->vm_next)
+	for_each_vma(vmi, tmp)
 		if (valid_ref_ctr_vma(uprobe, tmp))
 			return tmp;
 
@@ -1230,11 +1231,12 @@ int uprobe_apply(struct inode *inode, loff_t offset,
 
 static int unapply_uprobe(struct uprobe *uprobe, struct mm_struct *mm)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 	int err = 0;
 
 	mmap_read_lock(mm);
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		unsigned long vaddr;
 		loff_t offset;
 
@@ -1982,9 +1984,10 @@ bool uprobe_deny_signal(void)
 
 static void mmf_recalc_uprobes(struct mm_struct *mm)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		if (!valid_vma(vma, false))
 			continue;
 		/*
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 48/69] sched: use maple tree iterator to walk VMAs
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (30 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 46/69] acct: use VMA iterator instead of linked list Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 50/69] bpf: remove VMA linked list Liam Howlett
                     ` (20 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The linked list is slower than walking the VMAs using the maple tree.  We
can't use the VMA iterator here because it doesn't support moving to an
earlier position.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/sched/fair.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a68482d66535..aa18f180ef7d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2722,6 +2722,7 @@ static void task_numa_work(struct callback_head *work)
 	struct task_struct *p = current;
 	struct mm_struct *mm = p->mm;
 	u64 runtime = p->se.sum_exec_runtime;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
@@ -2778,13 +2779,16 @@ static void task_numa_work(struct callback_head *work)
 
 	if (!mmap_read_trylock(mm))
 		return;
-	vma = find_vma(mm, start);
+	mas_set(&mas, start);
+	vma = mas_find(&mas, ULONG_MAX);
 	if (!vma) {
 		reset_ptenuma_scan(p);
 		start = 0;
-		vma = mm->mmap;
+		mas_set(&mas, start);
+		vma = mas_find(&mas, ULONG_MAX);
 	}
-	for (; vma; vma = vma->vm_next) {
+
+	for (; vma; vma = mas_find(&mas, ULONG_MAX)) {
 		if (!vma_migratable(vma) || !vma_policy_mof(vma) ||
 			is_vm_hugetlb_page(vma) || (vma->vm_flags & VM_MIXEDMAP)) {
 			continue;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 49/69] fork: use VMA iterator
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (32 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 50/69] bpf: remove VMA linked list Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 52/69] mm/khugepaged: stop using vma linked list Liam Howlett
                     ` (18 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

The VMA iterator is faster than the linked list and removing the linked
list will shrink the vm_area_struct.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 kernel/fork.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 4af22dd65fc6..9fcbd0b5c0be 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1291,13 +1291,16 @@ int replace_mm_exe_file(struct mm_struct *mm, struct file *new_exe_file)
 	/* Forbid mm->exe_file change if old file still mapped. */
 	old_exe_file = get_mm_exe_file(mm);
 	if (old_exe_file) {
+		VMA_ITERATOR(vmi, mm, 0);
 		mmap_read_lock(mm);
-		for (vma = mm->mmap; vma && !ret; vma = vma->vm_next) {
+		for_each_vma(vmi, vma) {
 			if (!vma->vm_file)
 				continue;
 			if (path_equal(&vma->vm_file->f_path,
-				       &old_exe_file->f_path))
+				       &old_exe_file->f_path)) {
 				ret = -EBUSY;
+				break;
+			}
 		}
 		mmap_read_unlock(mm);
 		fput(old_exe_file);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 50/69] bpf: remove VMA linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (31 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 48/69] sched: use maple tree iterator to walk VMAs Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 49/69] fork: use VMA iterator Liam Howlett
                     ` (19 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Use vma_next() and remove reference to the start of the linked list

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 kernel/bpf/task_iter.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index d94696198ef8..9a0bbc808433 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -300,8 +300,8 @@ struct bpf_iter_seq_task_vma_info {
 };
 
 enum bpf_task_vma_iter_find_op {
-	task_vma_iter_first_vma,   /* use mm->mmap */
-	task_vma_iter_next_vma,    /* use curr_vma->vm_next */
+	task_vma_iter_first_vma,   /* use find_vma() with addr 0 */
+	task_vma_iter_next_vma,    /* use vma_next() with curr_vma */
 	task_vma_iter_find_vma,    /* use find_vma() to find next vma */
 };
 
@@ -401,10 +401,10 @@ task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)
 
 	switch (op) {
 	case task_vma_iter_first_vma:
-		curr_vma = curr_task->mm->mmap;
+		curr_vma = find_vma(curr_task->mm, 0);
 		break;
 	case task_vma_iter_next_vma:
-		curr_vma = curr_vma->vm_next;
+		curr_vma = find_vma(curr_task->mm, curr_vma->vm_end);
 		break;
 	case task_vma_iter_find_vma:
 		/* We dropped mmap_lock so it is necessary to use find_vma
@@ -418,7 +418,7 @@ task_vma_seq_get_next(struct bpf_iter_seq_task_vma_info *info)
 		if (curr_vma &&
 		    curr_vma->vm_start == info->prev_vm_start &&
 		    curr_vma->vm_end == info->prev_vm_end)
-			curr_vma = curr_vma->vm_next;
+			curr_vma = find_vma(curr_task->mm, curr_vma->vm_end);
 		break;
 	}
 	if (!curr_vma) {
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 52/69] mm/khugepaged: stop using vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (33 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 49/69] fork: use VMA iterator Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 51/69] mm/gup: use maple tree navigation instead of " Liam Howlett
                     ` (17 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use vma iterator & find_vma() instead of vma linked list.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/huge_memory.c |  4 ++--
 mm/khugepaged.c  | 11 ++++++++---
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index c468fee595ff..c72827d9cf04 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2221,11 +2221,11 @@ void vma_adjust_trans_huge(struct vm_area_struct *vma,
 	split_huge_pmd_if_needed(vma, end);
 
 	/*
-	 * If we're also updating the vma->vm_next->vm_start,
+	 * If we're also updating the next vma vm_start,
 	 * check if we need to split it.
 	 */
 	if (adjust_next > 0) {
-		struct vm_area_struct *next = vma->vm_next;
+		struct vm_area_struct *next = find_vma(vma->vm_mm, vma->vm_end);
 		unsigned long nstart = next->vm_start;
 		nstart += adjust_next;
 		split_huge_pmd_if_needed(next, nstart);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 03fda93ade3e..208fc0e19eb1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2089,10 +2089,12 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	__releases(&khugepaged_mm_lock)
 	__acquires(&khugepaged_mm_lock)
 {
+	struct vma_iterator vmi;
 	struct mm_slot *mm_slot;
 	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	int progress = 0;
+	unsigned long address;
 
 	VM_BUG_ON(!pages);
 	lockdep_assert_held(&khugepaged_mm_lock);
@@ -2116,11 +2118,14 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 	vma = NULL;
 	if (unlikely(!mmap_read_trylock(mm)))
 		goto breakouterloop_mmap_lock;
-	if (likely(!khugepaged_test_exit(mm)))
-		vma = find_vma(mm, khugepaged_scan.address);
 
 	progress++;
-	for (; vma; vma = vma->vm_next) {
+	if (unlikely(khugepaged_test_exit(mm)))
+		goto breakouterloop;
+
+	address = khugepaged_scan.address;
+	vma_iter_init(&vmi, mm, address);
+	for_each_vma(vmi, vma) {
 		unsigned long hstart, hend;
 
 		cond_resched();
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 53/69] mm/ksm: use vma iterators instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (35 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 51/69] mm/gup: use maple tree navigation instead of " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 54/69] mm/madvise: use vma_find() " Liam Howlett
                     ` (15 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Remove the use of the linked list for eventual removal.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/ksm.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/ksm.c b/mm/ksm.c
index 94bb0f049806..ea3e66241976 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -980,11 +980,13 @@ static int unmerge_and_remove_all_rmap_items(void)
 						struct mm_slot, mm_list);
 	spin_unlock(&ksm_mmlist_lock);
 
-	for (mm_slot = ksm_scan.mm_slot;
-			mm_slot != &ksm_mm_head; mm_slot = ksm_scan.mm_slot) {
+	for (mm_slot = ksm_scan.mm_slot; mm_slot != &ksm_mm_head;
+	     mm_slot = ksm_scan.mm_slot) {
+		VMA_ITERATOR(vmi, mm_slot->mm, 0);
+
 		mm = mm_slot->mm;
 		mmap_read_lock(mm);
-		for (vma = mm->mmap; vma; vma = vma->vm_next) {
+		for_each_vma(vmi, vma) {
 			if (ksm_test_exit(mm))
 				break;
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
@@ -2221,6 +2223,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 	struct mm_slot *slot;
 	struct vm_area_struct *vma;
 	struct rmap_item *rmap_item;
+	struct vma_iterator vmi;
 	int nid;
 
 	if (list_empty(&ksm_mm_head.mm_list))
@@ -2279,13 +2282,13 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 	}
 
 	mm = slot->mm;
+	vma_iter_init(&vmi, mm, ksm_scan.address);
+
 	mmap_read_lock(mm);
 	if (ksm_test_exit(mm))
-		vma = NULL;
-	else
-		vma = find_vma(mm, ksm_scan.address);
+		goto no_vmas;
 
-	for (; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		if (!(vma->vm_flags & VM_MERGEABLE))
 			continue;
 		if (ksm_scan.address < vma->vm_start)
@@ -2323,6 +2326,7 @@ static struct rmap_item *scan_get_next_rmap_item(struct page **page)
 	}
 
 	if (ksm_test_exit(mm)) {
+no_vmas:
 		ksm_scan.address = 0;
 		ksm_scan.rmap_list = &slot->rmap_list;
 	}
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 51/69] mm/gup: use maple tree navigation instead of linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (34 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 52/69] mm/khugepaged: stop using vma linked list Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 53/69] mm/ksm: use vma iterators instead of vma " Liam Howlett
                     ` (16 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Use find_vma_intersection() to locate the VMAs in __mm_populate() instead
of using find_vma() and the linked list.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/gup.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/gup.c b/mm/gup.c
index f598a037eb04..28fd5d5aa557 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1533,10 +1533,11 @@ int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
 		if (!locked) {
 			locked = 1;
 			mmap_read_lock(mm);
-			vma = find_vma(mm, nstart);
+			vma = find_vma_intersection(mm, nstart, end);
 		} else if (nstart >= vma->vm_end)
-			vma = vma->vm_next;
-		if (!vma || vma->vm_start >= end)
+			vma = find_vma_intersection(mm, vma->vm_end, end);
+
+		if (!vma)
 			break;
 		/*
 		 * Set [nstart; nend) to intersection of desired address
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 56/69] mm/mempolicy: use vma iterator & maple state instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (38 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 55/69] mm/memcontrol: stop using mm->highest_vm_end Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 59/69] mm/mremap: use vma_find_intersection() " Liam Howlett
                     ` (12 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Reworked the way mbind_range() finds the first VMA to reuse the maple
state and limit the number of tree walks needed.

Note, this drops the VM_BUG_ON(!vma) call, which would catch a start
address higher than the last VMA.  The code was written in a way that
allowed no VMA updates to occur and still return success.  There should be
no functional change to this scenario with the new code.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
---
 mm/mempolicy.c | 56 ++++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 24 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 0288ffaea064..df4487767259 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -380,9 +380,10 @@ void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
 void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
 {
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	mmap_write_lock(mm);
-	for (vma = mm->mmap; vma; vma = vma->vm_next)
+	for_each_vma(vmi, vma)
 		mpol_rebind_policy(vma->vm_policy, new);
 	mmap_write_unlock(mm);
 }
@@ -649,7 +650,7 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
 static int queue_pages_test_walk(unsigned long start, unsigned long end,
 				struct mm_walk *walk)
 {
-	struct vm_area_struct *vma = walk->vma;
+	struct vm_area_struct *next, *vma = walk->vma;
 	struct queue_pages *qp = walk->private;
 	unsigned long endvma = vma->vm_end;
 	unsigned long flags = qp->flags;
@@ -664,9 +665,10 @@ static int queue_pages_test_walk(unsigned long start, unsigned long end,
 			/* hole at head side of range */
 			return -EFAULT;
 	}
+	next = find_vma(vma->vm_mm, vma->vm_end);
 	if (!(flags & MPOL_MF_DISCONTIG_OK) &&
 		((vma->vm_end < qp->end) &&
-		(!vma->vm_next || vma->vm_end < vma->vm_next->vm_start)))
+		(!next || vma->vm_end < next->vm_start)))
 		/* hole at middle or tail of range */
 		return -EFAULT;
 
@@ -780,26 +782,24 @@ static int vma_replace_policy(struct vm_area_struct *vma,
 static int mbind_range(struct mm_struct *mm, unsigned long start,
 		       unsigned long end, struct mempolicy *new_pol)
 {
+	MA_STATE(mas, &mm->mm_mt, start - 1, start - 1);
 	struct vm_area_struct *prev;
 	struct vm_area_struct *vma;
 	int err = 0;
 	pgoff_t pgoff;
-	unsigned long vmstart;
-	unsigned long vmend;
-
-	vma = find_vma(mm, start);
-	VM_BUG_ON(!vma);
 
-	prev = vma->vm_prev;
-	if (start > vma->vm_start)
-		prev = vma;
+	prev = mas_find_rev(&mas, 0);
+	if (prev && (start < prev->vm_end))
+		vma = prev;
+	else
+		vma = mas_next(&mas, end - 1);
 
-	for (; vma && vma->vm_start < end; prev = vma, vma = vma->vm_next) {
-		vmstart = max(start, vma->vm_start);
-		vmend   = min(end, vma->vm_end);
+	for (; vma; vma = mas_next(&mas, end - 1)) {
+		unsigned long vmstart = max(start, vma->vm_start);
+		unsigned long vmend = min(end, vma->vm_end);
 
 		if (mpol_equal(vma_policy(vma), new_pol))
-			continue;
+			goto next;
 
 		pgoff = vma->vm_pgoff +
 			((vmstart - vma->vm_start) >> PAGE_SHIFT);
@@ -808,6 +808,8 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 				 new_pol, vma->vm_userfaultfd_ctx,
 				 anon_vma_name(vma));
 		if (prev) {
+			/* vma_merge() invalidated the mas */
+			mas_pause(&mas);
 			vma = prev;
 			goto replace;
 		}
@@ -815,19 +817,25 @@ static int mbind_range(struct mm_struct *mm, unsigned long start,
 			err = split_vma(vma->vm_mm, vma, vmstart, 1);
 			if (err)
 				goto out;
+			/* split_vma() invalidated the mas */
+			mas_pause(&mas);
 		}
 		if (vma->vm_end != vmend) {
 			err = split_vma(vma->vm_mm, vma, vmend, 0);
 			if (err)
 				goto out;
+			/* split_vma() invalidated the mas */
+			mas_pause(&mas);
 		}
- replace:
+replace:
 		err = vma_replace_policy(vma, new_pol);
 		if (err)
 			goto out;
+next:
+		prev = vma;
 	}
 
- out:
+out:
 	return err;
 }
 
@@ -1042,6 +1050,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 			   int flags)
 {
 	nodemask_t nmask;
+	struct vm_area_struct *vma;
 	LIST_HEAD(pagelist);
 	int err = 0;
 	struct migration_target_control mtc = {
@@ -1057,8 +1066,9 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 	 * need migration.  Between passing in the full user address
 	 * space range and MPOL_MF_DISCONTIG_OK, this call can not fail.
 	 */
+	vma = find_vma(mm, 0);
 	VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
-	queue_pages_range(mm, mm->mmap->vm_start, mm->task_size, &nmask,
+	queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
 	if (!list_empty(&pagelist)) {
@@ -1188,14 +1198,13 @@ static struct page *new_page(struct page *page, unsigned long start)
 	struct folio *dst, *src = page_folio(page);
 	struct vm_area_struct *vma;
 	unsigned long address;
+	VMA_ITERATOR(vmi, current->mm, start);
 	gfp_t gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL;
 
-	vma = find_vma(current->mm, start);
-	while (vma) {
+	for_each_vma(vmi, vma) {
 		address = page_address_in_vma(page, vma);
 		if (address != -EFAULT)
 			break;
-		vma = vma->vm_next;
 	}
 
 	if (folio_test_hugetlb(src))
@@ -1473,6 +1482,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 	unsigned long vmend;
 	unsigned long end;
 	int err = -ENOENT;
+	VMA_ITERATOR(vmi, mm, start);
 
 	start = untagged_addr(start);
 	if (start & ~PAGE_MASK)
@@ -1498,9 +1508,7 @@ SYSCALL_DEFINE4(set_mempolicy_home_node, unsigned long, start, unsigned long, le
 	if (end == start)
 		return 0;
 	mmap_write_lock(mm);
-	vma = find_vma(mm, start);
-	for (; vma && vma->vm_start < end;  vma = vma->vm_next) {
-
+	for_each_vma_range(vmi, vma, end) {
 		vmstart = max(start, vma->vm_start);
 		vmend   = min(end, vma->vm_end);
 		new = mpol_dup(vma_policy(vma));
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 54/69] mm/madvise: use vma_find() instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (36 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 53/69] mm/ksm: use vma iterators instead of vma " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 55/69] mm/memcontrol: stop using mm->highest_vm_end Liam Howlett
                     ` (14 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

madvise_walk_vmas() no longer uses linked list.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c965fac7b13a..3d8413a27757 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1227,7 +1227,7 @@ int madvise_walk_vmas(struct mm_struct *mm, unsigned long start,
 		if (start >= end)
 			break;
 		if (prev)
-			vma = prev->vm_next;
+			vma = find_vma(mm, prev->vm_end);
 		else	/* madvise_remove dropped mmap_lock */
 			vma = find_vma(mm, start);
 	}
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 55/69] mm/memcontrol: stop using mm->highest_vm_end
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (37 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 54/69] mm/madvise: use vma_find() " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 56/69] mm/mempolicy: use vma iterator & maple state instead of vma linked list Liam Howlett
                     ` (13 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Pass through ULONG_MAX instead.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/memcontrol.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d180ef985b17..ef0cc6111512 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5755,7 +5755,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 	unsigned long precharge;
 
 	mmap_read_lock(mm);
-	walk_page_range(mm, 0, mm->highest_vm_end, &precharge_walk_ops, NULL);
+	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL);
 	mmap_read_unlock(mm);
 
 	precharge = mc.precharge;
@@ -6053,9 +6053,7 @@ static void mem_cgroup_move_charge(void)
 	 * When we have consumed all precharges and failed in doing
 	 * additional charge, the page walk just aborts.
 	 */
-	walk_page_range(mc.mm, 0, mc.mm->highest_vm_end, &charge_walk_ops,
-			NULL);
-
+	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
 	mmap_read_unlock(mc.mm);
 	atomic_dec(&mc.from->moving_account);
 }
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 58/69] mm/mprotect: use maple tree navigation instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (41 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 57/69] mm/mlock: use vma iterator and maple state " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 60/69] mm/msync: use vma_find() " Liam Howlett
                     ` (9 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mprotect.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..81e5392ab13e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -539,6 +539,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
 	const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
 				(prot & PROT_READ);
+	MA_STATE(mas, &current->mm->mm_mt, start, start);
 
 	start = untagged_addr(start);
 
@@ -570,7 +571,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	if ((pkey != -1) && !mm_pkey_is_allocated(current->mm, pkey))
 		goto out;
 
-	vma = find_vma(current->mm, start);
+	vma = mas_find(&mas, ULONG_MAX);
 	error = -ENOMEM;
 	if (!vma)
 		goto out;
@@ -596,7 +597,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 	if (start > vma->vm_start)
 		prev = vma;
 	else
-		prev = vma->vm_prev;
+		prev = mas_prev(&mas, 0);
 
 	for (nstart = start ; ; ) {
 		unsigned long mask_off_old_flags;
@@ -658,7 +659,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
 		if (nstart >= end)
 			goto out;
 
-		vma = prev->vm_next;
+		vma = find_vma(current->mm, prev->vm_end);
 		if (!vma || vma->vm_start != nstart) {
 			error = -ENOMEM;
 			goto out;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 59/69] mm/mremap: use vma_find_intersection() instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (39 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 56/69] mm/mempolicy: use vma iterator & maple state instead of vma linked list Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 57/69] mm/mlock: use vma iterator and maple state " Liam Howlett
                     ` (11 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mremap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 4495f69eccbe..c0d32330d435 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -716,7 +716,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	if (excess) {
 		vma->vm_flags |= VM_ACCOUNT;
 		if (split)
-			vma->vm_next->vm_flags |= VM_ACCOUNT;
+			find_vma(mm, vma->vm_end)->vm_flags |= VM_ACCOUNT;
 	}
 
 	return new_addr;
@@ -866,9 +866,10 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 static int vma_expandable(struct vm_area_struct *vma, unsigned long delta)
 {
 	unsigned long end = vma->vm_end + delta;
+
 	if (end < vma->vm_end) /* overflow */
 		return 0;
-	if (vma->vm_next && vma->vm_next->vm_start < end) /* intersection */
+	if (find_vma_intersection(vma->vm_mm, vma->vm_end, end))
 		return 0;
 	if (get_unmapped_area(NULL, vma->vm_start, end - vma->vm_start,
 			      0, MAP_FIXED) & ~PAGE_MASK)
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 57/69] mm/mlock: use vma iterator and maple state instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (40 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 59/69] mm/mremap: use vma_find_intersection() " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 58/69] mm/mprotect: use maple tree navigation " Liam Howlett
                     ` (10 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Handle overflow checking in count_mm_mlocked_page_nr() differently.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 mm/mlock.c | 35 +++++++++++++++++++----------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 716caf851043..c41604ba5197 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -471,6 +471,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
 	unsigned long nstart, end, tmp;
 	struct vm_area_struct *vma, *prev;
 	int error;
+	MA_STATE(mas, &current->mm->mm_mt, start, start);
 
 	VM_BUG_ON(offset_in_page(start));
 	VM_BUG_ON(len != PAGE_ALIGN(len));
@@ -479,13 +480,14 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
 		return -EINVAL;
 	if (end == start)
 		return 0;
-	vma = find_vma(current->mm, start);
-	if (!vma || vma->vm_start > start)
+	vma = mas_walk(&mas);
+	if (!vma)
 		return -ENOMEM;
 
-	prev = vma->vm_prev;
 	if (start > vma->vm_start)
 		prev = vma;
+	else
+		prev = mas_prev(&mas, 0);
 
 	for (nstart = start ; ; ) {
 		vm_flags_t newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
@@ -505,7 +507,7 @@ static int apply_vma_lock_flags(unsigned long start, size_t len,
 		if (nstart >= end)
 			break;
 
-		vma = prev->vm_next;
+		vma = find_vma(prev->vm_mm, prev->vm_end);
 		if (!vma || vma->vm_start != nstart) {
 			error = -ENOMEM;
 			break;
@@ -526,24 +528,23 @@ static unsigned long count_mm_mlocked_page_nr(struct mm_struct *mm,
 {
 	struct vm_area_struct *vma;
 	unsigned long count = 0;
+	unsigned long end;
+	VMA_ITERATOR(vmi, mm, start);
 
 	if (mm == NULL)
 		mm = current->mm;
 
-	vma = find_vma(mm, start);
-	if (vma == NULL)
-		return 0;
-
-	for (; vma ; vma = vma->vm_next) {
-		if (start >= vma->vm_end)
-			continue;
-		if (start + len <=  vma->vm_start)
-			break;
+	/* Don't overflow past ULONG_MAX */
+	if (unlikely(ULONG_MAX - len < start))
+		end = ULONG_MAX;
+	else
+		end = start + len;
+	for_each_vma_range(vmi, vma, end) {
 		if (vma->vm_flags & VM_LOCKED) {
 			if (start > vma->vm_start)
 				count -= (start - vma->vm_start);
-			if (start + len < vma->vm_end) {
-				count += start + len - vma->vm_start;
+			if (end < vma->vm_end) {
+				count += end - vma->vm_start;
 				break;
 			}
 			count += vma->vm_end - vma->vm_start;
@@ -659,6 +660,7 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, len)
  */
 static int apply_mlockall_flags(int flags)
 {
+	MA_STATE(mas, &current->mm->mm_mt, 0, 0);
 	struct vm_area_struct *vma, *prev = NULL;
 	vm_flags_t to_add = 0;
 
@@ -679,7 +681,7 @@ static int apply_mlockall_flags(int flags)
 			to_add |= VM_LOCKONFAULT;
 	}
 
-	for (vma = current->mm->mmap; vma ; vma = prev->vm_next) {
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		vm_flags_t newflags;
 
 		newflags = vma->vm_flags & VM_LOCKED_CLEAR_MASK;
@@ -687,6 +689,7 @@ static int apply_mlockall_flags(int flags)
 
 		/* Ignore errors */
 		mlock_fixup(vma, &prev, vma->vm_start, vma->vm_end, newflags);
+		mas_pause(&mas);
 		cond_resched();
 	}
 out:
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 62/69] mm/pagewalk: use vma_find() instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (44 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 61/69] mm/oom_kill: use maple tree iterators " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 63/69] mm/swapfile: use vma iterator " Liam Howlett
                     ` (6 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

walk_page_range() no longer uses the one vma linked list reference.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/pagewalk.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 9b3db11a4d1d..53e5c145fcce 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -456,7 +456,7 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 		} else { /* inside vma */
 			walk.vma = vma;
 			next = min(end, vma->vm_end);
-			vma = vma->vm_next;
+			vma = find_vma(mm, vma->vm_end);
 
 			err = walk_page_test(start, next, &walk);
 			if (err > 0) {
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 61/69] mm/oom_kill: use maple tree iterators instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (43 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 60/69] mm/msync: use vma_find() " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 62/69] mm/pagewalk: use vma_find() " Liam Howlett
                     ` (7 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/oom_kill.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 49d7df39b02d..f3adad57f47f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -513,6 +513,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	bool ret = true;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	/*
 	 * Tell all users of get_user/copy_from_user etc... that the content
@@ -522,7 +523,7 @@ bool __oom_reap_task_mm(struct mm_struct *mm)
 	 */
 	set_bit(MMF_UNSTABLE, &mm->flags);
 
-	for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		if (vma->vm_flags & (VM_HUGETLB|VM_PFNMAP))
 			continue;
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 60/69] mm/msync: use vma_find() instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (42 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 58/69] mm/mprotect: use maple tree navigation " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 61/69] mm/oom_kill: use maple tree iterators " Liam Howlett
                     ` (8 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/msync.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/msync.c b/mm/msync.c
index 137d1c104f3e..ac4c9bfea2e7 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -104,7 +104,7 @@ SYSCALL_DEFINE3(msync, unsigned long, start, size_t, len, int, flags)
 				error = 0;
 				goto out_unlock;
 			}
-			vma = vma->vm_next;
+			vma = find_vma(mm, vma->vm_end);
 		}
 	}
 out_unlock:
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 65/69] nommu: remove uses of VMA linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (46 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 63/69] mm/swapfile: use vma iterator " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 64/69] i915: use the VMA iterator Liam Howlett
                     ` (4 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Use the maple tree or VMA iterator instead.  This is faster and will allow
us to shrink the VMA.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/nommu.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/nommu.c b/mm/nommu.c
index 1c9b4e8c4d5c..d94f6adf9c31 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1383,6 +1383,7 @@ static int shrink_vma(struct mm_struct *mm,
  */
 int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list_head *uf)
 {
+	MA_STATE(mas, &mm->mm_mt, start, start);
 	struct vm_area_struct *vma;
 	unsigned long end;
 	int ret;
@@ -1394,7 +1395,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list
 	end = start + len;
 
 	/* find the first potentially overlapping VMA */
-	vma = find_vma(mm, start);
+	vma = mas_find(&mas, end - 1);
 	if (!vma) {
 		static int limit;
 		if (limit < 5) {
@@ -1413,7 +1414,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len, struct list
 				return -EINVAL;
 			if (end == vma->vm_end)
 				goto erase_whole_vma;
-			vma = vma->vm_next;
+			vma = mas_next(&mas, end - 1);
 		} while (vma);
 		return -EINVAL;
 	} else {
@@ -1462,6 +1463,7 @@ SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
  */
 void exit_mmap(struct mm_struct *mm)
 {
+	VMA_ITERATOR(vmi, mm, 0);
 	struct vm_area_struct *vma;
 
 	if (!mm)
@@ -1469,12 +1471,17 @@ void exit_mmap(struct mm_struct *mm)
 
 	mm->total_vm = 0;
 
-	while ((vma = mm->mmap)) {
-		mm->mmap = vma->vm_next;
+	/*
+	 * Lock the mm to avoid assert complaining even though this is the only
+	 * user of the mm
+	 */
+	mmap_write_lock(mm);
+	for_each_vma(vmi, vma) {
 		delete_vma_from_mm(vma);
 		delete_vma(mm, vma);
 		cond_resched();
 	}
+	mmap_write_unlock(mm);
 	__mt_destroy(&mm->mm_mt);
 }
 
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 63/69] mm/swapfile: use vma iterator instead of vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (45 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 62/69] mm/pagewalk: use vma_find() " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 65/69] nommu: remove uses of VMA " Liam Howlett
                     ` (5 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

unuse_mm() no longer needs to reference the linked list.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/swapfile.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 63c61f8b2611..392bfffc30c9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1968,14 +1968,16 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 {
 	struct vm_area_struct *vma;
 	int ret = 0;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	mmap_read_lock(mm);
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		if (vma->anon_vma) {
 			ret = unuse_vma(vma, type);
 			if (ret)
 				break;
 		}
+
 		cond_resched();
 	}
 	mmap_read_unlock(mm);
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 64/69] i915: use the VMA iterator
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (47 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 65/69] nommu: remove uses of VMA " Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 66/69] riscv: use vma iterator for vdso Liam Howlett
                     ` (3 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>

Replace the linked list in probe_range() with the VMA iterator.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
index 6d1a71d6404c..e20ee4b611fd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_userptr.c
@@ -426,12 +426,11 @@ static const struct drm_i915_gem_object_ops i915_gem_userptr_ops = {
 static int
 probe_range(struct mm_struct *mm, unsigned long addr, unsigned long len)
 {
-	const unsigned long end = addr + len;
+	VMA_ITERATOR(vmi, mm, addr);
 	struct vm_area_struct *vma;
-	int ret = -EFAULT;
 
 	mmap_read_lock(mm);
-	for (vma = find_vma(mm, addr); vma; vma = vma->vm_next) {
+	for_each_vma_range(vmi, vma, addr + len) {
 		/* Check for holes, note that we also update the addr below */
 		if (vma->vm_start > addr)
 			break;
@@ -439,16 +438,13 @@ probe_range(struct mm_struct *mm, unsigned long addr, unsigned long len)
 		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
 			break;
 
-		if (vma->vm_end >= end) {
-			ret = 0;
-			break;
-		}
-
 		addr = vma->vm_end;
 	}
 	mmap_read_unlock(mm);
 
-	return ret;
+	if (vma)
+		return -EFAULT;
+	return 0;
 }
 
 /*
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 67/69] mm: remove the vma linked list
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (50 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 68/69] mm/mmap: drop range_has_overlap() function Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-13 13:30     ` Qian Cai
  2022-05-04  1:14   ` [PATCH v9 69/69] mm/mmap.c: pass in mapping to __vma_link_file() Liam Howlett
  52 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Replace any vm_next use with vma_find().

Update free_pgtables(), unmap_vmas(), and zap_page_range() to use the
maple tree.

Use the new free_pgtables() and unmap_vmas() in do_mas_align_munmap().  At
the same time, alter the loop to be more compact.

Now that free_pgtables() and unmap_vmas() take a maple tree as an
argument, rearrange do_mas_align_munmap() to use the new tree to hold the
vmas to remove.

Remove __vma_link_list() and __vma_unlink_list() as they are exclusively
used to update the linked list

Drop linked list update from __insert_vm_struct().

Rework validation of tree as it was depending on the linked list.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
---
 include/linux/mm.h       |   5 +-
 include/linux/mm_types.h |   4 -
 kernel/fork.c            |  19 +-
 mm/debug.c               |  14 +-
 mm/internal.h            |   8 +-
 mm/memory.c              |  33 ++-
 mm/mmap.c                | 439 +++++++++++++++++----------------------
 mm/nommu.c               |   5 -
 mm/util.c                |  40 ----
 9 files changed, 223 insertions(+), 344 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0cc2cb692a78..f16bf2c017ab 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1867,8 +1867,9 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned long address,
 		  unsigned long size);
 void zap_page_range(struct vm_area_struct *vma, unsigned long address,
 		    unsigned long size);
-void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
-		unsigned long start, unsigned long end);
+void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
+		struct vm_area_struct *start_vma, unsigned long start,
+		unsigned long end);
 
 struct mmu_notifier_range;
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index b844119387a3..bdc5d0a5e76d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,8 +398,6 @@ struct vm_area_struct {
 	unsigned long vm_end;		/* The first byte after our end address
 					   within vm_mm. */
 
-	/* linked list of VM areas per task, sorted by address */
-	struct vm_area_struct *vm_next, *vm_prev;
 	struct mm_struct *vm_mm;	/* The address space we belong to. */
 
 	/*
@@ -463,7 +461,6 @@ struct vm_area_struct {
 struct kioctx_table;
 struct mm_struct {
 	struct {
-		struct vm_area_struct *mmap;		/* list of VMAs */
 		struct maple_tree mm_mt;
 #ifdef CONFIG_MMU
 		unsigned long (*get_unmapped_area) (struct file *filp,
@@ -478,7 +475,6 @@ struct mm_struct {
 		unsigned long mmap_compat_legacy_base;
 #endif
 		unsigned long task_size;	/* size of task vm space */
-		unsigned long highest_vm_end;	/* highest vma end address */
 		pgd_t * pgd;
 
 #ifdef CONFIG_MEMBARRIER
diff --git a/kernel/fork.c b/kernel/fork.c
index 9fcbd0b5c0be..536dc3289734 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -474,7 +474,6 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
 		 */
 		*new = data_race(*orig);
 		INIT_LIST_HEAD(&new->anon_vma_chain);
-		new->vm_next = new->vm_prev = NULL;
 		dup_anon_vma_name(orig, new);
 	}
 	return new;
@@ -579,7 +578,7 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm)
 static __latent_entropy int dup_mmap(struct mm_struct *mm,
 					struct mm_struct *oldmm)
 {
-	struct vm_area_struct *mpnt, *tmp, *prev, **pprev;
+	struct vm_area_struct *mpnt, *tmp;
 	int retval;
 	unsigned long charge = 0;
 	MA_STATE(old_mas, &oldmm->mm_mt, 0, 0);
@@ -606,7 +605,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	mm->exec_vm = oldmm->exec_vm;
 	mm->stack_vm = oldmm->stack_vm;
 
-	pprev = &mm->mmap;
 	retval = ksm_fork(mm, oldmm);
 	if (retval)
 		goto out;
@@ -614,12 +612,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	if (retval)
 		goto out;
 
-	retval = mas_expected_entries(&mas, oldmm->map_count);
-	if (retval)
-		goto out;
-
-	prev = NULL;
-
 	retval = mas_expected_entries(&mas, oldmm->map_count);
 	if (retval)
 		goto out;
@@ -691,14 +683,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (is_vm_hugetlb_page(tmp))
 			reset_vma_resv_huge_pages(tmp);
 
-		/*
-		 * Link in the new vma and copy the page table entries.
-		 */
-		*pprev = tmp;
-		pprev = &tmp->vm_next;
-		tmp->vm_prev = prev;
-		prev = tmp;
-
 		/* Link the vma into the MT */
 		mas.index = tmp->vm_start;
 		mas.last = tmp->vm_end - 1;
@@ -1115,7 +1099,6 @@ static void mm_init_uprobes_state(struct mm_struct *mm)
 static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 	struct user_namespace *user_ns)
 {
-	mm->mmap = NULL;
 	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
 	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	atomic_set(&mm->mm_users, 1);
diff --git a/mm/debug.c b/mm/debug.c
index 2d625ca0e326..0fd15ba70d16 100644
--- a/mm/debug.c
+++ b/mm/debug.c
@@ -139,13 +139,11 @@ EXPORT_SYMBOL(dump_page);
 
 void dump_vma(const struct vm_area_struct *vma)
 {
-	pr_emerg("vma %px start %px end %px\n"
-		"next %px prev %px mm %px\n"
+	pr_emerg("vma %px start %px end %px mm %px\n"
 		"prot %lx anon_vma %px vm_ops %px\n"
 		"pgoff %lx file %px private_data %px\n"
 		"flags: %#lx(%pGv)\n",
-		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_next,
-		vma->vm_prev, vma->vm_mm,
+		vma, (void *)vma->vm_start, (void *)vma->vm_end, vma->vm_mm,
 		(unsigned long)pgprot_val(vma->vm_page_prot),
 		vma->anon_vma, vma->vm_ops, vma->vm_pgoff,
 		vma->vm_file, vma->vm_private_data,
@@ -155,11 +153,11 @@ EXPORT_SYMBOL(dump_vma);
 
 void dump_mm(const struct mm_struct *mm)
 {
-	pr_emerg("mm %px mmap %px task_size %lu\n"
+	pr_emerg("mm %px task_size %lu\n"
 #ifdef CONFIG_MMU
 		"get_unmapped_area %px\n"
 #endif
-		"mmap_base %lu mmap_legacy_base %lu highest_vm_end %lu\n"
+		"mmap_base %lu mmap_legacy_base %lu\n"
 		"pgd %px mm_users %d mm_count %d pgtables_bytes %lu map_count %d\n"
 		"hiwater_rss %lx hiwater_vm %lx total_vm %lx locked_vm %lx\n"
 		"pinned_vm %llx data_vm %lx exec_vm %lx stack_vm %lx\n"
@@ -183,11 +181,11 @@ void dump_mm(const struct mm_struct *mm)
 		"tlb_flush_pending %d\n"
 		"def_flags: %#lx(%pGv)\n",
 
-		mm, mm->mmap, mm->task_size,
+		mm, mm->task_size,
 #ifdef CONFIG_MMU
 		mm->get_unmapped_area,
 #endif
-		mm->mmap_base, mm->mmap_legacy_base, mm->highest_vm_end,
+		mm->mmap_base, mm->mmap_legacy_base,
 		mm->pgd, atomic_read(&mm->mm_users),
 		atomic_read(&mm->mm_count),
 		mm_pgtables_bytes(mm),
diff --git a/mm/internal.h b/mm/internal.h
index ddd09245a6db..fe33cb47935b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -69,8 +69,9 @@ void folio_rotate_reclaimable(struct folio *folio);
 bool __folio_end_writeback(struct folio *folio);
 void deactivate_file_folio(struct folio *folio);
 
-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
-		unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+		   struct vm_area_struct *start_vma, unsigned long floor,
+		   unsigned long ceiling);
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 struct zap_details;
@@ -458,9 +459,6 @@ static inline bool is_data_mapping(vm_flags_t flags)
 }
 
 /* mm/util.c */
-void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
-		struct vm_area_struct *prev);
-void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma);
 struct anon_vma *folio_anon_vma(struct folio *folio);
 
 #ifdef CONFIG_MMU
diff --git a/mm/memory.c b/mm/memory.c
index e873a6143181..9e5b3ab8f7c0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -399,12 +399,21 @@ void free_pgd_range(struct mmu_gather *tlb,
 	} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
-		unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+		   struct vm_area_struct *vma, unsigned long floor,
+		   unsigned long ceiling)
 {
-	while (vma) {
-		struct vm_area_struct *next = vma->vm_next;
+	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
+
+	do {
 		unsigned long addr = vma->vm_start;
+		struct vm_area_struct *next;
+
+		/*
+		 * Note: USER_PGTABLES_CEILING may be passed as ceiling and may
+		 * be 0.  This will underflow and is okay.
+		 */
+		next = mas_find(&mas, ceiling - 1);
 
 		/*
 		 * Hide vma from rmap and truncate_pagecache before freeing
@@ -423,7 +432,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			while (next && next->vm_start <= vma->vm_end + PMD_SIZE
 			       && !is_vm_hugetlb_page(next)) {
 				vma = next;
-				next = vma->vm_next;
+				next = mas_find(&mas, ceiling - 1);
 				unlink_anon_vmas(vma);
 				unlink_file_vma(vma);
 			}
@@ -431,7 +440,7 @@ void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				floor, next ? next->vm_start : ceiling);
 		}
 		vma = next;
-	}
+	} while (vma);
 }
 
 void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
@@ -1632,17 +1641,19 @@ static void unmap_single_vma(struct mmu_gather *tlb,
  * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
  * drops the lock and schedules.
  */
-void unmap_vmas(struct mmu_gather *tlb,
+void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
 		struct vm_area_struct *vma, unsigned long start_addr,
 		unsigned long end_addr)
 {
 	struct mmu_notifier_range range;
+	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
 				start_addr, end_addr);
 	mmu_notifier_invalidate_range_start(&range);
-	for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
+	do {
 		unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
+	} while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 }
 
@@ -1657,8 +1668,11 @@ void unmap_vmas(struct mmu_gather *tlb,
 void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 		unsigned long size)
 {
+	struct maple_tree *mt = &vma->vm_mm->mm_mt;
+	unsigned long end = start + size;
 	struct mmu_notifier_range range;
 	struct mmu_gather tlb;
+	MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
 
 	lru_add_drain();
 	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
@@ -1666,8 +1680,9 @@ void zap_page_range(struct vm_area_struct *vma, unsigned long start,
 	tlb_gather_mmu(&tlb, vma->vm_mm);
 	update_hiwater_rss(vma->vm_mm);
 	mmu_notifier_invalidate_range_start(&range);
-	for ( ; vma && vma->vm_start < range.end; vma = vma->vm_next)
+	do {
 		unmap_single_vma(&tlb, vma, start, range.end, NULL);
+	} while ((vma = mas_find(&mas, end - 1)) != NULL);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
 }
diff --git a/mm/mmap.c b/mm/mmap.c
index 572a2a474b49..7704c879bc6d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -75,9 +75,10 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm,
+static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
-		unsigned long start, unsigned long end);
+		struct vm_area_struct *next, unsigned long start,
+		unsigned long end);
 
 /* description of effects of mapping type and prot in current implementation.
  * this is due to the limited x86 page protection hardware.  The expected
@@ -177,12 +178,10 @@ void unlink_file_vma(struct vm_area_struct *vma)
 }
 
 /*
- * Close a vm structure and free it, returning the next.
+ * Close a vm structure and free it.
  */
-static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
+static void remove_vma(struct vm_area_struct *vma)
 {
-	struct vm_area_struct *next = vma->vm_next;
-
 	might_sleep();
 	if (vma->vm_ops && vma->vm_ops->close)
 		vma->vm_ops->close(vma);
@@ -190,7 +189,6 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 		fput(vma->vm_file);
 	mpol_put(vma_policy(vma));
 	vm_area_free(vma);
-	return next;
 }
 
 /*
@@ -215,8 +213,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 			 unsigned long newbrk, unsigned long oldbrk,
 			 struct list_head *uf);
 static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *brkvma,
-			unsigned long addr, unsigned long request,
-			unsigned long flags);
+		unsigned long addr, unsigned long request, unsigned long flags);
 SYSCALL_DEFINE1(brk, unsigned long, brk)
 {
 	unsigned long newbrk, oldbrk, origbrk;
@@ -285,7 +282,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 		 * before calling do_brk_munmap().
 		 */
 		mm->brk = brk;
-		mas.last = oldbrk - 1;
 		ret = do_brk_munmap(&mas, brkvma, newbrk, oldbrk, &uf);
 		if (ret == 1)  {
 			downgraded = true;
@@ -340,42 +336,20 @@ extern void mt_dump(const struct maple_tree *mt);
 static void validate_mm_mt(struct mm_struct *mm)
 {
 	struct maple_tree *mt = &mm->mm_mt;
-	struct vm_area_struct *vma_mt, *vma = mm->mmap;
+	struct vm_area_struct *vma_mt;
 
 	MA_STATE(mas, mt, 0, 0);
-	mas_for_each(&mas, vma_mt, ULONG_MAX) {
-		if (xa_is_zero(vma_mt))
-			continue;
-
-		if (!vma)
-			break;
 
-		if ((vma != vma_mt) ||
-		    (vma->vm_start != vma_mt->vm_start) ||
-		    (vma->vm_end != vma_mt->vm_end) ||
-		    (vma->vm_start != mas.index) ||
-		    (vma->vm_end - 1 != mas.last)) {
+	mas_for_each(&mas, vma_mt, ULONG_MAX) {
+		if ((vma_mt->vm_start != mas.index) ||
+		    (vma_mt->vm_end - 1 != mas.last)) {
 			pr_emerg("issue in %s\n", current->comm);
 			dump_stack();
 			dump_vma(vma_mt);
-			pr_emerg("and vm_next\n");
-			dump_vma(vma->vm_next);
 			pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
 				 mas.index, mas.last);
 			pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
 				 vma_mt->vm_start, vma_mt->vm_end);
-			if (vma->vm_prev) {
-				pr_emerg("ll prev: %px %lu - %lu\n",
-					 vma->vm_prev, vma->vm_prev->vm_start,
-					 vma->vm_prev->vm_end);
-			}
-			pr_emerg("ll vma: %px %lu - %lu\n", vma,
-				 vma->vm_start, vma->vm_end);
-			if (vma->vm_next) {
-				pr_emerg("ll next: %px %lu - %lu\n",
-					 vma->vm_next, vma->vm_next->vm_start,
-					 vma->vm_next->vm_end);
-			}
 
 			mt_dump(mas.tree);
 			if (vma_mt->vm_end != mas.last + 1) {
@@ -392,11 +366,7 @@ static void validate_mm_mt(struct mm_struct *mm)
 			}
 			VM_BUG_ON_MM(vma_mt->vm_start != mas.index, mm);
 		}
-		VM_BUG_ON(vma != vma_mt);
-		vma = vma->vm_next;
-
 	}
-	VM_BUG_ON(vma);
 	mt_validate(&mm->mm_mt);
 }
 
@@ -404,12 +374,12 @@ static void validate_mm(struct mm_struct *mm)
 {
 	int bug = 0;
 	int i = 0;
-	unsigned long highest_address = 0;
-	struct vm_area_struct *vma = mm->mmap;
+	struct vm_area_struct *vma;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	validate_mm_mt(mm);
 
-	while (vma) {
+	mas_for_each(&mas, vma, ULONG_MAX) {
 #ifdef CONFIG_DEBUG_VM_RB
 		struct anon_vma *anon_vma = vma->anon_vma;
 		struct anon_vma_chain *avc;
@@ -421,18 +391,10 @@ static void validate_mm(struct mm_struct *mm)
 			anon_vma_unlock_read(anon_vma);
 		}
 #endif
-
-		highest_address = vm_end_gap(vma);
-		vma = vma->vm_next;
 		i++;
 	}
 	if (i != mm->map_count) {
-		pr_emerg("map_count %d vm_next %d\n", mm->map_count, i);
-		bug = 1;
-	}
-	if (highest_address != mm->highest_vm_end) {
-		pr_emerg("mm->highest_vm_end %lx, found %lx\n",
-			  mm->highest_vm_end, highest_address);
+		pr_emerg("map_count %d mas_for_each %d\n", mm->map_count, i);
 		bug = 1;
 	}
 	VM_BUG_ON_MM(bug, mm);
@@ -492,29 +454,13 @@ bool range_has_overlap(struct mm_struct *mm, unsigned long start,
 	struct vm_area_struct *existing;
 
 	MA_STATE(mas, &mm->mm_mt, start, start);
+	rcu_read_lock();
 	existing = mas_find(&mas, end - 1);
 	*pprev = mas_prev(&mas, 0);
+	rcu_read_unlock();
 	return existing ? true : false;
 }
 
-/*
- * __vma_next() - Get the next VMA.
- * @mm: The mm_struct.
- * @vma: The current vma.
- *
- * If @vma is NULL, return the first vma in the mm.
- *
- * Returns: The next VMA after @vma.
- */
-static inline struct vm_area_struct *__vma_next(struct mm_struct *mm,
-					 struct vm_area_struct *vma)
-{
-	if (!vma)
-		return mm->mmap;
-
-	return vma->vm_next;
-}
-
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
 		unsigned long addr, unsigned long end)
 {
@@ -599,8 +545,7 @@ static inline void vma_mas_szero(struct ma_state *mas, unsigned long start,
 	mas_store_prealloc(mas, NULL);
 }
 
-static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
-			struct vm_area_struct *prev)
+static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct address_space *mapping = NULL;
@@ -614,7 +559,6 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	vma_mas_store(vma, &mas);
-	__vma_link_list(mm, vma, prev);
 	__vma_link_file(vma);
 
 	if (mapping)
@@ -625,22 +569,6 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	return 0;
 }
 
-/*
- * Helper for vma_adjust() in the split_vma insert case: insert a vma into the
- * mm's list and the mm tree.  It has already been inserted into the interval tree.
- */
-static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas,
-		struct vm_area_struct *vma, unsigned long location)
-{
-	struct vm_area_struct *prev;
-
-	mas_set(mas, location);
-	prev = mas_prev(mas, 0);
-	vma_mas_store(vma, mas);
-	__vma_link_list(mm, vma, prev);
-	mm->map_count++;
-}
-
 /*
  * vma_expand - Expand an existing VMA
  *
@@ -717,15 +645,8 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
 	}
 
 	/* Expanding over the next vma */
-	if (remove_next) {
-		/* Remove from mm linked list - also updates highest_vm_end */
-		__vma_unlink_list(mm, next);
-
-		if (file)
-			__remove_shared_vm_struct(next, file, mapping);
-
-	} else if (!next) {
-		mm->highest_vm_end = vm_end_gap(vma);
+	if (remove_next && file) {
+		__remove_shared_vm_struct(next, file, mapping);
 	}
 
 	if (anon_vma) {
@@ -782,7 +703,6 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	int remove_next = 0;
 	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct vm_area_struct *exporter = NULL, *importer = NULL;
-	unsigned long ll_prev = vma->vm_start; /* linked list prev. */
 
 	if (next && !insert) {
 		if (end >= next->vm_end) {
@@ -828,7 +748,7 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * next, if the vma overlaps with it.
 			 */
 			if (remove_next == 2 && !next->anon_vma)
-				exporter = next->vm_next;
+				exporter = find_vma(mm, next->vm_end);
 
 		} else if (end > next->vm_start) {
 			/*
@@ -927,17 +847,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		if (vma->vm_end > end) {
 			if (!insert || (insert->vm_start != end)) {
 				vma_mas_szero(&mas, end, vma->vm_end);
+				mas_reset(&mas);
 				VM_WARN_ON(insert &&
 					   insert->vm_end < vma->vm_end);
-			} else if (insert->vm_start == end) {
-				ll_prev = vma->vm_end;
 			}
 		} else {
 			vma_changed = true;
 		}
 		vma->vm_end = end;
-		if (!next)
-			mm->highest_vm_end = vm_end_gap(vma);
 	}
 
 	if (vma_changed)
@@ -957,17 +874,17 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		flush_dcache_mmap_unlock(mapping);
 	}
 
-	if (remove_next) {
-		__vma_unlink_list(mm, next);
-		if (file)
-			__remove_shared_vm_struct(next, file, mapping);
+	if (remove_next && file) {
+		__remove_shared_vm_struct(next, file, mapping);
 	} else if (insert) {
 		/*
 		 * split_vma has split insert from vma, and needs
 		 * us to insert it before dropping the locks
 		 * (it may either follow vma or precede it).
 		 */
-		__insert_vm_struct(mm, &mas, insert, ll_prev);
+		mas_reset(&mas);
+		vma_mas_store(insert, &mas);
+		mm->map_count++;
 	}
 
 	if (anon_vma) {
@@ -1006,8 +923,8 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			/*
 			 * If "next" was removed and vma->vm_end was
 			 * expanded (up) over it, in turn
-			 * "next->vm_prev->vm_end" changed and the
-			 * "vma->vm_next" gap must be updated.
+			 * "next->prev->vm_end" changed and the
+			 * "vma->next" gap must be updated.
 			 */
 			next = next_next;
 		} else {
@@ -1028,34 +945,15 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			remove_next = 1;
 			end = next->vm_end;
 			goto again;
-		} else if (!next) {
-			/*
-			 * If remove_next == 2 we obviously can't
-			 * reach this path.
-			 *
-			 * If remove_next == 3 we can't reach this
-			 * path because pre-swap() next is always not
-			 * NULL. pre-swap() "next" is not being
-			 * removed and its next->vm_end is not altered
-			 * (and furthermore "end" already matches
-			 * next->vm_end in remove_next == 3).
-			 *
-			 * We reach this only in the remove_next == 1
-			 * case if the "next" vma that was removed was
-			 * the highest vma of the mm. However in such
-			 * case next->vm_end == "end" and the extended
-			 * "vma" has vma->vm_end == next->vm_end so
-			 * mm->highest_vm_end doesn't need any update
-			 * in remove_next == 1 case.
-			 */
-			VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma));
 		}
 	}
-	if (insert && file)
+	if (insert && file) {
 		uprobe_mmap(insert);
+	}
 
 	mas_destroy(&mas);
 	validate_mm(mm);
+
 	return 0;
 }
 
@@ -1215,10 +1113,10 @@ struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	if (vm_flags & VM_SPECIAL)
 		return NULL;
 
-	next = __vma_next(mm, prev);
+	next = find_vma(mm, prev ? prev->vm_end : 0);
 	area = next;
 	if (area && area->vm_end == end)		/* cases 6, 7, 8 */
-		next = next->vm_next;
+		next = find_vma(mm, next->vm_end);
 
 	/* verify some invariant that must be enforced by the caller */
 	VM_WARN_ON(prev && addr <= prev->vm_start);
@@ -1352,18 +1250,24 @@ static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_
  */
 struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *vma)
 {
+	MA_STATE(mas, &vma->vm_mm->mm_mt, vma->vm_end, vma->vm_end);
 	struct anon_vma *anon_vma = NULL;
+	struct vm_area_struct *prev, *next;
 
 	/* Try next first. */
-	if (vma->vm_next) {
-		anon_vma = reusable_anon_vma(vma->vm_next, vma, vma->vm_next);
+	next = mas_walk(&mas);
+	if (next) {
+		anon_vma = reusable_anon_vma(next, vma, next);
 		if (anon_vma)
 			return anon_vma;
 	}
 
+	prev = mas_prev(&mas, 0);
+	VM_BUG_ON_VMA(prev != vma, vma);
+	prev = mas_prev(&mas, 0);
 	/* Try prev next. */
-	if (vma->vm_prev)
-		anon_vma = reusable_anon_vma(vma->vm_prev, vma->vm_prev, vma);
+	if (prev)
+		anon_vma = reusable_anon_vma(prev, prev, vma);
 
 	/*
 	 * We might reach here with anon_vma == NULL if we can't find
@@ -2112,8 +2016,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	if (gap_addr < address || gap_addr > TASK_SIZE)
 		gap_addr = TASK_SIZE;
 
-	next = vma->vm_next;
-	if (next && next->vm_start < gap_addr && vma_is_accessible(next)) {
+	next = find_vma_intersection(mm, vma->vm_end, gap_addr);
+	if (next && vma_is_accessible(next)) {
 		if (!(next->vm_flags & VM_GROWSUP))
 			return -ENOMEM;
 		/* Check that both stack segments have the same anon_vma? */
@@ -2164,8 +2068,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				/* Overwrite old entry in mtree. */
 				vma_mas_store(vma, &mas);
 				anon_vma_interval_tree_post_update_vma(vma);
-				if (!vma->vm_next)
-					mm->highest_vm_end = vm_end_gap(vma);
 				spin_unlock(&mm->page_table_lock);
 
 				perf_event_mmap(vma);
@@ -2184,16 +2086,16 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 int expand_downwards(struct vm_area_struct *vma, unsigned long address)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_start);
 	struct vm_area_struct *prev;
 	int error = 0;
-	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	address &= PAGE_MASK;
 	if (address < mmap_min_addr)
 		return -EPERM;
 
 	/* Enforce stack_guard_gap */
-	prev = vma->vm_prev;
+	prev = mas_prev(&mas, 0);
 	/* Check that both stack segments have the same anon_vma? */
 	if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
 			vma_is_accessible(prev)) {
@@ -2328,25 +2230,26 @@ find_extend_vma(struct mm_struct *mm, unsigned long addr)
 EXPORT_SYMBOL_GPL(find_extend_vma);
 
 /*
- * Ok - we have the memory areas we should free on the vma list,
- * so release them, and do the vma updates.
+ * Ok - we have the memory areas we should free on a maple tree so release them,
+ * and do the vma updates.
  *
  * Called with the mm semaphore held.
  */
-static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
+static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
 {
 	unsigned long nr_accounted = 0;
+	struct vm_area_struct *vma;
 
 	/* Update high watermark before we lower total_vm */
 	update_hiwater_vm(mm);
-	do {
+	mas_for_each(mas, vma, ULONG_MAX) {
 		long nrpages = vma_pages(vma);
 
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += nrpages;
 		vm_stat_account(mm, vma->vm_flags, -nrpages);
-		vma = remove_vma(vma);
-	} while (vma);
+		remove_vma(vma);
+	}
 	vm_unacct_memory(nr_accounted);
 	validate_mm(mm);
 }
@@ -2356,18 +2259,18 @@ static void remove_vma_list(struct mm_struct *mm, struct vm_area_struct *vma)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm,
+static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
 		struct vm_area_struct *vma, struct vm_area_struct *prev,
+		struct vm_area_struct *next,
 		unsigned long start, unsigned long end)
 {
-	struct vm_area_struct *next = __vma_next(mm, prev);
 	struct mmu_gather tlb;
 
 	lru_add_drain();
 	tlb_gather_mmu(&tlb, mm);
 	update_hiwater_rss(mm);
-	unmap_vmas(&tlb, vma, start, end);
-	free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+	unmap_vmas(&tlb, mt, vma, start, end);
+	free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
 				 next ? next->vm_start : USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 }
@@ -2451,24 +2354,13 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	return __split_vma(mm, vma, addr, new_below);
 }
 
-static inline int
-unlock_range(struct vm_area_struct *start, struct vm_area_struct **tail,
-	     unsigned long limit)
+static inline void munmap_sidetree(struct vm_area_struct *vma,
+				   struct ma_state *mas_detach)
 {
-	struct mm_struct *mm = start->vm_mm;
-	struct vm_area_struct *tmp = start;
-	int count = 0;
-
-	while (tmp && tmp->vm_start < limit) {
-		*tail = tmp;
-		count++;
-		if (tmp->vm_flags & VM_LOCKED)
-			mm->locked_vm -= vma_pages(tmp);
-
-		tmp = tmp->vm_next;
-	}
-
-	return count;
+	mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
+	mas_store(mas_detach, vma);
+	if (vma->vm_flags & VM_LOCKED)
+		vma->vm_mm->locked_vm -= vma_pages(vma);
 }
 
 /*
@@ -2488,13 +2380,20 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 		    struct mm_struct *mm, unsigned long start,
 		    unsigned long end, struct list_head *uf, bool downgrade)
 {
-	struct vm_area_struct *prev, *last;
+	struct vm_area_struct *prev, *next = NULL;
+	struct maple_tree mt_detach;
+	int count = 0;
 	int error = -ENOMEM;
-	/* we have start < vma->vm_end  */
+	MA_STATE(mas_detach, &mt_detach, start, end - 1);
+	mt_init_flags(&mt_detach, MM_MT_FLAGS);
+	mt_set_external_lock(&mt_detach, &mm->mmap_lock);
 
 	if (mas_preallocate(mas, vma, GFP_KERNEL))
 		return -ENOMEM;
 
+	if (mas_preallocate(&mas_detach, vma, GFP_KERNEL))
+		return -ENOMEM;
+
 	mas->last = end - 1;
 	/*
 	 * If we need to split any vma, do it now to save pain later.
@@ -2503,6 +2402,8 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	 * unmapped vm_area_struct will remain in use: so lower split_vma
 	 * places tmp vma above, and higher split_vma places tmp vma below.
 	 */
+
+	/* Does it split the first one? */
 	if (start > vma->vm_start) {
 
 		/*
@@ -2513,35 +2414,56 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
 			goto map_count_exceeded;
 
+		/*
+		 * mas_pause() is not needed since mas->index needs to be set
+		 * differently than vma->vm_end anyways.
+		 */
 		error = __split_vma(mm, vma, start, 0);
 		if (error)
 			goto split_failed;
 
-		prev = vma;
-		vma = __vma_next(mm, prev);
-		mas->index = start;
-		mas_reset(mas);
-	} else {
-		prev = vma->vm_prev;
+		mas_set(mas, start);
+		vma = mas_walk(mas);
 	}
 
-	if (vma->vm_end >= end)
-		last = vma;
-	else
-		last = find_vma_intersection(mm, end - 1, end);
-
-	/* Does it split the last one? */
-	if (last && end < last->vm_end) {
-		error = __split_vma(mm, last, end, 1);
+	prev = mas_prev(mas, 0);
+	if (unlikely((!prev)))
+		mas_set(mas, start);
 
-		if (error)
-			goto split_failed;
+	/*
+	 * Detach a range of VMAs from the mm. Using next as a temp variable as
+	 * it is always overwritten.
+	 */
+	mas_for_each(mas, next, end - 1) {
+		/* Does it split the end? */
+		if (next->vm_end > end) {
+			struct vm_area_struct *split;
 
-		if (vma == last)
-			vma = __vma_next(mm, prev);
-		mas_reset(mas);
+			error = __split_vma(mm, next, end, 1);
+			if (error)
+				goto split_failed;
+
+			mas_set(mas, end);
+			split = mas_prev(mas, 0);
+			munmap_sidetree(split, &mas_detach);
+			count++;
+			if (vma == next)
+				vma = split;
+			break;
+		}
+		count++;
+		munmap_sidetree(next, &mas_detach);
+#ifdef CONFIG_DEBUG_VM_MAPLE_TREE
+		BUG_ON(next->vm_start < start);
+		BUG_ON(next->vm_start > end);
+#endif
 	}
 
+	mas_destroy(&mas_detach);
+
+	if (!next)
+		next = mas_next(mas, ULONG_MAX);
+
 	if (unlikely(uf)) {
 		/*
 		 * If userfaultfd_unmap_prep returns an error the vmas
@@ -2558,35 +2480,36 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 			goto userfaultfd_error;
 	}
 
-	/*
-	 * unlock any mlock()ed ranges before detaching vmas, count the number
-	 * of VMAs to be dropped, and return the tail entry of the affected
-	 * area.
-	 */
-	mm->map_count -= unlock_range(vma, &last, end);
-	/* Drop removed area from the tree */
+	/* Point of no return */
+	mas_set_range(mas, start, end - 1);
+#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
+	/* Make sure no VMAs are about to be lost. */
+	{
+		MA_STATE(test, &mt_detach, start, end - 1);
+		struct vm_area_struct *vma_mas, *vma_test;
+		int test_count = 0;
+
+		rcu_read_lock();
+		vma_test = mas_find(&test, end - 1);
+		mas_for_each(mas, vma_mas, end - 1) {
+			BUG_ON(vma_mas != vma_test);
+			test_count++;
+			vma_test = mas_next(&test, end - 1);
+		}
+		rcu_read_unlock();
+		BUG_ON(count != test_count);
+		mas_set_range(mas, start, end - 1);
+	}
+#endif
 	mas_store_prealloc(mas, NULL);
-
-	/* Detach vmas from the MM linked list */
-	vma->vm_prev = NULL;
-	if (prev)
-		prev->vm_next = last->vm_next;
-	else
-		mm->mmap = last->vm_next;
-
-	if (last->vm_next) {
-		last->vm_next->vm_prev = prev;
-		last->vm_next = NULL;
-	} else
-		mm->highest_vm_end = prev ? vm_end_gap(prev) : 0;
-
+	mm->map_count -= count;
 	/*
 	 * Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
 	 * VM_GROWSUP VMA. Such VMAs can change their size under
 	 * down_read(mmap_lock) and collide with the VMA we are about to unmap.
 	 */
 	if (downgrade) {
-		if (last && (last->vm_flags & VM_GROWSDOWN))
+		if (next && (next->vm_flags & VM_GROWSDOWN))
 			downgrade = false;
 		else if (prev && (prev->vm_flags & VM_GROWSUP))
 			downgrade = false;
@@ -2594,10 +2517,12 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 			mmap_write_downgrade(mm);
 	}
 
-	unmap_region(mm, vma, prev, start, end);
-
-	/* Fix up all other VM information */
-	remove_vma_list(mm, vma);
+	unmap_region(mm, &mt_detach, vma, prev, next, start, end);
+	/* Statistics and freeing VMAs */
+	mas_set(&mas_detach, start);
+	remove_mt(mm, &mas_detach);
+	validate_mm(mm);
+	__mt_destroy(&mt_detach);
 
 
 	validate_mm(mm);
@@ -2838,7 +2763,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 		i_mmap_lock_write(vma->vm_file->f_mapping);
 
 	vma_mas_store(vma, &mas);
-	__vma_link_list(mm, vma, prev);
 	mm->map_count++;
 	if (vma->vm_file) {
 		if (vma->vm_flags & VM_SHARED)
@@ -2890,7 +2814,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
 	vma->vm_file = NULL;
 
 	/* Undo any partial mapping done by a device driver. */
-	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+	unmap_region(mm, mas.tree, vma, prev, next, vma->vm_start, vma->vm_end);
 	charged = 0;
 	if (vm_flags & VM_SHARED)
 		mapping_unmap_writable(file->f_mapping);
@@ -2979,11 +2903,12 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 		goto out;
 
 	if (start + size > vma->vm_end) {
-		struct vm_area_struct *next;
+		VMA_ITERATOR(vmi, mm, vma->vm_end);
+		struct vm_area_struct *next, *prev = vma;
 
-		for (next = vma->vm_next; next; next = next->vm_next) {
+		for_each_vma_range(vmi, next, start + size) {
 			/* hole between vmas ? */
-			if (next->vm_start != next->vm_prev->vm_end)
+			if (next->vm_start != prev->vm_end)
 				goto out;
 
 			if (next->vm_file != vma->vm_file)
@@ -2992,8 +2917,7 @@ SYSCALL_DEFINE5(remap_file_pages, unsigned long, start, unsigned long, size,
 			if (next->vm_flags != vma->vm_flags)
 				goto out;
 
-			if (start + size <= next->vm_end)
-				break;
+			prev = next;
 		}
 
 		if (!next)
@@ -3039,7 +2963,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 			 struct list_head *uf)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	struct vm_area_struct unmap;
+	struct vm_area_struct unmap, *next;
 	unsigned long unmap_pages;
 	int ret;
 
@@ -3056,6 +2980,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	ret = userfaultfd_unmap_prep(mm, newbrk, oldbrk, uf);
 	if (ret)
 		return ret;
+
 	ret = 1;
 
 	/* Change the oldbrk of vma to the newbrk of the munmap area */
@@ -3077,6 +3002,7 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 
 	vma_mas_remove(&unmap, mas);
 
+	vma->vm_end = newbrk;
 	if (vma->anon_vma) {
 		anon_vma_interval_tree_post_update_vma(vma);
 		anon_vma_unlock_write(vma->anon_vma);
@@ -3086,8 +3012,9 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_LOCKED)
 		mm->locked_vm -= unmap_pages;
 
+	next = mas_next(mas, ULONG_MAX);
 	mmap_write_downgrade(mm);
-	unmap_region(mm, &unmap, vma, newbrk, oldbrk);
+	unmap_region(mm, mas->tree, &unmap, vma, next, newbrk, oldbrk);
 	/* Statistics */
 	vm_stat_account(mm, vma->vm_flags, -unmap_pages);
 	if (vma->vm_flags & VM_ACCOUNT)
@@ -3111,11 +3038,9 @@ static int do_brk_munmap(struct ma_state *mas, struct vm_area_struct *vma,
  * do some brk-specific accounting here.
  */
 static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
-			unsigned long addr, unsigned long len,
-			unsigned long flags)
+		unsigned long addr, unsigned long len, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *prev = NULL;
 	validate_mm_mt(mm);
 
 
@@ -3159,7 +3084,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 		khugepaged_enter_vma_merge(vma, flags);
 		goto out;
 	}
-	prev = vma;
 
 	/* create a vma struct for an anonymous mapping */
 	vma = vm_area_alloc(mm);
@@ -3177,12 +3101,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 		goto mas_store_fail;
 
 	mm->map_count++;
-
-	if (!prev)
-		prev = mas_prev(mas, 0);
-
-	__vma_link_list(mm, vma, prev);
-	mm->map_count++;
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
@@ -3190,7 +3108,7 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
-	validate_mm_mt(mm);
+	validate_mm(mm);
 	return 0;
 
 mas_store_fail:
@@ -3264,6 +3182,8 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
+	int count = 0;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
@@ -3287,7 +3207,7 @@ void exit_mmap(struct mm_struct *mm)
 	}
 
 	arch_exit_mmap(mm);
-	vma = mm->mmap;
+	vma = mas_find(&mas, ULONG_MAX);
 	if (!vma) {
 		/* Can happen if dup_mmap() received an OOM */
 		mmap_write_unlock(mm);
@@ -3299,17 +3219,25 @@ void exit_mmap(struct mm_struct *mm)
 	tlb_gather_mmu_fullmm(&tlb, mm);
 	/* update_hiwater_rss(mm) here? but nobody should be looking */
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
-	unmap_vmas(&tlb, vma, 0, -1);
-	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
+	unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX);
+	free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb);
 
-	/* Walk the list again, actually closing and freeing it. */
-	while (vma) {
+	/*
+	 * Walk the list again, actually closing and freeing it, with preemption
+	 * enabled, without holding any MM locks besides the unreachable
+	 * mmap_write_lock.
+	 */
+	do {
 		if (vma->vm_flags & VM_ACCOUNT)
 			nr_accounted += vma_pages(vma);
-		vma = remove_vma(vma);
+		remove_vma(vma);
+		count++;
 		cond_resched();
-	}
+	} while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+
+	BUG_ON(count != mm->map_count);
+
 
 	trace_exit_mmap(mm);
 	__mt_destroy(&mm->mm_mt);
@@ -3349,7 +3277,7 @@ int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
 	}
 
-	if (vma_link(mm, vma, prev))
+	if (vma_link(mm, vma))
 		return -ENOMEM;
 
 	return 0;
@@ -3379,7 +3307,8 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		faulted_in_anon_vma = false;
 	}
 
-	if (range_has_overlap(mm, addr, addr + len, &prev))
+	new_vma = find_vma_prev(mm, addr, &prev);
+	if (new_vma->vm_start < addr + len)
 		return NULL;	/* should never get here */
 
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
@@ -3422,7 +3351,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
-		if (vma_link(mm, new_vma, prev))
+		if (vma_link(mm, new_vma))
 			goto out_vma_link;
 		*need_rmap_locks = false;
 	}
@@ -3727,12 +3656,13 @@ int mm_take_all_locks(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	struct anon_vma_chain *avc;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	mmap_assert_write_locked(mm);
 
 	mutex_lock(&mm_all_locks_mutex);
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		if (signal_pending(current))
 			goto out_unlock;
 		if (vma->vm_file && vma->vm_file->f_mapping &&
@@ -3740,7 +3670,8 @@ int mm_take_all_locks(struct mm_struct *mm)
 			vm_lock_mapping(mm, vma->vm_file->f_mapping);
 	}
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	mas_set(&mas, 0);
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		if (signal_pending(current))
 			goto out_unlock;
 		if (vma->vm_file && vma->vm_file->f_mapping &&
@@ -3748,7 +3679,8 @@ int mm_take_all_locks(struct mm_struct *mm)
 			vm_lock_mapping(mm, vma->vm_file->f_mapping);
 	}
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	mas_set(&mas, 0);
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		if (signal_pending(current))
 			goto out_unlock;
 		if (vma->anon_vma)
@@ -3807,11 +3739,12 @@ void mm_drop_all_locks(struct mm_struct *mm)
 {
 	struct vm_area_struct *vma;
 	struct anon_vma_chain *avc;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	mmap_assert_write_locked(mm);
 	BUG_ON(!mutex_is_locked(&mm_all_locks_mutex));
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	mas_for_each(&mas, vma, ULONG_MAX) {
 		if (vma->anon_vma)
 			list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
 				vm_unlock_anon_vma(avc->anon_vma);
diff --git a/mm/nommu.c b/mm/nommu.c
index d94f6adf9c31..e32561f9f55f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -553,7 +553,6 @@ static void put_nommu_region(struct vm_region *region)
 static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	struct address_space *mapping;
-	struct vm_area_struct *prev;
 	MA_STATE(mas, &mm->mm_mt, vma->vm_start, vma->vm_end);
 
 	BUG_ON(!vma->vm_region);
@@ -572,11 +571,8 @@ static void add_vma_to_mm(struct mm_struct *mm, struct vm_area_struct *vma)
 		i_mmap_unlock_write(mapping);
 	}
 
-	prev = mas_prev(&mas, 0);
-	mas_reset(&mas);
 	/* add the VMA to the tree */
 	vma_mas_store(vma, &mas);
-	__vma_link_list(mm, vma, prev);
 }
 
 /*
@@ -601,7 +597,6 @@ static void delete_vma_from_mm(struct vm_area_struct *vma)
 
 	/* remove from the MM's tree and list */
 	vma_mas_remove(vma, &mas);
-	__vma_unlink_list(vma->vm_mm, vma);
 }
 
 /*
diff --git a/mm/util.c b/mm/util.c
index 3e97807c353b..136e1775d54c 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -271,46 +271,6 @@ void *memdup_user_nul(const void __user *src, size_t len)
 }
 EXPORT_SYMBOL(memdup_user_nul);
 
-void __vma_link_list(struct mm_struct *mm, struct vm_area_struct *vma,
-		struct vm_area_struct *prev)
-{
-	struct vm_area_struct *next;
-
-	vma->vm_prev = prev;
-	if (prev) {
-		next = prev->vm_next;
-		prev->vm_next = vma;
-	} else {
-		next = mm->mmap;
-		mm->mmap = vma;
-	}
-	vma->vm_next = next;
-	if (next)
-		next->vm_prev = vma;
-	else
-		mm->highest_vm_end = vm_end_gap(vma);
-}
-
-void __vma_unlink_list(struct mm_struct *mm, struct vm_area_struct *vma)
-{
-	struct vm_area_struct *prev, *next;
-
-	next = vma->vm_next;
-	prev = vma->vm_prev;
-	if (prev)
-		prev->vm_next = next;
-	else
-		mm->mmap = next;
-	if (next) {
-		next->vm_prev = prev;
-	} else {
-		if (prev)
-			mm->highest_vm_end = vm_end_gap(prev);
-		else
-			mm->highest_vm_end = 0;
-	}
-}
-
 /* Check if the vma is being used as a stack by this task */
 int vma_is_stack_for_current(struct vm_area_struct *vma)
 {
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 66/69] riscv: use vma iterator for vdso
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (48 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 64/69] i915: use the VMA iterator Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 68/69] mm/mmap: drop range_has_overlap() function Liam Howlett
                     ` (2 subsequent siblings)
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Remove the linked list use in favour of the vma iterator.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 arch/riscv/kernel/vdso.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/vdso.c b/arch/riscv/kernel/vdso.c
index a9436a65161a..20e2ae135fb9 100644
--- a/arch/riscv/kernel/vdso.c
+++ b/arch/riscv/kernel/vdso.c
@@ -116,10 +116,11 @@ int vdso_join_timens(struct task_struct *task, struct time_namespace *ns)
 {
 	struct mm_struct *mm = task->mm;
 	struct vm_area_struct *vma;
+	VMA_ITERATOR(vmi, mm, 0);
 
 	mmap_read_lock(mm);
 
-	for (vma = mm->mmap; vma; vma = vma->vm_next) {
+	for_each_vma(vmi, vma) {
 		unsigned long size = vma->vm_end - vma->vm_start;
 
 		if (vma_is_special_mapping(vma, vdso_info.dm))
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 68/69] mm/mmap: drop range_has_overlap() function
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (49 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 66/69] riscv: use vma iterator for vdso Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 67/69] mm: remove the vma linked list Liam Howlett
  2022-05-04  1:14   ` [PATCH v9 69/69] mm/mmap.c: pass in mapping to __vma_link_file() Liam Howlett
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Since there is no longer a linked list, the range_has_overlap() function
is identical to the find_vma_intersection() function.

Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mmap.c | 28 +---------------------------
 1 file changed, 1 insertion(+), 27 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 7704c879bc6d..32d3922cec27 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -437,30 +437,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
 		anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
 }
 
-/*
- * range_has_overlap() - Check the @start - @end range for overlapping VMAs and
- * sets up a pointer to the previous VMA
- * @mm: the mm struct
- * @start: the start address of the range
- * @end: the end address of the range
- * @pprev: the pointer to the pointer of the previous VMA
- *
- * Returns: True if there is an overlapping VMA, false otherwise
- */
-static inline
-bool range_has_overlap(struct mm_struct *mm, unsigned long start,
-		       unsigned long end, struct vm_area_struct **pprev)
-{
-	struct vm_area_struct *existing;
-
-	MA_STATE(mas, &mm->mm_mt, start, start);
-	rcu_read_lock();
-	existing = mas_find(&mas, end - 1);
-	*pprev = mas_prev(&mas, 0);
-	rcu_read_unlock();
-	return existing ? true : false;
-}
-
 static unsigned long count_vma_pages_range(struct mm_struct *mm,
 		unsigned long addr, unsigned long end)
 {
@@ -3251,9 +3227,7 @@ void exit_mmap(struct mm_struct *mm)
  */
 int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	struct vm_area_struct *prev;
-
-	if (range_has_overlap(mm, vma->vm_start, vma->vm_end, &prev))
+	if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
 		return -ENOMEM;
 
 	if ((vma->vm_flags & VM_ACCOUNT) &&
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v9 69/69] mm/mmap.c: pass in mapping to __vma_link_file()
  2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
                     ` (51 preceding siblings ...)
  2022-05-04  1:14   ` [PATCH v9 67/69] mm: remove the vma linked list Liam Howlett
@ 2022-05-04  1:14   ` Liam Howlett
  52 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-04  1:14 UTC (permalink / raw)
  To: maple-tree, linux-mm, linux-kernel, Andrew Morton

From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

__vma_link_file() resolves the mapping from the file, if there is one.
Pass through the mapping and check the vm_file externally since most
places already have the required information and check of vm_file.

Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 mm/mmap.c | 33 ++++++++++++++-------------------
 1 file changed, 14 insertions(+), 19 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 32d3922cec27..c4d79d901085 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -246,6 +246,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	if (brk < min_brk)
 		goto out;
 
+
 	/*
 	 * Check against rlimit here. If this check is done later after the test
 	 * of oldbrk with newbrk then it can escape the test and let the data
@@ -322,7 +323,6 @@ SYSCALL_DEFINE1(brk, unsigned long, brk)
 	if (populate)
 		mm_populate(oldbrk, newbrk - oldbrk);
 	return brk;
-
 out:
 	mmap_write_unlock(mm);
 	return origbrk;
@@ -454,21 +454,15 @@ static unsigned long count_vma_pages_range(struct mm_struct *mm,
 	return nr_pages;
 }
 
-static void __vma_link_file(struct vm_area_struct *vma)
+static void __vma_link_file(struct vm_area_struct *vma,
+			    struct address_space *mapping)
 {
-	struct file *file;
-
-	file = vma->vm_file;
-	if (file) {
-		struct address_space *mapping = file->f_mapping;
-
-		if (vma->vm_flags & VM_SHARED)
-			mapping_allow_writable(mapping);
+	if (vma->vm_flags & VM_SHARED)
+		mapping_allow_writable(mapping);
 
-		flush_dcache_mmap_lock(mapping);
-		vma_interval_tree_insert(vma, &mapping->i_mmap);
-		flush_dcache_mmap_unlock(mapping);
-	}
+	flush_dcache_mmap_lock(mapping);
+	vma_interval_tree_insert(vma, &mapping->i_mmap);
+	flush_dcache_mmap_unlock(mapping);
 }
 
 /*
@@ -535,10 +529,11 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 	}
 
 	vma_mas_store(vma, &mas);
-	__vma_link_file(vma);
 
-	if (mapping)
+	if (mapping) {
+		__vma_link_file(vma, mapping);
 		i_mmap_unlock_write(mapping);
+	}
 
 	mm->map_count++;
 	validate_mm(mm);
@@ -779,14 +774,14 @@ int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			uprobe_munmap(next, next->vm_start, next->vm_end);
 
 		i_mmap_lock_write(mapping);
-		if (insert) {
+		if (insert && insert->vm_file) {
 			/*
 			 * Put into interval tree now, so instantiated pages
 			 * are visible to arm/parisc __flush_dcache_page
 			 * throughout; but we cannot insert into address
 			 * space until vma start or end is updated.
 			 */
-			__vma_link_file(insert);
+			__vma_link_file(insert, insert->vm_file->f_mapping);
 		}
 	}
 
@@ -3019,7 +3014,6 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 	struct mm_struct *mm = current->mm;
 	validate_mm_mt(mm);
 
-
 	/*
 	 * Check against address space limits by the changed size
 	 * Note: This happens *after* clearing old mappings in some code paths.
@@ -3077,6 +3071,7 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
 		goto mas_store_fail;
 
 	mm->map_count++;
+
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
-- 
2.35.1

^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator
  2022-05-04  1:12 ` [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator Liam Howlett
@ 2022-05-10 10:44   ` SeongJae Park
  2022-05-10 16:27     ` Liam Howlett
  2022-05-10 19:13     ` Andrew Morton
  0 siblings, 2 replies; 83+ messages in thread
From: SeongJae Park @ 2022-05-10 10:44 UTC (permalink / raw)
  To: Liam Howlett
  Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton, damon, SeongJae Park

On Wed, 4 May 2022 01:12:26 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> This rather specialised walk can use the VMA iterator.  If this proves to
> be too slow, we can write a custom routine to find the two largest gaps,
> but it will be somewhat complicated, so let's see if we need it first.
> 
> Update the kunit test case to use the maple tree.  This also fixes an
> issue with the kunit testcase not adding the last VMA to the list.
> 
> Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests)
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Reviewed-by: SeongJae Park <sj@kernel.org>
> ---
>  mm/damon/vaddr-test.h | 37 +++++++++++-------------------
>  mm/damon/vaddr.c      | 53 ++++++++++++++++++++++---------------------
>  2 files changed, 40 insertions(+), 50 deletions(-)
> 
> diff --git a/mm/damon/vaddr-test.h b/mm/damon/vaddr-test.h
> index 5431da4fe9d4..dbf2b8759607 100644
> --- a/mm/damon/vaddr-test.h
> +++ b/mm/damon/vaddr-test.h
> @@ -13,34 +13,21 @@
>  #define _DAMON_VADDR_TEST_H
>  
>  #include <kunit/test.h>
> +#include "../../mm/internal.h"

V9 maple tree patchset has moved the definition of vma_mas_store() from
internal.h to mmap.c, so inclusion of internal.h wouldn't needed here, right?

If we end up moving the definitions back to internal.h, because this file is
under mm/damon/, we can also use shorter include path, "../internal.h".


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator
  2022-05-10 10:44   ` SeongJae Park
@ 2022-05-10 16:27     ` Liam Howlett
  2022-05-10 19:13     ` Andrew Morton
  1 sibling, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-10 16:27 UTC (permalink / raw)
  To: SeongJae Park; +Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton, damon

* SeongJae Park <sj@kernel.org> [220510 03:44]:
> On Wed, 4 May 2022 01:12:26 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > 
> > This rather specialised walk can use the VMA iterator.  If this proves to
> > be too slow, we can write a custom routine to find the two largest gaps,
> > but it will be somewhat complicated, so let's see if we need it first.
> > 
> > Update the kunit test case to use the maple tree.  This also fixes an
> > issue with the kunit testcase not adding the last VMA to the list.
> > 
> > Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests)
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Reviewed-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/damon/vaddr-test.h | 37 +++++++++++-------------------
> >  mm/damon/vaddr.c      | 53 ++++++++++++++++++++++---------------------
> >  2 files changed, 40 insertions(+), 50 deletions(-)
> > 
> > diff --git a/mm/damon/vaddr-test.h b/mm/damon/vaddr-test.h
> > index 5431da4fe9d4..dbf2b8759607 100644
> > --- a/mm/damon/vaddr-test.h
> > +++ b/mm/damon/vaddr-test.h
> > @@ -13,34 +13,21 @@
> >  #define _DAMON_VADDR_TEST_H
> >  
> >  #include <kunit/test.h>
> > +#include "../../mm/internal.h"
> 
> V9 maple tree patchset has moved the definition of vma_mas_store() from
> internal.h to mmap.c, so inclusion of internal.h wouldn't needed here, right?
> 
> If we end up moving the definitions back to internal.h, because this file is
> under mm/damon/, we can also use shorter include path, "../internal.h".

Yeah, that seems like a good plan.

I will be moving it back to internal to restore functionality to nommu.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator
  2022-05-10 10:44   ` SeongJae Park
  2022-05-10 16:27     ` Liam Howlett
@ 2022-05-10 19:13     ` Andrew Morton
  1 sibling, 0 replies; 83+ messages in thread
From: Andrew Morton @ 2022-05-10 19:13 UTC (permalink / raw)
  To: SeongJae Park; +Cc: Liam Howlett, maple-tree, linux-mm, linux-kernel, damon

On Tue, 10 May 2022 10:44:28 +0000 SeongJae Park <sj@kernel.org> wrote:

> On Wed, 4 May 2022 01:12:26 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > 
> > This rather specialised walk can use the VMA iterator.  If this proves to
> > be too slow, we can write a custom routine to find the two largest gaps,
> > but it will be somewhat complicated, so let's see if we need it first.
> > 
> > Update the kunit test case to use the maple tree.  This also fixes an
> > issue with the kunit testcase not adding the last VMA to the list.
> > 
> > Fixes: 17ccae8bb5c9 (mm/damon: add kunit tests)
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Reviewed-by: SeongJae Park <sj@kernel.org>
> > ---
> >  mm/damon/vaddr-test.h | 37 +++++++++++-------------------
> >  mm/damon/vaddr.c      | 53 ++++++++++++++++++++++---------------------
> >  2 files changed, 40 insertions(+), 50 deletions(-)
> > 
> > diff --git a/mm/damon/vaddr-test.h b/mm/damon/vaddr-test.h
> > index 5431da4fe9d4..dbf2b8759607 100644
> > --- a/mm/damon/vaddr-test.h
> > +++ b/mm/damon/vaddr-test.h
> > @@ -13,34 +13,21 @@
> >  #define _DAMON_VADDR_TEST_H
> >  
> >  #include <kunit/test.h>
> > +#include "../../mm/internal.h"
> 
> V9 maple tree patchset has moved the definition of vma_mas_store() from
> internal.h to mmap.c, so inclusion of internal.h wouldn't needed here, right?
> 
> If we end up moving the definitions back to internal.h, because this file is
> under mm/damon/, we can also use shorter include path, "../internal.h".

I put the vma_mas_store() and vma_mas_remove() declarations into
include/linux/mm.h so yes, internal.h is no longer required.  I queued
a fixlet against
damon-convert-__damon_va_three_regions-to-use-the-vma-iterator.patch


--- a/mm/damon/vaddr-test.h~damon-convert-__damon_va_three_regions-to-use-the-vma-iterator-fix
+++ a/mm/damon/vaddr-test.h
@@ -13,7 +13,6 @@
 #define _DAMON_VADDR_TEST_H
 
 #include <kunit/test.h>
-#include "../../mm/internal.h"
 
 static void __link_vmas(struct maple_tree *mt, struct vm_area_struct *vmas,
 			ssize_t nr_vmas)
_


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH 1/1] mips: rename mt_init to mips_mt_init
  2022-05-04  0:26 ` [PATCH 1/1] mips: rename mt_init to mips_mt_init Liam Howlett
@ 2022-05-12  9:54   ` David Hildenbrand
  0 siblings, 0 replies; 83+ messages in thread
From: David Hildenbrand @ 2022-05-12  9:54 UTC (permalink / raw)
  To: Liam Howlett, maple-tree, linux-mm, linux-kernel, Andrew Morton

On 04.05.22 02:26, Liam Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Move mt_init out of the way for the maple tree.  Use mips_mt prefix to
> match the rest of the functions in the file.
> 
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>


Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 67/69] mm: remove the vma linked list
  2022-05-04  1:14   ` [PATCH v9 67/69] mm: remove the vma linked list Liam Howlett
@ 2022-05-13 13:30     ` Qian Cai
  2022-05-13 14:17       ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Qian Cai @ 2022-05-13 13:30 UTC (permalink / raw)
  To: Liam Howlett; +Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, May 04, 2022 at 01:14:07AM +0000, Liam Howlett wrote:
...
> @@ -2488,13 +2380,20 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
>  		    struct mm_struct *mm, unsigned long start,
>  		    unsigned long end, struct list_head *uf, bool downgrade)
>  {
> -	struct vm_area_struct *prev, *last;
> +	struct vm_area_struct *prev, *next = NULL;
> +	struct maple_tree mt_detach;
> +	int count = 0;
>  	int error = -ENOMEM;
> -	/* we have start < vma->vm_end  */
> +	MA_STATE(mas_detach, &mt_detach, start, end - 1);
> +	mt_init_flags(&mt_detach, MM_MT_FLAGS);
> +	mt_set_external_lock(&mt_detach, &mm->mmap_lock);
>  
>  	if (mas_preallocate(mas, vma, GFP_KERNEL))
>  		return -ENOMEM;
>  
> +	if (mas_preallocate(&mas_detach, vma, GFP_KERNEL))

This guy was reported as leaks as well.

unreferenced object 0xffff0802d49b5500 (size 256):
  comm "trinity-c22", pid 107245, jiffies 4295674711 (age 816.980s)
  hex dump (first 32 bytes):
    01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     kmem_cache_alloc
     mas_alloc_nodes
     mas_preallocate
     do_mas_align_munmap.constprop.0
     do_mas_align_munmap at mm/mmap.c:2384
     do_mas_munmap
     __vm_munmap
     __arm64_sys_munmap
     invoke_syscall
     el0_svc_common.constprop.0
     do_el0_svc
     el0_svc
     el0t_64_sync_handler
     el0t_64_sync

> +		return -ENOMEM;
> +
>  	mas->last = end - 1;
>  	/*
>  	 * If we need to split any vma, do it now to save pain later.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 67/69] mm: remove the vma linked list
  2022-05-13 13:30     ` Qian Cai
@ 2022-05-13 14:17       ` Liam Howlett
  0 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-05-13 14:17 UTC (permalink / raw)
  To: Qian Cai; +Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton

* Qian Cai <quic_qiancai@quicinc.com> [220513 09:30]:
> On Wed, May 04, 2022 at 01:14:07AM +0000, Liam Howlett wrote:
> ...
> > @@ -2488,13 +2380,20 @@ do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
> >  		    struct mm_struct *mm, unsigned long start,
> >  		    unsigned long end, struct list_head *uf, bool downgrade)
> >  {
> > -	struct vm_area_struct *prev, *last;
> > +	struct vm_area_struct *prev, *next = NULL;
> > +	struct maple_tree mt_detach;
> > +	int count = 0;
> >  	int error = -ENOMEM;
> > -	/* we have start < vma->vm_end  */
> > +	MA_STATE(mas_detach, &mt_detach, start, end - 1);
> > +	mt_init_flags(&mt_detach, MM_MT_FLAGS);
> > +	mt_set_external_lock(&mt_detach, &mm->mmap_lock);
> >  
> >  	if (mas_preallocate(mas, vma, GFP_KERNEL))
> >  		return -ENOMEM;
> >  
> > +	if (mas_preallocate(&mas_detach, vma, GFP_KERNEL))
> 
> This guy was reported as leaks as well.
> 
> unreferenced object 0xffff0802d49b5500 (size 256):
>   comm "trinity-c22", pid 107245, jiffies 4295674711 (age 816.980s)
>   hex dump (first 32 bytes):
>     01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>      kmem_cache_alloc
>      mas_alloc_nodes
>      mas_preallocate
>      do_mas_align_munmap.constprop.0
>      do_mas_align_munmap at mm/mmap.c:2384
>      do_mas_munmap
>      __vm_munmap
>      __arm64_sys_munmap
>      invoke_syscall
>      el0_svc_common.constprop.0
>      do_el0_svc
>      el0_svc
>      el0t_64_sync_handler
>      el0t_64_sync


Thanks.  I have not seen this myself but there certainly is a potential
for a leak here when the task runs out of memory in the middle of a
munmap operation.  I've sent you and Andrew a fix.

Cheers,
Liam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-05-04  1:13   ` [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states Liam Howlett
@ 2022-06-06 12:09     ` Qian Cai
  2022-06-06 16:19       ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Qian Cai @ 2022-06-06 12:09 UTC (permalink / raw)
  To: Liam Howlett; +Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, May 04, 2022 at 01:13:53AM +0000, Liam Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
> do_mas_align_munmap().
> 
> do_munmap() is a wrapper to create a maple state for any callers that have
> not been converted to the maple tree.
> 
> do_mas_munmap() takes a maple state to mumap a range.  This is just a
> small function which checks for error conditions and aligns the end of the
> range.
> 
> do_mas_align_munmap() uses the aligned range to mumap a range.
> do_mas_align_munmap() starts with the first VMA in the range, then finds
> the last VMA in the range.  Both start and end are split if necessary.
> Then the VMAs are removed from the linked list and the mm mlock count is
> updated at the same time.  Followed by a single tree operation of
> overwriting the area in with a NULL.  Finally, the detached list is
> unmapped and freed.
> 
> By reorganizing the munmap calls as outlined, it is now possible to avoid
> extra work of aligning pre-aligned callers which are known to be safe,
> avoid extra VMA lookups or tree walks for modifications.
> 
> detach_vmas_to_be_unmapped() is no longer used, so drop this code.
> 
> vm_brk_flags() can just call the do_mas_munmap() as it checks for
> intersecting VMAs directly.
> 
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
...
> +/*
> + * do_mas_align_munmap() - munmap the aligned region from @start to @end.
> + * @mas: The maple_state, ideally set up to alter the correct tree location.
> + * @vma: The starting vm_area_struct
> + * @mm: The mm_struct
> + * @start: The aligned start address to munmap.
> + * @end: The aligned end address to munmap.
> + * @uf: The userfaultfd list_head
> + * @downgrade: Set to true to attempt a write downgrade of the mmap_sem
> + *
> + * If @downgrade is true, check return code for potential release of the lock.
> + */
> +static int
> +do_mas_align_munmap(struct ma_state *mas, struct vm_area_struct *vma,
> +		    struct mm_struct *mm, unsigned long start,
> +		    unsigned long end, struct list_head *uf, bool downgrade)
> +{
> +	struct vm_area_struct *prev, *last;
> +	int error = -ENOMEM;
> +	/* we have start < vma->vm_end  */
>  
> -	if (mas_preallocate(&mas, vma, GFP_KERNEL))
> +	if (mas_preallocate(mas, vma, GFP_KERNEL))
>  		return -ENOMEM;
> -	prev = vma->vm_prev;
> -	/* we have start < vma->vm_end  */
>  
> +	mas->last = end - 1;
>  	/*
>  	 * If we need to split any vma, do it now to save pain later.
>  	 *
...
> +/*
> + * do_mas_munmap() - munmap a given range.
> + * @mas: The maple state
> + * @mm: The mm_struct
> + * @start: The start address to munmap
> + * @len: The length of the range to munmap
> + * @uf: The userfaultfd list_head
> + * @downgrade: set to true if the user wants to attempt to write_downgrade the
> + * mmap_sem
> + *
> + * This function takes a @mas that is either pointing to the previous VMA or set
> + * to MA_START and sets it up to remove the mapping(s).  The @len will be
> + * aligned and any arch_unmap work will be preformed.
> + *
> + * Returns: -EINVAL on failure, 1 on success and unlock, 0 otherwise.
> + */
> +int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> +		  unsigned long start, size_t len, struct list_head *uf,
> +		  bool downgrade)
> +{
> +	unsigned long end;
> +	struct vm_area_struct *vma;
> +
> +	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
> +		return -EINVAL;
> +
> +	end = start + PAGE_ALIGN(len);
> +	if (end == start)
> +		return -EINVAL;
> +
> +	 /* arch_unmap() might do unmaps itself.  */
> +	arch_unmap(mm, start, end);
> +
> +	/* Find the first overlapping VMA */
> +	vma = mas_find(mas, end - 1);
> +	if (!vma)
> +		return 0;
> +
> +	return do_mas_align_munmap(mas, vma, mm, start, end, uf, downgrade);
> +}
> +
...
> @@ -2845,11 +2908,12 @@ static int __vm_munmap(unsigned long start, size_t len, bool downgrade)
>  	int ret;
>  	struct mm_struct *mm = current->mm;
>  	LIST_HEAD(uf);
> +	MA_STATE(mas, &mm->mm_mt, start, start);
>  
>  	if (mmap_write_lock_killable(mm))
>  		return -EINTR;
>  
> -	ret = __do_munmap(mm, start, len, &uf, downgrade);
> +	ret = do_mas_munmap(&mas, mm, start, len, &uf, downgrade);
>  	/*
>  	 * Returning 1 indicates mmap_lock is downgraded.
>  	 * But 1 is not legal return value of vm_munmap() and munmap(), reset

Running a syscall fuzzer for a while could trigger those.

 WARNING: CPU: 95 PID: 1329067 at mm/slub.c:3643 kmem_cache_free_bulk
 CPU: 95 PID: 1329067 Comm: trinity-c32 Not tainted 5.18.0-next-20220603 #137
 pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : kmem_cache_free_bulk
 lr : mt_destroy_walk
 sp : ffff80005ed66bf0
 x29: ffff80005ed66bf0 x28: ffff401d6c82f050 x27: 0000000000000000
 x26: dfff800000000000 x25: 0000000000000003 x24: 1ffffa97cc5fb120
 x23: ffffd4be62fd8760 x22: ffff401d6c82f050 x21: 0000000000000003
 x20: 0000000000000000 x19: ffff401d6c82f000 x18: ffffd4be66407d1c
 x17: ffff40297ac21f0c x16: 1fffe8016136146b x15: 1fffe806c7d1ad38
 x14: 1fffe8016136145e x13: 0000000000000004 x12: ffff70000bdacd8d
 x11: 1ffff0000bdacd8c x10: ffff70000bdacd8c x9 : ffffd4be60d633c4
 x8 : ffff80005ed66c63 x7 : 0000000000000001 x6 : 0000000000000003
 x5 : ffff80005ed66c60 x4 : 0000000000000000 x3 : ffff400b09b09a80
 x2 : ffff401d6c82f050 x1 : 0000000000000000 x0 : ffff07ff80014a80
 Call trace:
  kmem_cache_free_bulk
  mt_destroy_walk
  mas_wmb_replace
  mas_spanning_rebalance.isra.0
  mas_wr_spanning_store.isra.0
  mas_wr_store_entry.isra.0
  mas_store_prealloc
  do_mas_align_munmap.constprop.0
  do_mas_munmap
  __vm_munmap
  __arm64_sys_munmap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync
 irq event stamp: 665580
 hardirqs last  enabled at (665579):  kasan_quarantine_put
 hardirqs last disabled at (665580):  el1_dbg
 softirqs last  enabled at (664048):  __do_softirq
 softirqs last disabled at (663831):  __irq_exit_rcu


 BUG: KASAN: double-free or invalid-free in kmem_cache_free_bulk

 CPU: 95 PID: 1329067 Comm: trinity-c32 Tainted: G        W         5.18.0-next-20220603 #137
 Call trace:
  dump_backtrace
  show_stack
  dump_stack_lvl
  print_address_description.constprop.0
  print_report
  kasan_report_invalid_free
  ____kasan_slab_free
  __kasan_slab_free
  slab_free_freelist_hook
  kmem_cache_free_bulk
  mas_destroy
  mas_store_prealloc
  do_mas_align_munmap.constprop.0
  do_mas_munmap
  __vm_munmap
  __arm64_sys_munmap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Allocated by task 1329067:
  kasan_save_stack
  __kasan_slab_alloc
  slab_post_alloc_hook
  kmem_cache_alloc_bulk
  mas_alloc_nodes
  mas_preallocate
  __vma_adjust
  vma_merge
  mprotect_fixup
  do_mprotect_pkey.constprop.0
  __arm64_sys_mprotect
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Freed by task 1329067:
  kasan_save_stack
  kasan_set_track
  kasan_set_free_info
  ____kasan_slab_free
  __kasan_slab_free
  slab_free_freelist_hook
  kmem_cache_free
  mt_destroy_walk
  mas_wmb_replace
  mas_spanning_rebalance.isra.0
  mas_wr_spanning_store.isra.0
  mas_wr_store_entry.isra.0
  mas_store_prealloc
  do_mas_align_munmap.constprop.0
  do_mas_munmap
  __vm_munmap
  __arm64_sys_munmap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 The buggy address belongs to the object at ffff401d6c82f000
                which belongs to the cache maple_node of size 256
 The buggy address is located 0 bytes inside of
                256-byte region [ffff401d6c82f000, ffff401d6c82f100)

 The buggy address belongs to the physical page:
 page:fffffd0075b20a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x401dec828
 head:fffffd0075b20a00 order:3 compound_mapcount:0 compound_pincount:0
 flags: 0x1bfffc0000010200(slab|head|node=1|zone=2|lastcpupid=0xffff)
 raw: 1bfffc0000010200 fffffd00065b2a08 fffffd0006474408 ffff07ff80014a80
 raw: 0000000000000000 00000000002a002a 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected
 page_owner tracks the page as allocated
 page last allocated via order 3, migratetype Unmovable, gfp_mask 0x1d20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC|__GFP_HARDWALL), pid 185514, tgid 185514 (trinity-c15), ts 9791681605400, free_ts 9785882037080
  post_alloc_hook
  get_page_from_freelist
  __alloc_pages
  alloc_pages
  allocate_slab
  new_slab
  ___slab_alloc
  __slab_alloc.constprop.0
  kmem_cache_alloc
  mas_alloc_nodes
  mas_preallocate
  __vma_adjust
  vma_merge
  mlock_fixup
  apply_mlockall_flags
  __arm64_sys_munlockall
 page last free stack trace:
  free_pcp_prepare
  free_unref_page
  __free_pages
  __free_slab
  discard_slab
  __slab_free
  ___cache_free
  qlist_free_all
  kasan_quarantine_reduce
  __kasan_slab_alloc
  __kmalloc_node
  kvmalloc_node
  __slab_free
  ___cache_free
  qlist_free_all
  kasan_quarantine_reduce
  __kasan_slab_alloc
  __kmalloc_node
  kvmalloc_node
  proc_sys_call_handler
  proc_sys_read
  new_sync_read
  vfs_read

 Memory state around the buggy address:
  ffff401d6c82ef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff401d6c82ef80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff401d6c82f000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  ffff401d6c82f080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff401d6c82f100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-06 12:09     ` Qian Cai
@ 2022-06-06 16:19       ` Liam Howlett
  2022-06-06 16:40         ` Qian Cai
  0 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-06-06 16:19 UTC (permalink / raw)
  To: Qian Cai; +Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton

* Qian Cai <quic_qiancai@quicinc.com> [220606 08:09]:
> On Wed, May 04, 2022 at 01:13:53AM +0000, Liam Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > 
> > Remove __do_munmap() in favour of do_munmap(), do_mas_munmap(), and
> > do_mas_align_munmap().
> > 
> > do_munmap() is a wrapper to create a maple state for any callers that have
> > not been converted to the maple tree.
> > 
> > do_mas_munmap() takes a maple state to mumap a range.  This is just a
> > small function which checks for error conditions and aligns the end of the
> > range.
> > 
> > do_mas_align_munmap() uses the aligned range to mumap a range.
> > do_mas_align_munmap() starts with the first VMA in the range, then finds
> > the last VMA in the range.  Both start and end are split if necessary.
> > Then the VMAs are removed from the linked list and the mm mlock count is
> > updated at the same time.  Followed by a single tree operation of
> > overwriting the area in with a NULL.  Finally, the detached list is
> > unmapped and freed.
> > 
> > By reorganizing the munmap calls as outlined, it is now possible to avoid
> > extra work of aligning pre-aligned callers which are known to be safe,
> > avoid extra VMA lookups or tree walks for modifications.
> > 
> > detach_vmas_to_be_unmapped() is no longer used, so drop this code.
> > 
> > vm_brk_flags() can just call the do_mas_munmap() as it checks for
> > intersecting VMAs directly.
> > 
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> ...

..
> Running a syscall fuzzer for a while could trigger those.

Thanks.

> 
>  WARNING: CPU: 95 PID: 1329067 at mm/slub.c:3643 kmem_cache_free_bulk
>  CPU: 95 PID: 1329067 Comm: trinity-c32 Not tainted 5.18.0-next-20220603 #137
>  pstate: 10400009 (nzcV daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>  pc : kmem_cache_free_bulk
>  lr : mt_destroy_walk
>  sp : ffff80005ed66bf0


Does your syscall fuzzer create a reproducer?  This looks like arm64
and says 5.18.0-next-20220603 again.  Was this bisected to the patch
above?

Regards,
Liam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-06 16:19       ` Liam Howlett
@ 2022-06-06 16:40         ` Qian Cai
  2022-06-11 20:11           ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Qian Cai @ 2022-06-06 16:40 UTC (permalink / raw)
  To: Liam Howlett; +Cc: maple-tree, linux-mm, linux-kernel, Andrew Morton

On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> Does your syscall fuzzer create a reproducer?  This looks like arm64
> and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> above?

This was triggered by running the fuzzer over the weekend.

$ trinity -C 160

No bisection was done. It was only brought up here because the trace
pointed to do_mas_munmap() which was introduced here.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-06 16:40         ` Qian Cai
@ 2022-06-11 20:11           ` Yu Zhao
  2022-06-11 21:49             ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-11 20:11 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
>
> On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > above?
>
> This was triggered by running the fuzzer over the weekend.
>
> $ trinity -C 160
>
> No bisection was done. It was only brought up here because the trace
> pointed to do_mas_munmap() which was introduced here.

Liam,

I'm getting a similar crash on arm64 -- the allocator is madvise(),
not mprotect(). Please take a look.

Thanks.

==================================================================
BUG: KASAN: double-free or invalid-free in kmem_cache_free_bulk+0x230/0x3b0
Pointer tag: [0c], memory tag: [fe]

CPU: 2 PID: 8320 Comm: stress-ng Tainted: G    B   W
5.19.0-rc1-lockdep+ #3
Call trace:
 dump_backtrace+0x1a0/0x200
 show_stack+0x24/0x30
 dump_stack_lvl+0x7c/0xa0
 print_report+0x15c/0x524
 kasan_report_invalid_free+0x64/0x84
 ____kasan_slab_free+0x150/0x184
 __kasan_slab_free+0x14/0x24
 slab_free_freelist_hook+0x100/0x1ac
 kmem_cache_free_bulk+0x230/0x3b0
 mas_destroy+0x10d8/0x1270
 mas_store_prealloc+0xb8/0xec
 do_mas_align_munmap+0x398/0x694
 do_mas_munmap+0xf8/0x118
 __vm_munmap+0x154/0x1e0
 __arm64_sys_munmap+0x44/0x54
 el0_svc_common+0xfc/0x1cc
 do_el0_svc_compat+0x38/0x5c
 el0_svc_compat+0x68/0xf4
 el0t_32_sync_handler+0xc0/0xf0
 el0t_32_sync+0x190/0x194

Allocated by task 8437:
 kasan_set_track+0x4c/0x7c
 __kasan_slab_alloc+0x84/0xa8
 kmem_cache_alloc_bulk+0x300/0x408
 mas_alloc_nodes+0x198/0x294
 mas_preallocate+0x8c/0x110
 __vma_adjust+0x174/0xc88
 vma_merge+0x2e4/0x300
 do_madvise+0x504/0xd20
 __arm64_sys_madvise+0x54/0x64
 el0_svc_common+0xfc/0x1cc
 do_el0_svc_compat+0x38/0x5c
 el0_svc_compat+0x68/0xf4
 el0t_32_sync_handler+0xc0/0xf0
 el0t_32_sync+0x190/0x194

Freed by task 8320:
 kasan_set_track+0x4c/0x7c
 kasan_set_free_info+0x2c/0x38
 ____kasan_slab_free+0x13c/0x184
 __kasan_slab_free+0x14/0x24
 slab_free_freelist_hook+0x100/0x1ac
 kmem_cache_free+0x11c/0x264
 mt_destroy_walk+0x6d8/0x714
 mas_wmb_replace+0x9d4/0xa68
 mas_spanning_rebalance+0x1af0/0x1d2c
 mas_wr_spanning_store+0x908/0x964
 mas_wr_store_entry+0x53c/0x5c0
 mas_store_prealloc+0x88/0xec
 do_mas_align_munmap+0x398/0x694
 do_mas_munmap+0xf8/0x118
 __vm_munmap+0x154/0x1e0
 __arm64_sys_munmap+0x44/0x54
 el0_svc_common+0xfc/0x1cc
 do_el0_svc_compat+0x38/0x5c
 el0_svc_compat+0x68/0xf4
 el0t_32_sync_handler+0xc0/0xf0
 el0t_32_sync+0x190/0x194

The buggy address belongs to the object at ffffff808b5f0a00
 which belongs to the cache maple_node of size 256
The buggy address is located 0 bytes inside of
 256-byte region [ffffff808b5f0a00, ffffff808b5f0b00)

The buggy address belongs to the physical page:
page:fffffffe022d7c00 refcount:1 mapcount:0 mapping:0000000000000000
index:0xcffff808b5f0a00 pfn:0x10b5f0
head:fffffffe022d7c00 order:2 compound_mapcount:0 compound_pincount:0
flags: 0x8000000000010200(slab|head|zone=2|kasantag=0x0)
raw: 8000000000010200 fffffffe031a8608 fffffffe021a3608 caffff808002c800
raw: 0cffff808b5f0a00 0000000000150013 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffffff808b5f0800: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
 ffffff808b5f0900: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
>ffffff808b5f0a00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
                   ^
 ffffff808b5f0b00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
 ffffff808b5f0c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
==================================================================

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-11 20:11           ` Yu Zhao
@ 2022-06-11 21:49             ` Yu Zhao
  2022-06-12  1:09               ` Liam Howlett
  2022-06-15 14:25               ` Liam Howlett
  0 siblings, 2 replies; 83+ messages in thread
From: Yu Zhao @ 2022-06-11 21:49 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> >
> > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > above?
> >
> > This was triggered by running the fuzzer over the weekend.
> >
> > $ trinity -C 160
> >
> > No bisection was done. It was only brought up here because the trace
> > pointed to do_mas_munmap() which was introduced here.
>
> Liam,
>
> I'm getting a similar crash on arm64 -- the allocator is madvise(),
> not mprotect(). Please take a look.

Another crash on x86_64, which seems different:

==================================================================
BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461

CPU: 66 PID: 18461 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
Call Trace:
 <TASK>
 dump_stack_lvl+0xc5/0xf4
 print_address_description+0x7f/0x460
 print_report+0x10b/0x240
 ? mab_mas_cp+0x2d9/0x6c0
 kasan_report+0xe6/0x110
 ? mab_mas_cp+0x2d9/0x6c0
 kasan_check_range+0x2ef/0x310
 ? mab_mas_cp+0x2d9/0x6c0
 memcpy+0x44/0x70
 mab_mas_cp+0x2d9/0x6c0
 mas_spanning_rebalance+0x1a45/0x4d70
 ? stack_trace_save+0xca/0x160
 ? stack_trace_save+0xca/0x160
 mas_wr_spanning_store+0x16a4/0x1ad0
 mas_wr_spanning_store+0x16a4/0x1ad0
 mas_wr_store_entry+0xbf9/0x12e0
 mas_store_prealloc+0x205/0x3c0
 do_mas_align_munmap+0x6cf/0xd10
 do_mas_munmap+0x1bb/0x210
 ? down_write_killable+0xa6/0x110
 __vm_munmap+0x1c4/0x270
 __x64_sys_munmap+0x60/0x70
 do_syscall_64+0x44/0xa0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x589827
Code: 00 00 00 48 c7 c2 98 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff
ff eb 85 66 2e 0f 1f 84 00 00 00 00 00 90 b8 0b 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 98 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff9276c518 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
RAX: ffffffffffffffda RBX: 0000400000000000 RCX: 0000000000589827
RDX: 0000000000000000 RSI: 00007ffffffff000 RDI: 0000000000000000
RBP: 00000000004cf000 R08: 00007fff9276c550 R09: 0000000000923bf0
R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000001000
R13: 00000000004cf040 R14: 0000000000000004 R15: 00007fff9276c668
 </TASK>

Allocated by task 18461:
 __kasan_slab_alloc+0xaf/0xe0
 kmem_cache_alloc_bulk+0x261/0x360
 mas_alloc_nodes+0x2d7/0x4d0
 mas_preallocate+0xe0/0x220
 do_mas_align_munmap+0x1ce/0xd10
 do_mas_munmap+0x1bb/0x210
 __vm_munmap+0x1c4/0x270
 __x64_sys_munmap+0x60/0x70
 do_syscall_64+0x44/0xa0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The buggy address belongs to the object at ffff88c5a2319c00
 which belongs to the cache maple_node of size 256
The buggy address is located 128 bytes inside of
 256-byte region [ffff88c5a2319c00, ffff88c5a2319d00)

The buggy address belongs to the physical page:
page:000000000a5cfe8b refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x45a2319
flags: 0x1400000000000200(slab|node=1|zone=1)
raw: 1400000000000200 ffffea01168dea88 ffffea0116951f48 ffff88810004ff00
raw: 0000000000000000 ffff88c5a2319000 0000000100000008 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88c5a2319c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88c5a2319c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88c5a2319d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                   ^
 ffff88c5a2319d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88c5a2319e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-11 21:49             ` Yu Zhao
@ 2022-06-12  1:09               ` Liam Howlett
  2022-06-15 14:25               ` Liam Howlett
  1 sibling, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-06-12  1:09 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220611 17:50]:
> On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> > >
> > > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > > above?
> > >
> > > This was triggered by running the fuzzer over the weekend.
> > >
> > > $ trinity -C 160
> > >
> > > No bisection was done. It was only brought up here because the trace
> > > pointed to do_mas_munmap() which was introduced here.
> >
> > Liam,
> >
> > I'm getting a similar crash on arm64 -- the allocator is madvise(),
> > not mprotect(). Please take a look.
> 
> Another crash on x86_64, which seems different:

Thanks, yes.  This one may be different.  The others are the same source
and I'm working on that.

> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461
> 
> CPU: 66 PID: 18461 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0xc5/0xf4
>  print_address_description+0x7f/0x460
>  print_report+0x10b/0x240
>  ? mab_mas_cp+0x2d9/0x6c0
>  kasan_report+0xe6/0x110
>  ? mab_mas_cp+0x2d9/0x6c0
>  kasan_check_range+0x2ef/0x310
>  ? mab_mas_cp+0x2d9/0x6c0
>  memcpy+0x44/0x70
>  mab_mas_cp+0x2d9/0x6c0
>  mas_spanning_rebalance+0x1a45/0x4d70
>  ? stack_trace_save+0xca/0x160
>  ? stack_trace_save+0xca/0x160
>  mas_wr_spanning_store+0x16a4/0x1ad0
>  mas_wr_spanning_store+0x16a4/0x1ad0
>  mas_wr_store_entry+0xbf9/0x12e0
>  mas_store_prealloc+0x205/0x3c0
>  do_mas_align_munmap+0x6cf/0xd10
>  do_mas_munmap+0x1bb/0x210
>  ? down_write_killable+0xa6/0x110
>  __vm_munmap+0x1c4/0x270
>  __x64_sys_munmap+0x60/0x70
>  do_syscall_64+0x44/0xa0
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x589827
> Code: 00 00 00 48 c7 c2 98 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff
> ff eb 85 66 2e 0f 1f 84 00 00 00 00 00 90 b8 0b 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 c7 c1 98 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fff9276c518 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
> RAX: ffffffffffffffda RBX: 0000400000000000 RCX: 0000000000589827
> RDX: 0000000000000000 RSI: 00007ffffffff000 RDI: 0000000000000000
> RBP: 00000000004cf000 R08: 00007fff9276c550 R09: 0000000000923bf0
> R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000001000
> R13: 00000000004cf040 R14: 0000000000000004 R15: 00007fff9276c668
>  </TASK>
> 
> Allocated by task 18461:
>  __kasan_slab_alloc+0xaf/0xe0
>  kmem_cache_alloc_bulk+0x261/0x360
>  mas_alloc_nodes+0x2d7/0x4d0
>  mas_preallocate+0xe0/0x220
>  do_mas_align_munmap+0x1ce/0xd10
>  do_mas_munmap+0x1bb/0x210
>  __vm_munmap+0x1c4/0x270
>  __x64_sys_munmap+0x60/0x70
>  do_syscall_64+0x44/0xa0
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> The buggy address belongs to the object at ffff88c5a2319c00
>  which belongs to the cache maple_node of size 256
> The buggy address is located 128 bytes inside of
>  256-byte region [ffff88c5a2319c00, ffff88c5a2319d00)
> 
> The buggy address belongs to the physical page:
> page:000000000a5cfe8b refcount:1 mapcount:0 mapping:0000000000000000
> index:0x0 pfn:0x45a2319
> flags: 0x1400000000000200(slab|node=1|zone=1)
> raw: 1400000000000200 ffffea01168dea88 ffffea0116951f48 ffff88810004ff00
> raw: 0000000000000000 ffff88c5a2319000 0000000100000008 0000000000000000
> page dumped because: kasan: bad access detected
> 
> Memory state around the buggy address:
>  ffff88c5a2319c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff88c5a2319c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >ffff88c5a2319d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>                    ^
>  ffff88c5a2319d80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff88c5a2319e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ==================================================================

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-11 21:49             ` Yu Zhao
  2022-06-12  1:09               ` Liam Howlett
@ 2022-06-15 14:25               ` Liam Howlett
  2022-06-15 18:07                 ` Yu Zhao
  1 sibling, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-06-15 14:25 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220611 17:50]:
> On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> > >
> > > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > > above?
> > >
> > > This was triggered by running the fuzzer over the weekend.
> > >
> > > $ trinity -C 160
> > >
> > > No bisection was done. It was only brought up here because the trace
> > > pointed to do_mas_munmap() which was introduced here.
> >
> > Liam,
> >
> > I'm getting a similar crash on arm64 -- the allocator is madvise(),
> > not mprotect(). Please take a look.
> 
> Another crash on x86_64, which seems different:

Thanks for this.  I was able to reproduce the other crashes that you and
Qian reported.  I've sent out a patch set to Andrew to apply to the
branch which includes the fix for them and an unrelated issue discovered
when I wrote the testcases to cover what was going on here.


> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461
> 
> CPU: 66 PID: 18461 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0xc5/0xf4
>  print_address_description+0x7f/0x460
>  print_report+0x10b/0x240
>  ? mab_mas_cp+0x2d9/0x6c0
>  kasan_report+0xe6/0x110
>  ? mab_mas_cp+0x2d9/0x6c0
>  kasan_check_range+0x2ef/0x310
>  ? mab_mas_cp+0x2d9/0x6c0
>  memcpy+0x44/0x70
>  mab_mas_cp+0x2d9/0x6c0
>  mas_spanning_rebalance+0x1a45/0x4d70
>  ? stack_trace_save+0xca/0x160
>  ? stack_trace_save+0xca/0x160
>  mas_wr_spanning_store+0x16a4/0x1ad0
>  mas_wr_spanning_store+0x16a4/0x1ad0
>  mas_wr_store_entry+0xbf9/0x12e0
>  mas_store_prealloc+0x205/0x3c0
>  do_mas_align_munmap+0x6cf/0xd10
>  do_mas_munmap+0x1bb/0x210
>  ? down_write_killable+0xa6/0x110
>  __vm_munmap+0x1c4/0x270
>  __x64_sys_munmap+0x60/0x70
>  do_syscall_64+0x44/0xa0
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x589827
> Code: 00 00 00 48 c7 c2 98 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff
> ff eb 85 66 2e 0f 1f 84 00 00 00 00 00 90 b8 0b 00 00 00 0f 05 <48> 3d
> 01 f0 ff ff 73 01 c3 48 c7 c1 98 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fff9276c518 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
> RAX: ffffffffffffffda RBX: 0000400000000000 RCX: 0000000000589827
> RDX: 0000000000000000 RSI: 00007ffffffff000 RDI: 0000000000000000
> RBP: 00000000004cf000 R08: 00007fff9276c550 R09: 0000000000923bf0
> R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000001000
> R13: 00000000004cf040 R14: 0000000000000004 R15: 00007fff9276c668
>  </TASK>

...

As for this crash, I was unable to reproduce and the code I just sent
out changes this code a lot.  Was this running with "trinity -c madvise"
or another use case/fuzzer?


Thanks,
Liam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-15 14:25               ` Liam Howlett
@ 2022-06-15 18:07                 ` Yu Zhao
  2022-06-15 18:55                   ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-15 18:07 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 8:25 AM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Yu Zhao <yuzhao@google.com> [220611 17:50]:
> > On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> > > >
> > > > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > > > above?
> > > >
> > > > This was triggered by running the fuzzer over the weekend.
> > > >
> > > > $ trinity -C 160
> > > >
> > > > No bisection was done. It was only brought up here because the trace
> > > > pointed to do_mas_munmap() which was introduced here.
> > >
> > > Liam,
> > >
> > > I'm getting a similar crash on arm64 -- the allocator is madvise(),
> > > not mprotect(). Please take a look.
> >
> > Another crash on x86_64, which seems different:
>
> Thanks for this.  I was able to reproduce the other crashes that you and
> Qian reported.  I've sent out a patch set to Andrew to apply to the
> branch which includes the fix for them and an unrelated issue discovered
> when I wrote the testcases to cover what was going on here.

Thanks. I'm restarting the test and will report the results in a few hours.

> > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461
                                                       ^^^^^^^^^

> As for this crash, I was unable to reproduce and the code I just sent
> out changes this code a lot.  Was this running with "trinity -c madvise"
> or another use case/fuzzer?

This is also stress-ng (same as the one on arm64). The test stopped
before it could try syzkaller (fuzzer).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-15 18:07                 ` Yu Zhao
@ 2022-06-15 18:55                   ` Liam Howlett
  2022-06-15 19:05                     ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-06-15 18:55 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220615 14:08]:
> On Wed, Jun 15, 2022 at 8:25 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Yu Zhao <yuzhao@google.com> [220611 17:50]:
> > > On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> > > > >
> > > > > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > > > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > > > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > > > > above?
> > > > >
> > > > > This was triggered by running the fuzzer over the weekend.
> > > > >
> > > > > $ trinity -C 160
> > > > >
> > > > > No bisection was done. It was only brought up here because the trace
> > > > > pointed to do_mas_munmap() which was introduced here.
> > > >
> > > > Liam,
> > > >
> > > > I'm getting a similar crash on arm64 -- the allocator is madvise(),
> > > > not mprotect(). Please take a look.
> > >
> > > Another crash on x86_64, which seems different:
> >
> > Thanks for this.  I was able to reproduce the other crashes that you and
> > Qian reported.  I've sent out a patch set to Andrew to apply to the
> > branch which includes the fix for them and an unrelated issue discovered
> > when I wrote the testcases to cover what was going on here.
> 
> Thanks. I'm restarting the test and will report the results in a few hours.
> 
> > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461
>                                                        ^^^^^^^^^
> 
> > As for this crash, I was unable to reproduce and the code I just sent
> > out changes this code a lot.  Was this running with "trinity -c madvise"
> > or another use case/fuzzer?
> 
> This is also stress-ng (same as the one on arm64). The test stopped
> before it could try syzkaller (fuzzer).

Thanks.  What are the arguments to stress-ng you use?  I've run
"stress-ng --class vm -a 20 -t 600s --temp-path /tmp" until it OOMs on
my vm, but it only has 8GB of ram.

Regards,
Liam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-15 18:55                   ` Liam Howlett
@ 2022-06-15 19:05                     ` Yu Zhao
  2022-06-15 21:16                       ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-15 19:05 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 12:55 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Yu Zhao <yuzhao@google.com> [220615 14:08]:
> > On Wed, Jun 15, 2022 at 8:25 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> > >
> > > * Yu Zhao <yuzhao@google.com> [220611 17:50]:
> > > > On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > > > > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > > > > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > > > > > above?
> > > > > >
> > > > > > This was triggered by running the fuzzer over the weekend.
> > > > > >
> > > > > > $ trinity -C 160
> > > > > >
> > > > > > No bisection was done. It was only brought up here because the trace
> > > > > > pointed to do_mas_munmap() which was introduced here.
> > > > >
> > > > > Liam,
> > > > >
> > > > > I'm getting a similar crash on arm64 -- the allocator is madvise(),
> > > > > not mprotect(). Please take a look.
> > > >
> > > > Another crash on x86_64, which seems different:
> > >
> > > Thanks for this.  I was able to reproduce the other crashes that you and
> > > Qian reported.  I've sent out a patch set to Andrew to apply to the
> > > branch which includes the fix for them and an unrelated issue discovered
> > > when I wrote the testcases to cover what was going on here.
> >
> > Thanks. I'm restarting the test and will report the results in a few hours.
> >
> > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461
> >                                                        ^^^^^^^^^
> >
> > > As for this crash, I was unable to reproduce and the code I just sent
> > > out changes this code a lot.  Was this running with "trinity -c madvise"
> > > or another use case/fuzzer?
> >
> > This is also stress-ng (same as the one on arm64). The test stopped
> > before it could try syzkaller (fuzzer).
>
> Thanks.  What are the arguments to stress-ng you use?  I've run
> "stress-ng --class vm -a 20 -t 600s --temp-path /tmp" until it OOMs on
> my vm, but it only has 8GB of ram.

Yes, I used the same parameters with 512GB of RAM, and the kernel with
KASAN and other debug options.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-15 19:05                     ` Yu Zhao
@ 2022-06-15 21:16                       ` Yu Zhao
  2022-06-16  1:50                         ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-15 21:16 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 1:05 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jun 15, 2022 at 12:55 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Yu Zhao <yuzhao@google.com> [220615 14:08]:
> > > On Wed, Jun 15, 2022 at 8:25 AM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > >
> > > > * Yu Zhao <yuzhao@google.com> [220611 17:50]:
> > > > > On Sat, Jun 11, 2022 at 2:11 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > > >
> > > > > > On Mon, Jun 6, 2022 at 10:40 AM Qian Cai <quic_qiancai@quicinc.com> wrote:
> > > > > > >
> > > > > > > On Mon, Jun 06, 2022 at 04:19:52PM +0000, Liam Howlett wrote:
> > > > > > > > Does your syscall fuzzer create a reproducer?  This looks like arm64
> > > > > > > > and says 5.18.0-next-20220603 again.  Was this bisected to the patch
> > > > > > > > above?
> > > > > > >
> > > > > > > This was triggered by running the fuzzer over the weekend.
> > > > > > >
> > > > > > > $ trinity -C 160
> > > > > > >
> > > > > > > No bisection was done. It was only brought up here because the trace
> > > > > > > pointed to do_mas_munmap() which was introduced here.
> > > > > >
> > > > > > Liam,
> > > > > >
> > > > > > I'm getting a similar crash on arm64 -- the allocator is madvise(),
> > > > > > not mprotect(). Please take a look.
> > > > >
> > > > > Another crash on x86_64, which seems different:
> > > >
> > > > Thanks for this.  I was able to reproduce the other crashes that you and
> > > > Qian reported.  I've sent out a patch set to Andrew to apply to the
> > > > branch which includes the fix for them and an unrelated issue discovered
> > > > when I wrote the testcases to cover what was going on here.
> > >
> > > Thanks. I'm restarting the test and will report the results in a few hours.
> > >
> > > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > > Write of size 136 at addr ffff88c5a2319c80 by task stress-ng/18461
> > >                                                        ^^^^^^^^^
> > >
> > > > As for this crash, I was unable to reproduce and the code I just sent
> > > > out changes this code a lot.  Was this running with "trinity -c madvise"
> > > > or another use case/fuzzer?
> > >
> > > This is also stress-ng (same as the one on arm64). The test stopped
> > > before it could try syzkaller (fuzzer).
> >
> > Thanks.  What are the arguments to stress-ng you use?  I've run
> > "stress-ng --class vm -a 20 -t 600s --temp-path /tmp" until it OOMs on
> > my vm, but it only has 8GB of ram.
>
> Yes, I used the same parameters with 512GB of RAM, and the kernel with
> KASAN and other debug options.

Sorry, Liam. I got the same crash :(

9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
55140693394d maple_tree: Make mas_prealloc() error checking more generic
2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
4d4472148ccd maple_tree: Change spanning store to work on larger trees
ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
spanning writes
0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()

==================================================================
BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303

CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
Call Trace:
 <TASK>
 dump_stack_lvl+0xc5/0xf4
 print_address_description+0x7f/0x460
 print_report+0x10b/0x240
 ? mab_mas_cp+0x2d9/0x6c0
 kasan_report+0xe6/0x110
 ? mast_spanning_rebalance+0x2634/0x29b0
 ? mab_mas_cp+0x2d9/0x6c0
 kasan_check_range+0x2ef/0x310
 ? mab_mas_cp+0x2d9/0x6c0
 ? mab_mas_cp+0x2d9/0x6c0
 memcpy+0x44/0x70
 mab_mas_cp+0x2d9/0x6c0
 mas_spanning_rebalance+0x1a3e/0x4f90
 ? stack_trace_save+0xca/0x160
 ? stack_trace_save+0xca/0x160
 mas_wr_spanning_store+0x16c5/0x1b80
 mas_wr_store_entry+0xbf9/0x12e0
 mas_store_prealloc+0x205/0x3c0
 do_mas_align_munmap+0x6cf/0xd10
 do_mas_munmap+0x1bb/0x210
 ? down_write_killable+0xa6/0x110
 __vm_munmap+0x1c4/0x270
 __x64_sys_munmap+0x60/0x70
 do_syscall_64+0x44/0xa0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x589827
Code: 00 00 00 48 c7 c2 98 ff ff ff f7 d8 64 89 02 48 c7 c0 ff ff ff
ff eb 85 66 2e 0f 1f 84 00 00 00 00 00 90 b8 0b 00 00 00 0f 05 <48> 3d
01 f0 ff ff 73 01 c3 48 c7 c1 98 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffee601ec08 EFLAGS: 00000206 ORIG_RAX: 000000000000000b
RAX: ffffffffffffffda RBX: 0000400000000000 RCX: 0000000000589827
RDX: 0000000000000000 RSI: 00007ffffffff000 RDI: 0000000000000000
RBP: 00000000004cf000 R08: 00007ffee601ec40 R09: 0000000000923bf0
R10: 0000000000000008 R11: 0000000000000206 R12: 0000000000001000
R13: 00000000004cf040 R14: 0000000000000002 R15: 00007ffee601ed58
 </TASK>

Allocated by task 19303:
 __kasan_slab_alloc+0xaf/0xe0
 kmem_cache_alloc_bulk+0x261/0x360
 mas_alloc_nodes+0x2d7/0x4d0
 mas_preallocate+0xe2/0x230
 do_mas_align_munmap+0x1ce/0xd10
 do_mas_munmap+0x1bb/0x210
 __vm_munmap+0x1c4/0x270
 __x64_sys_munmap+0x60/0x70
 do_syscall_64+0x44/0xa0
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

The buggy address belongs to the object at ffff88c35a3b9e00
 which belongs to the cache maple_node of size 256
The buggy address is located 128 bytes inside of
 256-byte region [ffff88c35a3b9e00, ffff88c35a3b9f00)

The buggy address belongs to the physical page:
page:00000000325428b6 refcount:1 mapcount:0 mapping:0000000000000000
index:0x0 pfn:0x435a3b9
flags: 0x1400000000000200(slab|node=1|zone=1)
raw: 1400000000000200 ffffea010d71a5c8 ffffea010d71dec8 ffff88810004ff00
raw: 0000000000000000 ffff88c35a3b9000 0000000100000008 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88c35a3b9e00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88c35a3b9e80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff88c35a3b9f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                   ^
 ffff88c35a3b9f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88c35a3ba000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-15 21:16                       ` Yu Zhao
@ 2022-06-16  1:50                         ` Liam Howlett
  2022-06-16  1:58                           ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-06-16  1:50 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220615 17:17]:

...

> > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > KASAN and other debug options.
> 
> Sorry, Liam. I got the same crash :(

Thanks for running this promptly.  I am trying to get my own server
setup now.

> 
> 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> spanning writes
> 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> 
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> 
> CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> Call Trace:
>  <TASK>
>  dump_stack_lvl+0xc5/0xf4
>  print_address_description+0x7f/0x460
>  print_report+0x10b/0x240
>  ? mab_mas_cp+0x2d9/0x6c0
>  kasan_report+0xe6/0x110
>  ? mast_spanning_rebalance+0x2634/0x29b0
>  ? mab_mas_cp+0x2d9/0x6c0
>  kasan_check_range+0x2ef/0x310
>  ? mab_mas_cp+0x2d9/0x6c0
>  ? mab_mas_cp+0x2d9/0x6c0
>  memcpy+0x44/0x70
>  mab_mas_cp+0x2d9/0x6c0
>  mas_spanning_rebalance+0x1a3e/0x4f90

Does this translate to an inline around line 2997?
And then probably around 2808?

>  ? stack_trace_save+0xca/0x160
>  ? stack_trace_save+0xca/0x160
>  mas_wr_spanning_store+0x16c5/0x1b80
>  mas_wr_store_entry+0xbf9/0x12e0
>  mas_store_prealloc+0x205/0x3c0
>  do_mas_align_munmap+0x6cf/0xd10
>  do_mas_munmap+0x1bb/0x210
>  ? down_write_killable+0xa6/0x110
>  __vm_munmap+0x1c4/0x270

Looks like a NULL entry being written.

>  __x64_sys_munmap+0x60/0x70
>  do_syscall_64+0x44/0xa0
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x589827


Thanks,
Liam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16  1:50                         ` Liam Howlett
@ 2022-06-16  1:58                           ` Yu Zhao
  2022-06-16  2:56                             ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-16  1:58 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Yu Zhao <yuzhao@google.com> [220615 17:17]:
>
> ...
>
> > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > KASAN and other debug options.
> >
> > Sorry, Liam. I got the same crash :(
>
> Thanks for running this promptly.  I am trying to get my own server
> setup now.
>
> >
> > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > spanning writes
> > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> >
> > ==================================================================
> > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> >
> > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > Call Trace:
> >  <TASK>
> >  dump_stack_lvl+0xc5/0xf4
> >  print_address_description+0x7f/0x460
> >  print_report+0x10b/0x240
> >  ? mab_mas_cp+0x2d9/0x6c0
> >  kasan_report+0xe6/0x110
> >  ? mast_spanning_rebalance+0x2634/0x29b0
> >  ? mab_mas_cp+0x2d9/0x6c0
> >  kasan_check_range+0x2ef/0x310
> >  ? mab_mas_cp+0x2d9/0x6c0
> >  ? mab_mas_cp+0x2d9/0x6c0
> >  memcpy+0x44/0x70
> >  mab_mas_cp+0x2d9/0x6c0
> >  mas_spanning_rebalance+0x1a3e/0x4f90
>
> Does this translate to an inline around line 2997?
> And then probably around 2808?

$ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
mab_mas_cp+0x2d9/0x6c0:
mab_mas_cp at lib/maple_tree.c:1988
$ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
mas_spanning_rebalance+0x1a3e/0x4f90:
mast_cp_to_nodes at lib/maple_tree.c:?
(inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
$ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
mas_wr_spanning_store+0x16c5/0x1b80:
mas_wr_spanning_store at lib/maple_tree.c:?

No idea why faddr2line didn't work for the last two addresses. GDB
seems more reliable.

(gdb) li *(mab_mas_cp+0x2d9)
0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
(gdb) li *(mas_spanning_rebalance+0x1a3e)
0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
quit)
(gdb) li *(mas_wr_spanning_store+0x16c5)
0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16  1:58                           ` Yu Zhao
@ 2022-06-16  2:56                             ` Liam Howlett
  2022-06-16  3:02                               ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-06-16  2:56 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220615 21:59]:
> On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> >
> > ...
> >
> > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > KASAN and other debug options.
> > >
> > > Sorry, Liam. I got the same crash :(
> >
> > Thanks for running this promptly.  I am trying to get my own server
> > setup now.
> >
> > >
> > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > spanning writes
> > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > >
> > > ==================================================================
> > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > >
> > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > Call Trace:
> > >  <TASK>
> > >  dump_stack_lvl+0xc5/0xf4
> > >  print_address_description+0x7f/0x460
> > >  print_report+0x10b/0x240
> > >  ? mab_mas_cp+0x2d9/0x6c0
> > >  kasan_report+0xe6/0x110
> > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > >  ? mab_mas_cp+0x2d9/0x6c0
> > >  kasan_check_range+0x2ef/0x310
> > >  ? mab_mas_cp+0x2d9/0x6c0
> > >  ? mab_mas_cp+0x2d9/0x6c0
> > >  memcpy+0x44/0x70
> > >  mab_mas_cp+0x2d9/0x6c0
> > >  mas_spanning_rebalance+0x1a3e/0x4f90
> >
> > Does this translate to an inline around line 2997?
> > And then probably around 2808?
> 
> $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> mab_mas_cp+0x2d9/0x6c0:
> mab_mas_cp at lib/maple_tree.c:1988
> $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> mas_spanning_rebalance+0x1a3e/0x4f90:
> mast_cp_to_nodes at lib/maple_tree.c:?
> (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> mas_wr_spanning_store+0x16c5/0x1b80:
> mas_wr_spanning_store at lib/maple_tree.c:?
> 
> No idea why faddr2line didn't work for the last two addresses. GDB
> seems more reliable.
> 
> (gdb) li *(mab_mas_cp+0x2d9)
> 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> (gdb) li *(mas_spanning_rebalance+0x1a3e)
> 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> quit)
> (gdb) li *(mas_wr_spanning_store+0x16c5)
> 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).


Thanks.  I am not having luck recreating it.  I am hitting what looks
like an unrelated issue in the unstable mm, "scheduling while atomic".
I will try the git commit you indicate above.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16  2:56                             ` Liam Howlett
@ 2022-06-16  3:02                               ` Yu Zhao
  2022-06-16  5:45                                 ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-16  3:02 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 8:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Yu Zhao <yuzhao@google.com> [220615 21:59]:
> > On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > >
> > > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> > >
> > > ...
> > >
> > > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > > KASAN and other debug options.
> > > >
> > > > Sorry, Liam. I got the same crash :(
> > >
> > > Thanks for running this promptly.  I am trying to get my own server
> > > setup now.
> > >
> > > >
> > > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > > spanning writes
> > > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > > >
> > > > ==================================================================
> > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > > >
> > > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > > Call Trace:
> > > >  <TASK>
> > > >  dump_stack_lvl+0xc5/0xf4
> > > >  print_address_description+0x7f/0x460
> > > >  print_report+0x10b/0x240
> > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > >  kasan_report+0xe6/0x110
> > > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > >  kasan_check_range+0x2ef/0x310
> > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > >  memcpy+0x44/0x70
> > > >  mab_mas_cp+0x2d9/0x6c0
> > > >  mas_spanning_rebalance+0x1a3e/0x4f90
> > >
> > > Does this translate to an inline around line 2997?
> > > And then probably around 2808?
> >
> > $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> > mab_mas_cp+0x2d9/0x6c0:
> > mab_mas_cp at lib/maple_tree.c:1988
> > $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> > mas_spanning_rebalance+0x1a3e/0x4f90:
> > mast_cp_to_nodes at lib/maple_tree.c:?
> > (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> > $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> > mas_wr_spanning_store+0x16c5/0x1b80:
> > mas_wr_spanning_store at lib/maple_tree.c:?
> >
> > No idea why faddr2line didn't work for the last two addresses. GDB
> > seems more reliable.
> >
> > (gdb) li *(mab_mas_cp+0x2d9)
> > 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> > (gdb) li *(mas_spanning_rebalance+0x1a3e)
> > 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> > quit)
> > (gdb) li *(mas_wr_spanning_store+0x16c5)
> > 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).
>
>
> Thanks.  I am not having luck recreating it.  I am hitting what looks
> like an unrelated issue in the unstable mm, "scheduling while atomic".
> I will try the git commit you indicate above.

Fix here:
https://lore.kernel.org/linux-mm/20220615160446.be1f75fd256d67e57b27a9fc@linux-foundation.org/

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16  3:02                               ` Yu Zhao
@ 2022-06-16  5:45                                 ` Yu Zhao
  2022-06-16  5:55                                   ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-16  5:45 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jun 15, 2022 at 8:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Yu Zhao <yuzhao@google.com> [220615 21:59]:
> > > On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > >
> > > > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> > > >
> > > > ...
> > > >
> > > > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > > > KASAN and other debug options.
> > > > >
> > > > > Sorry, Liam. I got the same crash :(
> > > >
> > > > Thanks for running this promptly.  I am trying to get my own server
> > > > setup now.
> > > >
> > > > >
> > > > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > > > spanning writes
> > > > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > > > >
> > > > > ==================================================================
> > > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > > > >
> > > > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > > > Call Trace:
> > > > >  <TASK>
> > > > >  dump_stack_lvl+0xc5/0xf4
> > > > >  print_address_description+0x7f/0x460
> > > > >  print_report+0x10b/0x240
> > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > >  kasan_report+0xe6/0x110
> > > > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > >  kasan_check_range+0x2ef/0x310
> > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > >  memcpy+0x44/0x70
> > > > >  mab_mas_cp+0x2d9/0x6c0
> > > > >  mas_spanning_rebalance+0x1a3e/0x4f90
> > > >
> > > > Does this translate to an inline around line 2997?
> > > > And then probably around 2808?
> > >
> > > $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> > > mab_mas_cp+0x2d9/0x6c0:
> > > mab_mas_cp at lib/maple_tree.c:1988
> > > $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> > > mas_spanning_rebalance+0x1a3e/0x4f90:
> > > mast_cp_to_nodes at lib/maple_tree.c:?
> > > (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> > > $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> > > mas_wr_spanning_store+0x16c5/0x1b80:
> > > mas_wr_spanning_store at lib/maple_tree.c:?
> > >
> > > No idea why faddr2line didn't work for the last two addresses. GDB
> > > seems more reliable.
> > >
> > > (gdb) li *(mab_mas_cp+0x2d9)
> > > 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> > > (gdb) li *(mas_spanning_rebalance+0x1a3e)
> > > 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> > > quit)
> > > (gdb) li *(mas_wr_spanning_store+0x16c5)
> > > 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).
> >
> >
> > Thanks.  I am not having luck recreating it.  I am hitting what looks
> > like an unrelated issue in the unstable mm, "scheduling while atomic".
> > I will try the git commit you indicate above.
>
> Fix here:
> https://lore.kernel.org/linux-mm/20220615160446.be1f75fd256d67e57b27a9fc@linux-foundation.org/

A seemingly new crash on arm64:

KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
pc : __hwasan_check_x2_67043363+0x4/0x34
lr : mas_wr_walk_descend+0xe0/0x2c0
sp : ffffffc0164378d0
x29: ffffffc0164378f0 x28: 13ffff8028ee7328 x27: ffffffc016437a68
x26: 0dffff807aa63710 x25: ffffffc016437a60 x24: 51ffff8028ee1928
x23: ffffffc016437a78 x22: ffffffc0164379e0 x21: ffffffc016437998
x20: efffffc000000000 x19: ffffffc016437998 x18: 07ffff8077718180
x17: 45ffff800b366010 x16: 0000000000000000 x15: 9cffff8092bfcdf0
x14: ffffffefef411b8c x13: 0000000000000001 x12: 0000000000000002
x11: ffffffffffffff00 x10: 0000000000000000 x9 : efffffc000000000
x8 : ffffffc016437a60 x7 : 0000000000000000 x6 : ffffffefef8246cc
x5 : 0000000000000000 x4 : 0000000000000000 x3 : ffffffeff0bf48ee
x2 : 0000000000000008 x1 : ffffffc0164379b8 x0 : ffffffc016437998
Call trace:
 __hwasan_check_x2_67043363+0x4/0x34
 mas_wr_store_entry+0x178/0x5c0
 mas_store+0x88/0xc8
 dup_mmap+0x4bc/0x6d8
 dup_mm+0x8c/0x17c
 copy_mm+0xb0/0x12c
 copy_process+0xa44/0x17d4
 kernel_clone+0x100/0x2cc
 __arm64_sys_clone+0xf4/0x120
 el0_svc_common+0xfc/0x1cc
 do_el0_svc_compat+0x38/0x5c
 el0_svc_compat+0x68/0xf4
 el0t_32_sync_handler+0xc0/0xf0
 el0t_32_sync+0x190/0x194
Code: aa0203e0 d2800441 141e931d 9344dc50 (38706930)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16  5:45                                 ` Yu Zhao
@ 2022-06-16  5:55                                   ` Yu Zhao
  2022-06-16 18:26                                     ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-16  5:55 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Wed, Jun 15, 2022 at 11:45 PM Yu Zhao <yuzhao@google.com> wrote:
>
> On Wed, Jun 15, 2022 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Jun 15, 2022 at 8:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > >
> > > * Yu Zhao <yuzhao@google.com> [220615 21:59]:
> > > > On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > > >
> > > > > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> > > > >
> > > > > ...
> > > > >
> > > > > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > > > > KASAN and other debug options.
> > > > > >
> > > > > > Sorry, Liam. I got the same crash :(
> > > > >
> > > > > Thanks for running this promptly.  I am trying to get my own server
> > > > > setup now.
> > > > >
> > > > > >
> > > > > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > > > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > > > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > > > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > > > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > > > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > > > > spanning writes
> > > > > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > > > > >
> > > > > > ==================================================================
> > > > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > > > > >
> > > > > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > > > > Call Trace:
> > > > > >  <TASK>
> > > > > >  dump_stack_lvl+0xc5/0xf4
> > > > > >  print_address_description+0x7f/0x460
> > > > > >  print_report+0x10b/0x240
> > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > >  kasan_report+0xe6/0x110
> > > > > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > >  kasan_check_range+0x2ef/0x310
> > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > >  memcpy+0x44/0x70
> > > > > >  mab_mas_cp+0x2d9/0x6c0
> > > > > >  mas_spanning_rebalance+0x1a3e/0x4f90
> > > > >
> > > > > Does this translate to an inline around line 2997?
> > > > > And then probably around 2808?
> > > >
> > > > $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> > > > mab_mas_cp+0x2d9/0x6c0:
> > > > mab_mas_cp at lib/maple_tree.c:1988
> > > > $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> > > > mas_spanning_rebalance+0x1a3e/0x4f90:
> > > > mast_cp_to_nodes at lib/maple_tree.c:?
> > > > (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> > > > $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> > > > mas_wr_spanning_store+0x16c5/0x1b80:
> > > > mas_wr_spanning_store at lib/maple_tree.c:?
> > > >
> > > > No idea why faddr2line didn't work for the last two addresses. GDB
> > > > seems more reliable.
> > > >
> > > > (gdb) li *(mab_mas_cp+0x2d9)
> > > > 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> > > > (gdb) li *(mas_spanning_rebalance+0x1a3e)
> > > > 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> > > > quit)
> > > > (gdb) li *(mas_wr_spanning_store+0x16c5)
> > > > 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).
> > >
> > >
> > > Thanks.  I am not having luck recreating it.  I am hitting what looks
> > > like an unrelated issue in the unstable mm, "scheduling while atomic".
> > > I will try the git commit you indicate above.
> >
> > Fix here:
> > https://lore.kernel.org/linux-mm/20220615160446.be1f75fd256d67e57b27a9fc@linux-foundation.org/
>
> A seemingly new crash on arm64:
>
> KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
> Call trace:
>  __hwasan_check_x2_67043363+0x4/0x34
>  mas_wr_store_entry+0x178/0x5c0
>  mas_store+0x88/0xc8
>  dup_mmap+0x4bc/0x6d8
>  dup_mm+0x8c/0x17c
>  copy_mm+0xb0/0x12c
>  copy_process+0xa44/0x17d4
>  kernel_clone+0x100/0x2cc
>  __arm64_sys_clone+0xf4/0x120
>  el0_svc_common+0xfc/0x1cc
>  do_el0_svc_compat+0x38/0x5c
>  el0_svc_compat+0x68/0xf4
>  el0t_32_sync_handler+0xc0/0xf0
>  el0t_32_sync+0x190/0x194
> Code: aa0203e0 d2800441 141e931d 9344dc50 (38706930)

And bad rss counters from another arm64 machine:

BUG: Bad rss-counter state mm:a6ffff80895ff840 type:MM_ANONPAGES val:4
Call trace:
 __mmdrop+0x1f0/0x208
 __mmput+0x194/0x198
 mmput+0x5c/0x80
 exit_mm+0x108/0x190
 do_exit+0x244/0xc98
 __arm64_sys_exit_group+0x0/0x30
 __wake_up_parent+0x0/0x48
 el0_svc_common+0xfc/0x1cc
 do_el0_svc_compat+0x38/0x5c
 el0_svc_compat+0x68/0xf4
 el0t_32_sync_handler+0xc0/0xf0
 el0t_32_sync+0x190/0x194
Code: b000b520 91259c00 aa1303e1 94482015 (d4210000)

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16  5:55                                   ` Yu Zhao
@ 2022-06-16 18:26                                     ` Liam Howlett
  2022-06-16 18:34                                       ` Yu Zhao
  0 siblings, 1 reply; 83+ messages in thread
From: Liam Howlett @ 2022-06-16 18:26 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220616 01:56]:
> On Wed, Jun 15, 2022 at 11:45 PM Yu Zhao <yuzhao@google.com> wrote:
> >
> > On Wed, Jun 15, 2022 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Wed, Jun 15, 2022 at 8:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > >
> > > > * Yu Zhao <yuzhao@google.com> [220615 21:59]:
> > > > > On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > > > >
> > > > > > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > > > > > KASAN and other debug options.
> > > > > > >
> > > > > > > Sorry, Liam. I got the same crash :(
> > > > > >
> > > > > > Thanks for running this promptly.  I am trying to get my own server
> > > > > > setup now.
> > > > > >
> > > > > > >
> > > > > > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > > > > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > > > > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > > > > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > > > > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > > > > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > > > > > spanning writes
> > > > > > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > > > > > >
> > > > > > > ==================================================================
> > > > > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > > > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > > > > > >
> > > > > > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > > > > > Call Trace:
> > > > > > >  <TASK>
> > > > > > >  dump_stack_lvl+0xc5/0xf4
> > > > > > >  print_address_description+0x7f/0x460
> > > > > > >  print_report+0x10b/0x240
> > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > >  kasan_report+0xe6/0x110
> > > > > > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > >  kasan_check_range+0x2ef/0x310
> > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > >  memcpy+0x44/0x70
> > > > > > >  mab_mas_cp+0x2d9/0x6c0
> > > > > > >  mas_spanning_rebalance+0x1a3e/0x4f90
> > > > > >
> > > > > > Does this translate to an inline around line 2997?
> > > > > > And then probably around 2808?
> > > > >
> > > > > $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> > > > > mab_mas_cp+0x2d9/0x6c0:
> > > > > mab_mas_cp at lib/maple_tree.c:1988
> > > > > $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> > > > > mas_spanning_rebalance+0x1a3e/0x4f90:
> > > > > mast_cp_to_nodes at lib/maple_tree.c:?
> > > > > (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> > > > > $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> > > > > mas_wr_spanning_store+0x16c5/0x1b80:
> > > > > mas_wr_spanning_store at lib/maple_tree.c:?
> > > > >
> > > > > No idea why faddr2line didn't work for the last two addresses. GDB
> > > > > seems more reliable.
> > > > >
> > > > > (gdb) li *(mab_mas_cp+0x2d9)
> > > > > 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> > > > > (gdb) li *(mas_spanning_rebalance+0x1a3e)
> > > > > 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> > > > > quit)
> > > > > (gdb) li *(mas_wr_spanning_store+0x16c5)
> > > > > 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).
> > > >
> > > >
> > > > Thanks.  I am not having luck recreating it.  I am hitting what looks
> > > > like an unrelated issue in the unstable mm, "scheduling while atomic".
> > > > I will try the git commit you indicate above.
> > >
> > > Fix here:
> > > https://lore.kernel.org/linux-mm/20220615160446.be1f75fd256d67e57b27a9fc@linux-foundation.org/
> >
> > A seemingly new crash on arm64:
> >
> > KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
> > Call trace:
> >  __hwasan_check_x2_67043363+0x4/0x34
> >  mas_wr_store_entry+0x178/0x5c0
> >  mas_store+0x88/0xc8
> >  dup_mmap+0x4bc/0x6d8
> >  dup_mm+0x8c/0x17c
> >  copy_mm+0xb0/0x12c
> >  copy_process+0xa44/0x17d4
> >  kernel_clone+0x100/0x2cc
> >  __arm64_sys_clone+0xf4/0x120
> >  el0_svc_common+0xfc/0x1cc
> >  do_el0_svc_compat+0x38/0x5c
> >  el0_svc_compat+0x68/0xf4
> >  el0t_32_sync_handler+0xc0/0xf0
> >  el0t_32_sync+0x190/0x194
> > Code: aa0203e0 d2800441 141e931d 9344dc50 (38706930)
> 
> And bad rss counters from another arm64 machine:
> 
> BUG: Bad rss-counter state mm:a6ffff80895ff840 type:MM_ANONPAGES val:4
> Call trace:
>  __mmdrop+0x1f0/0x208
>  __mmput+0x194/0x198
>  mmput+0x5c/0x80
>  exit_mm+0x108/0x190
>  do_exit+0x244/0xc98
>  __arm64_sys_exit_group+0x0/0x30
>  __wake_up_parent+0x0/0x48
>  el0_svc_common+0xfc/0x1cc
>  do_el0_svc_compat+0x38/0x5c
>  el0_svc_compat+0x68/0xf4
>  el0t_32_sync_handler+0xc0/0xf0
>  el0t_32_sync+0x190/0x194
> Code: b000b520 91259c00 aa1303e1 94482015 (d4210000)
> 


What was the setup for these two?  I'm running trinity, but I suspect
you are using stress-ng?  If so, what are the arguments?  My arm64 vm is
even lower memory than my x86_64 vm so I will probably have to adjust
accordingly.


Thanks,
Liam

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16 18:26                                     ` Liam Howlett
@ 2022-06-16 18:34                                       ` Yu Zhao
  2022-06-17 13:49                                         ` Liam Howlett
  0 siblings, 1 reply; 83+ messages in thread
From: Yu Zhao @ 2022-06-16 18:34 UTC (permalink / raw)
  To: Liam Howlett; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

On Thu, Jun 16, 2022 at 12:27 PM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Yu Zhao <yuzhao@google.com> [220616 01:56]:
> > On Wed, Jun 15, 2022 at 11:45 PM Yu Zhao <yuzhao@google.com> wrote:
> > >
> > > On Wed, Jun 15, 2022 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Wed, Jun 15, 2022 at 8:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > > >
> > > > > * Yu Zhao <yuzhao@google.com> [220615 21:59]:
> > > > > > On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > > > > >
> > > > > > > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> > > > > > >
> > > > > > > ...
> > > > > > >
> > > > > > > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > > > > > > KASAN and other debug options.
> > > > > > > >
> > > > > > > > Sorry, Liam. I got the same crash :(
> > > > > > >
> > > > > > > Thanks for running this promptly.  I am trying to get my own server
> > > > > > > setup now.
> > > > > > >
> > > > > > > >
> > > > > > > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > > > > > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > > > > > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > > > > > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > > > > > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > > > > > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > > > > > > spanning writes
> > > > > > > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > > > > > > >
> > > > > > > > ==================================================================
> > > > > > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > > > > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > > > > > > >
> > > > > > > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > > > > > > Call Trace:
> > > > > > > >  <TASK>
> > > > > > > >  dump_stack_lvl+0xc5/0xf4
> > > > > > > >  print_address_description+0x7f/0x460
> > > > > > > >  print_report+0x10b/0x240
> > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > >  kasan_report+0xe6/0x110
> > > > > > > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > >  kasan_check_range+0x2ef/0x310
> > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > >  memcpy+0x44/0x70
> > > > > > > >  mab_mas_cp+0x2d9/0x6c0
> > > > > > > >  mas_spanning_rebalance+0x1a3e/0x4f90
> > > > > > >
> > > > > > > Does this translate to an inline around line 2997?
> > > > > > > And then probably around 2808?
> > > > > >
> > > > > > $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> > > > > > mab_mas_cp+0x2d9/0x6c0:
> > > > > > mab_mas_cp at lib/maple_tree.c:1988
> > > > > > $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> > > > > > mas_spanning_rebalance+0x1a3e/0x4f90:
> > > > > > mast_cp_to_nodes at lib/maple_tree.c:?
> > > > > > (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> > > > > > $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> > > > > > mas_wr_spanning_store+0x16c5/0x1b80:
> > > > > > mas_wr_spanning_store at lib/maple_tree.c:?
> > > > > >
> > > > > > No idea why faddr2line didn't work for the last two addresses. GDB
> > > > > > seems more reliable.
> > > > > >
> > > > > > (gdb) li *(mab_mas_cp+0x2d9)
> > > > > > 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> > > > > > (gdb) li *(mas_spanning_rebalance+0x1a3e)
> > > > > > 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> > > > > > quit)
> > > > > > (gdb) li *(mas_wr_spanning_store+0x16c5)
> > > > > > 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).
> > > > >
> > > > >
> > > > > Thanks.  I am not having luck recreating it.  I am hitting what looks
> > > > > like an unrelated issue in the unstable mm, "scheduling while atomic".
> > > > > I will try the git commit you indicate above.
> > > >
> > > > Fix here:
> > > > https://lore.kernel.org/linux-mm/20220615160446.be1f75fd256d67e57b27a9fc@linux-foundation.org/
> > >
> > > A seemingly new crash on arm64:
> > >
> > > KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
> > > Call trace:
> > >  __hwasan_check_x2_67043363+0x4/0x34
> > >  mas_wr_store_entry+0x178/0x5c0
> > >  mas_store+0x88/0xc8
> > >  dup_mmap+0x4bc/0x6d8
> > >  dup_mm+0x8c/0x17c
> > >  copy_mm+0xb0/0x12c
> > >  copy_process+0xa44/0x17d4
> > >  kernel_clone+0x100/0x2cc
> > >  __arm64_sys_clone+0xf4/0x120
> > >  el0_svc_common+0xfc/0x1cc
> > >  do_el0_svc_compat+0x38/0x5c
> > >  el0_svc_compat+0x68/0xf4
> > >  el0t_32_sync_handler+0xc0/0xf0
> > >  el0t_32_sync+0x190/0x194
> > > Code: aa0203e0 d2800441 141e931d 9344dc50 (38706930)
> >
> > And bad rss counters from another arm64 machine:
> >
> > BUG: Bad rss-counter state mm:a6ffff80895ff840 type:MM_ANONPAGES val:4
> > Call trace:
> >  __mmdrop+0x1f0/0x208
> >  __mmput+0x194/0x198
> >  mmput+0x5c/0x80
> >  exit_mm+0x108/0x190
> >  do_exit+0x244/0xc98
> >  __arm64_sys_exit_group+0x0/0x30
> >  __wake_up_parent+0x0/0x48
> >  el0_svc_common+0xfc/0x1cc
> >  do_el0_svc_compat+0x38/0x5c
> >  el0_svc_compat+0x68/0xf4
> >  el0t_32_sync_handler+0xc0/0xf0
> >  el0t_32_sync+0x190/0x194
> > Code: b000b520 91259c00 aa1303e1 94482015 (d4210000)
> >
>
> What was the setup for these two?  I'm running trinity, but I suspect
> you are using stress-ng?

That's correct.

> If so, what are the arguments?  My arm64 vm is
> even lower memory than my x86_64 vm so I will probably have to adjust
> accordingly.

I usually lower the N for `-a N`.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states
  2022-06-16 18:34                                       ` Yu Zhao
@ 2022-06-17 13:49                                         ` Liam Howlett
  0 siblings, 0 replies; 83+ messages in thread
From: Liam Howlett @ 2022-06-17 13:49 UTC (permalink / raw)
  To: Yu Zhao; +Cc: Qian Cai, maple-tree, linux-mm, linux-kernel, Andrew Morton

* Yu Zhao <yuzhao@google.com> [220616 14:35]:
> On Thu, Jun 16, 2022 at 12:27 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> >
> > * Yu Zhao <yuzhao@google.com> [220616 01:56]:
> > > On Wed, Jun 15, 2022 at 11:45 PM Yu Zhao <yuzhao@google.com> wrote:
> > > >
> > > > On Wed, Jun 15, 2022 at 9:02 PM Yu Zhao <yuzhao@google.com> wrote:
> > > > >
> > > > > On Wed, Jun 15, 2022 at 8:56 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > > > >
> > > > > > * Yu Zhao <yuzhao@google.com> [220615 21:59]:
> > > > > > > On Wed, Jun 15, 2022 at 7:50 PM Liam Howlett <liam.howlett@oracle.com> wrote:
> > > > > > > >
> > > > > > > > * Yu Zhao <yuzhao@google.com> [220615 17:17]:
> > > > > > > >
> > > > > > > > ...
> > > > > > > >
> > > > > > > > > > Yes, I used the same parameters with 512GB of RAM, and the kernel with
> > > > > > > > > > KASAN and other debug options.
> > > > > > > > >
> > > > > > > > > Sorry, Liam. I got the same crash :(
> > > > > > > >
> > > > > > > > Thanks for running this promptly.  I am trying to get my own server
> > > > > > > > setup now.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > 9d27f2f1487a (tag: mm-everything-2022-06-14-19-05, akpm/mm-everything)
> > > > > > > > > 00d4d7b519d6 fs/userfaultfd: Fix vma iteration in mas_for_each() loop
> > > > > > > > > 55140693394d maple_tree: Make mas_prealloc() error checking more generic
> > > > > > > > > 2d7e7c2fcf16 maple_tree: Fix mt_destroy_walk() on full non-leaf non-alloc nodes
> > > > > > > > > 4d4472148ccd maple_tree: Change spanning store to work on larger trees
> > > > > > > > > ea36bcc14c00 test_maple_tree: Add tests for preallocations and large
> > > > > > > > > spanning writes
> > > > > > > > > 0d2aa86ead4f mm/mlock: Drop dead code in count_mm_mlocked_page_nr()
> > > > > > > > >
> > > > > > > > > ==================================================================
> > > > > > > > > BUG: KASAN: slab-out-of-bounds in mab_mas_cp+0x2d9/0x6c0
> > > > > > > > > Write of size 136 at addr ffff88c35a3b9e80 by task stress-ng/19303
> > > > > > > > >
> > > > > > > > > CPU: 66 PID: 19303 Comm: stress-ng Tainted: G S        I       5.19.0-smp-DEV #1
> > > > > > > > > Call Trace:
> > > > > > > > >  <TASK>
> > > > > > > > >  dump_stack_lvl+0xc5/0xf4
> > > > > > > > >  print_address_description+0x7f/0x460
> > > > > > > > >  print_report+0x10b/0x240
> > > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > > >  kasan_report+0xe6/0x110
> > > > > > > > >  ? mast_spanning_rebalance+0x2634/0x29b0
> > > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > > >  kasan_check_range+0x2ef/0x310
> > > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > > >  ? mab_mas_cp+0x2d9/0x6c0
> > > > > > > > >  memcpy+0x44/0x70
> > > > > > > > >  mab_mas_cp+0x2d9/0x6c0
> > > > > > > > >  mas_spanning_rebalance+0x1a3e/0x4f90
> > > > > > > >
> > > > > > > > Does this translate to an inline around line 2997?
> > > > > > > > And then probably around 2808?
> > > > > > >
> > > > > > > $ ./scripts/faddr2line vmlinux mab_mas_cp+0x2d9
> > > > > > > mab_mas_cp+0x2d9/0x6c0:
> > > > > > > mab_mas_cp at lib/maple_tree.c:1988
> > > > > > > $ ./scripts/faddr2line vmlinux mas_spanning_rebalance+0x1a3e
> > > > > > > mas_spanning_rebalance+0x1a3e/0x4f90:
> > > > > > > mast_cp_to_nodes at lib/maple_tree.c:?
> > > > > > > (inlined by) mas_spanning_rebalance at lib/maple_tree.c:2997
> > > > > > > $ ./scripts/faddr2line vmlinux mas_wr_spanning_store+0x16c5
> > > > > > > mas_wr_spanning_store+0x16c5/0x1b80:
> > > > > > > mas_wr_spanning_store at lib/maple_tree.c:?
> > > > > > >
> > > > > > > No idea why faddr2line didn't work for the last two addresses. GDB
> > > > > > > seems more reliable.
> > > > > > >
> > > > > > > (gdb) li *(mab_mas_cp+0x2d9)
> > > > > > > 0xffffffff8226b049 is in mab_mas_cp (lib/maple_tree.c:1988).
> > > > > > > (gdb) li *(mas_spanning_rebalance+0x1a3e)
> > > > > > > 0xffffffff822633ce is in mas_spanning_rebalance (lib/maple_tree.c:2801).
> > > > > > > quit)
> > > > > > > (gdb) li *(mas_wr_spanning_store+0x16c5)
> > > > > > > 0xffffffff8225cfb5 is in mas_wr_spanning_store (lib/maple_tree.c:4030).
> > > > > >
> > > > > >
> > > > > > Thanks.  I am not having luck recreating it.  I am hitting what looks
> > > > > > like an unrelated issue in the unstable mm, "scheduling while atomic".
> > > > > > I will try the git commit you indicate above.
> > > > >
> > > > > Fix here:
> > > > > https://lore.kernel.org/linux-mm/20220615160446.be1f75fd256d67e57b27a9fc@linux-foundation.org/
> > > >
> > > > A seemingly new crash on arm64:
> > > >
> > > > KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
> > > > Call trace:
> > > >  __hwasan_check_x2_67043363+0x4/0x34
> > > >  mas_wr_store_entry+0x178/0x5c0
> > > >  mas_store+0x88/0xc8
> > > >  dup_mmap+0x4bc/0x6d8
> > > >  dup_mm+0x8c/0x17c
> > > >  copy_mm+0xb0/0x12c
> > > >  copy_process+0xa44/0x17d4
> > > >  kernel_clone+0x100/0x2cc
> > > >  __arm64_sys_clone+0xf4/0x120
> > > >  el0_svc_common+0xfc/0x1cc
> > > >  do_el0_svc_compat+0x38/0x5c
> > > >  el0_svc_compat+0x68/0xf4
> > > >  el0t_32_sync_handler+0xc0/0xf0
> > > >  el0t_32_sync+0x190/0x194
> > > > Code: aa0203e0 d2800441 141e931d 9344dc50 (38706930)
> > >
> > > And bad rss counters from another arm64 machine:
> > >
> > > BUG: Bad rss-counter state mm:a6ffff80895ff840 type:MM_ANONPAGES val:4
> > > Call trace:
> > >  __mmdrop+0x1f0/0x208
> > >  __mmput+0x194/0x198
> > >  mmput+0x5c/0x80
> > >  exit_mm+0x108/0x190
> > >  do_exit+0x244/0xc98
> > >  __arm64_sys_exit_group+0x0/0x30
> > >  __wake_up_parent+0x0/0x48
> > >  el0_svc_common+0xfc/0x1cc
> > >  do_el0_svc_compat+0x38/0x5c
> > >  el0_svc_compat+0x68/0xf4
> > >  el0t_32_sync_handler+0xc0/0xf0
> > >  el0t_32_sync+0x190/0x194
> > > Code: b000b520 91259c00 aa1303e1 94482015 (d4210000)
> > >
> >
> > What was the setup for these two?  I'm running trinity, but I suspect
> > you are using stress-ng?
> 
> That's correct.
> 
> > If so, what are the arguments?  My arm64 vm is
> > even lower memory than my x86_64 vm so I will probably have to adjust
> > accordingly.
> 
> I usually lower the N for `-a N`.

I'm still trying to reproduce any of these bugs you are seeing.  I sent
out two fixes that I cc'ed you on that may help at least the last one
here.  My thinking is there isn't enough pre-allocation happening and so
I am missing some of the munmap events.  I fixed this by not
pre-allocating the side tree and return -ENOMEM instead.  This is safe
since munmap can allocate anyways for splits.

^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2022-06-17 13:50 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-04  0:26 [PATCH 0/1] Prepare for maple tree Liam Howlett
2022-05-04  0:26 ` [PATCH 1/1] mips: rename mt_init to mips_mt_init Liam Howlett
2022-05-12  9:54   ` David Hildenbrand
2022-05-04  1:12 ` [PATCH v9 15/69] damon: Convert __damon_va_three_regions to use the VMA iterator Liam Howlett
2022-05-10 10:44   ` SeongJae Park
2022-05-10 16:27     ` Liam Howlett
2022-05-10 19:13     ` Andrew Morton
2022-05-04  1:13 ` [PATCH v9 16/69] proc: remove VMA rbtree use from nommu Liam Howlett
2022-05-04  1:13   ` [PATCH v9 17/69] mm: remove rb tree Liam Howlett
2022-05-04  1:13   ` [PATCH v9 19/69] xen: use vma_lookup() in privcmd_ioctl_mmap() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 20/69] mm: optimize find_exact_vma() to use vma_lookup() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 18/69] mmap: change zeroing of maple tree in __vma_adjust() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 23/69] mm: use maple tree operations for find_vma_intersection() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 21/69] mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 22/69] mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 24/69] mm/mmap: use advanced maple tree API for mmap_region() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 25/69] mm: remove vmacache Liam Howlett
2022-05-04  1:13   ` [PATCH v9 27/69] mm/mmap: move mmap_region() below do_munmap() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 26/69] mm: convert vma_lookup() to use mtree_load() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 28/69] mm/mmap: reorganize munmap to use maple states Liam Howlett
2022-06-06 12:09     ` Qian Cai
2022-06-06 16:19       ` Liam Howlett
2022-06-06 16:40         ` Qian Cai
2022-06-11 20:11           ` Yu Zhao
2022-06-11 21:49             ` Yu Zhao
2022-06-12  1:09               ` Liam Howlett
2022-06-15 14:25               ` Liam Howlett
2022-06-15 18:07                 ` Yu Zhao
2022-06-15 18:55                   ` Liam Howlett
2022-06-15 19:05                     ` Yu Zhao
2022-06-15 21:16                       ` Yu Zhao
2022-06-16  1:50                         ` Liam Howlett
2022-06-16  1:58                           ` Yu Zhao
2022-06-16  2:56                             ` Liam Howlett
2022-06-16  3:02                               ` Yu Zhao
2022-06-16  5:45                                 ` Yu Zhao
2022-06-16  5:55                                   ` Yu Zhao
2022-06-16 18:26                                     ` Liam Howlett
2022-06-16 18:34                                       ` Yu Zhao
2022-06-17 13:49                                         ` Liam Howlett
2022-05-04  1:13   ` [PATCH v9 29/69] mm/mmap: change do_brk_munmap() to use do_mas_align_munmap() Liam Howlett
2022-05-04  1:13   ` [PATCH v9 30/69] arm64: remove mmap linked list from vdso Liam Howlett
2022-05-04  1:13   ` [PATCH v9 31/69] arm64: Change elfcore for_each_mte_vma() to use VMA iterator Liam Howlett
2022-05-04  1:13   ` [PATCH v9 33/69] powerpc: remove mmap linked list walks Liam Howlett
2022-05-04  1:13   ` [PATCH v9 32/69] parisc: remove mmap linked list from cache handling Liam Howlett
2022-05-04  1:13   ` [PATCH v9 35/69] x86: remove vma linked list walks Liam Howlett
2022-05-04  1:13   ` [PATCH v9 36/69] xtensa: " Liam Howlett
2022-05-04  1:13   ` [PATCH v9 34/69] s390: " Liam Howlett
2022-05-04  1:13   ` [PATCH v9 38/69] optee: remove vma linked list walk Liam Howlett
2022-05-04  1:13   ` [PATCH v9 39/69] um: " Liam Howlett
2022-05-04  1:13   ` [PATCH v9 37/69] cxl: " Liam Howlett
2022-05-04  1:13   ` [PATCH v9 40/69] coredump: " Liam Howlett
2022-05-04  1:13   ` [PATCH v9 41/69] exec: use VMA iterator instead of linked list Liam Howlett
2022-05-04  1:13   ` [PATCH v9 42/69] fs/proc/base: use maple tree iterators in place " Liam Howlett
2022-05-04  1:13   ` [PATCH v9 43/69] fs/proc/task_mmu: stop using linked list and highest_vm_end Liam Howlett
2022-05-04  1:13   ` [PATCH v9 45/69] ipc/shm: use VMA iterator instead of linked list Liam Howlett
2022-05-04  1:13   ` [PATCH v9 44/69] userfaultfd: use maple tree iterator to iterate VMAs Liam Howlett
2022-05-04  1:14   ` [PATCH v9 47/69] perf: use VMA iterator Liam Howlett
2022-05-04  1:14   ` [PATCH v9 46/69] acct: use VMA iterator instead of linked list Liam Howlett
2022-05-04  1:14   ` [PATCH v9 48/69] sched: use maple tree iterator to walk VMAs Liam Howlett
2022-05-04  1:14   ` [PATCH v9 50/69] bpf: remove VMA linked list Liam Howlett
2022-05-04  1:14   ` [PATCH v9 49/69] fork: use VMA iterator Liam Howlett
2022-05-04  1:14   ` [PATCH v9 52/69] mm/khugepaged: stop using vma linked list Liam Howlett
2022-05-04  1:14   ` [PATCH v9 51/69] mm/gup: use maple tree navigation instead of " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 53/69] mm/ksm: use vma iterators instead of vma " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 54/69] mm/madvise: use vma_find() " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 55/69] mm/memcontrol: stop using mm->highest_vm_end Liam Howlett
2022-05-04  1:14   ` [PATCH v9 56/69] mm/mempolicy: use vma iterator & maple state instead of vma linked list Liam Howlett
2022-05-04  1:14   ` [PATCH v9 59/69] mm/mremap: use vma_find_intersection() " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 57/69] mm/mlock: use vma iterator and maple state " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 58/69] mm/mprotect: use maple tree navigation " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 60/69] mm/msync: use vma_find() " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 61/69] mm/oom_kill: use maple tree iterators " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 62/69] mm/pagewalk: use vma_find() " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 63/69] mm/swapfile: use vma iterator " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 65/69] nommu: remove uses of VMA " Liam Howlett
2022-05-04  1:14   ` [PATCH v9 64/69] i915: use the VMA iterator Liam Howlett
2022-05-04  1:14   ` [PATCH v9 66/69] riscv: use vma iterator for vdso Liam Howlett
2022-05-04  1:14   ` [PATCH v9 68/69] mm/mmap: drop range_has_overlap() function Liam Howlett
2022-05-04  1:14   ` [PATCH v9 67/69] mm: remove the vma linked list Liam Howlett
2022-05-13 13:30     ` Qian Cai
2022-05-13 14:17       ` Liam Howlett
2022-05-04  1:14   ` [PATCH v9 69/69] mm/mmap.c: pass in mapping to __vma_link_file() Liam Howlett

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).