[PATCH v4 0/7] Optimize mremap during mutual alignment within PMD

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD
@ 2023-05-31 22:08 Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

Hello!

Here is v4 of the mremap start address optimization / fix for exec warning.  It
took me a while to write a test that catches the issue me/Linus discussed in
the last version. And I verified kernel crashes without the check. See below.

The main changes in this series is:
Care to be taken to move purely within a VMA, in other words this check
   in call_align_down():
    if (vma->vm_start != addr_masked)
            return false;

    As an example of why this is needed:
    Consider the following range which is 2MB aligned and is
    a part of a larger 10MB range which is not shown. Each
    character is 256KB below making the source and destination
    2MB each. The lower case letters are moved (s to d) and the
    upper case letters are not moved.

    |DDDDddddSSSSssss|

    If we align down 'ssss' to start from the 'SSSS', we will end up destroying
    SSSS. The above if statement prevents that and I verified it.

    I also added a test for this in the last patch.

History of patches
==================
v3->v4:
1. Make sure to check address to align is beginning of VMA
2. Add test to check this (test fails with a kernel crash if we don't do this).

v2->v3:
1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
2. We now handle moves happening purely within a VMA, a new test is added to handle this.
3. More code comments.

v1->v2:
1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
that it works correctly.

2. Fix issue with bogus return value found by Linus if we broke out of the
above loop for the first PMD itself.

v1: Initial RFC.

Description of patches
======================
These patches optimizes the start addresses in move_page_tables() and tests the
changes. It addresses a warning [1] that occurs due to a downward, overlapping
move on a mutually-aligned offset within a PMD during exec. By initiating the
copy process at the PMD level when such alignment is present, we can prevent
this warning and speed up the copying process at the same time. Linus Torvalds
suggested this idea.

Please check the individual patches for more details.

thanks,

 - Joel

[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/

Joel Fernandes (Google) (7):
mm/mremap: Optimize the start addresses in move_page_tables()
mm/mremap: Allow moves within the same VMA for stack
selftests: mm: Fix failure case when new remap region was not found
selftests: mm: Add a test for mutually aligned moves > PMD size
selftests: mm: Add a test for remapping to area immediately after
existing mapping
selftests: mm: Add a test for remapping within a range
selftests: mm: Add a test for moving from an offset from start of
mapping

fs/exec.c                                |   2 +-
include/linux/mm.h                       |   2 +-
mm/mremap.c                              |  63 ++++-
tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
4 files changed, 319 insertions(+), 49 deletions(-)

--
2.41.0.rc2.161.g9c6817b8e7-goog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-06-17 22:49   ` Lorenzo Stoakes
  2023-05-31 22:08 ` [PATCH v4 2/7] mm/mremap: Allow moves within the same VMA for stack Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Linus Torvalds, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Lorenzo Stoakes,
	Kirill A Shutemov, Liam R. Howlett, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra, Vineeth Pillai

Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I
mean the source and destination addresses of the mremap are at
the same offset within a PMD.

This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that
does not have PTEs in it.

This warning will only trigger when there is mutual alignment in the
move operation. A solution, as suggested by Linus Torvalds [2], is to
initiate the copy process at the PMD level whenever such alignment is
present. Implementing this approach will not only prevent the warning
from being triggered, but it will also optimize the operation as this
method should enhance the speed of the copy process whenever there's a
possibility to start copying at the PMD level.

Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.

b. #a is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.

c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.

d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address to align
is not the beginning of the VMA which that address corresponds to.

[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 61 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 411a85682b58..bf355e4d6bd4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 	return moved;
 }
 
+/*
+ * A helper to check if a previous mapping exists. Required for
+ * move_page_tables() and realign_addr() to determine if a previous mapping
+ * exists before we can do realignment optimizations.
+ */
+static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
+			       unsigned long mask)
+{
+	unsigned long addr_masked = addr_to_align & mask;
+	struct vm_area_struct *prev = NULL, *cur = NULL;
+
+	/*
+	 * If @addr_to_align of either source or destination is not the beginning
+	 * of the corresponding VMA, we can't align down or we will destroy part
+	 * of the current mapping.
+	 */
+	if (vma->vm_start != addr_to_align)
+		return false;
+
+	/*
+	 * Find the VMA before @vma to see if it subsumes the masked address.
+	 * The mmap write lock is held here so the lookup is safe.
+	 */
+	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
+	if (WARN_ON_ONCE(cur != vma))
+		return false;
+
+	return !prev || prev->vm_end <= addr_masked;
+}
+
+/* Opportunistically realign to specified boundary for faster copy. */
+static void realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
+			 unsigned long *new_addr, struct vm_area_struct *new_vma,
+			 unsigned long mask)
+{
+	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
+
+	if ((*old_addr & ~mask) && mutually_aligned
+	    && can_align_down(old_vma, *old_addr, mask)
+	    && can_align_down(new_vma, *new_addr, mask)) {
+		*old_addr = *old_addr & mask;
+		*new_addr = *new_addr & mask;
+	}
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
@@ -493,6 +538,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	old_end = old_addr + len;
 
+	/*
+	 * If possible, realign addresses to PMD boundary for faster copy.
+	 * Don't align for intra-VMA moves as we may destroy existing mappings.
+	 */
+	if ((vma != new_vma)
+		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {
+		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
+	}
+
 	if (is_vm_hugetlb_page(vma))
 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
 						new_addr, len);
@@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	/*
+	 * Prevent negative return values when {old,new}_addr was realigned
+	 * but we broke out of the above loop for the first PMD itself.
+	 */
+	if (len + old_addr < old_end)
+		return 0;
+
 	return len + old_addr - old_end;	/* how much done */
 }
 
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/7] mm/mremap: Allow moves within the same VMA for stack
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 3/7] selftests: mm: Fix failure case when new remap region was not found Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

For the stack move happening in shift_arg_pages(), the move is happening
within the same VMA which spans the old and new ranges.

In case the aligned address happens to fall within that VMA, allow such
moves and don't abort the optimization.

In the mremap case, we cannot allow any such moves as will end up
destroying some part of the mapping (either the source of the move, or
part of the existing mapping). So just avoid it for mremap.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 fs/exec.c          |  2 +-
 include/linux/mm.h |  2 +-
 mm/mremap.c        | 40 ++++++++++++++++++----------------------
 3 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 7c44d0c65b1b..7a7217353115 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -707,7 +707,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	 * process cleanup to remove whatever mess we made.
 	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length, false))
+				       vma, new_start, length, false, true))
 		return -ENOMEM;
 
 	lru_add_drain();
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..dd415cd2493d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2265,7 +2265,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks);
+		bool need_rmap_locks, bool for_stack);
 
 /*
  * Flags used by change_protection().  For now we make it a bitmap so
diff --git a/mm/mremap.c b/mm/mremap.c
index bf355e4d6bd4..0283f9f43d92 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -479,22 +479,23 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 }
 
 /*
- * A helper to check if a previous mapping exists. Required for
- * move_page_tables() and realign_addr() to determine if a previous mapping
- * exists before we can do realignment optimizations.
+ * A helper to check if aligning down is OK. The newly aligned address should
+ * not fall on any existing mapping otherwise we don't align. For the stack
+ * moving down, that's a special move within the VMA that is created to span
+ * the source and destination of the move, so we make an exception for it.
  */
 static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
-			       unsigned long mask)
+			    unsigned long mask, bool for_stack)
 {
 	unsigned long addr_masked = addr_to_align & mask;
 	struct vm_area_struct *prev = NULL, *cur = NULL;
 
 	/*
-	 * If @addr_to_align of either source or destination is not the beginning
-	 * of the corresponding VMA, we can't align down or we will destroy part
-	 * of the current mapping.
+	 * Other than for stack moves, if @addr_to_align of either source or
+	 * destination is not the beginning of the corresponding VMA, we can't
+	 * align down or we will destroy part of the current mapping.
 	 */
-	if (vma->vm_start != addr_to_align)
+	if (!for_stack && vma->vm_start != addr_to_align)
 		return false;
 
 	/*
@@ -511,13 +512,13 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
 /* Opportunistically realign to specified boundary for faster copy. */
 static void realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
 			 unsigned long *new_addr, struct vm_area_struct *new_vma,
-			 unsigned long mask)
+			 unsigned long mask, bool for_stack)
 {
 	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
 
 	if ((*old_addr & ~mask) && mutually_aligned
-	    && can_align_down(old_vma, *old_addr, mask)
-	    && can_align_down(new_vma, *new_addr, mask)) {
+	    && can_align_down(old_vma, *old_addr, mask, for_stack)
+	    && can_align_down(new_vma, *new_addr, mask, for_stack)) {
 		*old_addr = *old_addr & mask;
 		*new_addr = *new_addr & mask;
 	}
@@ -526,7 +527,7 @@ static void realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks)
+		bool need_rmap_locks, bool for_stack)
 {
 	unsigned long extent, old_end;
 	struct mmu_notifier_range range;
@@ -538,14 +539,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	old_end = old_addr + len;
 
-	/*
-	 * If possible, realign addresses to PMD boundary for faster copy.
-	 * Don't align for intra-VMA moves as we may destroy existing mappings.
-	 */
-	if ((vma != new_vma)
-		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {
-		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
-	}
+	/* If possible, realign addresses to PMD boundary for faster copy. */
+	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
+		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK, for_stack);
 
 	if (is_vm_hugetlb_page(vma))
 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
@@ -694,7 +690,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	}
 
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
-				     need_rmap_locks);
+				     need_rmap_locks, false);
 	if (moved_len < old_len) {
 		err = -ENOMEM;
 	} else if (vma->vm_ops && vma->vm_ops->mremap) {
@@ -708,7 +704,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 * and then proceed to unmap new area instead of old.
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
-				 true);
+				 true, false);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/7] selftests: mm: Fix failure case when new remap region was not found
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 2/7] mm/mremap: Allow moves within the same VMA for stack Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

When a valid remap region could not be found, the source mapping is not
cleaned up. Fix the goto statement such that the clean up happens.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 5c3773de9f0f..6822d657f589 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -316,7 +316,7 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		if (addr + c.dest_alignment < addr) {
 			ksft_print_msg("Couldn't find a valid region to remap to\n");
 			ret = -1;
-			goto out;
+			goto clean_up_src;
 		}
 		addr += c.dest_alignment;
 	}
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2023-05-31 22:08 ` [PATCH v4 3/7] selftests: mm: Fix failure case when new remap region was not found Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

This patch adds a test case to check if a PMD-alignment optimization
successfully happens.

I add support to make sure there is some room before the source mapping,
otherwise the optimization to trigger PMD-aligned move will be disabled
as the kernel will detect that a mapping before the source exists and
such optimization becomes impossible.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 6822d657f589..6304eb0947a3 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -44,6 +44,7 @@ enum {
 	_1MB = 1ULL << 20,
 	_2MB = 2ULL << 20,
 	_4MB = 4ULL << 20,
+	_5MB = 5ULL << 20,
 	_1GB = 1ULL << 30,
 	_2GB = 2ULL << 30,
 	PMD = _2MB,
@@ -235,6 +236,11 @@ static void *get_source_mapping(struct config c)
 	unsigned long long mmap_min_addr;
 
 	mmap_min_addr = get_mmap_min_addr();
+	/*
+	 * For some tests, we need to not have any mappings below the
+	 * source mapping. Add some headroom to mmap_min_addr for this.
+	 */
+	mmap_min_addr += 10 * _4MB;
 
 retry:
 	addr += c.src_alignment;
@@ -434,7 +440,7 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
-#define MAX_TEST 13
+#define MAX_TEST 14
 #define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
@@ -500,6 +506,10 @@ int main(int argc, char **argv)
 	test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
 				   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
 
+	/* Src and Dest addr 1MB aligned. 5MB mremap. */
+	test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
 	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
 					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
 	/*
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2023-05-31 22:08 ` [PATCH v4 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

This patch adds support for verifying that we correctly handle the
situation where something is already mapped before the destination of the remap.

Any realignment of destination address and PMD-copy will destroy that
existing mapping. In such cases, we need to avoid doing the optimization.

To test this, we map an area called the preamble before the remap
region. Then we verify after the mremap operation that this region did not get
corrupted.

Putting some prints in the kernel, I verified that we optimize
correctly in different situations:

Optimize when there is alignment and no previous mapping (this is tested
by previous patch).
<prints>
can_align_down(old_vma->vm_start=2900000, old_addr=2900000, mask=-2097152): 0
can_align_down(new_vma->vm_start=2f00000, new_addr=2f00000, mask=-2097152): 0
=== Starting move_page_tables ===
Doing PUD move for 2800000 -> 2e00000 of extent=200000 <-- Optimization
Doing PUD move for 2a00000 -> 3000000 of extent=200000
Doing PUD move for 2c00000 -> 3200000 of extent=200000
</prints>

Don't optimize when there is alignment but there is previous mapping
(this is tested by this patch).
Notice that can_align_down() returns 1 for the destination mapping
as we detected there is something there.
<prints>
can_align_down(old_vma->vm_start=2900000, old_addr=2900000, mask=-2097152): 0
can_align_down(new_vma->vm_start=5700000, new_addr=5700000, mask=-2097152): 1
=== Starting move_page_tables ===
Doing move_ptes for 2900000 -> 5700000 of extent=100000 <-- Unoptimized
Doing PUD move for 2a00000 -> 5800000 of extent=200000
Doing PUD move for 2c00000 -> 5a00000 of extent=200000
</prints>

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 57 +++++++++++++++++++++---
 1 file changed, 52 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 6304eb0947a3..d7366074e2a8 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -29,6 +29,7 @@ struct config {
 	unsigned long long dest_alignment;
 	unsigned long long region_size;
 	int overlapping;
+	int dest_preamble_size;
 };
 
 struct test {
@@ -283,7 +284,7 @@ static void *get_source_mapping(struct config c)
 static long long remap_region(struct config c, unsigned int threshold_mb,
 			      char pattern_seed)
 {
-	void *addr, *src_addr, *dest_addr;
+	void *addr, *src_addr, *dest_addr, *dest_preamble_addr;
 	unsigned long long i;
 	struct timespec t_start = {0, 0}, t_end = {0, 0};
 	long long  start_ns, end_ns, align_mask, ret, offset;
@@ -300,7 +301,7 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		goto out;
 	}
 
-	/* Set byte pattern */
+	/* Set byte pattern for source block. */
 	srand(pattern_seed);
 	for (i = 0; i < threshold; i++)
 		memset((char *) src_addr + i, (char) rand(), 1);
@@ -312,6 +313,9 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	addr = (void *) (((unsigned long long) src_addr + c.region_size
 			  + offset) & align_mask);
 
+	/* Remap after the destination block preamble. */
+	addr += c.dest_preamble_size;
+
 	/* See comment in get_source_mapping() */
 	if (!((unsigned long long) addr & c.dest_alignment))
 		addr = (void *) ((unsigned long long) addr | c.dest_alignment);
@@ -327,6 +331,24 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		addr += c.dest_alignment;
 	}
 
+	if (c.dest_preamble_size) {
+		dest_preamble_addr = mmap((void *) addr - c.dest_preamble_size, c.dest_preamble_size,
+					  PROT_READ | PROT_WRITE,
+					  MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+							-1, 0);
+		if (dest_preamble_addr == MAP_FAILED) {
+			ksft_print_msg("Failed to map dest preamble region: %s\n",
+					strerror(errno));
+			ret = -1;
+			goto clean_up_src;
+		}
+
+		/* Set byte pattern for the dest preamble block. */
+		srand(pattern_seed);
+		for (i = 0; i < c.dest_preamble_size; i++)
+			memset((char *) dest_preamble_addr + i, (char) rand(), 1);
+	}
+
 	clock_gettime(CLOCK_MONOTONIC, &t_start);
 	dest_addr = mremap(src_addr, c.region_size, c.region_size,
 					  MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr);
@@ -335,7 +357,7 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	if (dest_addr == MAP_FAILED) {
 		ksft_print_msg("mremap failed: %s\n", strerror(errno));
 		ret = -1;
-		goto clean_up_src;
+		goto clean_up_dest_preamble;
 	}
 
 	/* Verify byte pattern after remapping */
@@ -353,6 +375,23 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		}
 	}
 
+	/* Verify the dest preamble byte pattern after remapping */
+	if (c.dest_preamble_size) {
+		srand(pattern_seed);
+		for (i = 0; i < c.dest_preamble_size; i++) {
+			char c = (char) rand();
+
+			if (((char *) dest_preamble_addr)[i] != c) {
+				ksft_print_msg("Preamble data after remap doesn't match at offset %d\n",
+					       i);
+				ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+					       ((char *) dest_preamble_addr)[i] & 0xff);
+				ret = -1;
+				goto clean_up_dest;
+			}
+		}
+	}
+
 	start_ns = t_start.tv_sec * NS_PER_SEC + t_start.tv_nsec;
 	end_ns = t_end.tv_sec * NS_PER_SEC + t_end.tv_nsec;
 	ret = end_ns - start_ns;
@@ -365,6 +404,9 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
  */
 clean_up_dest:
 	munmap(dest_addr, c.region_size);
+clean_up_dest_preamble:
+	if (c.dest_preamble_size && dest_preamble_addr)
+		munmap(dest_preamble_addr, c.dest_preamble_size);
 clean_up_src:
 	munmap(src_addr, c.region_size);
 out:
@@ -440,7 +482,7 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
-#define MAX_TEST 14
+#define MAX_TEST 15
 #define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
@@ -449,7 +491,7 @@ int main(int argc, char **argv)
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
 	int num_expand_tests = 2;
-	struct test test_cases[MAX_TEST];
+	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
 	time_t t;
@@ -510,6 +552,11 @@ int main(int argc, char **argv)
 	test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
 				  "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
 
+	/* Src and Dest addr 1MB aligned. 5MB mremap. */
+	test_cases[14] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble");
+	test_cases[14].config.dest_preamble_size = 10 * _4MB;
+
 	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
 					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
 	/*
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/7] selftests: mm: Add a test for remapping within a range
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2023-05-31 22:08 ` [PATCH v4 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-05-31 22:08 ` [PATCH v4 7/7] selftests: mm: Add a test for moving from an offset from start of mapping Joel Fernandes (Google)
  2023-05-31 23:33 ` [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Linus Torvalds
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

Move a block of memory within a memory range. Any alignment optimization
on the source address may cause corruption. Verify using kselftest that
it works. I have also verified with tracing that such optimization does
not happen due to this check in can_align_down():

if (!for_stack && vma->vm_start != addr_to_align)
	return false;

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 79 +++++++++++++++++++++++-
 1 file changed, 78 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index d7366074e2a8..f45d1abedc9c 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -23,6 +23,7 @@
 #define VALIDATION_NO_THRESHOLD 0	/* Verify the entire region */
 
 #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
+#define SIZE_MB(m) ((size_t)m * (1024 * 1024))
 
 struct config {
 	unsigned long long src_alignment;
@@ -226,6 +227,79 @@ static void mremap_expand_merge_offset(FILE *maps_fp, unsigned long page_size)
 		ksft_test_result_fail("%s\n", test_name);
 }
 
+/*
+ * Verify that an mremap within a range does not cause corruption
+ * of unrelated part of range.
+ *
+ * Consider the following range which is 2MB aligned and is
+ * a part of a larger 10MB range which is not shown. Each
+ * character is 256KB below making the source and destination
+ * 2MB each. The lower case letters are moved (s to d) and the
+ * upper case letters are not moved. The below test verifies
+ * that the upper case S letters are not corrupted by the
+ * adjacent mremap.
+ *
+ * |DDDDddddSSSSssss|
+ */
+static void mremap_move_within_range(char pattern_seed)
+{
+	char *test_name = "mremap mremap move within range";
+	void *src, *dest;
+	int i, success = 1;
+
+	size_t size = SIZE_MB(20);
+	void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (ptr == MAP_FAILED) {
+		perror("mmap");
+		success = 0;
+		goto out;
+	}
+	memset(ptr, 0, size);
+
+	src = ptr + SIZE_MB(6);
+	src = (void *)((unsigned long)src & ~(SIZE_MB(2) - 1));
+
+	/* Set byte pattern for source block. */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(2); i++) {
+		((char *)src)[i] = (char) rand();
+	}
+
+	dest = src - SIZE_MB(2);
+
+	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+	if (new_ptr == MAP_FAILED) {
+		perror("mremap");
+		success = 0;
+		goto out;
+	}
+
+	/* Verify byte pattern after remapping */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(1); i++) {
+		char c = (char) rand();
+
+		if (((char *)src)[i] != c) {
+			ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+				       i);
+			ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+					((char *) src)[i] & 0xff);
+			success = 0;
+		}
+	}
+
+out:
+	if (munmap(ptr, size) == -1)
+		perror("munmap");
+
+	if (success)
+		ksft_test_result_pass("%s\n", test_name);
+	else
+		ksft_test_result_fail("%s\n", test_name);
+}
+
 /*
  * Returns the start address of the mapping on success, else returns
  * NULL on failure.
@@ -491,6 +565,7 @@ int main(int argc, char **argv)
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
 	int num_expand_tests = 2;
+	int num_misc_tests = 1;
 	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
@@ -572,7 +647,7 @@ int main(int argc, char **argv)
 				(threshold_mb * _1MB >= _1GB);
 
 	ksft_set_plan(ARRAY_SIZE(test_cases) + (run_perf_tests ?
-		      ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests);
+		      ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests + num_misc_tests);
 
 	for (i = 0; i < ARRAY_SIZE(test_cases); i++)
 		run_mremap_test_case(test_cases[i], &failures, threshold_mb,
@@ -590,6 +665,8 @@ int main(int argc, char **argv)
 
 	fclose(maps_fp);
 
+	mremap_move_within_range(pattern_seed);
+
 	if (run_perf_tests) {
 		ksft_print_msg("\n%s\n",
 		 "mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:");
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 7/7] selftests: mm: Add a test for moving from an offset from start of mapping
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2023-05-31 22:08 ` [PATCH v4 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
@ 2023-05-31 22:08 ` Joel Fernandes (Google)
  2023-05-31 23:33 ` [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Linus Torvalds
  7 siblings, 0 replies; 18+ messages in thread
From: Joel Fernandes (Google) @ 2023-05-31 22:08 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

It is possible that the aligned address falls on no existing mapping,
however that does not mean that we can just align it down to that.
This test verifies that the "vma->vm_start != addr_to_align" check in
can_align_down() prevents disastrous results if aligning down when
source and dest are mutually aligned within a PMD but the source/dest
addresses requested are not at the beginning of the respective mapping
containing these addresses.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 189 ++++++++++++++++-------
 1 file changed, 134 insertions(+), 55 deletions(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index f45d1abedc9c..c71ac8306c89 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -24,6 +24,7 @@
 
 #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
 #define SIZE_MB(m) ((size_t)m * (1024 * 1024))
+#define SIZE_KB(k) ((size_t)k * 1024)
 
 struct config {
 	unsigned long long src_alignment;
@@ -148,6 +149,60 @@ static bool is_range_mapped(FILE *maps_fp, void *start, void *end)
 	return success;
 }
 
+/*
+ * Returns the start address of the mapping on success, else returns
+ * NULL on failure.
+ */
+static void *get_source_mapping(struct config c)
+{
+	unsigned long long addr = 0ULL;
+	void *src_addr = NULL;
+	unsigned long long mmap_min_addr;
+
+	mmap_min_addr = get_mmap_min_addr();
+	/*
+	 * For some tests, we need to not have any mappings below the
+	 * source mapping. Add some headroom to mmap_min_addr for this.
+	 */
+	mmap_min_addr += 10 * _4MB;
+
+retry:
+	addr += c.src_alignment;
+	if (addr < mmap_min_addr)
+		goto retry;
+
+	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
+					MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+					-1, 0);
+	if (src_addr == MAP_FAILED) {
+		if (errno == EPERM || errno == EEXIST)
+			goto retry;
+		goto error;
+	}
+	/*
+	 * Check that the address is aligned to the specified alignment.
+	 * Addresses which have alignments that are multiples of that
+	 * specified are not considered valid. For instance, 1GB address is
+	 * 2MB-aligned, however it will not be considered valid for a
+	 * requested alignment of 2MB. This is done to reduce coincidental
+	 * alignment in the tests.
+	 */
+	if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
+			!((unsigned long long) src_addr & c.src_alignment)) {
+		munmap(src_addr, c.region_size);
+		goto retry;
+	}
+
+	if (!src_addr)
+		goto error;
+
+	return src_addr;
+error:
+	ksft_print_msg("Failed to map source region: %s\n",
+			strerror(errno));
+	return NULL;
+}
+
 /*
  * This test validates that merge is called when expanding a mapping.
  * Mapping containing three pages is created, middle page is unmapped
@@ -300,60 +355,6 @@ static void mremap_move_within_range(char pattern_seed)
 		ksft_test_result_fail("%s\n", test_name);
 }
 
-/*
- * Returns the start address of the mapping on success, else returns
- * NULL on failure.
- */
-static void *get_source_mapping(struct config c)
-{
-	unsigned long long addr = 0ULL;
-	void *src_addr = NULL;
-	unsigned long long mmap_min_addr;
-
-	mmap_min_addr = get_mmap_min_addr();
-	/*
-	 * For some tests, we need to not have any mappings below the
-	 * source mapping. Add some headroom to mmap_min_addr for this.
-	 */
-	mmap_min_addr += 10 * _4MB;
-
-retry:
-	addr += c.src_alignment;
-	if (addr < mmap_min_addr)
-		goto retry;
-
-	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-					MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
-					-1, 0);
-	if (src_addr == MAP_FAILED) {
-		if (errno == EPERM || errno == EEXIST)
-			goto retry;
-		goto error;
-	}
-	/*
-	 * Check that the address is aligned to the specified alignment.
-	 * Addresses which have alignments that are multiples of that
-	 * specified are not considered valid. For instance, 1GB address is
-	 * 2MB-aligned, however it will not be considered valid for a
-	 * requested alignment of 2MB. This is done to reduce coincidental
-	 * alignment in the tests.
-	 */
-	if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
-			!((unsigned long long) src_addr & c.src_alignment)) {
-		munmap(src_addr, c.region_size);
-		goto retry;
-	}
-
-	if (!src_addr)
-		goto error;
-
-	return src_addr;
-error:
-	ksft_print_msg("Failed to map source region: %s\n",
-			strerror(errno));
-	return NULL;
-}
-
 /* Returns the time taken for the remap on success else returns -1. */
 static long long remap_region(struct config c, unsigned int threshold_mb,
 			      char pattern_seed)
@@ -487,6 +488,83 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	return ret;
 }
 
+/*
+ * Verify that an mremap aligning down does not destroy
+ * the beginning of the mapping just because the aligned
+ * down address landed on a mapping that maybe does not exist.
+ */
+static void mremap_move_1mb_from_start(char pattern_seed)
+{
+	char *test_name = "mremap move 1mb from start at 2MB+256KB aligned src";
+	void *src = NULL, *dest = NULL;
+	int i, success = 1;
+
+	/* Config to reuse get_source_mapping() to do an aligned mmap. */
+	struct config c = {
+		.src_alignment = SIZE_MB(1) + SIZE_KB(256),
+		.region_size = SIZE_MB(6)
+	};
+
+	src = get_source_mapping(c);
+	if (!src) {
+		success = 0;
+		goto out;
+	}
+
+	c.src_alignment = SIZE_MB(1) + SIZE_KB(256);
+	dest = get_source_mapping(c);
+	if (!dest) {
+		success = 0;
+		goto out;
+	}
+
+	/* Set byte pattern for source block. */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(2); i++) {
+		((char *)src)[i] = (char) rand();
+	}
+
+	/*
+	 * Unmap the beginning of dest so that the aligned address
+	 * falls on no mapping.
+	 */
+	munmap(dest, SIZE_MB(1));
+
+	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+	if (new_ptr == MAP_FAILED) {
+		perror("mremap");
+		success = 0;
+		goto out;
+	}
+
+	/* Verify byte pattern after remapping */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(1); i++) {
+		char c = (char) rand();
+
+		if (((char *)src)[i] != c) {
+			ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+				       i);
+			ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+					((char *) src)[i] & 0xff);
+			success = 0;
+		}
+	}
+
+out:
+	if (src && munmap(src, c.region_size) == -1)
+		perror("munmap src");
+
+	if (dest && munmap(dest, c.region_size) == -1)
+		perror("munmap dest");
+
+	if (success)
+		ksft_test_result_pass("%s\n", test_name);
+	else
+		ksft_test_result_fail("%s\n", test_name);
+}
+
 static void run_mremap_test_case(struct test test_case, int *failures,
 				 unsigned int threshold_mb,
 				 unsigned int pattern_seed)
@@ -565,7 +643,7 @@ int main(int argc, char **argv)
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
 	int num_expand_tests = 2;
-	int num_misc_tests = 1;
+	int num_misc_tests = 2;
 	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
@@ -666,6 +744,7 @@ int main(int argc, char **argv)
 	fclose(maps_fp);
 
 	mremap_move_within_range(pattern_seed);
+	mremap_move_1mb_from_start(pattern_seed);
 
 	if (run_perf_tests) {
 		ksft_print_msg("\n%s\n",
-- 
2.41.0.rc2.161.g9c6817b8e7-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD
  2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2023-05-31 22:08 ` [PATCH v4 7/7] selftests: mm: Add a test for moving from an offset from start of mapping Joel Fernandes (Google)
@ 2023-05-31 23:33 ` Linus Torvalds
  7 siblings, 0 replies; 18+ messages in thread
From: Linus Torvalds @ 2023-05-31 23:33 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Lorenzo Stoakes,
	Kirill A Shutemov, Liam R. Howlett, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra, Vineeth Pillai

On Wed, May 31, 2023 at 6:08 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Here is v4 of the mremap start address optimization / fix for exec warning.

I don't see anything suspicious here.

Not that that probably means much, but the test coverage looks reasonable too.

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-05-31 22:08 ` [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
@ 2023-06-17 22:49   ` Lorenzo Stoakes
  2023-06-19 15:55     ` Joel Fernandes
  0 siblings, 1 reply; 18+ messages in thread
From: Lorenzo Stoakes @ 2023-06-17 22:49 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

On Wed, May 31, 2023 at 10:08:01PM +0000, Joel Fernandes (Google) wrote:
> Recently, we see reports [1] of a warning that triggers due to
> move_page_tables() doing a downward and overlapping move on a
> mutually-aligned offset within a PMD. By mutual alignment, I
> mean the source and destination addresses of the mremap are at
> the same offset within a PMD.
>
> This mutual alignment along with the fact that the move is downward is
> sufficient to cause a warning related to having an allocated PMD that
> does not have PTEs in it.
>
> This warning will only trigger when there is mutual alignment in the
> move operation. A solution, as suggested by Linus Torvalds [2], is to
> initiate the copy process at the PMD level whenever such alignment is
> present. Implementing this approach will not only prevent the warning
> from being triggered, but it will also optimize the operation as this
> method should enhance the speed of the copy process whenever there's a
> possibility to start copying at the PMD level.
>
> Some more points:
> a. The optimization can be done only when both the source and
> destination of the mremap do not have anything mapped below it up to a
> PMD boundary. I add support to detect that.
>
> b. #a is not a problem for the call to move_page_tables() from exec.c as
> nothing is expected to be mapped below the source. However, for
> non-overlapping mutually aligned moves as triggered by mremap(2), I
> added support for checking such cases.
>
> c. I currently only optimize for PMD moves, in the future I/we can build
> on this work and do PUD moves as well if there is a need for this. But I
> want to take it one step at a time.
>
> d. We need to be careful about mremap of ranges within the VMA itself.
> For this purpose, I added checks to determine if the address to align
> is not the beginning of the VMA which that address corresponds to.
>
> [1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
> [2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 61 insertions(+)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 411a85682b58..bf355e4d6bd4 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>  	return moved;
>  }
>
> +/*
> + * A helper to check if a previous mapping exists. Required for
> + * move_page_tables() and realign_addr() to determine if a previous mapping
> + * exists before we can do realignment optimizations.
> + */
> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> +			       unsigned long mask)
> +{
> +	unsigned long addr_masked = addr_to_align & mask;
> +	struct vm_area_struct *prev = NULL, *cur = NULL;
> +
> +	/*
> +	 * If @addr_to_align of either source or destination is not the beginning
> +	 * of the corresponding VMA, we can't align down or we will destroy part
> +	 * of the current mapping.
> +	 */
> +	if (vma->vm_start != addr_to_align)
> +		return false;

See below, I think we can eliminate this check.

> +
> +	/*
> +	 * Find the VMA before @vma to see if it subsumes the masked address.
> +	 * The mmap write lock is held here so the lookup is safe.
> +	 */
> +	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> +	if (WARN_ON_ONCE(cur != vma))
> +		return false;
> +
> +	return !prev || prev->vm_end <= addr_masked;

This is a bit clunky, and I don't think we need the WARN_ON_ONCE() check if
we're under the mmap_lock.

How about something like:-

return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;

Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
empty.

But actually, we should be able to go further and replace the previous
check with:-

return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;

Which will fail if addr_to_align is offset within the VMA.

> +}
> +
> +/* Opportunistically realign to specified boundary for faster copy. */
> +static void realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,

Something of a nit, but this isn't _always_ realigning the address, so perhaps
something like maybe_realign_addr() or try_realign_addr() is better?

This is probably debatable, as the comment already explains it is opportunistic
:)

> +			 unsigned long *new_addr, struct vm_area_struct *new_vma,
> +			 unsigned long mask)
> +{
> +	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
> +
> +	if ((*old_addr & ~mask) && mutually_aligned

I may be misunderstanding something here, but doesn't the first condition
here disallow for offset into PMD == 0? Why?

> +	    && can_align_down(old_vma, *old_addr, mask)
> +	    && can_align_down(new_vma, *new_addr, mask)) {
> +		*old_addr = *old_addr & mask;
> +		*new_addr = *new_addr & mask;
> +	}
> +}
> +
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len,
> @@ -493,6 +538,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>
>  	old_end = old_addr + len;
>
> +	/*
> +	 * If possible, realign addresses to PMD boundary for faster copy.
> +	 * Don't align for intra-VMA moves as we may destroy existing mappings.
> +	 */
> +	if ((vma != new_vma)

Nit but these parens aren't needed. Also if we're deferring the decision as
to whether we realign to this function, why are we doing this check here
and not here?

It feels like it'd be neater to keep all the conditions (including the
length one) together in one place.


> +		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {

You don't mention this condition in the above comment (if we have this
altogether as part of the realign function could comment separately there)
- so we only go ahead and do this optimisation if the length of the remap
is such that the entire of old_addr -> end of its PMD (and thus the same
for new_addr) is copied?

I may be missing something/being naive here, but can't we just do a similar
check to the one done for space _below_ the VMA to see if [end, (end of
PMD)) is equally empty?

> +		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> +	}
> +
>  	if (is_vm_hugetlb_page(vma))
>  		return move_hugetlb_page_tables(vma, new_vma, old_addr,
>  						new_addr, len);
> @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>
>  	mmu_notifier_invalidate_range_end(&range);
>
> +	/*
> +	 * Prevent negative return values when {old,new}_addr was realigned
> +	 * but we broke out of the above loop for the first PMD itself.
> +	 */
> +	if (len + old_addr < old_end)
> +		return 0;
> +

I find this a little iffy, I mean I see that if you align [old,new]_addr to
PMD, then from then on in you're relying on the fact that the loop is just
going from old_addr (now aligned) -> old_end and thus has the correct
length.

Can't we just fix this issue by correcting len? If you take my review above
which checks len in [maybe_]realign_addr(), you could take that as a
pointer and equally update that.

Then you can drop this check.

Also I am concerned in the hugetlb case -> len is passed to
move_hugetlb_page_tables() which is now strictly incorrect, I wonder if
this could cause an issue?

Correcting len seems the neat way of addressing this.

>  	return len + old_addr - old_end;	/* how much done */
>  }
>
> --
> 2.41.0.rc2.161.g9c6817b8e7-goog
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-17 22:49   ` Lorenzo Stoakes
@ 2023-06-19 15:55     ` Joel Fernandes
  2023-06-20 11:02       ` Lorenzo Stoakes
  2023-06-27 17:56       ` Liam R. Howlett
  0 siblings, 2 replies; 18+ messages in thread
From: Joel Fernandes @ 2023-06-19 15:55 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

Hi Lorenzo,
Thanks for the review! I replied below:

On 6/17/23 18:49, Lorenzo Stoakes wrote:
 > On Wed, May 31, 2023 at 10:08:01PM +0000, Joel Fernandes (Google) wrote:
 >> Recently, we see reports [1] of a warning that triggers due to
 >> move_page_tables() doing a downward and overlapping move on a
 >> mutually-aligned offset within a PMD. By mutual alignment, I
 >> mean the source and destination addresses of the mremap are at
 >> the same offset within a PMD.
 >>
 >> This mutual alignment along with the fact that the move is downward is
 >> sufficient to cause a warning related to having an allocated PMD that
 >> does not have PTEs in it.
 >>
 >> This warning will only trigger when there is mutual alignment in the
 >> move operation. A solution, as suggested by Linus Torvalds [2], is to
 >> initiate the copy process at the PMD level whenever such alignment is
 >> present. Implementing this approach will not only prevent the warning
 >> from being triggered, but it will also optimize the operation as this
 >> method should enhance the speed of the copy process whenever there's a

[...]

 >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
 >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
 >> ---
 >>   mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 >>   1 file changed, 61 insertions(+)
 >>
 >> diff --git a/mm/mremap.c b/mm/mremap.c
 >> index 411a85682b58..bf355e4d6bd4 100644
 >> --- a/mm/mremap.c
 >> +++ b/mm/mremap.c
 >> @@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct
 >>   	return moved;
 >>   }
 >>
 >> +/*
 >> + * A helper to check if a previous mapping exists. Required for
 >> + * move_page_tables() and realign_addr() to determine if a previous mapping
 >> + * exists before we can do realignment optimizations.
 >> + */
 >> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
 >> +			       unsigned long mask)
 >> +{
 >> +	unsigned long addr_masked = addr_to_align & mask;
 >> +	struct vm_area_struct *prev = NULL, *cur = NULL;
 >> +
 >> +	/*
 >> +	 * If @addr_to_align of either source or destination is not the beginning
 >> +	 * of the corresponding VMA, we can't align down or we will destroy part
 >> +	 * of the current mapping.
 >> +	 */
 >> +	if (vma->vm_start != addr_to_align)
 >> +		return false;
 >
 > See below, I think we can eliminate this check.
 >
 >> +
 >> +	/*
 >> +	 * Find the VMA before @vma to see if it subsumes the masked address.
 >> +	 * The mmap write lock is held here so the lookup is safe.
 >> +	 */
 >> +	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
 >> +	if (WARN_ON_ONCE(cur != vma))
 >> +		return false;
 >> +
 >> +	return !prev || prev->vm_end <= addr_masked;
 >
 > This is a bit clunky, and I don't think we need the WARN_ON_ONCE() check if
 > we're under the mmap_lock.
 >
 > How about something like:-
 >
 > return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;
 >
 > Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
 > empty.
 >
 > But actually, we should be able to go further and replace the previous
 > check with:-
 >
 > return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;
 >
 > Which will fail if addr_to_align is offset within the VMA.

Your suggestion would mean that we do a full VMA search starting from the root. That would 
not be a nice thing if say we've 1000s of VMAs?

Actually Liam told me to use find_vma_prev() because given a VMA, the maple tree would not 
have to work that hard for the common case to find the previous VMA. Per conversing with 
him, there is a chance we may have to go one step above in the tree if we hit the edge of 
a node, but that's not supposed to be the common case. In previous code, the previous VMA 
could just be obtained using the "previous VMA" pointer, however that pointer has been 
remove since the maple tree changes and given a VMA, going to the previous one using the 
maple tree is just as fast (as I'm told).

Considering this, I would keep the code as-is and perhaps you/we could consider the 
replacement with another API in a subsequent patch as it does the job for this patch.

 >> +			 unsigned long *new_addr, struct vm_area_struct *new_vma,
 >> +			 unsigned long mask)
 >> +{
 >> +	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
 >> +
 >> +	if ((*old_addr & ~mask) && mutually_aligned
 >
 > I may be misunderstanding something here, but doesn't the first condition
 > here disallow for offset into PMD == 0? Why?

Because in such a situation, the alignment is already done and there's nothing to align. 
The patch wants to align down to the PMD and we would not want to waste CPU cycles if 
there's nothing to do.

 >> +	    && can_align_down(old_vma, *old_addr, mask)
 >> +	    && can_align_down(new_vma, *new_addr, mask)) {
 >> +		*old_addr = *old_addr & mask;
 >> +		*new_addr = *new_addr & mask;
 >> +	}
 >> +}
 >> +
 >>   unsigned long move_page_tables(struct vm_area_struct *vma,
 >>   		unsigned long old_addr, struct vm_area_struct *new_vma,
 >>   		unsigned long new_addr, unsigned long len,
 >> @@ -493,6 +538,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 >>
 >>   	old_end = old_addr + len;
 >>
 >> +	/*
 >> +	 * If possible, realign addresses to PMD boundary for faster copy.
 >> +	 * Don't align for intra-VMA moves as we may destroy existing mappings.
 >> +	 */
 >> +	if ((vma != new_vma)
 >
 > Nit but these parens aren't needed.

Sure, I can drop the parens.

 > Also if we're deferring the decision as
 > to whether we realign to this function, why are we doing this check here
 > and not here?

Hmm, well the function name is realign_addr() so I kept some of the initial checks outside 
of it where we should "obviously" not realign. I could do what you're suggesting and 
change it to try_realign_addr() or something. And move those checks in there. That would 
be a bit better.

 > It feels like it'd be neater to keep all the conditions (including the
 > length one) together in one place.
 >
 >
 >> +		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {

Well, yeah maybe. I'll look into it, thanks.

 > You don't mention this condition in the above comment (if we have this
 > altogether as part of the realign function could comment separately there)

Ok, sounds good -- I will add a comment with some of the explanation above.

 > - so we only go ahead and do this optimisation if the length of the remap
 > is such that the entire of old_addr -> end of its PMD (and thus the same
 > for new_addr) is copied?

Yes, correct. And in the future that could also be optimized (if say there is no 
subsequent mapping, so we can copy the tail PMD as well, however one step at a time and 
all that.)

 > I may be missing something/being naive here, but can't we just do a similar
 > check to the one done for space _below_ the VMA to see if [end, (end of
 > PMD)) is equally empty?

We can, but the warning that was triggering does not really need that to be silenced. I am 
happy to do that in a later patch if needed, or you can. ;-) But I'd like to keep the risk 
low since this was itself hard enough to get right.

 >> +		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
 >> +	}
 >> +
 >>   	if (is_vm_hugetlb_page(vma))
 >>   		return move_hugetlb_page_tables(vma, new_vma, old_addr,
 >>   						new_addr, len);
 >> @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 >>
 >>   	mmu_notifier_invalidate_range_end(&range);
 >>
 >> +	/*
 >> +	 * Prevent negative return values when {old,new}_addr was realigned
 >> +	 * but we broke out of the above loop for the first PMD itself.
 >> +	 */
 >> +	if (len + old_addr < old_end)
 >> +		return 0;
 >> +
 >
 > I find this a little iffy, I mean I see that if you align [old,new]_addr to
 > PMD, then from then on in you're relying on the fact that the loop is just
 > going from old_addr (now aligned) -> old_end and thus has the correct
 > length.
 >
 > Can't we just fix this issue by correcting len? If you take my review above
 > which checks len in [maybe_]realign_addr(), you could take that as a
 > pointer and equally update that.
 >
 > Then you can drop this check.

The drawback of adjusting len is it changes what move_page_tables() users were previously 
expecting.

I think we should look at the return value of move_page_tables() as well, not just len 
independently.

len is what the user requested.

"len + old_addr - old_end" is how much was actually copied and is the return value.

If everything was copied, old_addr == old_end and len is unchanged.

The users of move_page_tables(), like move_vma() should not care whether we copied a full 
PMD or not. In fact telling them anything like may cause problems with the interpretation 
of the return value I think.

They asked us to copy len, did we copy it? hell yeah.

Note that after the first loop iteration's PMD copy, old_addr is now at the PMD boundary 
and the functionality of this function is not changed with this patch. We end up doing a 
PMD-copy just like we used to without this patch. So this patch does not really change 
anything from before.

The following are the cases:

1. If we realign and copy, yes we copied a PMD, but really it was to satisfy the requested 
length. In this situation, "len + old_addr - old_end"  is accurate and just like before. 
We copied whatever the user requested. Yes we copied a little more, but who cares? We 
copied into a mapping that does not exist anyway. It may be absurd for us to return a len 
that is greater than the requested len IMO.

2. If there are no errors (example first PMD copy did not fail), "len + old_addr - 
old_end" is identical to what it was without this patch -- as it should be. That's true 
whether we realigned or not.

3. If we realigned and the first PMD copy failed (unlikely error) -- that's where there's 
a problem. We would end up returning a negative value. That's what Linus found and 
suggested to correct. Because (old_addr - old_end) will be greater than len in such a 
situation, however unlikely.

 >>   	return len + old_addr - old_end;	/* how much done */
 >>   }
 > Also I am concerned in the hugetlb case -> len is passed to
 > move_hugetlb_page_tables() which is now strictly incorrect, I wonder if
 > this could cause an issue?
 >
 > Correcting len seems the neat way of addressing this.

That's a good point. I am wondering if we can just change that from:

	if (is_vm_hugetlb_page(vma))
		return move_hugetlb_page_tables(vma, new_vma, old_addr,
				new_addr, len);

to:
	if (is_vm_hugetlb_page(vma))
		return move_hugetlb_page_tables(vma, new_vma, old_addr,
				new_addr, old_addr - new_addr);

Or, another option is to turn it off for hugetlb by just moving:

	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
		realign_addr(...);

to after:

	if (is_vm_hugetlb_page(vma))
		return move_hugetlb_page_tables(...);

thanks,

  - Joel

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-19 15:55     ` Joel Fernandes
@ 2023-06-20 11:02       ` Lorenzo Stoakes
  2023-06-20 21:16         ` Joel Fernandes
  2023-06-27 17:56       ` Liam R. Howlett
  1 sibling, 1 reply; 18+ messages in thread
From: Lorenzo Stoakes @ 2023-06-20 11:02 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

On Mon, Jun 19, 2023 at 11:55:08AM -0400, Joel Fernandes wrote:
> Hi Lorenzo,
> Thanks for the review! I replied below:
>
> On 6/17/23 18:49, Lorenzo Stoakes wrote:
> > On Wed, May 31, 2023 at 10:08:01PM +0000, Joel Fernandes (Google) wrote:
> >> Recently, we see reports [1] of a warning that triggers due to
> >> move_page_tables() doing a downward and overlapping move on a
> >> mutually-aligned offset within a PMD. By mutual alignment, I
> >> mean the source and destination addresses of the mremap are at
> >> the same offset within a PMD.
> >>
> >> This mutual alignment along with the fact that the move is downward is
> >> sufficient to cause a warning related to having an allocated PMD that
> >> does not have PTEs in it.
> >>
> >> This warning will only trigger when there is mutual alignment in the
> >> move operation. A solution, as suggested by Linus Torvalds [2], is to
> >> initiate the copy process at the PMD level whenever such alignment is
> >> present. Implementing this approach will not only prevent the warning
> >> from being triggered, but it will also optimize the operation as this
> >> method should enhance the speed of the copy process whenever there's a
>
> [...]
>
> >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>   mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 61 insertions(+)
> >>
> >> diff --git a/mm/mremap.c b/mm/mremap.c
> >> index 411a85682b58..bf355e4d6bd4 100644
> >> --- a/mm/mremap.c
> >> +++ b/mm/mremap.c
> >> @@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct
> >>   	return moved;
> >>   }
> >>
> >> +/*
> >> + * A helper to check if a previous mapping exists. Required for
> >> + * move_page_tables() and realign_addr() to determine if a previous mapping
> >> + * exists before we can do realignment optimizations.
> >> + */
> >> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> >> +			       unsigned long mask)
> >> +{
> >> +	unsigned long addr_masked = addr_to_align & mask;
> >> +	struct vm_area_struct *prev = NULL, *cur = NULL;
> >> +
> >> +	/*
> >> +	 * If @addr_to_align of either source or destination is not the beginning
> >> +	 * of the corresponding VMA, we can't align down or we will destroy part
> >> +	 * of the current mapping.
> >> +	 */
> >> +	if (vma->vm_start != addr_to_align)
> >> +		return false;
> >
> > See below, I think we can eliminate this check.
> >
> >> +
> >> +	/*
> >> +	 * Find the VMA before @vma to see if it subsumes the masked address.
> >> +	 * The mmap write lock is held here so the lookup is safe.
> >> +	 */
> >> +	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> >> +	if (WARN_ON_ONCE(cur != vma))
> >> +		return false;
> >> +
> >> +	return !prev || prev->vm_end <= addr_masked;
> >
> > This is a bit clunky, and I don't think we need the WARN_ON_ONCE() check if
> > we're under the mmap_lock.
> >
> > How about something like:-
> >
> > return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;
> >
> > Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
> > empty.
> >
> > But actually, we should be able to go further and replace the previous
> > check with:-
> >
> > return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;
> >
> > Which will fail if addr_to_align is offset within the VMA.
>
> Your suggestion would mean that we do a full VMA search starting from the
> root. That would not be a nice thing if say we've 1000s of VMAs?
>
> Actually Liam told me to use find_vma_prev() because given a VMA, the maple
> tree would not have to work that hard for the common case to find the
> previous VMA. Per conversing with him, there is a chance we may have to go
> one step above in the tree if we hit the edge of a node, but that's not
> supposed to be the common case. In previous code, the previous VMA could
> just be obtained using the "previous VMA" pointer, however that pointer has
> been remove since the maple tree changes and given a VMA, going to the
> previous one using the maple tree is just as fast (as I'm told).

As far as I can tell, find_vma_prev() already does a walk? I mean this is
equivalent to find_vma() only retrieving the previous VMA right? I defer to
Liam, but I'm not sure this would be that much more involved? Perhaps he
can comment.

An alternative is to create an iterator and use vma_prev(). I find it
extremely clunky that we search for a VMA we already possess (and it's
previous one) while not needing the the former.

I'm not hugely familiar with the maple tree (perhaps Liam can comment) but
I suspect that'd be more performant if that's the concern. Either way I
would be surprised if this is the correct approach.

>
> Considering this, I would keep the code as-is and perhaps you/we could
> consider the replacement with another API in a subsequent patch as it does
> the job for this patch.

See above. I don't think this kind of comment is helpful in code
review. Your disagreement above suffices, I've responded to it and of
course if there is no other way this is fine.

But I'd be surprised, and re-looking up a VMA we already have is just
horrid. It's not really a nitpick, it's a code quality issue in my view.

In any case, let's please try to avoid 'if you are bothered, write a follow
up patch' style responses. If you disagree with something just say so, it's
fine! :)

>
> >> +			 unsigned long *new_addr, struct vm_area_struct *new_vma,
> >> +			 unsigned long mask)
> >> +{
> >> +	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
> >> +
> >> +	if ((*old_addr & ~mask) && mutually_aligned
> >
> > I may be misunderstanding something here, but doesn't the first condition
> > here disallow for offset into PMD == 0? Why?
>
> Because in such a situation, the alignment is already done and there's
> nothing to align. The patch wants to align down to the PMD and we would not
> want to waste CPU cycles if there's nothing to do.

OK, makes sense. It'd be useful to have a comment to this effect.

>
> >> +	    && can_align_down(old_vma, *old_addr, mask)
> >> +	    && can_align_down(new_vma, *new_addr, mask)) {
> >> +		*old_addr = *old_addr & mask;
> >> +		*new_addr = *new_addr & mask;
> >> +	}
> >> +}
> >> +
> >>   unsigned long move_page_tables(struct vm_area_struct *vma,
> >>   		unsigned long old_addr, struct vm_area_struct *new_vma,
> >>   		unsigned long new_addr, unsigned long len,
> >> @@ -493,6 +538,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >>
> >>   	old_end = old_addr + len;
> >>
> >> +	/*
> >> +	 * If possible, realign addresses to PMD boundary for faster copy.
> >> +	 * Don't align for intra-VMA moves as we may destroy existing mappings.
> >> +	 */
> >> +	if ((vma != new_vma)
> >
> > Nit but these parens aren't needed.
>
> Sure, I can drop the parens.

Thanks.

>
> > Also if we're deferring the decision as
> > to whether we realign to this function, why are we doing this check here
> > and not here?
>
> Hmm, well the function name is realign_addr() so I kept some of the initial
> checks outside of it where we should "obviously" not realign. I could do
> what you're suggesting and change it to try_realign_addr() or something. And
> move those checks in there. That would be a bit better.

Thanks.

>
> > It feels like it'd be neater to keep all the conditions (including the
> > length one) together in one place.
> >
> >
> >> +		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {
>
> Well, yeah maybe. I'll look into it, thanks.

I mean it's not a huge big deal, but reading your code having a bunch of
conditions in two different places is a little hard to parse and jarring.

>
> > You don't mention this condition in the above comment (if we have this
> > altogether as part of the realign function could comment separately there)
>
> Ok, sounds good -- I will add a comment with some of the explanation above.
>
> > - so we only go ahead and do this optimisation if the length of the remap
> > is such that the entire of old_addr -> end of its PMD (and thus the same
> > for new_addr) is copied?
>
> Yes, correct. And in the future that could also be optimized (if say there
> is no subsequent mapping, so we can copy the tail PMD as well, however one
> step at a time and all that.)
>

OK cool makes sense.

> > I may be missing something/being naive here, but can't we just do a similar
> > check to the one done for space _below_ the VMA to see if [end, (end of
> > PMD)) is equally empty?
>
> We can, but the warning that was triggering does not really need that to be
> silenced. I am happy to do that in a later patch if needed, or you can. ;-)
> But I'd like to keep the risk low since this was itself hard enough to get
> right.

(see above about 'later patch' comments...)

Sure, this is not a big deal.

>
> >> +		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> >> +	}
> >> +
> >>   	if (is_vm_hugetlb_page(vma))
> >>   		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> >>   						new_addr, len);
> >> @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >>
> >>   	mmu_notifier_invalidate_range_end(&range);
> >>
> >> +	/*
> >> +	 * Prevent negative return values when {old,new}_addr was realigned
> >> +	 * but we broke out of the above loop for the first PMD itself.
> >> +	 */
> >> +	if (len + old_addr < old_end)
> >> +		return 0;
> >> +
> >
> > I find this a little iffy, I mean I see that if you align [old,new]_addr to
> > PMD, then from then on in you're relying on the fact that the loop is just
> > going from old_addr (now aligned) -> old_end and thus has the correct
> > length.
> >
> > Can't we just fix this issue by correcting len? If you take my review above
> > which checks len in [maybe_]realign_addr(), you could take that as a
> > pointer and equally update that.
> >
> > Then you can drop this check.
>
> The drawback of adjusting len is it changes what move_page_tables() users
> were previously expecting.
>
> I think we should look at the return value of move_page_tables() as well,
> not just len independently.
>
> len is what the user requested.
>
> "len + old_addr - old_end" is how much was actually copied and is the return value.
>
> If everything was copied, old_addr == old_end and len is unchanged.

Ah yeah I see, sorry I missed the fact we're returning a value, that does
complicate things...

If we retain the hugetlb logic, then we could work around the issue with
that instance of len by storing the 'actual length' of the range in
a new var actual_len and passing that.

If we choose to instead just not do this for hugetlb (I wonder if the
hugetlb handling code actually does the equivalent of this since surely
these pages have to be handled a PMD at a time?) then we can drop the whole
actual_len idea [see below on response to hugetlb thing].

>
> The users of move_page_tables(), like move_vma() should not care whether we
> copied a full PMD or not. In fact telling them anything like may cause
> problems with the interpretation of the return value I think.
>
> They asked us to copy len, did we copy it? hell yeah.
>
> Note that after the first loop iteration's PMD copy, old_addr is now at the
> PMD boundary and the functionality of this function is not changed with this
> patch. We end up doing a PMD-copy just like we used to without this patch.
> So this patch does not really change anything from before.
>
> The following are the cases:
>
> 1. If we realign and copy, yes we copied a PMD, but really it was to satisfy
> the requested length. In this situation, "len + old_addr - old_end"  is
> accurate and just like before. We copied whatever the user requested. Yes we
> copied a little more, but who cares? We copied into a mapping that does not
> exist anyway. It may be absurd for us to return a len that is greater than
> the requested len IMO.
>
> 2. If there are no errors (example first PMD copy did not fail), "len +
> old_addr - old_end" is identical to what it was without this patch -- as it
> should be. That's true whether we realigned or not.
>
> 3. If we realigned and the first PMD copy failed (unlikely error) -- that's
> where there's a problem. We would end up returning a negative value. That's
> what Linus found and suggested to correct. Because (old_addr - old_end) will
> be greater than len in such a situation, however unlikely.
>

Right. Yeah that is thorny, sorry I did miss the degree of the complexity
with that... ugh ye gods. Probably then this has to be retained.

I was thinking we could use min(actual_len + old_addr - old_end, len), but
then we'd over-report what was 'copied' (actually not copied) because
that'd include the address range spanned by the empty PTE entries up to the
start of the PMD entry.

> >>   	return len + old_addr - old_end;	/* how much done */
> >>   }
> > Also I am concerned in the hugetlb case -> len is passed to
> > move_hugetlb_page_tables() which is now strictly incorrect, I wonder if
> > this could cause an issue?
> >
> > Correcting len seems the neat way of addressing this.
>
> That's a good point. I am wondering if we can just change that from:
>
> 	if (is_vm_hugetlb_page(vma))
> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> 				new_addr, len);
>
> to:
> 	if (is_vm_hugetlb_page(vma))
> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> 				new_addr, old_addr - new_addr);
>
> Or, another option is to turn it off for hugetlb by just moving:
>
> 	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
> 		realign_addr(...);
>
> to after:
>
> 	if (is_vm_hugetlb_page(vma))
> 		return move_hugetlb_page_tables(...);
>
> thanks,

I think the actual_len solution should sort this right? If not maybe better
to be conservative and disable for the hugetlb case (I'm not sure if this
would help given you'd need to be PMD aligned anyway right?), so not to
hold up the series.

If we do decide not to include hugetlb (the endless 'special case' for so
much code...) in this then we can drop the actual_len idea altogether.

(Yes I realise it's ironic I'm suggesting deferring to a later patch here
but there you go ;)

>
>  - Joel
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-20 11:02       ` Lorenzo Stoakes
@ 2023-06-20 21:16         ` Joel Fernandes
  2023-06-20 21:21           ` Joel Fernandes
  2023-06-20 22:00           ` Lorenzo Stoakes
  0 siblings, 2 replies; 18+ messages in thread
From: Joel Fernandes @ 2023-06-20 21:16 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

Hi Lorenzo,

On 6/20/23 07:02, Lorenzo Stoakes wrote:
> On Mon, Jun 19, 2023 at 11:55:08AM -0400, Joel Fernandes wrote:
>> Hi Lorenzo,
>> Thanks for the review! I replied below:
>>
>> On 6/17/23 18:49, Lorenzo Stoakes wrote:
>>> On Wed, May 31, 2023 at 10:08:01PM +0000, Joel Fernandes (Google) wrote:
>>>> Recently, we see reports [1] of a warning that triggers due to
>>>> move_page_tables() doing a downward and overlapping move on a
>>>> mutually-aligned offset within a PMD. By mutual alignment, I
>>>> mean the source and destination addresses of the mremap are at
>>>> the same offset within a PMD.
>>>>
>>>> This mutual alignment along with the fact that the move is downward is
>>>> sufficient to cause a warning related to having an allocated PMD that
>>>> does not have PTEs in it.
>>>>
>>>> This warning will only trigger when there is mutual alignment in the
>>>> move operation. A solution, as suggested by Linus Torvalds [2], is to
>>>> initiate the copy process at the PMD level whenever such alignment is
>>>> present. Implementing this approach will not only prevent the warning
>>>> from being triggered, but it will also optimize the operation as this
>>>> method should enhance the speed of the copy process whenever there's a
>>
>> [...]
>>
>>>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>> ---
>>>>    mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>    1 file changed, 61 insertions(+)
>>>>
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 411a85682b58..bf355e4d6bd4 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct
>>>>    	return moved;
>>>>    }
>>>>
>>>> +/*
>>>> + * A helper to check if a previous mapping exists. Required for
>>>> + * move_page_tables() and realign_addr() to determine if a previous mapping
>>>> + * exists before we can do realignment optimizations.
>>>> + */
>>>> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
>>>> +			       unsigned long mask)
>>>> +{
>>>> +	unsigned long addr_masked = addr_to_align & mask;
>>>> +	struct vm_area_struct *prev = NULL, *cur = NULL;
>>>> +
>>>> +	/*
>>>> +	 * If @addr_to_align of either source or destination is not the beginning
>>>> +	 * of the corresponding VMA, we can't align down or we will destroy part
>>>> +	 * of the current mapping.
>>>> +	 */
>>>> +	if (vma->vm_start != addr_to_align)
>>>> +		return false;
>>>
>>> See below, I think we can eliminate this check.
>>>
>>>> +
>>>> +	/*
>>>> +	 * Find the VMA before @vma to see if it subsumes the masked address.
>>>> +	 * The mmap write lock is held here so the lookup is safe.
>>>> +	 */
>>>> +	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
>>>> +	if (WARN_ON_ONCE(cur != vma))
>>>> +		return false;
>>>> +
>>>> +	return !prev || prev->vm_end <= addr_masked;
>>>
>>> This is a bit clunky, and I don't think we need the WARN_ON_ONCE() check if
>>> we're under the mmap_lock.
>>>
>>> How about something like:-
>>>
>>> return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;
>>>
>>> Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
>>> empty.
>>>
>>> But actually, we should be able to go further and replace the previous
>>> check with:-
>>>
>>> return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;
>>>
>>> Which will fail if addr_to_align is offset within the VMA.
>>
>> Your suggestion would mean that we do a full VMA search starting from the
>> root. That would not be a nice thing if say we've 1000s of VMAs?
>>
>> Actually Liam told me to use find_vma_prev() because given a VMA, the maple
>> tree would not have to work that hard for the common case to find the
>> previous VMA. Per conversing with him, there is a chance we may have to go
>> one step above in the tree if we hit the edge of a node, but that's not
>> supposed to be the common case. In previous code, the previous VMA could
>> just be obtained using the "previous VMA" pointer, however that pointer has
>> been remove since the maple tree changes and given a VMA, going to the
>> previous one using the maple tree is just as fast (as I'm told).
> 
> As far as I can tell, find_vma_prev() already does a walk? I mean this is
> equivalent to find_vma() only retrieving the previous VMA right? I defer to
> Liam, but I'm not sure this would be that much more involved? Perhaps he
> can comment.
> 
> An alternative is to create an iterator and use vma_prev(). I find it
> extremely clunky that we search for a VMA we already possess (and it's
> previous one) while not needing the the former.
> 
> I'm not hugely familiar with the maple tree (perhaps Liam can comment) but
> I suspect that'd be more performant if that's the concern. Either way I
> would be surprised if this is the correct approach.

I see your point. I am not sure myself, the maple tree functions for both APIs 
are indeed similar. We already have looked up the VMA being aligned down. If 
there is a way to get the previous VMA quickly, given an existing VMA, I can 
incorporate that change.

Ideally, if I had access to the ma_state used for lookup of the VMA being 
aligned down, I could perhaps reuse that somehow. But when I checked, that 
seemed a lot more invasive to pass that state down to these align functions.

But there is a merit to your suggestion itself in the sense it cuts down a few 
more lines of code.

>> Considering this, I would keep the code as-is and perhaps you/we could
>> consider the replacement with another API in a subsequent patch as it does
>> the job for this patch.
> 
> See above. I don't think this kind of comment is helpful in code
> review. Your disagreement above suffices, I've responded to it and of
> course if there is no other way this is fine.
> 
> But I'd be surprised, and re-looking up a VMA we already have is just
> horrid. It's not really a nitpick, it's a code quality issue in my view.
> 
> In any case, let's please try to avoid 'if you are bothered, write a follow
> up patch' style responses. If you disagree with something just say so, it's
> fine! :)

I wasn't disagreeing :) Just saying that the find_vma_prev() suggested in a 
previous conversation with Liam fixes the issue (and has been tested a lot in 
this series, on my side) so I was hoping to stick to that and we could iterate 
more on that in the future.

However, after taking a deeper look at the maple tree, I'd like to give the 
find_vma_intersection() option at least a try (with appropriate attribution to you).

Apologies if the response style in my previous email came across badly. That 
wasn't my intent and I will try to improve myself.

[..]

>>>> +		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
>>>> +	}
>>>> +
>>>>    	if (is_vm_hugetlb_page(vma))
>>>>    		return move_hugetlb_page_tables(vma, new_vma, old_addr,
>>>>    						new_addr, len);
>>>> @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>>>>
>>>>    	mmu_notifier_invalidate_range_end(&range);
>>>>
>>>> +	/*
>>>> +	 * Prevent negative return values when {old,new}_addr was realigned
>>>> +	 * but we broke out of the above loop for the first PMD itself.
>>>> +	 */
>>>> +	if (len + old_addr < old_end)
>>>> +		return 0;
>>>> +
>>>
>>> I find this a little iffy, I mean I see that if you align [old,new]_addr to
>>> PMD, then from then on in you're relying on the fact that the loop is just
>>> going from old_addr (now aligned) -> old_end and thus has the correct
>>> length.
>>>
>>> Can't we just fix this issue by correcting len? If you take my review above
>>> which checks len in [maybe_]realign_addr(), you could take that as a
>>> pointer and equally update that.
>>>
>>> Then you can drop this check.
>>
>> The drawback of adjusting len is it changes what move_page_tables() users
>> were previously expecting.
>>
>> I think we should look at the return value of move_page_tables() as well,
>> not just len independently.
>>
>> len is what the user requested.
>>
>> "len + old_addr - old_end" is how much was actually copied and is the return value.
>>
>> If everything was copied, old_addr == old_end and len is unchanged.
> 
> Ah yeah I see, sorry I missed the fact we're returning a value, that does
> complicate things...
> 
> If we retain the hugetlb logic, then we could work around the issue with
> that instance of len by storing the 'actual length' of the range in
> a new var actual_len and passing that.
> 
> If we choose to instead just not do this for hugetlb (I wonder if the
> hugetlb handling code actually does the equivalent of this since surely
> these pages have to be handled a PMD at a time?) then we can drop the whole
> actual_len idea [see below on response to hugetlb thing].

Thanks. Yes, you are right. We should already b  good with hugetlb handling as 
it does appear that hugetlb_move_page_tables() does copy by huge_page_size(h), 
so the old_addr should already be PMD-aligned for it to be able to do that.

[..]

>>>>    	return len + old_addr - old_end;	/* how much done */
>>>>    }
>>> Also I am concerned in the hugetlb case -> len is passed to
>>> move_hugetlb_page_tables() which is now strictly incorrect, I wonder if
>>> this could cause an issue?
>>>
>>> Correcting len seems the neat way of addressing this.
>>
>> That's a good point. I am wondering if we can just change that from:
>>
>> 	if (is_vm_hugetlb_page(vma))
>> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
>> 				new_addr, len);
>>
>> to:
>> 	if (is_vm_hugetlb_page(vma))
>> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
>> 				new_addr, old_addr - new_addr);
>>
>> Or, another option is to turn it off for hugetlb by just moving:
>>
>> 	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
>> 		realign_addr(...);
>>
>> to after:
>>
>> 	if (is_vm_hugetlb_page(vma))
>> 		return move_hugetlb_page_tables(...);
>>
>> thanks,
> 
> I think the actual_len solution should sort this right? If not maybe better
> to be conservative and disable for the hugetlb case (I'm not sure if this
> would help given you'd need to be PMD aligned anyway right?), so not to
> hold up the series.
> 
> If we do decide not to include hugetlb (the endless 'special case' for so
> much code...) in this then we can drop the actual_len idea altogether.
> 
> (Yes I realise it's ironic I'm suggesting deferring to a later patch here
> but there you go ;)

;-). Considering our discussion above that hugetlb mremap addresses should 
always starts at a PMD boundary, maybe I can just add a warning to the if() like 
so to detect any potential?

	if (is_vm_hugetlb_page(vma)) {
		WARN_ON_ONCE(old_addr - old_end != len);
		return move_hugetlb_page_tables(vma, new_vma, old_addr,
						new_addr, len);
         }


Thank you so much and I learnt a lot from you and others in -mm community.

- Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-20 21:16         ` Joel Fernandes
@ 2023-06-20 21:21           ` Joel Fernandes
  2023-06-20 22:00           ` Lorenzo Stoakes
  1 sibling, 0 replies; 18+ messages in thread
From: Joel Fernandes @ 2023-06-20 21:21 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

On 6/20/23 17:16, Joel Fernandes wrote:
> 
> Considering our discussion above that hugetlb mremap addresses should always 
> starts at a PMD boundary, maybe I can just add a warning to the if() like so to 
> detect any potential?
> 
>      if (is_vm_hugetlb_page(vma)) {
>          WARN_ON_ONCE(old_addr - old_end != len);

Oops, I meant WARN_ON_ONCE(old_end - old_addr != len);

to make sure we did not mess up hugetlb mremaps.

thanks,

  - Joel


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-20 21:16         ` Joel Fernandes
  2023-06-20 21:21           ` Joel Fernandes
@ 2023-06-20 22:00           ` Lorenzo Stoakes
  1 sibling, 0 replies; 18+ messages in thread
From: Lorenzo Stoakes @ 2023-06-20 22:00 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

On Tue, Jun 20, 2023 at 05:16:57PM -0400, Joel Fernandes wrote:
> Hi Lorenzo,
>
> > As far as I can tell, find_vma_prev() already does a walk? I mean this is
> > equivalent to find_vma() only retrieving the previous VMA right? I defer to
> > Liam, but I'm not sure this would be that much more involved? Perhaps he
> > can comment.
> >
> > An alternative is to create an iterator and use vma_prev(). I find it
> > extremely clunky that we search for a VMA we already possess (and it's
> > previous one) while not needing the the former.
> >
> > I'm not hugely familiar with the maple tree (perhaps Liam can comment) but
> > I suspect that'd be more performant if that's the concern. Either way I
> > would be surprised if this is the correct approach.
>
> I see your point. I am not sure myself, the maple tree functions for both
> APIs are indeed similar. We already have looked up the VMA being aligned
> down. If there is a way to get the previous VMA quickly, given an existing
> VMA, I can incorporate that change.
>
> Ideally, if I had access to the ma_state used for lookup of the VMA being
> aligned down, I could perhaps reuse that somehow. But when I checked, that
> seemed a lot more invasive to pass that state down to these align functions.
>
> But there is a merit to your suggestion itself in the sense it cuts down a
> few more lines of code.

Yeah it's thorny, the maple tree seems to add a separation between the vmi
and vma that didn't exist previously, and you'd have to thread through the
mas or vmi that was used in the first instance or end up having to rewalk
the tree anyway.

I have spesnt too much time staring at v6 code where it was trivial :)

I think given we have to walk the tree either way, we may as well do the
find_vma_intersection() [happy to stand corrected by Liam if this turns out
not to be less efficient but I don't think it is].

>
> > > Considering this, I would keep the code as-is and perhaps you/we could
> > > consider the replacement with another API in a subsequent patch as it does
> > > the job for this patch.
> >
> > See above. I don't think this kind of comment is helpful in code
> > review. Your disagreement above suffices, I've responded to it and of
> > course if there is no other way this is fine.
> >
> > But I'd be surprised, and re-looking up a VMA we already have is just
> > horrid. It's not really a nitpick, it's a code quality issue in my view.
> >
> > In any case, let's please try to avoid 'if you are bothered, write a follow
> > up patch' style responses. If you disagree with something just say so, it's
> > fine! :)
>
> I wasn't disagreeing :) Just saying that the find_vma_prev() suggested in a
> previous conversation with Liam fixes the issue (and has been tested a lot
> in this series, on my side) so I was hoping to stick to that and we could
> iterate more on that in the future.
>
> However, after taking a deeper look at the maple tree, I'd like to give the
> find_vma_intersection() option at least a try (with appropriate attribution
> to you).

Cheers appreciate it, sorry to be a pain and nitpicky but I think if that
works as well it could be quite a nice solution.

>
> Apologies if the response style in my previous email came across badly. That
> wasn't my intent and I will try to improve myself.

No worries, text is a sucky medium and likely I misinterpreted you in any
case. We're having a productive discussion which is what matters! :)

>
> [..]
>
> > > > > +		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> > > > > +	}
> > > > > +
> > > > >    	if (is_vm_hugetlb_page(vma))
> > > > >    		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> > > > >    						new_addr, len);
> > > > > @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> > > > >
> > > > >    	mmu_notifier_invalidate_range_end(&range);
> > > > >
> > > > > +	/*
> > > > > +	 * Prevent negative return values when {old,new}_addr was realigned
> > > > > +	 * but we broke out of the above loop for the first PMD itself.
> > > > > +	 */
> > > > > +	if (len + old_addr < old_end)
> > > > > +		return 0;
> > > > > +
> > > >
> > > > I find this a little iffy, I mean I see that if you align [old,new]_addr to
> > > > PMD, then from then on in you're relying on the fact that the loop is just
> > > > going from old_addr (now aligned) -> old_end and thus has the correct
> > > > length.
> > > >
> > > > Can't we just fix this issue by correcting len? If you take my review above
> > > > which checks len in [maybe_]realign_addr(), you could take that as a
> > > > pointer and equally update that.
> > > >
> > > > Then you can drop this check.
> > >
> > > The drawback of adjusting len is it changes what move_page_tables() users
> > > were previously expecting.
> > >
> > > I think we should look at the return value of move_page_tables() as well,
> > > not just len independently.
> > >
> > > len is what the user requested.
> > >
> > > "len + old_addr - old_end" is how much was actually copied and is the return value.
> > >
> > > If everything was copied, old_addr == old_end and len is unchanged.
> >
> > Ah yeah I see, sorry I missed the fact we're returning a value, that does
> > complicate things...
> >
> > If we retain the hugetlb logic, then we could work around the issue with
> > that instance of len by storing the 'actual length' of the range in
> > a new var actual_len and passing that.
> >
> > If we choose to instead just not do this for hugetlb (I wonder if the
> > hugetlb handling code actually does the equivalent of this since surely
> > these pages have to be handled a PMD at a time?) then we can drop the whole
> > actual_len idea [see below on response to hugetlb thing].
>
> Thanks. Yes, you are right. We should already b  good with hugetlb handling
> as it does appear that hugetlb_move_page_tables() does copy by
> huge_page_size(h), so the old_addr should already be PMD-aligned for it to
> be able to do that.

Cool, in which case we can drop the actual_len idea as this is really the
only place we'd need it.

>
> [..]
>
> > > > >    	return len + old_addr - old_end;	/* how much done */
> > > > >    }
> > > > Also I am concerned in the hugetlb case -> len is passed to
> > > > move_hugetlb_page_tables() which is now strictly incorrect, I wonder if
> > > > this could cause an issue?
> > > >
> > > > Correcting len seems the neat way of addressing this.
> > >
> > > That's a good point. I am wondering if we can just change that from:
> > >
> > > 	if (is_vm_hugetlb_page(vma))
> > > 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> > > 				new_addr, len);
> > >
> > > to:
> > > 	if (is_vm_hugetlb_page(vma))
> > > 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> > > 				new_addr, old_addr - new_addr);
> > >
> > > Or, another option is to turn it off for hugetlb by just moving:
> > >
> > > 	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
> > > 		realign_addr(...);
> > >
> > > to after:
> > >
> > > 	if (is_vm_hugetlb_page(vma))
> > > 		return move_hugetlb_page_tables(...);
> > >
> > > thanks,
> >
> > I think the actual_len solution should sort this right? If not maybe better
> > to be conservative and disable for the hugetlb case (I'm not sure if this
> > would help given you'd need to be PMD aligned anyway right?), so not to
> > hold up the series.
> >
> > If we do decide not to include hugetlb (the endless 'special case' for so
> > much code...) in this then we can drop the actual_len idea altogether.
> >
> > (Yes I realise it's ironic I'm suggesting deferring to a later patch here
> > but there you go ;)
>
> ;-). Considering our discussion above that hugetlb mremap addresses should
> always starts at a PMD boundary, maybe I can just add a warning to the if()
> like so to detect any potential?
>
> 	if (is_vm_hugetlb_page(vma)) {
> 		WARN_ON_ONCE(old_addr - old_end != len);
> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> 						new_addr, len);
>         }
>

Yeah looks good [ack your follow up that it should be old_end - old_addr],
maybe adding a comment explaining why this is a problem here too.

>
> Thank you so much and I learnt a lot from you and others in -mm community.

No worries, thanks very much for this patch series, this is a nice fixup
for a quite stupid failure we were experiencing before with a neat
solution. Your hard work is appreciated!

>
> - Joel
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-19 15:55     ` Joel Fernandes
  2023-06-20 11:02       ` Lorenzo Stoakes
@ 2023-06-27 17:56       ` Liam R. Howlett
  2023-06-27 18:02         ` Lorenzo Stoakes
  1 sibling, 1 reply; 18+ messages in thread
From: Liam R. Howlett @ 2023-06-27 17:56 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Lorenzo Stoakes, linux-kernel, Linus Torvalds, linux-kselftest,
	linux-mm, Shuah Khan, Vlastimil Babka, Michal Hocko,
	Kirill A Shutemov, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

* Joel Fernandes <joel@joelfernandes.org> [230619 11:55]:
> Hi Lorenzo,
> Thanks for the review! I replied below:
> 
> On 6/17/23 18:49, Lorenzo Stoakes wrote:
> > On Wed, May 31, 2023 at 10:08:01PM +0000, Joel Fernandes (Google) wrote:
> >> Recently, we see reports [1] of a warning that triggers due to
> >> move_page_tables() doing a downward and overlapping move on a
> >> mutually-aligned offset within a PMD. By mutual alignment, I
> >> mean the source and destination addresses of the mremap are at
> >> the same offset within a PMD.
> >>
> >> This mutual alignment along with the fact that the move is downward is
> >> sufficient to cause a warning related to having an allocated PMD that
> >> does not have PTEs in it.
> >>
> >> This warning will only trigger when there is mutual alignment in the
> >> move operation. A solution, as suggested by Linus Torvalds [2], is to
> >> initiate the copy process at the PMD level whenever such alignment is
> >> present. Implementing this approach will not only prevent the warning
> >> from being triggered, but it will also optimize the operation as this
> >> method should enhance the speed of the copy process whenever there's a
> 
> [...]
> 
> >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>   mm/mremap.c | 61 +++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 61 insertions(+)
> >>
> >> diff --git a/mm/mremap.c b/mm/mremap.c
> >> index 411a85682b58..bf355e4d6bd4 100644
> >> --- a/mm/mremap.c
> >> +++ b/mm/mremap.c
> >> @@ -478,6 +478,51 @@ static bool move_pgt_entry(enum pgt_entry entry, struct
> >>   	return moved;
> >>   }
> >>
> >> +/*
> >> + * A helper to check if a previous mapping exists. Required for
> >> + * move_page_tables() and realign_addr() to determine if a previous mapping
> >> + * exists before we can do realignment optimizations.
> >> + */
> >> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> >> +			       unsigned long mask)
> >> +{
> >> +	unsigned long addr_masked = addr_to_align & mask;
> >> +	struct vm_area_struct *prev = NULL, *cur = NULL;
> >> +
> >> +	/*
> >> +	 * If @addr_to_align of either source or destination is not the beginning
> >> +	 * of the corresponding VMA, we can't align down or we will destroy part
> >> +	 * of the current mapping.
> >> +	 */
> >> +	if (vma->vm_start != addr_to_align)
> >> +		return false;
> >
> > See below, I think we can eliminate this check.
> >
> >> +
> >> +	/*
> >> +	 * Find the VMA before @vma to see if it subsumes the masked address.
> >> +	 * The mmap write lock is held here so the lookup is safe.
> >> +	 */
> >> +	cur = find_vma_prev(vma->vm_mm, vma->vm_start, &prev);
> >> +	if (WARN_ON_ONCE(cur != vma))
> >> +		return false;
> >> +
> >> +	return !prev || prev->vm_end <= addr_masked;
> >
> > This is a bit clunky, and I don't think we need the WARN_ON_ONCE() check if
> > we're under the mmap_lock.
> >
> > How about something like:-
> >
> > return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;
> >
> > Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
> > empty.
> >
> > But actually, we should be able to go further and replace the previous
> > check with:-
> >
> > return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;
> >
> > Which will fail if addr_to_align is offset within the VMA.
> 
> Your suggestion would mean that we do a full VMA search starting from the
> root. That would not be a nice thing if say we've 1000s of VMAs?
> 
> Actually Liam told me to use find_vma_prev() because given a VMA, the maple
> tree would not have to work that hard for the common case to find the
> previous VMA. Per conversing with him, there is a chance we may have to go
> one step above in the tree if we hit the edge of a node, but that's not
> supposed to be the common case. In previous code, the previous VMA could
> just be obtained using the "previous VMA" pointer, however that pointer has
> been remove since the maple tree changes and given a VMA, going to the
> previous one using the maple tree is just as fast (as I'm told).

I think there's been a bit of a miscommunication on that..

If you have already found the VMA and are using the maple state, then
it's very little effort to get the next/prev.  Leaf nodes can hold 16
entries/NULL ranges, so the chances to go to the next/prev is usually in
the cpu cache already.. if you go up a level in the tree, then you will
have 10 nodes each with 16 entries each, etc, etc..  So the chances of
being on an edge node and having to walk up multiple levels to get to
the prev/next becomes rather rare.. and if you've just walked down, the
nodes on the way up will still be cached.

Here, you're not using the maple state but searching for an address
using find_vma_prev(), but internally, that function does use a maple
state to get you the previous.  So you are looking up the VMA from the
root, but the prev will very likely be in the CPU cache.

Assuming the worst case tree (each VMA has a gap next to it, not really
going to happen as they tend to be grouped together), then we are
looking at a 4 level tree to get to 8,000 VMAs.  5 levels gets you a
minimum 80,000.  I've never seen a tree of height 6 in the wild, but you
can fit 1.6M to 800K in one.

I think the code is fine, but I wanted to clarify what we discussed.

> 
> Considering this, I would keep the code as-is and perhaps you/we could
> consider the replacement with another API in a subsequent patch as it does
> the job for this patch.
> 
> >> +			 unsigned long *new_addr, struct vm_area_struct *new_vma,
> >> +			 unsigned long mask)
> >> +{
> >> +	bool mutually_aligned = (*old_addr & ~mask) == (*new_addr & ~mask);
> >> +
> >> +	if ((*old_addr & ~mask) && mutually_aligned
> >
> > I may be misunderstanding something here, but doesn't the first condition
> > here disallow for offset into PMD == 0? Why?
> 
> Because in such a situation, the alignment is already done and there's
> nothing to align. The patch wants to align down to the PMD and we would not
> want to waste CPU cycles if there's nothing to do.
> 
> >> +	    && can_align_down(old_vma, *old_addr, mask)
> >> +	    && can_align_down(new_vma, *new_addr, mask)) {
> >> +		*old_addr = *old_addr & mask;
> >> +		*new_addr = *new_addr & mask;
> >> +	}
> >> +}
> >> +
> >>   unsigned long move_page_tables(struct vm_area_struct *vma,
> >>   		unsigned long old_addr, struct vm_area_struct *new_vma,
> >>   		unsigned long new_addr, unsigned long len,
> >> @@ -493,6 +538,15 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >>
> >>   	old_end = old_addr + len;
> >>
> >> +	/*
> >> +	 * If possible, realign addresses to PMD boundary for faster copy.
> >> +	 * Don't align for intra-VMA moves as we may destroy existing mappings.
> >> +	 */
> >> +	if ((vma != new_vma)
> >
> > Nit but these parens aren't needed.
> 
> Sure, I can drop the parens.
> 
> > Also if we're deferring the decision as
> > to whether we realign to this function, why are we doing this check here
> > and not here?
> 
> Hmm, well the function name is realign_addr() so I kept some of the initial
> checks outside of it where we should "obviously" not realign. I could do
> what you're suggesting and change it to try_realign_addr() or something. And
> move those checks in there. That would be a bit better.
> 
> > It feels like it'd be neater to keep all the conditions (including the
> > length one) together in one place.
> >
> >
> >> +		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK))) {
> 
> Well, yeah maybe. I'll look into it, thanks.
> 
> > You don't mention this condition in the above comment (if we have this
> > altogether as part of the realign function could comment separately there)
> 
> Ok, sounds good -- I will add a comment with some of the explanation above.
> 
> > - so we only go ahead and do this optimisation if the length of the remap
> > is such that the entire of old_addr -> end of its PMD (and thus the same
> > for new_addr) is copied?
> 
> Yes, correct. And in the future that could also be optimized (if say there
> is no subsequent mapping, so we can copy the tail PMD as well, however one
> step at a time and all that.)
> 
> > I may be missing something/being naive here, but can't we just do a similar
> > check to the one done for space _below_ the VMA to see if [end, (end of
> > PMD)) is equally empty?
> 
> We can, but the warning that was triggering does not really need that to be
> silenced. I am happy to do that in a later patch if needed, or you can. ;-)
> But I'd like to keep the risk low since this was itself hard enough to get
> right.
> 
> >> +		realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> >> +	}
> >> +
> >>   	if (is_vm_hugetlb_page(vma))
> >>   		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> >>   						new_addr, len);
> >> @@ -565,6 +619,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
> >>
> >>   	mmu_notifier_invalidate_range_end(&range);
> >>
> >> +	/*
> >> +	 * Prevent negative return values when {old,new}_addr was realigned
> >> +	 * but we broke out of the above loop for the first PMD itself.
> >> +	 */
> >> +	if (len + old_addr < old_end)
> >> +		return 0;
> >> +
> >
> > I find this a little iffy, I mean I see that if you align [old,new]_addr to
> > PMD, then from then on in you're relying on the fact that the loop is just
> > going from old_addr (now aligned) -> old_end and thus has the correct
> > length.
> >
> > Can't we just fix this issue by correcting len? If you take my review above
> > which checks len in [maybe_]realign_addr(), you could take that as a
> > pointer and equally update that.
> >
> > Then you can drop this check.
> 
> The drawback of adjusting len is it changes what move_page_tables() users
> were previously expecting.
> 
> I think we should look at the return value of move_page_tables() as well,
> not just len independently.
> 
> len is what the user requested.
> 
> "len + old_addr - old_end" is how much was actually copied and is the return value.
> 
> If everything was copied, old_addr == old_end and len is unchanged.
> 
> The users of move_page_tables(), like move_vma() should not care whether we
> copied a full PMD or not. In fact telling them anything like may cause
> problems with the interpretation of the return value I think.
> 
> They asked us to copy len, did we copy it? hell yeah.
> 
> Note that after the first loop iteration's PMD copy, old_addr is now at the
> PMD boundary and the functionality of this function is not changed with this
> patch. We end up doing a PMD-copy just like we used to without this patch.
> So this patch does not really change anything from before.
> 
> The following are the cases:
> 
> 1. If we realign and copy, yes we copied a PMD, but really it was to satisfy
> the requested length. In this situation, "len + old_addr - old_end"  is
> accurate and just like before. We copied whatever the user requested. Yes we
> copied a little more, but who cares? We copied into a mapping that does not
> exist anyway. It may be absurd for us to return a len that is greater than
> the requested len IMO.
> 
> 2. If there are no errors (example first PMD copy did not fail), "len +
> old_addr - old_end" is identical to what it was without this patch -- as it
> should be. That's true whether we realigned or not.
> 
> 3. If we realigned and the first PMD copy failed (unlikely error) -- that's
> where there's a problem. We would end up returning a negative value. That's
> what Linus found and suggested to correct. Because (old_addr - old_end) will
> be greater than len in such a situation, however unlikely.
> 
> >>   	return len + old_addr - old_end;	/* how much done */
> >>   }
> > Also I am concerned in the hugetlb case -> len is passed to
> > move_hugetlb_page_tables() which is now strictly incorrect, I wonder if
> > this could cause an issue?
> >
> > Correcting len seems the neat way of addressing this.
> 
> That's a good point. I am wondering if we can just change that from:
> 
> 	if (is_vm_hugetlb_page(vma))
> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> 				new_addr, len);
> 
> to:
> 	if (is_vm_hugetlb_page(vma))
> 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
> 				new_addr, old_addr - new_addr);
> 
> Or, another option is to turn it off for hugetlb by just moving:
> 
> 	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
> 		realign_addr(...);
> 
> to after:
> 
> 	if (is_vm_hugetlb_page(vma))
> 		return move_hugetlb_page_tables(...);
> 
> thanks,
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-27 17:56       ` Liam R. Howlett
@ 2023-06-27 18:02         ` Lorenzo Stoakes
  2023-06-27 20:28           ` Liam R. Howlett
  0 siblings, 1 reply; 18+ messages in thread
From: Lorenzo Stoakes @ 2023-06-27 18:02 UTC (permalink / raw)
  To: Liam R. Howlett, Joel Fernandes, linux-kernel, Linus Torvalds,
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Kirill A Shutemov, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra, Vineeth Pillai

On Tue, Jun 27, 2023 at 01:56:09PM -0400, Liam R. Howlett wrote:
[snip]
> > > How about something like:-
> > >
> > > return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;
> > >
> > > Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
> > > empty.
> > >
> > > But actually, we should be able to go further and replace the previous
> > > check with:-
> > >
> > > return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;
> > >
> > > Which will fail if addr_to_align is offset within the VMA.
> >
> > Your suggestion would mean that we do a full VMA search starting from the
> > root. That would not be a nice thing if say we've 1000s of VMAs?
> >
> > Actually Liam told me to use find_vma_prev() because given a VMA, the maple
> > tree would not have to work that hard for the common case to find the
> > previous VMA. Per conversing with him, there is a chance we may have to go
> > one step above in the tree if we hit the edge of a node, but that's not
> > supposed to be the common case. In previous code, the previous VMA could
> > just be obtained using the "previous VMA" pointer, however that pointer has
> > been remove since the maple tree changes and given a VMA, going to the
> > previous one using the maple tree is just as fast (as I'm told).
>
> I think there's been a bit of a miscommunication on that..
>
> If you have already found the VMA and are using the maple state, then
> it's very little effort to get the next/prev.  Leaf nodes can hold 16
> entries/NULL ranges, so the chances to go to the next/prev is usually in
> the cpu cache already.. if you go up a level in the tree, then you will
> have 10 nodes each with 16 entries each, etc, etc..  So the chances of
> being on an edge node and having to walk up multiple levels to get to
> the prev/next becomes rather rare.. and if you've just walked down, the
> nodes on the way up will still be cached.
>
> Here, you're not using the maple state but searching for an address
> using find_vma_prev(), but internally, that function does use a maple
> state to get you the previous.  So you are looking up the VMA from the
> root, but the prev will very likely be in the CPU cache.
>
> Assuming the worst case tree (each VMA has a gap next to it, not really
> going to happen as they tend to be grouped together), then we are
> looking at a 4 level tree to get to 8,000 VMAs.  5 levels gets you a
> minimum 80,000.  I've never seen a tree of height 6 in the wild, but you
> can fit 1.6M to 800K in one.
>
> I think the code is fine, but I wanted to clarify what we discussed.

Would the same apply to find_vma_intersection(), as they equally searches
from the root and allows the code to be made fairly succinct?

I really am not a huge fan of find_vma_prev() searching for a VMA you
already have just to get the previous one... would at lesat like to use
vma_prev() on a newly defined vmi, but if find_vma_intersection() is fine
then can reduce code to this.

[snip]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-06-27 18:02         ` Lorenzo Stoakes
@ 2023-06-27 20:28           ` Liam R. Howlett
  0 siblings, 0 replies; 18+ messages in thread
From: Liam R. Howlett @ 2023-06-27 20:28 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Joel Fernandes, linux-kernel, Linus Torvalds, linux-kselftest,
	linux-mm, Shuah Khan, Vlastimil Babka, Michal Hocko,
	Kirill A Shutemov, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra, Vineeth Pillai

* Lorenzo Stoakes <lstoakes@gmail.com> [230627 14:02]:
> On Tue, Jun 27, 2023 at 01:56:09PM -0400, Liam R. Howlett wrote:
> [snip]
> > > > How about something like:-
> > > >
> > > > return find_vma_intersection(vma->mm, addr_masked, vma->vm_start) == NULL;
> > > >
> > > > Which explicitly asserts that the range in [addr_masked, vma->vm_start) is
> > > > empty.
> > > >
> > > > But actually, we should be able to go further and replace the previous
> > > > check with:-
> > > >
> > > > return find_vma_intersection(vma->mm, addr_masked, addr_to_align) == NULL;
> > > >
> > > > Which will fail if addr_to_align is offset within the VMA.
> > >
> > > Your suggestion would mean that we do a full VMA search starting from the
> > > root. That would not be a nice thing if say we've 1000s of VMAs?
> > >
> > > Actually Liam told me to use find_vma_prev() because given a VMA, the maple
> > > tree would not have to work that hard for the common case to find the
> > > previous VMA. Per conversing with him, there is a chance we may have to go
> > > one step above in the tree if we hit the edge of a node, but that's not
> > > supposed to be the common case. In previous code, the previous VMA could
> > > just be obtained using the "previous VMA" pointer, however that pointer has
> > > been remove since the maple tree changes and given a VMA, going to the
> > > previous one using the maple tree is just as fast (as I'm told).
> >
> > I think there's been a bit of a miscommunication on that..
> >
> > If you have already found the VMA and are using the maple state, then
> > it's very little effort to get the next/prev.  Leaf nodes can hold 16
> > entries/NULL ranges, so the chances to go to the next/prev is usually in
> > the cpu cache already.. if you go up a level in the tree, then you will
> > have 10 nodes each with 16 entries each, etc, etc..  So the chances of
> > being on an edge node and having to walk up multiple levels to get to
> > the prev/next becomes rather rare.. and if you've just walked down, the
> > nodes on the way up will still be cached.
> >
> > Here, you're not using the maple state but searching for an address
> > using find_vma_prev(), but internally, that function does use a maple
> > state to get you the previous.  So you are looking up the VMA from the
> > root, but the prev will very likely be in the CPU cache.
> >
> > Assuming the worst case tree (each VMA has a gap next to it, not really
> > going to happen as they tend to be grouped together), then we are
> > looking at a 4 level tree to get to 8,000 VMAs.  5 levels gets you a
> > minimum 80,000.  I've never seen a tree of height 6 in the wild, but you
> > can fit 1.6M to 800K in one.
> >
> > I think the code is fine, but I wanted to clarify what we discussed.
> 
> Would the same apply to find_vma_intersection(), as they equally searches
> from the root and allows the code to be made fairly succinct?

I think so.

> 
> I really am not a huge fan of find_vma_prev() searching for a VMA you
> already have just to get the previous one... would at lesat like to use
> vma_prev() on a newly defined vmi, but if find_vma_intersection() is fine
> then can reduce code to this.

find_vma_intersection() will work as well.

> [snip]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-06-27 20:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-31 22:08 [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
2023-05-31 22:08 ` [PATCH v4 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
2023-06-17 22:49   ` Lorenzo Stoakes
2023-06-19 15:55     ` Joel Fernandes
2023-06-20 11:02       ` Lorenzo Stoakes
2023-06-20 21:16         ` Joel Fernandes
2023-06-20 21:21           ` Joel Fernandes
2023-06-20 22:00           ` Lorenzo Stoakes
2023-06-27 17:56       ` Liam R. Howlett
2023-06-27 18:02         ` Lorenzo Stoakes
2023-06-27 20:28           ` Liam R. Howlett
2023-05-31 22:08 ` [PATCH v4 2/7] mm/mremap: Allow moves within the same VMA for stack Joel Fernandes (Google)
2023-05-31 22:08 ` [PATCH v4 3/7] selftests: mm: Fix failure case when new remap region was not found Joel Fernandes (Google)
2023-05-31 22:08 ` [PATCH v4 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size Joel Fernandes (Google)
2023-05-31 22:08 ` [PATCH v4 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping Joel Fernandes (Google)
2023-05-31 22:08 ` [PATCH v4 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
2023-05-31 22:08 ` [PATCH v4 7/7] selftests: mm: Add a test for moving from an offset from start of mapping Joel Fernandes (Google)
2023-05-31 23:33 ` [PATCH v4 0/7] Optimize mremap during mutual alignment within PMD Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.