All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD
@ 2023-09-03 15:13 Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

Hello!

Here is v6 of the mremap start address optimization / fix for exec warning.
Should be hopefully final now and only 2/7 and 6/7 need a tag. Thanks a lot to
Lorenzo and Linus for the detailed reviews.

Description of patches
======================
These patches optimizes the start addresses in move_page_tables() and tests the
changes. It addresses a warning [1] that occurs due to a downward, overlapping
move on a mutually-aligned offset within a PMD during exec. By initiating the
copy process at the PMD level when such alignment is present, we can prevent
this warning and speed up the copying process at the same time. Linus Torvalds
suggested this idea. Check the individual patches for more details.
[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/

History of patches:
v5->v6:
1. Reworking the stack case a bit more and tested it (should be final now).
2. Other small nits.

v4->v5:
1. Rebased on mainline.
2. Several improvement suggestions from Lorenzo.

v3->v4:
1. Care to be taken to move purely within a VMA, in other words this check
   in call_align_down():
    if (vma->vm_start != addr_masked)
            return false;

    As an example of why this is needed:
    Consider the following range which is 2MB aligned and is
    a part of a larger 10MB range which is not shown. Each
    character is 256KB below making the source and destination
    2MB each. The lower case letters are moved (s to d) and the
    upper case letters are not moved.

    |DDDDddddSSSSssss|

    If we align down 'ssss' to start from the 'SSSS', we will end up destroying
    SSSS. The above if statement prevents that and I verified it.

    I also added a test for this in the last patch.

2. Handle the stack case separately. We do not care about #1 for stack movement
   because the 'SSSS' does not matter during this move. Further we need to do this
   to prevent the stack move warning.

    if (!for_stack && vma->vm_start <= addr_masked)
            return false;

v2->v3:
1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
2. We now handle moves happening purely within a VMA, a new test is added to handle this.
3. More code comments.

v1->v2:
1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
that it works correctly.

2. Fix issue with bogus return value found by Linus if we broke out of the
above loop for the first PMD itself.

v1: Initial RFC.

Joel Fernandes (1):
selftests: mm: Add a test for moving from an offset from start of
mapping

Joel Fernandes (Google) (6):
mm/mremap: Optimize the start addresses in move_page_tables()
mm/mremap: Allow moves within the same VMA for stack moves
selftests: mm: Fix failure case when new remap region was not found
selftests: mm: Add a test for mutually aligned moves > PMD size
selftests: mm: Add a test for remapping to area immediately after
existing mapping
selftests: mm: Add a test for remapping within a range

fs/exec.c                                |   2 +-
include/linux/mm.h                       |   2 +-
mm/mremap.c                              |  73 +++++-
tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
4 files changed, 329 insertions(+), 49 deletions(-)

--
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-03 16:07   ` [lkp] [+134 bytes kernel size regression] [i386-tinyconfig] [8d22a4573c] " kernel test robot
  2023-09-08 13:07   ` [PATCH v6 1/7] " Michal Hocko
  2023-09-03 15:13 ` [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Lorenzo Stoakes, Linus Torvalds, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

Recently, we see reports [1] of a warning that triggers due to
move_page_tables() doing a downward and overlapping move on a
mutually-aligned offset within a PMD. By mutual alignment, I
mean the source and destination addresses of the mremap are at
the same offset within a PMD.

This mutual alignment along with the fact that the move is downward is
sufficient to cause a warning related to having an allocated PMD that
does not have PTEs in it.

This warning will only trigger when there is mutual alignment in the
move operation. A solution, as suggested by Linus Torvalds [2], is to
initiate the copy process at the PMD level whenever such alignment is
present. Implementing this approach will not only prevent the warning
from being triggered, but it will also optimize the operation as this
method should enhance the speed of the copy process whenever there's a
possibility to start copying at the PMD level.

Some more points:
a. The optimization can be done only when both the source and
destination of the mremap do not have anything mapped below it up to a
PMD boundary. I add support to detect that.

b. #1 is not a problem for the call to move_page_tables() from exec.c as
nothing is expected to be mapped below the source. However, for
non-overlapping mutually aligned moves as triggered by mremap(2), I
added support for checking such cases.

c. I currently only optimize for PMD moves, in the future I/we can build
on this work and do PUD moves as well if there is a need for this. But I
want to take it one step at a time.

d. We need to be careful about mremap of ranges within the VMA itself.
For this purpose, I added checks to determine if the address after
alignment falls within its VMA itself.

[1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
[2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 mm/mremap.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 11e06e4ab33b..1011326b7b80 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -489,6 +489,53 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 	return moved;
 }
 
+/*
+ * A helper to check if a previous mapping exists. Required for
+ * move_page_tables() and realign_addr() to determine if a previous mapping
+ * exists before we can do realignment optimizations.
+ */
+static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
+			       unsigned long mask)
+{
+	unsigned long addr_masked = addr_to_align & mask;
+
+	/*
+	 * If @addr_to_align of either source or destination is not the beginning
+	 * of the corresponding VMA, we can't align down or we will destroy part
+	 * of the current mapping.
+	 */
+	if (vma->vm_start != addr_to_align)
+		return false;
+
+	/*
+	 * Make sure the realignment doesn't cause the address to fall on an
+	 * existing mapping.
+	 */
+	return find_vma_intersection(vma->vm_mm, addr_masked, vma->vm_start) == NULL;
+}
+
+/* Opportunistically realign to specified boundary for faster copy. */
+static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
+			     unsigned long *new_addr, struct vm_area_struct *new_vma,
+			     unsigned long mask)
+{
+	/* Skip if the addresses are already aligned. */
+	if ((*old_addr & ~mask) == 0)
+		return;
+
+	/* Only realign if the new and old addresses are mutually aligned. */
+	if ((*old_addr & ~mask) != (*new_addr & ~mask))
+		return;
+
+	/* Ensure realignment doesn't cause overlap with existing mappings. */
+	if (!can_align_down(old_vma, *old_addr, mask) ||
+	    !can_align_down(new_vma, *new_addr, mask))
+		return;
+
+	*old_addr = *old_addr & mask;
+	*new_addr = *new_addr & mask;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
@@ -508,6 +555,14 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		return move_hugetlb_page_tables(vma, new_vma, old_addr,
 						new_addr, len);
 
+	/*
+	 * If possible, realign addresses to PMD boundary for faster copy.
+	 * Only realign if the mremap copying hits a PMD boundary.
+	 */
+	if ((vma != new_vma)
+		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK)))
+		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
+
 	flush_cache_range(vma, old_addr, old_end);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
 				old_addr, old_end);
@@ -577,6 +632,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 
 	mmu_notifier_invalidate_range_end(&range);
 
+	/*
+	 * Prevent negative return values when {old,new}_addr was realigned
+	 * but we broke out of the above loop for the first PMD itself.
+	 */
+	if (len + old_addr < old_end)
+		return 0;
+
 	return len + old_addr - old_end;	/* how much done */
 }
 
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-05  6:47   ` Lorenzo Stoakes
  2023-09-08 13:11   ` Michal Hocko
  2023-09-03 15:13 ` [PATCH v6 3/7] selftests: mm: Fix failure case when new remap region was not found Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

For the stack move happening in shift_arg_pages(), the move is happening
within the same VMA which spans the old and new ranges.

In case the aligned address happens to fall within that VMA, allow such
moves and don't abort the mremap alignment optimization.

In the regular non-stack mremap case, we cannot allow any such moves as
will end up destroying some part of the mapping (either the source of
the move, or part of the existing mapping). So just avoid it for stack
moves.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 fs/exec.c          |  2 +-
 include/linux/mm.h |  2 +-
 mm/mremap.c        | 33 +++++++++++++++++++--------------
 3 files changed, 21 insertions(+), 16 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 1a827d55ba94..244925307958 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -712,7 +712,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
 	 * process cleanup to remove whatever mess we made.
 	 */
 	if (length != move_page_tables(vma, old_start,
-				       vma, new_start, length, false))
+				       vma, new_start, length, false, true))
 		return -ENOMEM;
 
 	lru_add_drain();
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 406ab9ea818f..e635d1fc73b6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2458,7 +2458,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks);
+		bool need_rmap_locks, bool for_stack);
 
 /*
  * Flags used by change_protection().  For now we make it a bitmap so
diff --git a/mm/mremap.c b/mm/mremap.c
index 1011326b7b80..2b51f8b7cad8 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -490,12 +490,13 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 }
 
 /*
- * A helper to check if a previous mapping exists. Required for
- * move_page_tables() and realign_addr() to determine if a previous mapping
- * exists before we can do realignment optimizations.
+ * A helper to check if aligning down is OK. The aligned address should fall
+ * on *no mapping*. For the stack moving down, that's a special move within
+ * the VMA that is created to span the source and destination of the move,
+ * so we make an exception for it.
  */
 static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
-			       unsigned long mask)
+			    unsigned long mask, bool for_stack)
 {
 	unsigned long addr_masked = addr_to_align & mask;
 
@@ -504,9 +505,13 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
 	 * of the corresponding VMA, we can't align down or we will destroy part
 	 * of the current mapping.
 	 */
-	if (vma->vm_start != addr_to_align)
+	if (!for_stack && vma->vm_start != addr_to_align)
 		return false;
 
+	/* In the stack case we explicitly permit in-VMA alignment. */
+	if (for_stack && addr_masked >= vma->vm_start)
+		return true;
+
 	/*
 	 * Make sure the realignment doesn't cause the address to fall on an
 	 * existing mapping.
@@ -517,7 +522,7 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
 /* Opportunistically realign to specified boundary for faster copy. */
 static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
 			     unsigned long *new_addr, struct vm_area_struct *new_vma,
-			     unsigned long mask)
+			     unsigned long mask, bool for_stack)
 {
 	/* Skip if the addresses are already aligned. */
 	if ((*old_addr & ~mask) == 0)
@@ -528,8 +533,8 @@ static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old
 		return;
 
 	/* Ensure realignment doesn't cause overlap with existing mappings. */
-	if (!can_align_down(old_vma, *old_addr, mask) ||
-	    !can_align_down(new_vma, *new_addr, mask))
+	if (!can_align_down(old_vma, *old_addr, mask, for_stack) ||
+	    !can_align_down(new_vma, *new_addr, mask, for_stack))
 		return;
 
 	*old_addr = *old_addr & mask;
@@ -539,7 +544,7 @@ static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old
 unsigned long move_page_tables(struct vm_area_struct *vma,
 		unsigned long old_addr, struct vm_area_struct *new_vma,
 		unsigned long new_addr, unsigned long len,
-		bool need_rmap_locks)
+		bool need_rmap_locks, bool for_stack)
 {
 	unsigned long extent, old_end;
 	struct mmu_notifier_range range;
@@ -559,9 +564,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	 * If possible, realign addresses to PMD boundary for faster copy.
 	 * Only realign if the mremap copying hits a PMD boundary.
 	 */
-	if ((vma != new_vma)
-		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK)))
-		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
+	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
+		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK,
+				 for_stack);
 
 	flush_cache_range(vma, old_addr, old_end);
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
@@ -708,7 +713,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 	}
 
 	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
-				     need_rmap_locks);
+				     need_rmap_locks, false);
 	if (moved_len < old_len) {
 		err = -ENOMEM;
 	} else if (vma->vm_ops && vma->vm_ops->mremap) {
@@ -722,7 +727,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
 		 * and then proceed to unmap new area instead of old.
 		 */
 		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
-				 true);
+				 true, false);
 		vma = new_vma;
 		old_len = new_len;
 		old_addr = new_addr;
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 3/7] selftests: mm: Fix failure case when new remap region was not found
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Lorenzo Stoakes, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

When a valid remap region could not be found, the source mapping is not
cleaned up. Fix the goto statement such that the clean up happens.

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 5c3773de9f0f..6822d657f589 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -316,7 +316,7 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		if (addr + c.dest_alignment < addr) {
 			ksft_print_msg("Couldn't find a valid region to remap to\n");
 			ret = -1;
-			goto out;
+			goto clean_up_src;
 		}
 		addr += c.dest_alignment;
 	}
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2023-09-03 15:13 ` [PATCH v6 3/7] selftests: mm: Fix failure case when new remap region was not found Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Lorenzo Stoakes, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

This patch adds a test case to check if a PMD-alignment optimization
successfully happens.

I add support to make sure there is some room before the source mapping,
otherwise the optimization to trigger PMD-aligned move will be disabled
as the kernel will detect that a mapping before the source exists and
such optimization becomes impossible.

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 6822d657f589..6304eb0947a3 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -44,6 +44,7 @@ enum {
 	_1MB = 1ULL << 20,
 	_2MB = 2ULL << 20,
 	_4MB = 4ULL << 20,
+	_5MB = 5ULL << 20,
 	_1GB = 1ULL << 30,
 	_2GB = 2ULL << 30,
 	PMD = _2MB,
@@ -235,6 +236,11 @@ static void *get_source_mapping(struct config c)
 	unsigned long long mmap_min_addr;
 
 	mmap_min_addr = get_mmap_min_addr();
+	/*
+	 * For some tests, we need to not have any mappings below the
+	 * source mapping. Add some headroom to mmap_min_addr for this.
+	 */
+	mmap_min_addr += 10 * _4MB;
 
 retry:
 	addr += c.src_alignment;
@@ -434,7 +440,7 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
-#define MAX_TEST 13
+#define MAX_TEST 14
 #define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
@@ -500,6 +506,10 @@ int main(int argc, char **argv)
 	test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
 				   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
 
+	/* Src and Dest addr 1MB aligned. 5MB mremap. */
+	test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
 	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
 					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
 	/*
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2023-09-03 15:13 ` [PATCH v6 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-03 15:13 ` [PATCH v6 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Lorenzo Stoakes, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

This patch adds support for verifying that we correctly handle the
situation where something is already mapped before the destination of the remap.

Any realignment of destination address and PMD-copy will destroy that
existing mapping. In such cases, we need to avoid doing the optimization.

To test this, we map an area called the preamble before the remap
region. Then we verify after the mremap operation that this region did not get
corrupted.

Putting some prints in the kernel, I verified that we optimize
correctly in different situations:

Optimize when there is alignment and no previous mapping (this is tested
by previous patch).
<prints>
can_align_down(old_vma->vm_start=2900000, old_addr=2900000, mask=-2097152): 0
can_align_down(new_vma->vm_start=2f00000, new_addr=2f00000, mask=-2097152): 0
=== Starting move_page_tables ===
Doing PUD move for 2800000 -> 2e00000 of extent=200000 <-- Optimization
Doing PUD move for 2a00000 -> 3000000 of extent=200000
Doing PUD move for 2c00000 -> 3200000 of extent=200000
</prints>

Don't optimize when there is alignment but there is previous mapping
(this is tested by this patch).
Notice that can_align_down() returns 1 for the destination mapping
as we detected there is something there.
<prints>
can_align_down(old_vma->vm_start=2900000, old_addr=2900000, mask=-2097152): 0
can_align_down(new_vma->vm_start=5700000, new_addr=5700000, mask=-2097152): 1
=== Starting move_page_tables ===
Doing move_ptes for 2900000 -> 5700000 of extent=100000 <-- Unoptimized
Doing PUD move for 2a00000 -> 5800000 of extent=200000
Doing PUD move for 2c00000 -> 5a00000 of extent=200000
</prints>

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 57 +++++++++++++++++++++---
 1 file changed, 52 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 6304eb0947a3..d7366074e2a8 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -29,6 +29,7 @@ struct config {
 	unsigned long long dest_alignment;
 	unsigned long long region_size;
 	int overlapping;
+	int dest_preamble_size;
 };
 
 struct test {
@@ -283,7 +284,7 @@ static void *get_source_mapping(struct config c)
 static long long remap_region(struct config c, unsigned int threshold_mb,
 			      char pattern_seed)
 {
-	void *addr, *src_addr, *dest_addr;
+	void *addr, *src_addr, *dest_addr, *dest_preamble_addr;
 	unsigned long long i;
 	struct timespec t_start = {0, 0}, t_end = {0, 0};
 	long long  start_ns, end_ns, align_mask, ret, offset;
@@ -300,7 +301,7 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		goto out;
 	}
 
-	/* Set byte pattern */
+	/* Set byte pattern for source block. */
 	srand(pattern_seed);
 	for (i = 0; i < threshold; i++)
 		memset((char *) src_addr + i, (char) rand(), 1);
@@ -312,6 +313,9 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	addr = (void *) (((unsigned long long) src_addr + c.region_size
 			  + offset) & align_mask);
 
+	/* Remap after the destination block preamble. */
+	addr += c.dest_preamble_size;
+
 	/* See comment in get_source_mapping() */
 	if (!((unsigned long long) addr & c.dest_alignment))
 		addr = (void *) ((unsigned long long) addr | c.dest_alignment);
@@ -327,6 +331,24 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		addr += c.dest_alignment;
 	}
 
+	if (c.dest_preamble_size) {
+		dest_preamble_addr = mmap((void *) addr - c.dest_preamble_size, c.dest_preamble_size,
+					  PROT_READ | PROT_WRITE,
+					  MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+							-1, 0);
+		if (dest_preamble_addr == MAP_FAILED) {
+			ksft_print_msg("Failed to map dest preamble region: %s\n",
+					strerror(errno));
+			ret = -1;
+			goto clean_up_src;
+		}
+
+		/* Set byte pattern for the dest preamble block. */
+		srand(pattern_seed);
+		for (i = 0; i < c.dest_preamble_size; i++)
+			memset((char *) dest_preamble_addr + i, (char) rand(), 1);
+	}
+
 	clock_gettime(CLOCK_MONOTONIC, &t_start);
 	dest_addr = mremap(src_addr, c.region_size, c.region_size,
 					  MREMAP_MAYMOVE|MREMAP_FIXED, (char *) addr);
@@ -335,7 +357,7 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	if (dest_addr == MAP_FAILED) {
 		ksft_print_msg("mremap failed: %s\n", strerror(errno));
 		ret = -1;
-		goto clean_up_src;
+		goto clean_up_dest_preamble;
 	}
 
 	/* Verify byte pattern after remapping */
@@ -353,6 +375,23 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 		}
 	}
 
+	/* Verify the dest preamble byte pattern after remapping */
+	if (c.dest_preamble_size) {
+		srand(pattern_seed);
+		for (i = 0; i < c.dest_preamble_size; i++) {
+			char c = (char) rand();
+
+			if (((char *) dest_preamble_addr)[i] != c) {
+				ksft_print_msg("Preamble data after remap doesn't match at offset %d\n",
+					       i);
+				ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+					       ((char *) dest_preamble_addr)[i] & 0xff);
+				ret = -1;
+				goto clean_up_dest;
+			}
+		}
+	}
+
 	start_ns = t_start.tv_sec * NS_PER_SEC + t_start.tv_nsec;
 	end_ns = t_end.tv_sec * NS_PER_SEC + t_end.tv_nsec;
 	ret = end_ns - start_ns;
@@ -365,6 +404,9 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
  */
 clean_up_dest:
 	munmap(dest_addr, c.region_size);
+clean_up_dest_preamble:
+	if (c.dest_preamble_size && dest_preamble_addr)
+		munmap(dest_preamble_addr, c.dest_preamble_size);
 clean_up_src:
 	munmap(src_addr, c.region_size);
 out:
@@ -440,7 +482,7 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
-#define MAX_TEST 14
+#define MAX_TEST 15
 #define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
@@ -449,7 +491,7 @@ int main(int argc, char **argv)
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
 	int num_expand_tests = 2;
-	struct test test_cases[MAX_TEST];
+	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
 	time_t t;
@@ -510,6 +552,11 @@ int main(int argc, char **argv)
 	test_cases[13] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
 				  "5MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
 
+	/* Src and Dest addr 1MB aligned. 5MB mremap. */
+	test_cases[14] = MAKE_TEST(_1MB, _1MB, _5MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble");
+	test_cases[14].config.dest_preamble_size = 10 * _4MB;
+
 	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
 					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
 	/*
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 6/7] selftests: mm: Add a test for remapping within a range
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2023-09-03 15:13 ` [PATCH v6 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-05  6:48   ` Lorenzo Stoakes
  2023-09-03 15:13 ` [PATCH v6 7/7] selftests: mm: Add a test for moving from an offset from start of mapping Joel Fernandes (Google)
  2023-09-18 15:35 ` [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD zhenyu zhang
  7 siblings, 1 reply; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	linux-kselftest, linux-mm, Shuah Khan, Vlastimil Babka,
	Michal Hocko, Linus Torvalds, Lorenzo Stoakes, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

Move a block of memory within a memory range. Any alignment optimization
on the source address may cause corruption. Verify using kselftest that
it works. I have also verified with tracing that such optimization does
not happen due to this check in can_align_down():

if (!for_stack && vma->vm_start != addr_to_align)
	return false;

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 79 +++++++++++++++++++++++-
 1 file changed, 78 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index d7366074e2a8..12a095457f4c 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -23,6 +23,7 @@
 #define VALIDATION_NO_THRESHOLD 0	/* Verify the entire region */
 
 #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
+#define SIZE_MB(m) ((size_t)m * (1024 * 1024))
 
 struct config {
 	unsigned long long src_alignment;
@@ -226,6 +227,79 @@ static void mremap_expand_merge_offset(FILE *maps_fp, unsigned long page_size)
 		ksft_test_result_fail("%s\n", test_name);
 }
 
+/*
+ * Verify that an mremap within a range does not cause corruption
+ * of unrelated part of range.
+ *
+ * Consider the following range which is 2MB aligned and is
+ * a part of a larger 20MB range which is not shown. Each
+ * character is 256KB below making the source and destination
+ * 2MB each. The lower case letters are moved (s to d) and the
+ * upper case letters are not moved. The below test verifies
+ * that the upper case S letters are not corrupted by the
+ * adjacent mremap.
+ *
+ * |DDDDddddSSSSssss|
+ */
+static void mremap_move_within_range(char pattern_seed)
+{
+	char *test_name = "mremap mremap move within range";
+	void *src, *dest;
+	int i, success = 1;
+
+	size_t size = SIZE_MB(20);
+	void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (ptr == MAP_FAILED) {
+		perror("mmap");
+		success = 0;
+		goto out;
+	}
+	memset(ptr, 0, size);
+
+	src = ptr + SIZE_MB(6);
+	src = (void *)((unsigned long)src & ~(SIZE_MB(2) - 1));
+
+	/* Set byte pattern for source block. */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(2); i++) {
+		((char *)src)[i] = (char) rand();
+	}
+
+	dest = src - SIZE_MB(2);
+
+	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+	if (new_ptr == MAP_FAILED) {
+		perror("mremap");
+		success = 0;
+		goto out;
+	}
+
+	/* Verify byte pattern after remapping */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(1); i++) {
+		char c = (char) rand();
+
+		if (((char *)src)[i] != c) {
+			ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+				       i);
+			ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+					((char *) src)[i] & 0xff);
+			success = 0;
+		}
+	}
+
+out:
+	if (munmap(ptr, size) == -1)
+		perror("munmap");
+
+	if (success)
+		ksft_test_result_pass("%s\n", test_name);
+	else
+		ksft_test_result_fail("%s\n", test_name);
+}
+
 /*
  * Returns the start address of the mapping on success, else returns
  * NULL on failure.
@@ -491,6 +565,7 @@ int main(int argc, char **argv)
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
 	int num_expand_tests = 2;
+	int num_misc_tests = 1;
 	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
@@ -572,7 +647,7 @@ int main(int argc, char **argv)
 				(threshold_mb * _1MB >= _1GB);
 
 	ksft_set_plan(ARRAY_SIZE(test_cases) + (run_perf_tests ?
-		      ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests);
+		      ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests + num_misc_tests);
 
 	for (i = 0; i < ARRAY_SIZE(test_cases); i++)
 		run_mremap_test_case(test_cases[i], &failures, threshold_mb,
@@ -590,6 +665,8 @@ int main(int argc, char **argv)
 
 	fclose(maps_fp);
 
+	mremap_move_within_range(pattern_seed);
+
 	if (run_perf_tests) {
 		ksft_print_msg("\n%s\n",
 		 "mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:");
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v6 7/7] selftests: mm: Add a test for moving from an offset from start of mapping
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2023-09-03 15:13 ` [PATCH v6 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
@ 2023-09-03 15:13 ` Joel Fernandes (Google)
  2023-09-18 15:35 ` [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD zhenyu zhang
  7 siblings, 0 replies; 16+ messages in thread
From: Joel Fernandes (Google) @ 2023-09-03 15:13 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes, Lorenzo Stoakes, linux-kselftest, linux-mm,
	Shuah Khan, Vlastimil Babka, Michal Hocko, Linus Torvalds,
	Kirill A Shutemov, Liam R. Howlett, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra

From: Joel Fernandes <joel@joelfernandes.org>

It is possible that the aligned address falls on no existing mapping,
however that does not mean that we can just align it down to that.
This test verifies that the "vma->vm_start != addr_to_align" check in
can_align_down() prevents disastrous results if aligning down when
source and dest are mutually aligned within a PMD but the source/dest
addresses requested are not at the beginning of the respective mapping
containing these addresses.

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 tools/testing/selftests/mm/mremap_test.c | 189 ++++++++++++++++-------
 1 file changed, 134 insertions(+), 55 deletions(-)

diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
index 12a095457f4c..1f836e670a37 100644
--- a/tools/testing/selftests/mm/mremap_test.c
+++ b/tools/testing/selftests/mm/mremap_test.c
@@ -24,6 +24,7 @@
 
 #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
 #define SIZE_MB(m) ((size_t)m * (1024 * 1024))
+#define SIZE_KB(k) ((size_t)k * 1024)
 
 struct config {
 	unsigned long long src_alignment;
@@ -148,6 +149,60 @@ static bool is_range_mapped(FILE *maps_fp, void *start, void *end)
 	return success;
 }
 
+/*
+ * Returns the start address of the mapping on success, else returns
+ * NULL on failure.
+ */
+static void *get_source_mapping(struct config c)
+{
+	unsigned long long addr = 0ULL;
+	void *src_addr = NULL;
+	unsigned long long mmap_min_addr;
+
+	mmap_min_addr = get_mmap_min_addr();
+	/*
+	 * For some tests, we need to not have any mappings below the
+	 * source mapping. Add some headroom to mmap_min_addr for this.
+	 */
+	mmap_min_addr += 10 * _4MB;
+
+retry:
+	addr += c.src_alignment;
+	if (addr < mmap_min_addr)
+		goto retry;
+
+	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
+					MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+					-1, 0);
+	if (src_addr == MAP_FAILED) {
+		if (errno == EPERM || errno == EEXIST)
+			goto retry;
+		goto error;
+	}
+	/*
+	 * Check that the address is aligned to the specified alignment.
+	 * Addresses which have alignments that are multiples of that
+	 * specified are not considered valid. For instance, 1GB address is
+	 * 2MB-aligned, however it will not be considered valid for a
+	 * requested alignment of 2MB. This is done to reduce coincidental
+	 * alignment in the tests.
+	 */
+	if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
+			!((unsigned long long) src_addr & c.src_alignment)) {
+		munmap(src_addr, c.region_size);
+		goto retry;
+	}
+
+	if (!src_addr)
+		goto error;
+
+	return src_addr;
+error:
+	ksft_print_msg("Failed to map source region: %s\n",
+			strerror(errno));
+	return NULL;
+}
+
 /*
  * This test validates that merge is called when expanding a mapping.
  * Mapping containing three pages is created, middle page is unmapped
@@ -300,60 +355,6 @@ static void mremap_move_within_range(char pattern_seed)
 		ksft_test_result_fail("%s\n", test_name);
 }
 
-/*
- * Returns the start address of the mapping on success, else returns
- * NULL on failure.
- */
-static void *get_source_mapping(struct config c)
-{
-	unsigned long long addr = 0ULL;
-	void *src_addr = NULL;
-	unsigned long long mmap_min_addr;
-
-	mmap_min_addr = get_mmap_min_addr();
-	/*
-	 * For some tests, we need to not have any mappings below the
-	 * source mapping. Add some headroom to mmap_min_addr for this.
-	 */
-	mmap_min_addr += 10 * _4MB;
-
-retry:
-	addr += c.src_alignment;
-	if (addr < mmap_min_addr)
-		goto retry;
-
-	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-					MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
-					-1, 0);
-	if (src_addr == MAP_FAILED) {
-		if (errno == EPERM || errno == EEXIST)
-			goto retry;
-		goto error;
-	}
-	/*
-	 * Check that the address is aligned to the specified alignment.
-	 * Addresses which have alignments that are multiples of that
-	 * specified are not considered valid. For instance, 1GB address is
-	 * 2MB-aligned, however it will not be considered valid for a
-	 * requested alignment of 2MB. This is done to reduce coincidental
-	 * alignment in the tests.
-	 */
-	if (((unsigned long long) src_addr & (c.src_alignment - 1)) ||
-			!((unsigned long long) src_addr & c.src_alignment)) {
-		munmap(src_addr, c.region_size);
-		goto retry;
-	}
-
-	if (!src_addr)
-		goto error;
-
-	return src_addr;
-error:
-	ksft_print_msg("Failed to map source region: %s\n",
-			strerror(errno));
-	return NULL;
-}
-
 /* Returns the time taken for the remap on success else returns -1. */
 static long long remap_region(struct config c, unsigned int threshold_mb,
 			      char pattern_seed)
@@ -487,6 +488,83 @@ static long long remap_region(struct config c, unsigned int threshold_mb,
 	return ret;
 }
 
+/*
+ * Verify that an mremap aligning down does not destroy
+ * the beginning of the mapping just because the aligned
+ * down address landed on a mapping that maybe does not exist.
+ */
+static void mremap_move_1mb_from_start(char pattern_seed)
+{
+	char *test_name = "mremap move 1mb from start at 1MB+256KB aligned src";
+	void *src = NULL, *dest = NULL;
+	int i, success = 1;
+
+	/* Config to reuse get_source_mapping() to do an aligned mmap. */
+	struct config c = {
+		.src_alignment = SIZE_MB(1) + SIZE_KB(256),
+		.region_size = SIZE_MB(6)
+	};
+
+	src = get_source_mapping(c);
+	if (!src) {
+		success = 0;
+		goto out;
+	}
+
+	c.src_alignment = SIZE_MB(1) + SIZE_KB(256);
+	dest = get_source_mapping(c);
+	if (!dest) {
+		success = 0;
+		goto out;
+	}
+
+	/* Set byte pattern for source block. */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(2); i++) {
+		((char *)src)[i] = (char) rand();
+	}
+
+	/*
+	 * Unmap the beginning of dest so that the aligned address
+	 * falls on no mapping.
+	 */
+	munmap(dest, SIZE_MB(1));
+
+	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
+						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
+	if (new_ptr == MAP_FAILED) {
+		perror("mremap");
+		success = 0;
+		goto out;
+	}
+
+	/* Verify byte pattern after remapping */
+	srand(pattern_seed);
+	for (i = 0; i < SIZE_MB(1); i++) {
+		char c = (char) rand();
+
+		if (((char *)src)[i] != c) {
+			ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
+				       i);
+			ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
+					((char *) src)[i] & 0xff);
+			success = 0;
+		}
+	}
+
+out:
+	if (src && munmap(src, c.region_size) == -1)
+		perror("munmap src");
+
+	if (dest && munmap(dest, c.region_size) == -1)
+		perror("munmap dest");
+
+	if (success)
+		ksft_test_result_pass("%s\n", test_name);
+	else
+		ksft_test_result_fail("%s\n", test_name);
+}
+
 static void run_mremap_test_case(struct test test_case, int *failures,
 				 unsigned int threshold_mb,
 				 unsigned int pattern_seed)
@@ -565,7 +643,7 @@ int main(int argc, char **argv)
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
 	int num_expand_tests = 2;
-	int num_misc_tests = 1;
+	int num_misc_tests = 2;
 	struct test test_cases[MAX_TEST] = {};
 	struct test perf_test_cases[MAX_PERF_TEST];
 	int page_size;
@@ -666,6 +744,7 @@ int main(int argc, char **argv)
 	fclose(maps_fp);
 
 	mremap_move_within_range(pattern_seed);
+	mremap_move_1mb_from_start(pattern_seed);
 
 	if (run_perf_tests) {
 		ksft_print_msg("\n%s\n",
-- 
2.42.0.283.g2d96d420d3-goog


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [lkp] [+134 bytes kernel size regression] [i386-tinyconfig] [8d22a4573c] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-09-03 15:13 ` [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
@ 2023-09-03 16:07   ` kernel test robot
  2023-09-08 13:07   ` [PATCH v6 1/7] " Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: kernel test robot @ 2023-09-03 16:07 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: oe-kbuild-all, lkp, Josh Triplett


FYI, we noticed a +134 bytes kernel size regression due to commit:

commit: 8d22a4573c37fc699bee5452368ce1c497ec8a0c (mm/mremap: Optimize the start addresses in move_page_tables())
url: https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/mm-mremap-Optimize-the-start-addresses-in-move_page_tables/20230903-231541
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/20230903151328.2981432-2-joel@joelfernandes.org/
patch subject: [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables()


Details as below (size data is obtained by `nm --size-sort vmlinux`):

69d5f96a: Merge branch 'mm-nonmm-unstable' into mm-everything
8d22a457: mm/mremap: Optimize the start addresses in move_page_tables()

+-----------------------+----------+----------+-------+
|        symbol         | 69d5f96a | 8d22a457 | delta |
+-----------------------+----------+----------+-------+
| nm.T.move_page_tables | 27       | 775      | 748   |
| bzImage               | 516400   | 516464   | 64    |
| nm.t.can_align_down   | 0        | 29       | 29    |
| nm.t.move_page_tables | 643      | 0        | -643  |
+-----------------------+----------+----------+-------+



Thanks



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves
  2023-09-03 15:13 ` [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves Joel Fernandes (Google)
@ 2023-09-05  6:47   ` Lorenzo Stoakes
  2023-09-08 13:11   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2023-09-05  6:47 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

On Sun, Sep 03, 2023 at 03:13:23PM +0000, Joel Fernandes (Google) wrote:
> For the stack move happening in shift_arg_pages(), the move is happening
> within the same VMA which spans the old and new ranges.
>
> In case the aligned address happens to fall within that VMA, allow such
> moves and don't abort the mremap alignment optimization.
>
> In the regular non-stack mremap case, we cannot allow any such moves as
> will end up destroying some part of the mapping (either the source of
> the move, or part of the existing mapping). So just avoid it for stack
> moves.
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  fs/exec.c          |  2 +-
>  include/linux/mm.h |  2 +-
>  mm/mremap.c        | 33 +++++++++++++++++++--------------
>  3 files changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 1a827d55ba94..244925307958 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -712,7 +712,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
>  	 * process cleanup to remove whatever mess we made.
>  	 */
>  	if (length != move_page_tables(vma, old_start,
> -				       vma, new_start, length, false))
> +				       vma, new_start, length, false, true))
>  		return -ENOMEM;
>
>  	lru_add_drain();
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 406ab9ea818f..e635d1fc73b6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2458,7 +2458,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len,
> -		bool need_rmap_locks);
> +		bool need_rmap_locks, bool for_stack);
>
>  /*
>   * Flags used by change_protection().  For now we make it a bitmap so
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1011326b7b80..2b51f8b7cad8 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -490,12 +490,13 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>  }
>
>  /*
> - * A helper to check if a previous mapping exists. Required for
> - * move_page_tables() and realign_addr() to determine if a previous mapping
> - * exists before we can do realignment optimizations.
> + * A helper to check if aligning down is OK. The aligned address should fall
> + * on *no mapping*. For the stack moving down, that's a special move within
> + * the VMA that is created to span the source and destination of the move,
> + * so we make an exception for it.
>   */
>  static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> -			       unsigned long mask)
> +			    unsigned long mask, bool for_stack)
>  {
>  	unsigned long addr_masked = addr_to_align & mask;
>
> @@ -504,9 +505,13 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
>  	 * of the corresponding VMA, we can't align down or we will destroy part
>  	 * of the current mapping.
>  	 */
> -	if (vma->vm_start != addr_to_align)
> +	if (!for_stack && vma->vm_start != addr_to_align)
>  		return false;
>
> +	/* In the stack case we explicitly permit in-VMA alignment. */
> +	if (for_stack && addr_masked >= vma->vm_start)
> +		return true;
> +
>  	/*
>  	 * Make sure the realignment doesn't cause the address to fall on an
>  	 * existing mapping.
> @@ -517,7 +522,7 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
>  /* Opportunistically realign to specified boundary for faster copy. */
>  static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
>  			     unsigned long *new_addr, struct vm_area_struct *new_vma,
> -			     unsigned long mask)
> +			     unsigned long mask, bool for_stack)
>  {
>  	/* Skip if the addresses are already aligned. */
>  	if ((*old_addr & ~mask) == 0)
> @@ -528,8 +533,8 @@ static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old
>  		return;
>
>  	/* Ensure realignment doesn't cause overlap with existing mappings. */
> -	if (!can_align_down(old_vma, *old_addr, mask) ||
> -	    !can_align_down(new_vma, *new_addr, mask))
> +	if (!can_align_down(old_vma, *old_addr, mask, for_stack) ||
> +	    !can_align_down(new_vma, *new_addr, mask, for_stack))
>  		return;
>
>  	*old_addr = *old_addr & mask;
> @@ -539,7 +544,7 @@ static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len,
> -		bool need_rmap_locks)
> +		bool need_rmap_locks, bool for_stack)
>  {
>  	unsigned long extent, old_end;
>  	struct mmu_notifier_range range;
> @@ -559,9 +564,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	 * If possible, realign addresses to PMD boundary for faster copy.
>  	 * Only realign if the mremap copying hits a PMD boundary.
>  	 */
> -	if ((vma != new_vma)
> -		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK)))
> -		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> +	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
> +		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK,
> +				 for_stack);
>
>  	flush_cache_range(vma, old_addr, old_end);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
> @@ -708,7 +713,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  	}
>
>  	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
> -				     need_rmap_locks);
> +				     need_rmap_locks, false);
>  	if (moved_len < old_len) {
>  		err = -ENOMEM;
>  	} else if (vma->vm_ops && vma->vm_ops->mremap) {
> @@ -722,7 +727,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  		 * and then proceed to unmap new area instead of old.
>  		 */
>  		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
> -				 true);
> +				 true, false);
>  		vma = new_vma;
>  		old_len = new_len;
>  		old_addr = new_addr;
> --
> 2.42.0.283.g2d96d420d3-goog
>

Looks good to me, thanks

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 6/7] selftests: mm: Add a test for remapping within a range
  2023-09-03 15:13 ` [PATCH v6 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
@ 2023-09-05  6:48   ` Lorenzo Stoakes
  0 siblings, 0 replies; 16+ messages in thread
From: Lorenzo Stoakes @ 2023-09-05  6:48 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

On Sun, Sep 03, 2023 at 03:13:27PM +0000, Joel Fernandes (Google) wrote:
> Move a block of memory within a memory range. Any alignment optimization
> on the source address may cause corruption. Verify using kselftest that
> it works. I have also verified with tracing that such optimization does
> not happen due to this check in can_align_down():
>
> if (!for_stack && vma->vm_start != addr_to_align)
> 	return false;
>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  tools/testing/selftests/mm/mremap_test.c | 79 +++++++++++++++++++++++-
>  1 file changed, 78 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/mm/mremap_test.c b/tools/testing/selftests/mm/mremap_test.c
> index d7366074e2a8..12a095457f4c 100644
> --- a/tools/testing/selftests/mm/mremap_test.c
> +++ b/tools/testing/selftests/mm/mremap_test.c
> @@ -23,6 +23,7 @@
>  #define VALIDATION_NO_THRESHOLD 0	/* Verify the entire region */
>
>  #define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
> +#define SIZE_MB(m) ((size_t)m * (1024 * 1024))
>
>  struct config {
>  	unsigned long long src_alignment;
> @@ -226,6 +227,79 @@ static void mremap_expand_merge_offset(FILE *maps_fp, unsigned long page_size)
>  		ksft_test_result_fail("%s\n", test_name);
>  }
>
> +/*
> + * Verify that an mremap within a range does not cause corruption
> + * of unrelated part of range.
> + *
> + * Consider the following range which is 2MB aligned and is
> + * a part of a larger 20MB range which is not shown. Each
> + * character is 256KB below making the source and destination
> + * 2MB each. The lower case letters are moved (s to d) and the
> + * upper case letters are not moved. The below test verifies
> + * that the upper case S letters are not corrupted by the
> + * adjacent mremap.
> + *
> + * |DDDDddddSSSSssss|
> + */
> +static void mremap_move_within_range(char pattern_seed)
> +{
> +	char *test_name = "mremap mremap move within range";
> +	void *src, *dest;
> +	int i, success = 1;
> +
> +	size_t size = SIZE_MB(20);
> +	void *ptr = mmap(NULL, size, PROT_READ | PROT_WRITE,
> +			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> +	if (ptr == MAP_FAILED) {
> +		perror("mmap");
> +		success = 0;
> +		goto out;
> +	}
> +	memset(ptr, 0, size);
> +
> +	src = ptr + SIZE_MB(6);
> +	src = (void *)((unsigned long)src & ~(SIZE_MB(2) - 1));
> +
> +	/* Set byte pattern for source block. */
> +	srand(pattern_seed);
> +	for (i = 0; i < SIZE_MB(2); i++) {
> +		((char *)src)[i] = (char) rand();
> +	}
> +
> +	dest = src - SIZE_MB(2);
> +
> +	void *new_ptr = mremap(src + SIZE_MB(1), SIZE_MB(1), SIZE_MB(1),
> +						   MREMAP_MAYMOVE | MREMAP_FIXED, dest + SIZE_MB(1));
> +	if (new_ptr == MAP_FAILED) {
> +		perror("mremap");
> +		success = 0;
> +		goto out;
> +	}
> +
> +	/* Verify byte pattern after remapping */
> +	srand(pattern_seed);
> +	for (i = 0; i < SIZE_MB(1); i++) {
> +		char c = (char) rand();
> +
> +		if (((char *)src)[i] != c) {
> +			ksft_print_msg("Data at src at %d got corrupted due to unrelated mremap\n",
> +				       i);
> +			ksft_print_msg("Expected: %#x\t Got: %#x\n", c & 0xff,
> +					((char *) src)[i] & 0xff);
> +			success = 0;
> +		}
> +	}
> +
> +out:
> +	if (munmap(ptr, size) == -1)
> +		perror("munmap");
> +
> +	if (success)
> +		ksft_test_result_pass("%s\n", test_name);
> +	else
> +		ksft_test_result_fail("%s\n", test_name);
> +}
> +
>  /*
>   * Returns the start address of the mapping on success, else returns
>   * NULL on failure.
> @@ -491,6 +565,7 @@ int main(int argc, char **argv)
>  	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
>  	unsigned int pattern_seed;
>  	int num_expand_tests = 2;
> +	int num_misc_tests = 1;
>  	struct test test_cases[MAX_TEST] = {};
>  	struct test perf_test_cases[MAX_PERF_TEST];
>  	int page_size;
> @@ -572,7 +647,7 @@ int main(int argc, char **argv)
>  				(threshold_mb * _1MB >= _1GB);
>
>  	ksft_set_plan(ARRAY_SIZE(test_cases) + (run_perf_tests ?
> -		      ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests);
> +		      ARRAY_SIZE(perf_test_cases) : 0) + num_expand_tests + num_misc_tests);
>
>  	for (i = 0; i < ARRAY_SIZE(test_cases); i++)
>  		run_mremap_test_case(test_cases[i], &failures, threshold_mb,
> @@ -590,6 +665,8 @@ int main(int argc, char **argv)
>
>  	fclose(maps_fp);
>
> +	mremap_move_within_range(pattern_seed);
> +
>  	if (run_perf_tests) {
>  		ksft_print_msg("\n%s\n",
>  		 "mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region:");
> --
> 2.42.0.283.g2d96d420d3-goog
>

Looks good to me, nice series in general!

Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-09-03 15:13 ` [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
  2023-09-03 16:07   ` [lkp] [+134 bytes kernel size regression] [i386-tinyconfig] [8d22a4573c] " kernel test robot
@ 2023-09-08 13:07   ` Michal Hocko
  2023-09-08 13:26     ` Joel Fernandes
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2023-09-08 13:07 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, Lorenzo Stoakes, Linus Torvalds, linux-kselftest,
	linux-mm, Shuah Khan, Vlastimil Babka, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

Sorry for being silent most of the time in this patch series and thanks
for pushing it forward.

On Sun 03-09-23 15:13:22, Joel Fernandes wrote:
> Recently, we see reports [1] of a warning that triggers due to
> move_page_tables() doing a downward and overlapping move on a
> mutually-aligned offset within a PMD. By mutual alignment, I
> mean the source and destination addresses of the mremap are at
> the same offset within a PMD.
> 
> This mutual alignment along with the fact that the move is downward is
> sufficient to cause a warning related to having an allocated PMD that
> does not have PTEs in it.
> 
> This warning will only trigger when there is mutual alignment in the
> move operation. A solution, as suggested by Linus Torvalds [2], is to
> initiate the copy process at the PMD level whenever such alignment is
> present. Implementing this approach will not only prevent the warning
> from being triggered, but it will also optimize the operation as this
> method should enhance the speed of the copy process whenever there's a
> possibility to start copying at the PMD level.
> 
> Some more points:
> a. The optimization can be done only when both the source and
> destination of the mremap do not have anything mapped below it up to a
> PMD boundary. I add support to detect that.
> 
> b. #1 is not a problem for the call to move_page_tables() from exec.c as
> nothing is expected to be mapped below the source. However, for
> non-overlapping mutually aligned moves as triggered by mremap(2), I
> added support for checking such cases.
> 
> c. I currently only optimize for PMD moves, in the future I/we can build
> on this work and do PUD moves as well if there is a need for this. But I
> want to take it one step at a time.
> 
> d. We need to be careful about mremap of ranges within the VMA itself.
> For this purpose, I added checks to determine if the address after
> alignment falls within its VMA itself.
> 
> [1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
> [2] https://lore.kernel.org/all/CAHk-=whd7msp8reJPfeGNyt0LiySMT0egExx3TVZSX3Ok6X=9g@mail.gmail.com/
> 
> Reviewed-by: Lorenzo Stoakes <lstoakes@gmail.com>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

The patch looks good to me.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/mremap.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 11e06e4ab33b..1011326b7b80 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -489,6 +489,53 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>  	return moved;
>  }
>  
> +/*
> + * A helper to check if a previous mapping exists. Required for
> + * move_page_tables() and realign_addr() to determine if a previous mapping
> + * exists before we can do realignment optimizations.
> + */
> +static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> +			       unsigned long mask)
> +{
> +	unsigned long addr_masked = addr_to_align & mask;
> +
> +	/*
> +	 * If @addr_to_align of either source or destination is not the beginning
> +	 * of the corresponding VMA, we can't align down or we will destroy part
> +	 * of the current mapping.
> +	 */
> +	if (vma->vm_start != addr_to_align)
> +		return false;
> +
> +	/*
> +	 * Make sure the realignment doesn't cause the address to fall on an
> +	 * existing mapping.
> +	 */
> +	return find_vma_intersection(vma->vm_mm, addr_masked, vma->vm_start) == NULL;
> +}
> +
> +/* Opportunistically realign to specified boundary for faster copy. */
> +static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
> +			     unsigned long *new_addr, struct vm_area_struct *new_vma,
> +			     unsigned long mask)
> +{
> +	/* Skip if the addresses are already aligned. */
> +	if ((*old_addr & ~mask) == 0)
> +		return;
> +
> +	/* Only realign if the new and old addresses are mutually aligned. */
> +	if ((*old_addr & ~mask) != (*new_addr & ~mask))
> +		return;
> +
> +	/* Ensure realignment doesn't cause overlap with existing mappings. */
> +	if (!can_align_down(old_vma, *old_addr, mask) ||
> +	    !can_align_down(new_vma, *new_addr, mask))
> +		return;
> +
> +	*old_addr = *old_addr & mask;
> +	*new_addr = *new_addr & mask;
> +}
> +
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len,
> @@ -508,6 +555,14 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		return move_hugetlb_page_tables(vma, new_vma, old_addr,
>  						new_addr, len);
>  
> +	/*
> +	 * If possible, realign addresses to PMD boundary for faster copy.
> +	 * Only realign if the mremap copying hits a PMD boundary.
> +	 */
> +	if ((vma != new_vma)
> +		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK)))
> +		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> +
>  	flush_cache_range(vma, old_addr, old_end);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
>  				old_addr, old_end);
> @@ -577,6 +632,13 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  
>  	mmu_notifier_invalidate_range_end(&range);
>  
> +	/*
> +	 * Prevent negative return values when {old,new}_addr was realigned
> +	 * but we broke out of the above loop for the first PMD itself.
> +	 */
> +	if (len + old_addr < old_end)
> +		return 0;
> +
>  	return len + old_addr - old_end;	/* how much done */
>  }
>  
> -- 
> 2.42.0.283.g2d96d420d3-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves
  2023-09-03 15:13 ` [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves Joel Fernandes (Google)
  2023-09-05  6:47   ` Lorenzo Stoakes
@ 2023-09-08 13:11   ` Michal Hocko
  1 sibling, 0 replies; 16+ messages in thread
From: Michal Hocko @ 2023-09-08 13:11 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Linus Torvalds, Lorenzo Stoakes,
	Kirill A Shutemov, Liam R. Howlett, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra

On Sun 03-09-23 15:13:23, Joel Fernandes wrote:
> For the stack move happening in shift_arg_pages(), the move is happening
> within the same VMA which spans the old and new ranges.
> 
> In case the aligned address happens to fall within that VMA, allow such
> moves and don't abort the mremap alignment optimization.
> 
> In the regular non-stack mremap case, we cannot allow any such moves as
> will end up destroying some part of the mapping (either the source of
> the move, or part of the existing mapping). So just avoid it for stack
> moves.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

LGTM
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  fs/exec.c          |  2 +-
>  include/linux/mm.h |  2 +-
>  mm/mremap.c        | 33 +++++++++++++++++++--------------
>  3 files changed, 21 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 1a827d55ba94..244925307958 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -712,7 +712,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
>  	 * process cleanup to remove whatever mess we made.
>  	 */
>  	if (length != move_page_tables(vma, old_start,
> -				       vma, new_start, length, false))
> +				       vma, new_start, length, false, true))
>  		return -ENOMEM;
>  
>  	lru_add_drain();
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 406ab9ea818f..e635d1fc73b6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2458,7 +2458,7 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen);
>  extern unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len,
> -		bool need_rmap_locks);
> +		bool need_rmap_locks, bool for_stack);
>  
>  /*
>   * Flags used by change_protection().  For now we make it a bitmap so
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 1011326b7b80..2b51f8b7cad8 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -490,12 +490,13 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>  }
>  
>  /*
> - * A helper to check if a previous mapping exists. Required for
> - * move_page_tables() and realign_addr() to determine if a previous mapping
> - * exists before we can do realignment optimizations.
> + * A helper to check if aligning down is OK. The aligned address should fall
> + * on *no mapping*. For the stack moving down, that's a special move within
> + * the VMA that is created to span the source and destination of the move,
> + * so we make an exception for it.
>   */
>  static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_align,
> -			       unsigned long mask)
> +			    unsigned long mask, bool for_stack)
>  {
>  	unsigned long addr_masked = addr_to_align & mask;
>  
> @@ -504,9 +505,13 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
>  	 * of the corresponding VMA, we can't align down or we will destroy part
>  	 * of the current mapping.
>  	 */
> -	if (vma->vm_start != addr_to_align)
> +	if (!for_stack && vma->vm_start != addr_to_align)
>  		return false;
>  
> +	/* In the stack case we explicitly permit in-VMA alignment. */
> +	if (for_stack && addr_masked >= vma->vm_start)
> +		return true;
> +
>  	/*
>  	 * Make sure the realignment doesn't cause the address to fall on an
>  	 * existing mapping.
> @@ -517,7 +522,7 @@ static bool can_align_down(struct vm_area_struct *vma, unsigned long addr_to_ali
>  /* Opportunistically realign to specified boundary for faster copy. */
>  static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old_vma,
>  			     unsigned long *new_addr, struct vm_area_struct *new_vma,
> -			     unsigned long mask)
> +			     unsigned long mask, bool for_stack)
>  {
>  	/* Skip if the addresses are already aligned. */
>  	if ((*old_addr & ~mask) == 0)
> @@ -528,8 +533,8 @@ static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old
>  		return;
>  
>  	/* Ensure realignment doesn't cause overlap with existing mappings. */
> -	if (!can_align_down(old_vma, *old_addr, mask) ||
> -	    !can_align_down(new_vma, *new_addr, mask))
> +	if (!can_align_down(old_vma, *old_addr, mask, for_stack) ||
> +	    !can_align_down(new_vma, *new_addr, mask, for_stack))
>  		return;
>  
>  	*old_addr = *old_addr & mask;
> @@ -539,7 +544,7 @@ static void try_realign_addr(unsigned long *old_addr, struct vm_area_struct *old
>  unsigned long move_page_tables(struct vm_area_struct *vma,
>  		unsigned long old_addr, struct vm_area_struct *new_vma,
>  		unsigned long new_addr, unsigned long len,
> -		bool need_rmap_locks)
> +		bool need_rmap_locks, bool for_stack)
>  {
>  	unsigned long extent, old_end;
>  	struct mmu_notifier_range range;
> @@ -559,9 +564,9 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	 * If possible, realign addresses to PMD boundary for faster copy.
>  	 * Only realign if the mremap copying hits a PMD boundary.
>  	 */
> -	if ((vma != new_vma)
> -		&& (len >= PMD_SIZE - (old_addr & ~PMD_MASK)))
> -		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK);
> +	if (len >= PMD_SIZE - (old_addr & ~PMD_MASK))
> +		try_realign_addr(&old_addr, vma, &new_addr, new_vma, PMD_MASK,
> +				 for_stack);
>  
>  	flush_cache_range(vma, old_addr, old_end);
>  	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
> @@ -708,7 +713,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  	}
>  
>  	moved_len = move_page_tables(vma, old_addr, new_vma, new_addr, old_len,
> -				     need_rmap_locks);
> +				     need_rmap_locks, false);
>  	if (moved_len < old_len) {
>  		err = -ENOMEM;
>  	} else if (vma->vm_ops && vma->vm_ops->mremap) {
> @@ -722,7 +727,7 @@ static unsigned long move_vma(struct vm_area_struct *vma,
>  		 * and then proceed to unmap new area instead of old.
>  		 */
>  		move_page_tables(new_vma, new_addr, vma, old_addr, moved_len,
> -				 true);
> +				 true, false);
>  		vma = new_vma;
>  		old_len = new_len;
>  		old_addr = new_addr;
> -- 
> 2.42.0.283.g2d96d420d3-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables()
  2023-09-08 13:07   ` [PATCH v6 1/7] " Michal Hocko
@ 2023-09-08 13:26     ` Joel Fernandes
  0 siblings, 0 replies; 16+ messages in thread
From: Joel Fernandes @ 2023-09-08 13:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Lorenzo Stoakes, Linus Torvalds, linux-kselftest,
	linux-mm, Shuah Khan, Vlastimil Babka, Kirill A Shutemov,
	Liam R. Howlett, Paul E. McKenney, Suren Baghdasaryan,
	Kalesh Singh, Lokesh Gidra

On Fri, Sep 8, 2023 at 9:07 AM Michal Hocko <mhocko@suse.com> wrote:
>
> Sorry for being silent most of the time in this patch series and thanks
> for pushing it forward.
>
[...]
>
> The patch looks good to me.
> Acked-by: Michal Hocko <mhocko@suse.com>

Thank you very much!!

 - Joel

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD
  2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2023-09-03 15:13 ` [PATCH v6 7/7] selftests: mm: Add a test for moving from an offset from start of mapping Joel Fernandes (Google)
@ 2023-09-18 15:35 ` zhenyu zhang
  2023-09-19 12:31   ` zhenyu zhang
  7 siblings, 1 reply; 16+ messages in thread
From: zhenyu zhang @ 2023-09-18 15:35 UTC (permalink / raw)
  To: joel
  Cc: linux-kernel, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Lorenzo Stoakes,
	Kirill A Shutemov, Liam R. Howlett, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra, gshan, david

With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
    Steps:
    1) System setup hugepages on host.
       # echo 50 > /proc/sys/vm/nr_hugepages
    2) Mount this hugepage to /mnt/kvm_hugepage.
       # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
    3) HugePages didn't leak when using non-existent mem-path.
       # cd /home/kar/workspace/avocado-vt/virttest; mkdir -p /mnt/tmp
    4) Run memory heavy stress inside guest.
       # /usr/libexec/qemu-kvm \
         ...
         -m 25600 \
         -object '{"size": 26843545600, "mem-path": "/mnt/tmp", "id":
"mem-machine_mem", "qom-type": "memory-backend-file"}'  \
         -smp 60,maxcpus=60,cores=30,threads=1,clusters=1,sockets=2  \
       login guest:
       # nohup stress --vm 50 --vm-bytes 256M --timeout 30s >
/dev/null & ------> hit Call trace

On guest kernel:
2023-09-18 07:54:03: [   76.592706] CPU: 23 PID: 254 Comm:
kworker/23:1 Kdump: loaded Not tainted 6.6.0-rc2-zhenyzha_4k+ #3
2023-09-18 07:54:03: [   76.593782] Hardware name: QEMU KVM Virtual
Machine, BIOS edk2-20230524-3.el9 05/24/2023
2023-09-18 07:54:03: [   76.594641] Workqueue: rcu_gp wait_rcu_exp_gp
2023-09-18 07:54:03: [   76.595248] pstate: 80400005 (Nzcv daif +PAN
-UAO -TCO -DIT -SSBS BTYPE=--)
2023-09-18 07:54:03: [   76.596025] pc : smp_call_function_single+0xe4/0x1e8
2023-09-18 07:54:03: [   76.596833] lr :
__sync_rcu_exp_select_node_cpus+0x27c/0x428
2023-09-18 07:54:03: [   76.597534] sp : ffff800084a0bc60
2023-09-18 07:54:03: [   76.598078] x29: ffff800084a0bc60 x28:
ffff0003fdad9440 x27: 0000000000000001
2023-09-18 07:54:03: [   76.598874] x26: ffff800081a541b0 x25:
ffff800081e0af40 x24: ffff0000c425ed80
2023-09-18 07:54:03: [   76.599817] x23: 0000000000000004 x22:
ffff800081532fa0 x21: 0000000000000ffe
2023-09-18 07:54:03: [   76.600621] x20: ffff800081537440 x19:
ffff800084a0bca0 x18: 0000000000000001
2023-09-18 07:54:03: [   76.601420] x17: 0000000000000000 x16:
ffff800080f352e8 x15: 0000ffff97d02fff
2023-09-18 07:54:03: [   76.602212] x14: 0000000000000000 x13:
0000000000000030 x12: 0101010101010101
2023-09-18 07:54:03: [   76.603158] x11: ffff800081532fa0 x10:
0000000000000001 x9 : ffff80008014c714
2023-09-18 07:54:03: [   76.603963] x8 : ffff800081e03130 x7 :
ffff800081521008 x6 : ffff80008014e070
2023-09-18 07:54:03: [   76.604759] x5 : 0000000000000000 x4 :
ffff0003fda34c88 x3 : 0000000000000001
2023-09-18 07:54:03: [   76.605703] x2 : 0000000000000000 x1 :
ffff0003fda34c80 x0 : 000000000000001c
2023-09-18 07:54:03: [   76.606507] Call trace:
2023-09-18 07:54:03: [   76.606990]  smp_call_function_single+0xe4/0x1e8
2023-09-18 07:54:03: [   76.607617]  __sync_rcu_exp_select_node_cpus+0x27c/0x428
2023-09-18 07:54:03: [   76.608290]  sync_rcu_exp_select_cpus+0x164/0x2e0
2023-09-18 07:54:03: [   76.608963]  wait_rcu_exp_gp+0x1c/0x38
2023-09-18 07:54:03: [   76.609563]  process_one_work+0x174/0x3c8
2023-09-18 07:54:03: [   76.610181]  worker_thread+0x2c8/0x3e0
2023-09-18 07:54:03: [   76.610776]  kthread+0x100/0x110
2023-09-18 07:54:03: [   76.611330]  ret_from_fork+0x10/0x20
2023-09-18 07:54:15: [   88.396191] rcu: INFO: rcu_preempt detected
stalls on CPUs/tasks:
2023-09-18 07:54:15: [   88.397195] rcu: 11-...0: (18 ticks this GP)
idle=79ec/1/0x4000000000000000 softirq=577/579 fqs=1215
2023-09-18 07:54:15: [   88.398244] rcu: 25-...0: (1 GPs behind)
idle=599c/1/0x4000000000000000 softirq=300/301 fqs=1215
2023-09-18 07:54:15: [   88.399254] rcu: 33-...0: (36 ticks this GP)
idle=e454/1/0x4000000000000000 softirq=717/719 fqs=1216
2023-09-18 07:54:15: [   88.400275] rcu: (detected by 19, t=6006
jiffies, g=1173, q=61327 ncpus=38)
2023-09-18 07:54:15: [   88.401135] Task dump for CPU 11:
2023-09-18 07:54:15: [   88.401711] task:stress          state:R
running task     stack:0     pid:3182  ppid:3178   flags:0x00000202
2023-09-18 07:54:15: [   88.402794] Call trace:
2023-09-18 07:54:15: [   88.403312]  __switch_to+0xc8/0x110
2023-09-18 07:54:15: [   88.403915]  do_page_fault+0x198/0x4e0
2023-09-18 07:54:15: [   88.404533]  do_translation_fault+0x38/0x68
2023-09-18 07:54:15: [   88.405169]  do_mem_abort+0x48/0xa0
2023-09-18 07:54:15: [   88.405771]  el0_da+0x4c/0x180
2023-09-18 07:54:15: [   88.406337]  el0t_64_sync_handler+0xdc/0x150
2023-09-18 07:54:15: [   88.406991]  el0t_64_sync+0x17c/0x180
2023-09-18 07:54:15: [   88.407601] Task dump for CPU 25:
2023-09-18 07:54:15: [   88.408182] task:stress          state:R
running task     stack:0     pid:3200  ppid:3178   flags:0x00000203
2023-09-18 07:54:15: [   88.409258] Call trace:
2023-09-18 07:54:15: [   88.409769]  __switch_to+0xc8/0x110
2023-09-18 07:54:15: [   88.410339]  0x440dc0
2023-09-18 07:54:15: [   88.410816] Task dump for CPU 33:
2023-09-18 07:54:15: [   88.411362] task:stress          state:R
running task     stack:0     pid:3191  ppid:3178   flags:0x00000203
2023-09-18 07:54:15: [   88.412403] Call trace:
2023-09-18 07:54:15: [   88.412866]  __switch_to+0xc8/0x110
2023-09-18 07:54:15: [   88.413405]  __memcg_kmem_charge_page+0x270/0x2c0
2023-09-18 07:54:15: [   88.414033]  __alloc_pages+0x100/0x278
2023-09-18 07:54:15: [   88.414585]  memcg_stock+0x0/0x58

On host kernel:
173242 Sep 18 08:57:51 virt-mtsnow-02 kernel: ------------[ cut here
]------------
173243 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
52 kernel messages
173244 Sep 18 08:57:51 virt-mtsnow-02 kernel: do_cow_fault+0xf0/0x300
173245 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
162 kernel messages
173246 Sep 18 08:57:51 virt-mtsnow-02 kernel: CPU: 14 PID: 11294 Comm:
qemu-kvm Tainted: G        W          6.6.0-rc2-zhenyzha-64k+ #1
173247 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
226 kernel messages
173248 Sep 18 08:57:51 virt-mtsnow-02 kernel: x21: 0000000000000000
173249 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
120 kernel messages
173250 Sep 18 08:57:51 virt-mtsnow-02 kernel: __do_fault+0x40/0x210
173251 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
39 kernel messages
173252 Sep 18 08:57:51 virt-mtsnow-02 kernel: do_el0_svc+0xb4/0xd0
173253 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
325 kernel messages
173254 Sep 18 08:57:51 virt-mtsnow-02 kernel: get_user_pages_unlocked+0xc4/0x3b8
173255 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
255 kernel messages
173256 Sep 18 08:57:51 virt-mtsnow-02 kernel: pci_hyperv_intf
i2c_designware_core
173257 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
87 kernel messages
173258 Sep 18 08:57:51 virt-mtsnow-02 kernel: xfs_filemap_fault+0x54/0x68 [xfs]
173259 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
248 kernel messages
173260 Sep 18 08:57:51 virt-mtsnow-02 kernel: pci_hyperv_intf
173261 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
69 kernel messages
173262 Sep 18 08:57:51 virt-mtsnow-02 kernel: Hardware name: GIGABYTE
R152-P31-00/MP32-AR1-00, BIOS F18v (SCP: 1.08.20211002) 12/01/2021
173263 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
297 kernel messages
173264 Sep 18 08:57:51 virt-mtsnow-02 kernel: __filemap_add_folio+0x33c/0x4e0
173265 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
12 kernel messages
173266 Sep 18 08:57:51 virt-mtsnow-02 kernel: x26: 0000000000000001
173267 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
74 kernel messages

[ 5456.588346] ------------[ cut here ]------------
[ 5456.588358]  x10: 000000000000000a
[ 5456.588365]  dm_mod
[ 5456.588372]  nft_compat
[ 5456.588374] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
F18v (SCP: 1.08.20211002) 12/01/2021
[ 5456.588417]  fat
[ 5456.588421]  x16: 000000009872d4d0
[ 5456.588430]  ipmi_msghandler arm_cmn
[ 5456.588439]  x10: 000000000000000a
[ 5456.588414]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
[ 5456.588454] x5 : 0000000000000028
[ 5456.588460]  nvme_core
[ 5456.588474]  pci_hyperv_intf
[ 5456.588482] ------------[ cut here ]------------
[ 5456.588488]  page_cache_async_ra+0x64/0xa8
[ 5456.588491]  filemap_fault+0x238/0xaa8
[ 5456.588506]  nf_defrag_ipv4 nf_tables
[ 5456.588514]  nfs_acl
[ 5456.588518]  x22: ffffffc202880000
[ 5456.588525]  netfs
[ 5456.588527]  stp

[ 5456.588539]  acpi_ipmi
[ 5456.588546]  x10: 000000000000000a
[ 5456.588554]  x7 : ffff07ffa0a67210
[ 5456.588562]  get_user_pages_unlocked+0xc4/0x3b8
[ 5456.588567]  __gfn_to_pfn_memslot+0xa4/0xf8
[ 5456.588575]  xas_split_alloc+0xf8/0x128
[ 5456.588581]  sha1_ce
[ 5456.588588]  i2c_algo_bit
[ 5456.588592]  page_cache_async_ra+0x64/0xa8


Using @gshan@redhat.com 's patch:KVM: arm64: Fix soft-lockup on
relaxing PTE permission
Still hit Call trace:
2023-09-18 10:56:20: [   57.494201] watchdog: BUG: soft lockup -
CPU#58 stuck for 22s! [gsd-power:4858]
2023-09-18 10:56:20: [   57.495674] Modules linked in: nft_fib_inet
nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr
vfat fat fuse xfs libcrc32c virtio_gpu virtio_dma_buf drm_shmem_helper
nvme_tcp drm_kms_helper nvme_fabrics nvme_core nvme_common sg drm
crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net
net_failover virtio_scsi failover virtio_mmio dm_multipath dm_mirror
dm_region_hash dm_log dm_mod be2iscsi cxgb4i cxgb4 tls libcxgbi
libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi
scsi_transport_iscsi
2023-09-18 10:56:20: [   57.501871] CPU: 58 PID: 4858 Comm: gsd-power
Kdump: loaded Not tainted 6.6.0-rc2-zhenyzha_4k+ #3
2023-09-18 10:56:20: [   57.502719] Hardware name: QEMU KVM Virtual
Machine, BIOS edk2-20230524-3.el9 05/24/2023
2023-09-18 10:56:20: [   57.503540] pstate: 20400005 (nzCv daif +PAN
-UAO -TCO -DIT -SSBS BTYPE=--)
2023-09-18 10:56:20: [   57.504612] pc : smp_call_function_many_cond+0x16c/0x618
2023-09-18 10:56:20: [   57.505425] lr : smp_call_function_many_cond+0x188/0x618
2023-09-18 10:56:20: [   57.505974] sp : ffff8000870f38f0
2023-09-18 10:56:20: [   57.506370] x29: ffff8000870f38f0 x28:
000000000000003c x27: ffff00063c5dcaa0
2023-09-18 10:56:20: [   57.507041] x26: 000000000000003c x25:
000000000000003b x24: ffff00063c5b6848
2023-09-18 10:56:20: [   57.507812] x23: 0000000000000000 x22:
ffff00063c5b6848 x21: ffff800081a541b0
2023-09-18 10:56:20: [   57.508513] x20: ffff00063c5b6840 x19:
ffff800081a4f840 x18: 0000000000000014
2023-09-18 10:56:20: [   57.509247] x17: 00000000fd875552 x16:
0000000044ca0210 x15: 000000005df1120b
2023-09-18 10:56:20: [   57.509947] x14: 00000000ac15cb21 x13:
00000000b7ff1817 x12: 0000000006d3918c
2023-09-18 10:56:20: [   57.510645] x11: 00000000ba65fdab x10:
00000000f60c2b88 x9 : ffff80008061a9dc
2023-09-18 10:56:20: [   57.511264] x8 : ffff00063c5b6a50 x7 :
0000000000000000 x6 : 0000000001000000
2023-09-18 10:56:20: [   57.511817] x5 : 000000000000003c x4 :
0000000000000007 x3 : ffff00063bf28aa8
2023-09-18 10:56:20: [   57.512415] x2 : 0000000000000000 x1 :
0000000000000011 x0 : 0000000000000007
2023-09-18 10:56:20: [   57.513092] Call trace:
2023-09-18 10:56:20: [   57.515105]  smp_call_function_many_cond+0x16c/0x618
2023-09-18 10:56:20: [   57.515684]  kick_all_cpus_sync+0x48/0x80
2023-09-18 10:56:20: [   57.516039]  flush_icache_range+0x40/0x60
2023-09-18 10:56:20: [   57.516413]  bpf_int_jit_compile+0x1ac/0x5f8
2023-09-18 10:56:20: [   57.516821]  bpf_prog_select_runtime+0xd4/0x110
2023-09-18 10:56:20: [   57.517279]  bpf_prepare_filter+0x1e8/0x220
2023-09-18 10:56:20: [   57.517727]  __get_filter+0xdc/0x180
2023-09-18 10:56:20: [   57.518231]  sk_attach_filter+0x1c/0xb0
2023-09-18 10:56:20: [   57.518605]  sk_setsockopt+0x9dc/0x1230
2023-09-18 10:56:20: [   57.518909]  sock_setsockopt+0x18/0x28
2023-09-18 10:56:20: [   57.519177]  __sys_setsockopt+0x164/0x190
2023-09-18 10:56:20: [   57.519501]  __arm64_sys_setsockopt+0x2c/0x40
2023-09-18 10:56:20: [   57.519911]  invoke_syscall.constprop.0+0x7c/0xd0
2023-09-18 10:56:20: [   57.520345]  do_el0_svc+0xb4/0xd0
2023-09-18 10:56:20: [   57.520670]  el0_svc+0x50/0x228
2023-09-18 10:56:20: [   57.521331]  el0t_64_sync_handler+0x134/0x150
2023-09-18 10:56:20: [   57.521758]  el0t_64_sync+0x17c/0x180
2023-09-18 10:56:23: [   60.724199] watchdog: BUG: soft lockup -
CPU#28 stuck for 26s! [(fwupd):5108]

[ 6253.928601] CPU: 64 PID: 18885 Comm: qemu-kvm Kdump: loaded
Tainted: G        W          6.6.0-rc1-zhenyzha_64k+ #2
[ 6253.939021] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
F31n (SCP: 2.10.20220810) 09/30/2022
[ 6253.948312] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 6253.955262] pc : xas_split_alloc+0xf8/0x128
[ 6253.959432] lr : __filemap_add_folio+0x33c/0x4e0
[ 6253.964037] sp : ffff80008b10f210
[ 6253.967338] x29: ffff80008b10f210 x28: ffffba8c43708c00 x27: 0000000000000001
[ 6253.974461] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
[ 6253.981583] x23: ffff80008b10f2c0 x22: 00000a36da000101 x21: 0000000000000000
[ 6253.988706] x20: ffffffc203be2a00 x19: 000000000000000d x18: 0000000000000014
[ 6253.995828] x17: 00000000be237f61 x16: 000000001baa68cc x15: ffffba8c429a5944
[ 6254.002950] x14: ffffba8c429b57bc x13: ffffba8c429a5944 x12: ffffba8c429b57bc
[ 6254.010073] x11: ffffba8c4297160c x10: ffffba8c4365d414 x9 : ffffba8c4365857c
[ 6254.017195] x8 : ffff80008b10f210 x7 : ffff07ffa1304900 x6 : ffff80008b10f210
[ 6254.024317] x5 : 000000000000000e x4 : 0000000000000000 x3 : 0000000000012c40
[ 6254.031439] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
[ 6254.038562] Call trace:
[ 6254.040995]  xas_split_alloc+0xf8/0x128
[ 6254.044818]  __filemap_add_folio+0x33c/0x4e0
[ 6254.049076]  filemap_add_folio+0x48/0xd0
[ 6254.052986]  page_cache_ra_unbounded+0xf0/0x1f0
[ 6254.057504]  page_cache_ra_order+0x8c/0x310
[ 6254.061675]  filemap_fault+0x67c/0xaa8
[ 6254.065412]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
[ 6254.070163]  xfs_filemap_fault+0x54/0x68 [xfs]
[ 6254.074651]  __do_fault+0x40/0x210
[ 6254.078040]  do_cow_fault+0xf0/0x300
[ 6254.081602]  do_pte_missing+0x140/0x238
[ 6254.085426]  handle_pte_fault+0x100/0x160
[ 6254.089423]  __handle_mm_fault+0x100/0x310
[ 6254.093506]  handle_mm_fault+0x6c/0x270
[ 6254.097330]  faultin_page+0x70/0x128
[ 6254.100893]  __get_user_pages+0xc8/0x2d8
[ 6254.104802]  get_user_pages_unlocked+0xc4/0x3b8
[ 6254.109320]  hva_to_pfn+0xf8/0x468
[ 6254.112709]  __gfn_to_pfn_memslot+0xa4/0xf8
[ 6254.116879]  user_mem_abort+0x174/0x7e8
[ 6254.120702]  kvm_handle_guest_abort+0x2dc/0x450
[ 6254.125220]  handle_exit+0x70/0x1c8
[ 6254.128696]  kvm_arch_vcpu_ioctl_run+0x224/0x5b8
[ 6254.133300]  kvm_vcpu_ioctl+0x28c/0x9d0
[ 6254.137123]  __arm64_sys_ioctl+0xa8/0xf0
[ 6254.141033]  invoke_syscall.constprop.0+0x7c/0xd0
[ 6254.145725]  do_el0_svc+0xb4/0xd0
[ 6254.149028]  el0_svc+0x50/0x228
[ 6254.152157]  el0t_64_sync_handler+0x134/0x150
[ 6254.156501]  el0t_64_sync+0x17c/0x180
[ 6254.160151] ---[ end trace 0000000000000000 ]---
[ 6254.164766] ------------[ cut here ]------------
[ 6254.169370] WARNING: CPU: 64 PID: 18885 at lib/xarray.c:1010
xas_split_alloc+0xf8/0x128
[ 6254.177361] Modules linked in: loop isofs cdrom vhost_net vhost
vhost_iotlb tap tun bluetooth tls nfsv3 rpcsec_gss_krb5 nfsv4
dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core
xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4
nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc rfkill vfat fat ast
drm_shmem_helper drm_kms_helper acpi_ipmi ipmi_ssif arm_spe_pmu
ipmi_devintf ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu
cppc_cpufreq drm fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs
libcrc32c crct10dif_ce ghash_ce igb sha2_ce sha256_arm64 sha1_ce
sbsa_gwdt i2c_designware_platform i2c_algo_bit i2c_designware_core
xgene_hwmon sg dm_mirror dm_region_hash dm_log dm_mod


Tested-by: Zhenyu Zhang <zhenyzha@redhat.com>

On Mon, Sep 4, 2023 at 1:20 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Hello!
>
> Here is v6 of the mremap start address optimization / fix for exec warning.
> Should be hopefully final now and only 2/7 and 6/7 need a tag. Thanks a lot to
> Lorenzo and Linus for the detailed reviews.
>
> Description of patches
> ======================
> These patches optimizes the start addresses in move_page_tables() and tests the
> changes. It addresses a warning [1] that occurs due to a downward, overlapping
> move on a mutually-aligned offset within a PMD during exec. By initiating the
> copy process at the PMD level when such alignment is present, we can prevent
> this warning and speed up the copying process at the same time. Linus Torvalds
> suggested this idea. Check the individual patches for more details.
> [1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
>
> History of patches:
> v5->v6:
> 1. Reworking the stack case a bit more and tested it (should be final now).
> 2. Other small nits.
>
> v4->v5:
> 1. Rebased on mainline.
> 2. Several improvement suggestions from Lorenzo.
>
> v3->v4:
> 1. Care to be taken to move purely within a VMA, in other words this check
>    in call_align_down():
>     if (vma->vm_start != addr_masked)
>             return false;
>
>     As an example of why this is needed:
>     Consider the following range which is 2MB aligned and is
>     a part of a larger 10MB range which is not shown. Each
>     character is 256KB below making the source and destination
>     2MB each. The lower case letters are moved (s to d) and the
>     upper case letters are not moved.
>
>     |DDDDddddSSSSssss|
>
>     If we align down 'ssss' to start from the 'SSSS', we will end up destroying
>     SSSS. The above if statement prevents that and I verified it.
>
>     I also added a test for this in the last patch.
>
> 2. Handle the stack case separately. We do not care about #1 for stack movement
>    because the 'SSSS' does not matter during this move. Further we need to do this
>    to prevent the stack move warning.
>
>     if (!for_stack && vma->vm_start <= addr_masked)
>             return false;
>
> v2->v3:
> 1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
> 2. We now handle moves happening purely within a VMA, a new test is added to handle this.
> 3. More code comments.
>
> v1->v2:
> 1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
> that it works correctly.
>
> 2. Fix issue with bogus return value found by Linus if we broke out of the
> above loop for the first PMD itself.
>
> v1: Initial RFC.
>
> Joel Fernandes (1):
> selftests: mm: Add a test for moving from an offset from start of
> mapping
>
> Joel Fernandes (Google) (6):
> mm/mremap: Optimize the start addresses in move_page_tables()
> mm/mremap: Allow moves within the same VMA for stack moves
> selftests: mm: Fix failure case when new remap region was not found
> selftests: mm: Add a test for mutually aligned moves > PMD size
> selftests: mm: Add a test for remapping to area immediately after
> existing mapping
> selftests: mm: Add a test for remapping within a range
>
> fs/exec.c                                |   2 +-
> include/linux/mm.h                       |   2 +-
> mm/mremap.c                              |  73 +++++-
> tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
> 4 files changed, 329 insertions(+), 49 deletions(-)
>
> --
> 2.42.0.283.g2d96d420d3-goog
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD
  2023-09-18 15:35 ` [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD zhenyu zhang
@ 2023-09-19 12:31   ` zhenyu zhang
  0 siblings, 0 replies; 16+ messages in thread
From: zhenyu zhang @ 2023-09-19 12:31 UTC (permalink / raw)
  To: joel
  Cc: linux-kernel, linux-kselftest, linux-mm, Shuah Khan,
	Vlastimil Babka, Michal Hocko, Linus Torvalds, Lorenzo Stoakes,
	Kirill A Shutemov, Liam R. Howlett, Paul E. McKenney,
	Suren Baghdasaryan, Kalesh Singh, Lokesh Gidra, gshan, david

Sorry for the late update. After removing this patch, the issue still exists.
And a kernel thread will be generated, which always triggers call trace on the
host and cannot be killed. causes host kernel damage.
So this is a upstream issue and has nothing to do with this patch.
This issue will be discussed elsewhere later.

Behavior is as expected for this patch.

on 64k host
[root@virt-mtsnow-02 mm]# ./mremap_test
# Test configs:
threshold_mb=4
pattern_seed=1695026632

1..19
# mremap failed: Invalid argument
ok 1 # XFAIL mremap - Source and Destination Regions Overlapping
Expected mremap failure
# mremap failed: Invalid argument
ok 2 # XFAIL mremap - Destination Address Misaligned (1KB-aligned)
Expected mremap failure
# Failed to map source region: Invalid argument
ok 3 # XFAIL mremap - Source Address Misaligned (1KB-aligned)
Expected mremap failure
ok 4 8KB mremap - Source PTE-aligned, Destination PTE-aligned
mremap time:         5480ns
ok 5 2MB mremap - Source 1MB-aligned, Destination PTE-aligned
mremap time:         8560ns
ok 6 2MB mremap - Source 1MB-aligned, Destination 1MB-aligned
mremap time:         8721ns
ok 7 4MB mremap - Source PMD-aligned, Destination PTE-aligned
mremap time:        13240ns
ok 8 4MB mremap - Source PMD-aligned, Destination 1MB-aligned
mremap time:        13120ns
ok 9 4MB mremap - Source PMD-aligned, Destination PMD-aligned
mremap time:        13120ns
ok 10 2GB mremap - Source PUD-aligned, Destination PTE-aligned
ok 11 2GB mremap - Source PUD-aligned, Destination 1MB-aligned
ok 12 2GB mremap - Source PUD-aligned, Destination PMD-aligned
ok 13 2GB mremap - Source PUD-aligned, Destination PUD-aligned
ok 14 5MB mremap - Source 1MB-aligned, Destination 1MB-aligned
ok 15 5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble
ok 16 mremap expand merge
ok 17 mremap expand merge offset
ok 18 mremap mremap move within range
ok 19 mremap move 1mb from start at 1MB+256KB aligned src
# Totals: pass:16 fail:0 xfail:3 xpass:0 skip:0 error:0



on 4k guest/ 64k host
[root@localhost mm]# ./mremap_test
# Test configs:
threshold_mb=4
pattern_seed=1695026539

1..19
# mremap failed: Invalid argument
ok 1 # XFAIL mremap - Source and Destination Regions Overlapping
Expected mremap failure
# mremap failed: Invalid argument
ok 2 # XFAIL mremap - Destination Address Misaligned (1KB-aligned)
Expected mremap failure
# Failed to map source region: Invalid argument
ok 3 # XFAIL mremap - Source Address Misaligned (1KB-aligned)
Expected mremap failure
ok 4 8KB mremap - Source PTE-aligned, Destination PTE-aligned
mremap time:         6080ns
ok 5 2MB mremap - Source 1MB-aligned, Destination PTE-aligned
mremap time:        98800ns
ok 6 2MB mremap - Source 1MB-aligned, Destination 1MB-aligned
mremap time:        54680ns
ok 7 4MB mremap - Source PMD-aligned, Destination PTE-aligned
mremap time:       193360ns
ok 8 4MB mremap - Source PMD-aligned, Destination 1MB-aligned
mremap time:       192440ns
ok 9 4MB mremap - Source PMD-aligned, Destination PMD-aligned
mremap time:         6400ns
ok 10 2GB mremap - Source PUD-aligned, Destination PTE-aligned
ok 11 2GB mremap - Source PUD-aligned, Destination 1MB-aligned
ok 12 2GB mremap - Source PUD-aligned, Destination PMD-aligned
ok 13 2GB mremap - Source PUD-aligned, Destination PUD-aligned
ok 14 5MB mremap - Source 1MB-aligned, Destination 1MB-aligned
ok 15 5MB mremap - Source 1MB-aligned, Dest 1MB-aligned with 40MB Preamble
ok 16 mremap expand merge
ok 17 mremap expand merge offset
ok 18 mremap mremap move within range
ok 19 mremap move 1mb from start at 1MB+256KB aligned src
# Totals: pass:16 fail:0 xfail:3 xpass:0 skip:0 error:0


Tested-by: Zhenyu Zhang <zhenyzha@redhat.com>

On Mon, Sep 18, 2023 at 11:35 PM zhenyu zhang <zhenyzha12@gmail.com> wrote:
>
> With 4k guest and 64k host, on aarch64(Ampere's Altra Max CPU) hit Call trace:
>     Steps:
>     1) System setup hugepages on host.
>        # echo 50 > /proc/sys/vm/nr_hugepages
>     2) Mount this hugepage to /mnt/kvm_hugepage.
>        # mount -t hugetlbfs -o pagesize=524288K none /mnt/kvm_hugepage
>     3) HugePages didn't leak when using non-existent mem-path.
>        # cd /home/kar/workspace/avocado-vt/virttest; mkdir -p /mnt/tmp
>     4) Run memory heavy stress inside guest.
>        # /usr/libexec/qemu-kvm \
>          ...
>          -m 25600 \
>          -object '{"size": 26843545600, "mem-path": "/mnt/tmp", "id":
> "mem-machine_mem", "qom-type": "memory-backend-file"}'  \
>          -smp 60,maxcpus=60,cores=30,threads=1,clusters=1,sockets=2  \
>        login guest:
>        # nohup stress --vm 50 --vm-bytes 256M --timeout 30s >
> /dev/null & ------> hit Call trace
>
> On guest kernel:
> 2023-09-18 07:54:03: [   76.592706] CPU: 23 PID: 254 Comm:
> kworker/23:1 Kdump: loaded Not tainted 6.6.0-rc2-zhenyzha_4k+ #3
> 2023-09-18 07:54:03: [   76.593782] Hardware name: QEMU KVM Virtual
> Machine, BIOS edk2-20230524-3.el9 05/24/2023
> 2023-09-18 07:54:03: [   76.594641] Workqueue: rcu_gp wait_rcu_exp_gp
> 2023-09-18 07:54:03: [   76.595248] pstate: 80400005 (Nzcv daif +PAN
> -UAO -TCO -DIT -SSBS BTYPE=--)
> 2023-09-18 07:54:03: [   76.596025] pc : smp_call_function_single+0xe4/0x1e8
> 2023-09-18 07:54:03: [   76.596833] lr :
> __sync_rcu_exp_select_node_cpus+0x27c/0x428
> 2023-09-18 07:54:03: [   76.597534] sp : ffff800084a0bc60
> 2023-09-18 07:54:03: [   76.598078] x29: ffff800084a0bc60 x28:
> ffff0003fdad9440 x27: 0000000000000001
> 2023-09-18 07:54:03: [   76.598874] x26: ffff800081a541b0 x25:
> ffff800081e0af40 x24: ffff0000c425ed80
> 2023-09-18 07:54:03: [   76.599817] x23: 0000000000000004 x22:
> ffff800081532fa0 x21: 0000000000000ffe
> 2023-09-18 07:54:03: [   76.600621] x20: ffff800081537440 x19:
> ffff800084a0bca0 x18: 0000000000000001
> 2023-09-18 07:54:03: [   76.601420] x17: 0000000000000000 x16:
> ffff800080f352e8 x15: 0000ffff97d02fff
> 2023-09-18 07:54:03: [   76.602212] x14: 0000000000000000 x13:
> 0000000000000030 x12: 0101010101010101
> 2023-09-18 07:54:03: [   76.603158] x11: ffff800081532fa0 x10:
> 0000000000000001 x9 : ffff80008014c714
> 2023-09-18 07:54:03: [   76.603963] x8 : ffff800081e03130 x7 :
> ffff800081521008 x6 : ffff80008014e070
> 2023-09-18 07:54:03: [   76.604759] x5 : 0000000000000000 x4 :
> ffff0003fda34c88 x3 : 0000000000000001
> 2023-09-18 07:54:03: [   76.605703] x2 : 0000000000000000 x1 :
> ffff0003fda34c80 x0 : 000000000000001c
> 2023-09-18 07:54:03: [   76.606507] Call trace:
> 2023-09-18 07:54:03: [   76.606990]  smp_call_function_single+0xe4/0x1e8
> 2023-09-18 07:54:03: [   76.607617]  __sync_rcu_exp_select_node_cpus+0x27c/0x428
> 2023-09-18 07:54:03: [   76.608290]  sync_rcu_exp_select_cpus+0x164/0x2e0
> 2023-09-18 07:54:03: [   76.608963]  wait_rcu_exp_gp+0x1c/0x38
> 2023-09-18 07:54:03: [   76.609563]  process_one_work+0x174/0x3c8
> 2023-09-18 07:54:03: [   76.610181]  worker_thread+0x2c8/0x3e0
> 2023-09-18 07:54:03: [   76.610776]  kthread+0x100/0x110
> 2023-09-18 07:54:03: [   76.611330]  ret_from_fork+0x10/0x20
> 2023-09-18 07:54:15: [   88.396191] rcu: INFO: rcu_preempt detected
> stalls on CPUs/tasks:
> 2023-09-18 07:54:15: [   88.397195] rcu: 11-...0: (18 ticks this GP)
> idle=79ec/1/0x4000000000000000 softirq=577/579 fqs=1215
> 2023-09-18 07:54:15: [   88.398244] rcu: 25-...0: (1 GPs behind)
> idle=599c/1/0x4000000000000000 softirq=300/301 fqs=1215
> 2023-09-18 07:54:15: [   88.399254] rcu: 33-...0: (36 ticks this GP)
> idle=e454/1/0x4000000000000000 softirq=717/719 fqs=1216
> 2023-09-18 07:54:15: [   88.400275] rcu: (detected by 19, t=6006
> jiffies, g=1173, q=61327 ncpus=38)
> 2023-09-18 07:54:15: [   88.401135] Task dump for CPU 11:
> 2023-09-18 07:54:15: [   88.401711] task:stress          state:R
> running task     stack:0     pid:3182  ppid:3178   flags:0x00000202
> 2023-09-18 07:54:15: [   88.402794] Call trace:
> 2023-09-18 07:54:15: [   88.403312]  __switch_to+0xc8/0x110
> 2023-09-18 07:54:15: [   88.403915]  do_page_fault+0x198/0x4e0
> 2023-09-18 07:54:15: [   88.404533]  do_translation_fault+0x38/0x68
> 2023-09-18 07:54:15: [   88.405169]  do_mem_abort+0x48/0xa0
> 2023-09-18 07:54:15: [   88.405771]  el0_da+0x4c/0x180
> 2023-09-18 07:54:15: [   88.406337]  el0t_64_sync_handler+0xdc/0x150
> 2023-09-18 07:54:15: [   88.406991]  el0t_64_sync+0x17c/0x180
> 2023-09-18 07:54:15: [   88.407601] Task dump for CPU 25:
> 2023-09-18 07:54:15: [   88.408182] task:stress          state:R
> running task     stack:0     pid:3200  ppid:3178   flags:0x00000203
> 2023-09-18 07:54:15: [   88.409258] Call trace:
> 2023-09-18 07:54:15: [   88.409769]  __switch_to+0xc8/0x110
> 2023-09-18 07:54:15: [   88.410339]  0x440dc0
> 2023-09-18 07:54:15: [   88.410816] Task dump for CPU 33:
> 2023-09-18 07:54:15: [   88.411362] task:stress          state:R
> running task     stack:0     pid:3191  ppid:3178   flags:0x00000203
> 2023-09-18 07:54:15: [   88.412403] Call trace:
> 2023-09-18 07:54:15: [   88.412866]  __switch_to+0xc8/0x110
> 2023-09-18 07:54:15: [   88.413405]  __memcg_kmem_charge_page+0x270/0x2c0
> 2023-09-18 07:54:15: [   88.414033]  __alloc_pages+0x100/0x278
> 2023-09-18 07:54:15: [   88.414585]  memcg_stock+0x0/0x58
>
> On host kernel:
> 173242 Sep 18 08:57:51 virt-mtsnow-02 kernel: ------------[ cut here
> ]------------
> 173243 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 52 kernel messages
> 173244 Sep 18 08:57:51 virt-mtsnow-02 kernel: do_cow_fault+0xf0/0x300
> 173245 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 162 kernel messages
> 173246 Sep 18 08:57:51 virt-mtsnow-02 kernel: CPU: 14 PID: 11294 Comm:
> qemu-kvm Tainted: G        W          6.6.0-rc2-zhenyzha-64k+ #1
> 173247 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 226 kernel messages
> 173248 Sep 18 08:57:51 virt-mtsnow-02 kernel: x21: 0000000000000000
> 173249 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 120 kernel messages
> 173250 Sep 18 08:57:51 virt-mtsnow-02 kernel: __do_fault+0x40/0x210
> 173251 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 39 kernel messages
> 173252 Sep 18 08:57:51 virt-mtsnow-02 kernel: do_el0_svc+0xb4/0xd0
> 173253 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 325 kernel messages
> 173254 Sep 18 08:57:51 virt-mtsnow-02 kernel: get_user_pages_unlocked+0xc4/0x3b8
> 173255 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 255 kernel messages
> 173256 Sep 18 08:57:51 virt-mtsnow-02 kernel: pci_hyperv_intf
> i2c_designware_core
> 173257 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 87 kernel messages
> 173258 Sep 18 08:57:51 virt-mtsnow-02 kernel: xfs_filemap_fault+0x54/0x68 [xfs]
> 173259 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 248 kernel messages
> 173260 Sep 18 08:57:51 virt-mtsnow-02 kernel: pci_hyperv_intf
> 173261 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 69 kernel messages
> 173262 Sep 18 08:57:51 virt-mtsnow-02 kernel: Hardware name: GIGABYTE
> R152-P31-00/MP32-AR1-00, BIOS F18v (SCP: 1.08.20211002) 12/01/2021
> 173263 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 297 kernel messages
> 173264 Sep 18 08:57:51 virt-mtsnow-02 kernel: __filemap_add_folio+0x33c/0x4e0
> 173265 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 12 kernel messages
> 173266 Sep 18 08:57:51 virt-mtsnow-02 kernel: x26: 0000000000000001
> 173267 Sep 18 08:57:51 virt-mtsnow-02 systemd-journald[15184]: Missed
> 74 kernel messages
>
> [ 5456.588346] ------------[ cut here ]------------
> [ 5456.588358]  x10: 000000000000000a
> [ 5456.588365]  dm_mod
> [ 5456.588372]  nft_compat
> [ 5456.588374] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> F18v (SCP: 1.08.20211002) 12/01/2021
> [ 5456.588417]  fat
> [ 5456.588421]  x16: 000000009872d4d0
> [ 5456.588430]  ipmi_msghandler arm_cmn
> [ 5456.588439]  x10: 000000000000000a
> [ 5456.588414]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> [ 5456.588454] x5 : 0000000000000028
> [ 5456.588460]  nvme_core
> [ 5456.588474]  pci_hyperv_intf
> [ 5456.588482] ------------[ cut here ]------------
> [ 5456.588488]  page_cache_async_ra+0x64/0xa8
> [ 5456.588491]  filemap_fault+0x238/0xaa8
> [ 5456.588506]  nf_defrag_ipv4 nf_tables
> [ 5456.588514]  nfs_acl
> [ 5456.588518]  x22: ffffffc202880000
> [ 5456.588525]  netfs
> [ 5456.588527]  stp
>
> [ 5456.588539]  acpi_ipmi
> [ 5456.588546]  x10: 000000000000000a
> [ 5456.588554]  x7 : ffff07ffa0a67210
> [ 5456.588562]  get_user_pages_unlocked+0xc4/0x3b8
> [ 5456.588567]  __gfn_to_pfn_memslot+0xa4/0xf8
> [ 5456.588575]  xas_split_alloc+0xf8/0x128
> [ 5456.588581]  sha1_ce
> [ 5456.588588]  i2c_algo_bit
> [ 5456.588592]  page_cache_async_ra+0x64/0xa8
>
>
> Using @gshan@redhat.com 's patch:KVM: arm64: Fix soft-lockup on
> relaxing PTE permission
> Still hit Call trace:
> 2023-09-18 10:56:20: [   57.494201] watchdog: BUG: soft lockup -
> CPU#58 stuck for 22s! [gsd-power:4858]
> 2023-09-18 10:56:20: [   57.495674] Modules linked in: nft_fib_inet
> nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
> nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr
> vfat fat fuse xfs libcrc32c virtio_gpu virtio_dma_buf drm_shmem_helper
> nvme_tcp drm_kms_helper nvme_fabrics nvme_core nvme_common sg drm
> crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce virtio_net
> net_failover virtio_scsi failover virtio_mmio dm_multipath dm_mirror
> dm_region_hash dm_log dm_mod be2iscsi cxgb4i cxgb4 tls libcxgbi
> libcxgb qla4xxx iscsi_boot_sysfs iscsi_tcp libiscsi_tcp libiscsi
> scsi_transport_iscsi
> 2023-09-18 10:56:20: [   57.501871] CPU: 58 PID: 4858 Comm: gsd-power
> Kdump: loaded Not tainted 6.6.0-rc2-zhenyzha_4k+ #3
> 2023-09-18 10:56:20: [   57.502719] Hardware name: QEMU KVM Virtual
> Machine, BIOS edk2-20230524-3.el9 05/24/2023
> 2023-09-18 10:56:20: [   57.503540] pstate: 20400005 (nzCv daif +PAN
> -UAO -TCO -DIT -SSBS BTYPE=--)
> 2023-09-18 10:56:20: [   57.504612] pc : smp_call_function_many_cond+0x16c/0x618
> 2023-09-18 10:56:20: [   57.505425] lr : smp_call_function_many_cond+0x188/0x618
> 2023-09-18 10:56:20: [   57.505974] sp : ffff8000870f38f0
> 2023-09-18 10:56:20: [   57.506370] x29: ffff8000870f38f0 x28:
> 000000000000003c x27: ffff00063c5dcaa0
> 2023-09-18 10:56:20: [   57.507041] x26: 000000000000003c x25:
> 000000000000003b x24: ffff00063c5b6848
> 2023-09-18 10:56:20: [   57.507812] x23: 0000000000000000 x22:
> ffff00063c5b6848 x21: ffff800081a541b0
> 2023-09-18 10:56:20: [   57.508513] x20: ffff00063c5b6840 x19:
> ffff800081a4f840 x18: 0000000000000014
> 2023-09-18 10:56:20: [   57.509247] x17: 00000000fd875552 x16:
> 0000000044ca0210 x15: 000000005df1120b
> 2023-09-18 10:56:20: [   57.509947] x14: 00000000ac15cb21 x13:
> 00000000b7ff1817 x12: 0000000006d3918c
> 2023-09-18 10:56:20: [   57.510645] x11: 00000000ba65fdab x10:
> 00000000f60c2b88 x9 : ffff80008061a9dc
> 2023-09-18 10:56:20: [   57.511264] x8 : ffff00063c5b6a50 x7 :
> 0000000000000000 x6 : 0000000001000000
> 2023-09-18 10:56:20: [   57.511817] x5 : 000000000000003c x4 :
> 0000000000000007 x3 : ffff00063bf28aa8
> 2023-09-18 10:56:20: [   57.512415] x2 : 0000000000000000 x1 :
> 0000000000000011 x0 : 0000000000000007
> 2023-09-18 10:56:20: [   57.513092] Call trace:
> 2023-09-18 10:56:20: [   57.515105]  smp_call_function_many_cond+0x16c/0x618
> 2023-09-18 10:56:20: [   57.515684]  kick_all_cpus_sync+0x48/0x80
> 2023-09-18 10:56:20: [   57.516039]  flush_icache_range+0x40/0x60
> 2023-09-18 10:56:20: [   57.516413]  bpf_int_jit_compile+0x1ac/0x5f8
> 2023-09-18 10:56:20: [   57.516821]  bpf_prog_select_runtime+0xd4/0x110
> 2023-09-18 10:56:20: [   57.517279]  bpf_prepare_filter+0x1e8/0x220
> 2023-09-18 10:56:20: [   57.517727]  __get_filter+0xdc/0x180
> 2023-09-18 10:56:20: [   57.518231]  sk_attach_filter+0x1c/0xb0
> 2023-09-18 10:56:20: [   57.518605]  sk_setsockopt+0x9dc/0x1230
> 2023-09-18 10:56:20: [   57.518909]  sock_setsockopt+0x18/0x28
> 2023-09-18 10:56:20: [   57.519177]  __sys_setsockopt+0x164/0x190
> 2023-09-18 10:56:20: [   57.519501]  __arm64_sys_setsockopt+0x2c/0x40
> 2023-09-18 10:56:20: [   57.519911]  invoke_syscall.constprop.0+0x7c/0xd0
> 2023-09-18 10:56:20: [   57.520345]  do_el0_svc+0xb4/0xd0
> 2023-09-18 10:56:20: [   57.520670]  el0_svc+0x50/0x228
> 2023-09-18 10:56:20: [   57.521331]  el0t_64_sync_handler+0x134/0x150
> 2023-09-18 10:56:20: [   57.521758]  el0t_64_sync+0x17c/0x180
> 2023-09-18 10:56:23: [   60.724199] watchdog: BUG: soft lockup -
> CPU#28 stuck for 26s! [(fwupd):5108]
>
> [ 6253.928601] CPU: 64 PID: 18885 Comm: qemu-kvm Kdump: loaded
> Tainted: G        W          6.6.0-rc1-zhenyzha_64k+ #2
> [ 6253.939021] Hardware name: GIGABYTE R152-P31-00/MP32-AR1-00, BIOS
> F31n (SCP: 2.10.20220810) 09/30/2022
> [ 6253.948312] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 6253.955262] pc : xas_split_alloc+0xf8/0x128
> [ 6253.959432] lr : __filemap_add_folio+0x33c/0x4e0
> [ 6253.964037] sp : ffff80008b10f210
> [ 6253.967338] x29: ffff80008b10f210 x28: ffffba8c43708c00 x27: 0000000000000001
> [ 6253.974461] x26: 0000000000000001 x25: ffffffffffffc005 x24: 0000000000000000
> [ 6253.981583] x23: ffff80008b10f2c0 x22: 00000a36da000101 x21: 0000000000000000
> [ 6253.988706] x20: ffffffc203be2a00 x19: 000000000000000d x18: 0000000000000014
> [ 6253.995828] x17: 00000000be237f61 x16: 000000001baa68cc x15: ffffba8c429a5944
> [ 6254.002950] x14: ffffba8c429b57bc x13: ffffba8c429a5944 x12: ffffba8c429b57bc
> [ 6254.010073] x11: ffffba8c4297160c x10: ffffba8c4365d414 x9 : ffffba8c4365857c
> [ 6254.017195] x8 : ffff80008b10f210 x7 : ffff07ffa1304900 x6 : ffff80008b10f210
> [ 6254.024317] x5 : 000000000000000e x4 : 0000000000000000 x3 : 0000000000012c40
> [ 6254.031439] x2 : 000000000000000d x1 : 000000000000000c x0 : 0000000000000000
> [ 6254.038562] Call trace:
> [ 6254.040995]  xas_split_alloc+0xf8/0x128
> [ 6254.044818]  __filemap_add_folio+0x33c/0x4e0
> [ 6254.049076]  filemap_add_folio+0x48/0xd0
> [ 6254.052986]  page_cache_ra_unbounded+0xf0/0x1f0
> [ 6254.057504]  page_cache_ra_order+0x8c/0x310
> [ 6254.061675]  filemap_fault+0x67c/0xaa8
> [ 6254.065412]  __xfs_filemap_fault+0x60/0x3c0 [xfs]
> [ 6254.070163]  xfs_filemap_fault+0x54/0x68 [xfs]
> [ 6254.074651]  __do_fault+0x40/0x210
> [ 6254.078040]  do_cow_fault+0xf0/0x300
> [ 6254.081602]  do_pte_missing+0x140/0x238
> [ 6254.085426]  handle_pte_fault+0x100/0x160
> [ 6254.089423]  __handle_mm_fault+0x100/0x310
> [ 6254.093506]  handle_mm_fault+0x6c/0x270
> [ 6254.097330]  faultin_page+0x70/0x128
> [ 6254.100893]  __get_user_pages+0xc8/0x2d8
> [ 6254.104802]  get_user_pages_unlocked+0xc4/0x3b8
> [ 6254.109320]  hva_to_pfn+0xf8/0x468
> [ 6254.112709]  __gfn_to_pfn_memslot+0xa4/0xf8
> [ 6254.116879]  user_mem_abort+0x174/0x7e8
> [ 6254.120702]  kvm_handle_guest_abort+0x2dc/0x450
> [ 6254.125220]  handle_exit+0x70/0x1c8
> [ 6254.128696]  kvm_arch_vcpu_ioctl_run+0x224/0x5b8
> [ 6254.133300]  kvm_vcpu_ioctl+0x28c/0x9d0
> [ 6254.137123]  __arm64_sys_ioctl+0xa8/0xf0
> [ 6254.141033]  invoke_syscall.constprop.0+0x7c/0xd0
> [ 6254.145725]  do_el0_svc+0xb4/0xd0
> [ 6254.149028]  el0_svc+0x50/0x228
> [ 6254.152157]  el0t_64_sync_handler+0x134/0x150
> [ 6254.156501]  el0t_64_sync+0x17c/0x180
> [ 6254.160151] ---[ end trace 0000000000000000 ]---
> [ 6254.164766] ------------[ cut here ]------------
> [ 6254.169370] WARNING: CPU: 64 PID: 18885 at lib/xarray.c:1010
> xas_split_alloc+0xf8/0x128
> [ 6254.177361] Modules linked in: loop isofs cdrom vhost_net vhost
> vhost_iotlb tap tun bluetooth tls nfsv3 rpcsec_gss_krb5 nfsv4
> dns_resolver nfs fscache netfs rpcrdma rdma_cm iw_cm ib_cm ib_core
> xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4
> nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
> nf_defrag_ipv4 nf_tables nfnetlink bridge stp llc rfkill vfat fat ast
> drm_shmem_helper drm_kms_helper acpi_ipmi ipmi_ssif arm_spe_pmu
> ipmi_devintf ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu
> cppc_cpufreq drm fuse nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs
> libcrc32c crct10dif_ce ghash_ce igb sha2_ce sha256_arm64 sha1_ce
> sbsa_gwdt i2c_designware_platform i2c_algo_bit i2c_designware_core
> xgene_hwmon sg dm_mirror dm_region_hash dm_log dm_mod
>
>
> Tested-by: Zhenyu Zhang <zhenyzha@redhat.com>
>
> On Mon, Sep 4, 2023 at 1:20 AM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Hello!
> >
> > Here is v6 of the mremap start address optimization / fix for exec warning.
> > Should be hopefully final now and only 2/7 and 6/7 need a tag. Thanks a lot to
> > Lorenzo and Linus for the detailed reviews.
> >
> > Description of patches
> > ======================
> > These patches optimizes the start addresses in move_page_tables() and tests the
> > changes. It addresses a warning [1] that occurs due to a downward, overlapping
> > move on a mutually-aligned offset within a PMD during exec. By initiating the
> > copy process at the PMD level when such alignment is present, we can prevent
> > this warning and speed up the copying process at the same time. Linus Torvalds
> > suggested this idea. Check the individual patches for more details.
> > [1] https://lore.kernel.org/all/ZB2GTBD%2FLWTrkOiO@dhcp22.suse.cz/
> >
> > History of patches:
> > v5->v6:
> > 1. Reworking the stack case a bit more and tested it (should be final now).
> > 2. Other small nits.
> >
> > v4->v5:
> > 1. Rebased on mainline.
> > 2. Several improvement suggestions from Lorenzo.
> >
> > v3->v4:
> > 1. Care to be taken to move purely within a VMA, in other words this check
> >    in call_align_down():
> >     if (vma->vm_start != addr_masked)
> >             return false;
> >
> >     As an example of why this is needed:
> >     Consider the following range which is 2MB aligned and is
> >     a part of a larger 10MB range which is not shown. Each
> >     character is 256KB below making the source and destination
> >     2MB each. The lower case letters are moved (s to d) and the
> >     upper case letters are not moved.
> >
> >     |DDDDddddSSSSssss|
> >
> >     If we align down 'ssss' to start from the 'SSSS', we will end up destroying
> >     SSSS. The above if statement prevents that and I verified it.
> >
> >     I also added a test for this in the last patch.
> >
> > 2. Handle the stack case separately. We do not care about #1 for stack movement
> >    because the 'SSSS' does not matter during this move. Further we need to do this
> >    to prevent the stack move warning.
> >
> >     if (!for_stack && vma->vm_start <= addr_masked)
> >             return false;
> >
> > v2->v3:
> > 1. Masked address was stored in int, fixed it to unsigned long to avoid truncation.
> > 2. We now handle moves happening purely within a VMA, a new test is added to handle this.
> > 3. More code comments.
> >
> > v1->v2:
> > 1. Trigger the optimization for mremaps smaller than a PMD. I tested by tracing
> > that it works correctly.
> >
> > 2. Fix issue with bogus return value found by Linus if we broke out of the
> > above loop for the first PMD itself.
> >
> > v1: Initial RFC.
> >
> > Joel Fernandes (1):
> > selftests: mm: Add a test for moving from an offset from start of
> > mapping
> >
> > Joel Fernandes (Google) (6):
> > mm/mremap: Optimize the start addresses in move_page_tables()
> > mm/mremap: Allow moves within the same VMA for stack moves
> > selftests: mm: Fix failure case when new remap region was not found
> > selftests: mm: Add a test for mutually aligned moves > PMD size
> > selftests: mm: Add a test for remapping to area immediately after
> > existing mapping
> > selftests: mm: Add a test for remapping within a range
> >
> > fs/exec.c                                |   2 +-
> > include/linux/mm.h                       |   2 +-
> > mm/mremap.c                              |  73 +++++-
> > tools/testing/selftests/mm/mremap_test.c | 301 +++++++++++++++++++----
> > 4 files changed, 329 insertions(+), 49 deletions(-)
> >
> > --
> > 2.42.0.283.g2d96d420d3-goog
> >

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-19 12:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-03 15:13 [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD Joel Fernandes (Google)
2023-09-03 15:13 ` [PATCH v6 1/7] mm/mremap: Optimize the start addresses in move_page_tables() Joel Fernandes (Google)
2023-09-03 16:07   ` [lkp] [+134 bytes kernel size regression] [i386-tinyconfig] [8d22a4573c] " kernel test robot
2023-09-08 13:07   ` [PATCH v6 1/7] " Michal Hocko
2023-09-08 13:26     ` Joel Fernandes
2023-09-03 15:13 ` [PATCH v6 2/7] mm/mremap: Allow moves within the same VMA for stack moves Joel Fernandes (Google)
2023-09-05  6:47   ` Lorenzo Stoakes
2023-09-08 13:11   ` Michal Hocko
2023-09-03 15:13 ` [PATCH v6 3/7] selftests: mm: Fix failure case when new remap region was not found Joel Fernandes (Google)
2023-09-03 15:13 ` [PATCH v6 4/7] selftests: mm: Add a test for mutually aligned moves > PMD size Joel Fernandes (Google)
2023-09-03 15:13 ` [PATCH v6 5/7] selftests: mm: Add a test for remapping to area immediately after existing mapping Joel Fernandes (Google)
2023-09-03 15:13 ` [PATCH v6 6/7] selftests: mm: Add a test for remapping within a range Joel Fernandes (Google)
2023-09-05  6:48   ` Lorenzo Stoakes
2023-09-03 15:13 ` [PATCH v6 7/7] selftests: mm: Add a test for moving from an offset from start of mapping Joel Fernandes (Google)
2023-09-18 15:35 ` [PATCH v6 0/7] Optimize mremap during mutual alignment within PMD zhenyu zhang
2023-09-19 12:31   ` zhenyu zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.