All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] mremap fixes
@ 2021-06-10  8:35 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V

This patch series is split out series from [PATCH v7 00/11] Speedup mremap on ppc64
(https://lore.kernel.org/linux-mm/20210607055131.156184-1-aneesh.kumar@linux.ibm.com)
dropping ppc64 specific changes.

I will send the ppc64 specific changes separately once we agree on how to handle the
TLB flush changes.


Aneesh Kumar K.V (6):
  selftest/mremap_test: Update the test to handle pagesize other than 4K
  selftest/mremap_test: Avoid crash with static build
  mm/mremap: Convert huge PUD move to separate helper
  mm/mremap: Don't enable optimized PUD move if page table levels is 2
  mm/mremap: Use pmd/pud_poplulate to update page table entries
  mm/mremap: hold the rmap lock in write mode when moving page table
    entries.

 mm/mremap.c                              |  93 +++++++++++++++---
 tools/testing/selftests/vm/mremap_test.c | 118 ++++++++++++-----------
 2 files changed, 143 insertions(+), 68 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/6] mremap fixes
@ 2021-06-10  8:35 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel,
	Kirill A . Shutemov, linuxppc-dev

This patch series is split out series from [PATCH v7 00/11] Speedup mremap on ppc64
(https://lore.kernel.org/linux-mm/20210607055131.156184-1-aneesh.kumar@linux.ibm.com)
dropping ppc64 specific changes.

I will send the ppc64 specific changes separately once we agree on how to handle the
TLB flush changes.


Aneesh Kumar K.V (6):
  selftest/mremap_test: Update the test to handle pagesize other than 4K
  selftest/mremap_test: Avoid crash with static build
  mm/mremap: Convert huge PUD move to separate helper
  mm/mremap: Don't enable optimized PUD move if page table levels is 2
  mm/mremap: Use pmd/pud_poplulate to update page table entries
  mm/mremap: hold the rmap lock in write mode when moving page table
    entries.

 mm/mremap.c                              |  93 +++++++++++++++---
 tools/testing/selftests/vm/mremap_test.c | 118 ++++++++++++-----------
 2 files changed, 143 insertions(+), 68 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K
  2021-06-10  8:35 ` Aneesh Kumar K.V
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V

Instead of hardcoding 4K page size fetch it using sysconf(). For the performance
measurements test still assume 2M and 1G are hugepage sizes.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 tools/testing/selftests/vm/mremap_test.c | 113 ++++++++++++-----------
 1 file changed, 61 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
index 9c391d016922..c9a5461eb786 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -45,14 +45,15 @@ enum {
 	_4MB = 4ULL << 20,
 	_1GB = 1ULL << 30,
 	_2GB = 2ULL << 30,
-	PTE = _4KB,
 	PMD = _2MB,
 	PUD = _1GB,
 };
 
+#define PTE page_size
+
 #define MAKE_TEST(source_align, destination_align, size,	\
 		  overlaps, should_fail, test_name)		\
-{								\
+(struct test){							\
 	.name = test_name,					\
 	.config = {						\
 		.src_alignment = source_align,			\
@@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
+#define MAX_TEST 13
+#define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
 	int failures = 0;
 	int i, run_perf_tests;
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
+	struct test test_cases[MAX_TEST];
+	struct test perf_test_cases[MAX_PERF_TEST];
+	int page_size;
 	time_t t;
 
 	pattern_seed = (unsigned int) time(&t);
@@ -268,56 +274,59 @@ int main(int argc, char **argv)
 	ksft_print_msg("Test configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n",
 		       threshold_mb, pattern_seed);
 
-	struct test test_cases[] = {
-		/* Expected mremap failures */
-		MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE,
-		  "mremap - Source and Destination Regions Overlapping"),
-		MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
-		  "mremap - Destination Address Misaligned (1KB-aligned)"),
-		MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
-		  "mremap - Source Address Misaligned (1KB-aligned)"),
-
-		/* Src addr PTE aligned */
-		MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "8KB mremap - Source PTE-aligned, Destination PTE-aligned"),
-
-		/* Src addr 1MB aligned */
-		MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"),
-		MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"),
-
-		/* Src addr PMD aligned */
-		MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "4MB mremap - Source PMD-aligned, Destination PTE-aligned"),
-		MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"),
-		MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "4MB mremap - Source PMD-aligned, Destination PMD-aligned"),
-
-		/* Src addr PUD aligned */
-		MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination PTE-aligned"),
-		MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"),
-		MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination PMD-aligned"),
-		MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-	};
-
-	struct test perf_test_cases[] = {
-		/*
-		 * mremap 1GB region - Page table level aligned time
-		 * comparison.
-		 */
-		MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "1GB mremap - Source PTE-aligned, Destination PTE-aligned"),
-		MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "1GB mremap - Source PMD-aligned, Destination PMD-aligned"),
-		MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "1GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-	};
+	page_size = sysconf(_SC_PAGESIZE);
+
+	/* Expected mremap failures */
+	test_cases[0] =	MAKE_TEST(page_size, page_size, page_size,
+				  OVERLAPPING, EXPECT_FAILURE,
+				  "mremap - Source and Destination Regions Overlapping");
+
+	test_cases[1] = MAKE_TEST(page_size, page_size/4, page_size,
+				  NON_OVERLAPPING, EXPECT_FAILURE,
+				  "mremap - Destination Address Misaligned (1KB-aligned)");
+	test_cases[2] = MAKE_TEST(page_size/4, page_size, page_size,
+				  NON_OVERLAPPING, EXPECT_FAILURE,
+				  "mremap - Source Address Misaligned (1KB-aligned)");
+
+	/* Src addr PTE aligned */
+	test_cases[3] = MAKE_TEST(PTE, PTE, PTE * 2,
+				  NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "8KB mremap - Source PTE-aligned, Destination PTE-aligned");
+
+	/* Src addr 1MB aligned */
+	test_cases[4] = MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "2MB mremap - Source 1MB-aligned, Destination PTE-aligned");
+	test_cases[5] = MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
+	/* Src addr PMD aligned */
+	test_cases[6] = MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "4MB mremap - Source PMD-aligned, Destination PTE-aligned");
+	test_cases[7] =	MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "4MB mremap - Source PMD-aligned, Destination 1MB-aligned");
+	test_cases[8] = MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "4MB mremap - Source PMD-aligned, Destination PMD-aligned");
+
+	/* Src addr PUD aligned */
+	test_cases[9] = MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "2GB mremap - Source PUD-aligned, Destination PTE-aligned");
+	test_cases[10] = MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				   "2GB mremap - Source PUD-aligned, Destination 1MB-aligned");
+	test_cases[11] = MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				   "2GB mremap - Source PUD-aligned, Destination PMD-aligned");
+	test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
+
+	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
+	/*
+	 * mremap 1GB region - Page table level aligned time
+	 * comparison.
+	 */
+	perf_test_cases[1] = MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				       "1GB mremap - Source PMD-aligned, Destination PMD-aligned");
+	perf_test_cases[2] = MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				       "1GB mremap - Source PUD-aligned, Destination PUD-aligned");
 
 	run_perf_tests =  (threshold_mb == VALIDATION_NO_THRESHOLD) ||
 				(threshold_mb * _1MB >= _1GB);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel,
	Kirill A . Shutemov, linuxppc-dev

Instead of hardcoding 4K page size fetch it using sysconf(). For the performance
measurements test still assume 2M and 1G are hugepage sizes.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 tools/testing/selftests/vm/mremap_test.c | 113 ++++++++++++-----------
 1 file changed, 61 insertions(+), 52 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
index 9c391d016922..c9a5461eb786 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -45,14 +45,15 @@ enum {
 	_4MB = 4ULL << 20,
 	_1GB = 1ULL << 30,
 	_2GB = 2ULL << 30,
-	PTE = _4KB,
 	PMD = _2MB,
 	PUD = _1GB,
 };
 
+#define PTE page_size
+
 #define MAKE_TEST(source_align, destination_align, size,	\
 		  overlaps, should_fail, test_name)		\
-{								\
+(struct test){							\
 	.name = test_name,					\
 	.config = {						\
 		.src_alignment = source_align,			\
@@ -252,12 +253,17 @@ static int parse_args(int argc, char **argv, unsigned int *threshold_mb,
 	return 0;
 }
 
+#define MAX_TEST 13
+#define MAX_PERF_TEST 3
 int main(int argc, char **argv)
 {
 	int failures = 0;
 	int i, run_perf_tests;
 	unsigned int threshold_mb = VALIDATION_DEFAULT_THRESHOLD;
 	unsigned int pattern_seed;
+	struct test test_cases[MAX_TEST];
+	struct test perf_test_cases[MAX_PERF_TEST];
+	int page_size;
 	time_t t;
 
 	pattern_seed = (unsigned int) time(&t);
@@ -268,56 +274,59 @@ int main(int argc, char **argv)
 	ksft_print_msg("Test configs:\n\tthreshold_mb=%u\n\tpattern_seed=%u\n\n",
 		       threshold_mb, pattern_seed);
 
-	struct test test_cases[] = {
-		/* Expected mremap failures */
-		MAKE_TEST(_4KB, _4KB, _4KB, OVERLAPPING, EXPECT_FAILURE,
-		  "mremap - Source and Destination Regions Overlapping"),
-		MAKE_TEST(_4KB, _1KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
-		  "mremap - Destination Address Misaligned (1KB-aligned)"),
-		MAKE_TEST(_1KB, _4KB, _4KB, NON_OVERLAPPING, EXPECT_FAILURE,
-		  "mremap - Source Address Misaligned (1KB-aligned)"),
-
-		/* Src addr PTE aligned */
-		MAKE_TEST(PTE, PTE, _8KB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "8KB mremap - Source PTE-aligned, Destination PTE-aligned"),
-
-		/* Src addr 1MB aligned */
-		MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2MB mremap - Source 1MB-aligned, Destination PTE-aligned"),
-		MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned"),
-
-		/* Src addr PMD aligned */
-		MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "4MB mremap - Source PMD-aligned, Destination PTE-aligned"),
-		MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "4MB mremap - Source PMD-aligned, Destination 1MB-aligned"),
-		MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "4MB mremap - Source PMD-aligned, Destination PMD-aligned"),
-
-		/* Src addr PUD aligned */
-		MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination PTE-aligned"),
-		MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination 1MB-aligned"),
-		MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination PMD-aligned"),
-		MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "2GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-	};
-
-	struct test perf_test_cases[] = {
-		/*
-		 * mremap 1GB region - Page table level aligned time
-		 * comparison.
-		 */
-		MAKE_TEST(PTE, PTE, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "1GB mremap - Source PTE-aligned, Destination PTE-aligned"),
-		MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "1GB mremap - Source PMD-aligned, Destination PMD-aligned"),
-		MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
-		  "1GB mremap - Source PUD-aligned, Destination PUD-aligned"),
-	};
+	page_size = sysconf(_SC_PAGESIZE);
+
+	/* Expected mremap failures */
+	test_cases[0] =	MAKE_TEST(page_size, page_size, page_size,
+				  OVERLAPPING, EXPECT_FAILURE,
+				  "mremap - Source and Destination Regions Overlapping");
+
+	test_cases[1] = MAKE_TEST(page_size, page_size/4, page_size,
+				  NON_OVERLAPPING, EXPECT_FAILURE,
+				  "mremap - Destination Address Misaligned (1KB-aligned)");
+	test_cases[2] = MAKE_TEST(page_size/4, page_size, page_size,
+				  NON_OVERLAPPING, EXPECT_FAILURE,
+				  "mremap - Source Address Misaligned (1KB-aligned)");
+
+	/* Src addr PTE aligned */
+	test_cases[3] = MAKE_TEST(PTE, PTE, PTE * 2,
+				  NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "8KB mremap - Source PTE-aligned, Destination PTE-aligned");
+
+	/* Src addr 1MB aligned */
+	test_cases[4] = MAKE_TEST(_1MB, PTE, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "2MB mremap - Source 1MB-aligned, Destination PTE-aligned");
+	test_cases[5] = MAKE_TEST(_1MB, _1MB, _2MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "2MB mremap - Source 1MB-aligned, Destination 1MB-aligned");
+
+	/* Src addr PMD aligned */
+	test_cases[6] = MAKE_TEST(PMD, PTE, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "4MB mremap - Source PMD-aligned, Destination PTE-aligned");
+	test_cases[7] =	MAKE_TEST(PMD, _1MB, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "4MB mremap - Source PMD-aligned, Destination 1MB-aligned");
+	test_cases[8] = MAKE_TEST(PMD, PMD, _4MB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "4MB mremap - Source PMD-aligned, Destination PMD-aligned");
+
+	/* Src addr PUD aligned */
+	test_cases[9] = MAKE_TEST(PUD, PTE, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				  "2GB mremap - Source PUD-aligned, Destination PTE-aligned");
+	test_cases[10] = MAKE_TEST(PUD, _1MB, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				   "2GB mremap - Source PUD-aligned, Destination 1MB-aligned");
+	test_cases[11] = MAKE_TEST(PUD, PMD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				   "2GB mremap - Source PUD-aligned, Destination PMD-aligned");
+	test_cases[12] = MAKE_TEST(PUD, PUD, _2GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				   "2GB mremap - Source PUD-aligned, Destination PUD-aligned");
+
+	perf_test_cases[0] =  MAKE_TEST(page_size, page_size, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+					"1GB mremap - Source PTE-aligned, Destination PTE-aligned");
+	/*
+	 * mremap 1GB region - Page table level aligned time
+	 * comparison.
+	 */
+	perf_test_cases[1] = MAKE_TEST(PMD, PMD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				       "1GB mremap - Source PMD-aligned, Destination PMD-aligned");
+	perf_test_cases[2] = MAKE_TEST(PUD, PUD, _1GB, NON_OVERLAPPING, EXPECT_SUCCESS,
+				       "1GB mremap - Source PUD-aligned, Destination PUD-aligned");
 
 	run_perf_tests =  (threshold_mb == VALIDATION_NO_THRESHOLD) ||
 				(threshold_mb * _1MB >= _1GB);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] selftest/mremap_test: Avoid crash with static build
  2021-06-10  8:35 ` Aneesh Kumar K.V
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V

With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 tools/testing/selftests/vm/mremap_test.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
index c9a5461eb786..0624d1bd71b5 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c)
 retry:
 	addr += c.src_alignment;
 	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-			MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+			MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+			-1, 0);
 	if (src_addr == MAP_FAILED) {
-		if (errno == EPERM)
+		if (errno == EPERM || errno == EEXIST)
 			goto retry;
 		goto error;
 	}
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/6] selftest/mremap_test: Avoid crash with static build
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel,
	Kirill A . Shutemov, linuxppc-dev

With a large mmap map size, we can overlap with the text area and using
MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE
and handle the EEXIST error.

Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 tools/testing/selftests/vm/mremap_test.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/vm/mremap_test.c b/tools/testing/selftests/vm/mremap_test.c
index c9a5461eb786..0624d1bd71b5 100644
--- a/tools/testing/selftests/vm/mremap_test.c
+++ b/tools/testing/selftests/vm/mremap_test.c
@@ -75,9 +75,10 @@ static void *get_source_mapping(struct config c)
 retry:
 	addr += c.src_alignment;
 	src_addr = mmap((void *) addr, c.region_size, PROT_READ | PROT_WRITE,
-			MAP_FIXED | MAP_ANONYMOUS | MAP_SHARED, -1, 0);
+			MAP_FIXED_NOREPLACE | MAP_ANONYMOUS | MAP_SHARED,
+			-1, 0);
 	if (src_addr == MAP_FAILED) {
-		if (errno == EPERM)
+		if (errno == EPERM || errno == EEXIST)
 			goto retry;
 		goto error;
 	}
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper
  2021-06-10  8:35 ` Aneesh Kumar K.V
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V

With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 47c255b60150..92ab7d24a587 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma,
 }
 #endif
 
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+	spinlock_t *old_ptl, *new_ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t pud;
+
+	/*
+	 * The destination pud shouldn't be established, free_pgtables()
+	 * should have released it.
+	 */
+	if (WARN_ON_ONCE(!pud_none(*new_pud)))
+		return false;
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * ptlocks because exclusive mmap_lock prevents deadlock.
+	 */
+	old_ptl = pud_lock(vma->vm_mm, old_pud);
+	new_ptl = pud_lockptr(mm, new_pud);
+	if (new_ptl != old_ptl)
+		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+	/* Clear the pud */
+	pud = *old_pud;
+	pud_clear(old_pud);
+
+	VM_BUG_ON(!pud_none(*new_pud));
+
+	/* Set the new pud */
+	/* mark soft_ditry when we add pud level soft dirty support */
+	set_pud_at(mm, new_addr, new_pud, pud);
+	flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
+	if (new_ptl != old_ptl)
+		spin_unlock(new_ptl);
+	spin_unlock(old_ptl);
+
+	return true;
+}
+#else
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+	WARN_ON_ONCE(1);
+	return false;
+
+}
+#endif
+
 enum pgt_entry {
 	NORMAL_PMD,
 	HPAGE_PMD,
 	NORMAL_PUD,
+	HPAGE_PUD,
 };
 
 /*
@@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry,
 		mask = PMD_MASK;
 		size = PMD_SIZE;
 		break;
+	case HPAGE_PUD:
 	case NORMAL_PUD:
 		mask = PUD_MASK;
 		size = PUD_SIZE;
@@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 			move_huge_pmd(vma, old_addr, new_addr, old_entry,
 				      new_entry);
 		break;
+	case HPAGE_PUD:
+		moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
+				      new_entry);
+		break;
+
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	unsigned long extent, old_end;
 	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
+	pud_t *old_pud, *new_pud;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
@@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		 * PUD level if possible.
 		 */
 		extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
-		if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
-			pud_t *old_pud, *new_pud;
 
-			old_pud = get_old_pud(vma->vm_mm, old_addr);
-			if (!old_pud)
+		old_pud = get_old_pud(vma->vm_mm, old_addr);
+		if (!old_pud)
+			continue;
+		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+		if (!new_pud)
+			break;
+		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+			if (extent == HPAGE_PUD_SIZE) {
+				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
+					       old_pud, new_pud, need_rmap_locks);
+				/* We ignore and continue on error? */
 				continue;
-			new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
-			if (!new_pud)
-				break;
+			}
+		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
+
 			if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
 					   old_pud, new_pud, need_rmap_locks))
 				continue;
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel,
	Kirill A . Shutemov, linuxppc-dev

With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 47c255b60150..92ab7d24a587 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma,
 }
 #endif
 
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+	spinlock_t *old_ptl, *new_ptl;
+	struct mm_struct *mm = vma->vm_mm;
+	pud_t pud;
+
+	/*
+	 * The destination pud shouldn't be established, free_pgtables()
+	 * should have released it.
+	 */
+	if (WARN_ON_ONCE(!pud_none(*new_pud)))
+		return false;
+
+	/*
+	 * We don't have to worry about the ordering of src and dst
+	 * ptlocks because exclusive mmap_lock prevents deadlock.
+	 */
+	old_ptl = pud_lock(vma->vm_mm, old_pud);
+	new_ptl = pud_lockptr(mm, new_pud);
+	if (new_ptl != old_ptl)
+		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+	/* Clear the pud */
+	pud = *old_pud;
+	pud_clear(old_pud);
+
+	VM_BUG_ON(!pud_none(*new_pud));
+
+	/* Set the new pud */
+	/* mark soft_ditry when we add pud level soft dirty support */
+	set_pud_at(mm, new_addr, new_pud, pud);
+	flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
+	if (new_ptl != old_ptl)
+		spin_unlock(new_ptl);
+	spin_unlock(old_ptl);
+
+	return true;
+}
+#else
+static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
+			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
+{
+	WARN_ON_ONCE(1);
+	return false;
+
+}
+#endif
+
 enum pgt_entry {
 	NORMAL_PMD,
 	HPAGE_PMD,
 	NORMAL_PUD,
+	HPAGE_PUD,
 };
 
 /*
@@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry,
 		mask = PMD_MASK;
 		size = PMD_SIZE;
 		break;
+	case HPAGE_PUD:
 	case NORMAL_PUD:
 		mask = PUD_MASK;
 		size = PUD_SIZE;
@@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
 			move_huge_pmd(vma, old_addr, new_addr, old_entry,
 				      new_entry);
 		break;
+	case HPAGE_PUD:
+		moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
+				      new_entry);
+		break;
+
 	default:
 		WARN_ON_ONCE(1);
 		break;
@@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 	unsigned long extent, old_end;
 	struct mmu_notifier_range range;
 	pmd_t *old_pmd, *new_pmd;
+	pud_t *old_pud, *new_pud;
 
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
@@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		 * PUD level if possible.
 		 */
 		extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
-		if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
-			pud_t *old_pud, *new_pud;
 
-			old_pud = get_old_pud(vma->vm_mm, old_addr);
-			if (!old_pud)
+		old_pud = get_old_pud(vma->vm_mm, old_addr);
+		if (!old_pud)
+			continue;
+		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
+		if (!new_pud)
+			break;
+		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+			if (extent == HPAGE_PUD_SIZE) {
+				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
+					       old_pud, new_pud, need_rmap_locks);
+				/* We ignore and continue on error? */
 				continue;
-			new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
-			if (!new_pud)
-				break;
+			}
+		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
+
 			if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
 					   old_pud, new_pud, need_rmap_locks))
 				continue;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2
  2021-06-10  8:35 ` Aneesh Kumar K.V
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V

With two level page table don't enable move_normal_pud.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 92ab7d24a587..795a7d628b53 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -276,7 +276,7 @@ static inline bool move_normal_pmd(struct vm_area_struct *vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_MOVE_PUD
+#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD)
 static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
 {
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel,
	Kirill A . Shutemov, linuxppc-dev

With two level page table don't enable move_normal_pud.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 92ab7d24a587..795a7d628b53 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -276,7 +276,7 @@ static inline bool move_normal_pmd(struct vm_area_struct *vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_MOVE_PUD
+#if CONFIG_PGTABLE_LEVELS > 2 && defined(CONFIG_HAVE_MOVE_PUD)
 static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
 		  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
 {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-10  8:35 ` Aneesh Kumar K.V
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V

pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 795a7d628b53..dacfa9111ab1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 #include "internal.h"
 
@@ -258,8 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 
 	VM_BUG_ON(!pmd_none(*new_pmd));
 
-	/* Set the new pmd */
-	set_pmd_at(mm, new_addr, new_pmd, pmd);
+	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
@@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
 
 	VM_BUG_ON(!pud_none(*new_pud));
 
-	/* Set the new pud */
-	set_pud_at(mm, new_addr, new_pud, pud);
+	pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
 	flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Linus Torvalds, npiggin, kaleshsingh, joel,
	Kirill A . Shutemov, linuxppc-dev

pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at
can only be used to set a hugepage PTE. Since we are not setting up a hugepage
PTE here, use the pmd/pud_populate interface.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 795a7d628b53..dacfa9111ab1 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -26,6 +26,7 @@
 
 #include <asm/cacheflush.h>
 #include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 #include "internal.h"
 
@@ -258,8 +259,7 @@ static bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 
 	VM_BUG_ON(!pmd_none(*new_pmd));
 
-	/* Set the new pmd */
-	set_pmd_at(mm, new_addr, new_pmd, pmd);
+	pmd_populate(mm, new_pmd, pmd_pgtable(pmd));
 	flush_tlb_range(vma, old_addr, old_addr + PMD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
@@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
 
 	VM_BUG_ON(!pud_none(*new_pud));
 
-	/* Set the new pud */
-	set_pud_at(mm, new_addr, new_pud, pud);
+	pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
 	flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
 	if (new_ptl != old_ptl)
 		spin_unlock(new_ptl);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries.
  2021-06-10  8:35 ` Aneesh Kumar K.V
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: mpe, linuxppc-dev, kaleshsingh, npiggin, joel, Christophe Leroy,
	Linus Torvalds, Kirill A . Shutemov, Aneesh Kumar K.V,
	Hugh Dickins, Kirill A . Shutemov

To avoid a race between rmap walk and mremap, mremap does take_rmap_locks().
The lock was taken to ensure that rmap walk don't miss a page table entry due to
PTE moves via move_pagetables(). The kernel does further optimization of
this lock such that if we are going to find the newly added vma after the
old vma, the rmap lock is not taken. This is because rmap walk would find the
vmas in the same order and if we don't find the page table attached to
older vma we would find it with the new vma which we would iterate later.

As explained in commit eb66ae030829 ("mremap: properly flush TLB before releasing the page")
mremap is special in that it doesn't take ownership of the page. The
optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock.
This can result in stale TLB entries as show below.

This patch updates the rmap locking requirement in mremap to handle the race condition
explained below with optimized mremap::

Optmized PMD move

    CPU 1                           CPU 2                                   CPU 3

    mremap(old_addr, new_addr)      page_shrinker/try_to_unmap_one

    mmap_write_lock_killable()

                                    addr = old_addr
                                    lock(pte_ptl)
    lock(pmd_ptl)
    pmd = *old_pmd
    pmd_clear(old_pmd)
    flush_tlb_range(old_addr)

    *new_pmd = pmd
                                                                            *new_addr = 10; and fills
                                                                            TLB with new addr
                                                                            and old pfn

    unlock(pmd_ptl)
                                    ptep_clear_flush()
                                    old pfn is free.
                                                                            Stale TLB entry

Optimized PUD move also suffers from a similar race.
Both the above race condition can be fixed if we force mremap path to take rmap lock.

Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index dacfa9111ab1..b8eed7645cea 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -504,7 +504,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
 
 			if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
-					   old_pud, new_pud, need_rmap_locks))
+					   old_pud, new_pud, true))
 				continue;
 		}
 
@@ -531,7 +531,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			 * moving at the PMD level if possible.
 			 */
 			if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
-					   old_pmd, new_pmd, need_rmap_locks))
+					   old_pmd, new_pmd, true))
 				continue;
 		}
 
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries.
@ 2021-06-10  8:35   ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-10  8:35 UTC (permalink / raw)
  To: linux-mm, akpm
  Cc: Aneesh Kumar K.V, Hugh Dickins, Linus Torvalds, npiggin,
	kaleshsingh, joel, Kirill A . Shutemov, linuxppc-dev,
	Kirill A . Shutemov

To avoid a race between rmap walk and mremap, mremap does take_rmap_locks().
The lock was taken to ensure that rmap walk don't miss a page table entry due to
PTE moves via move_pagetables(). The kernel does further optimization of
this lock such that if we are going to find the newly added vma after the
old vma, the rmap lock is not taken. This is because rmap walk would find the
vmas in the same order and if we don't find the page table attached to
older vma we would find it with the new vma which we would iterate later.

As explained in commit eb66ae030829 ("mremap: properly flush TLB before releasing the page")
mremap is special in that it doesn't take ownership of the page. The
optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock.
This can result in stale TLB entries as show below.

This patch updates the rmap locking requirement in mremap to handle the race condition
explained below with optimized mremap::

Optmized PMD move

    CPU 1                           CPU 2                                   CPU 3

    mremap(old_addr, new_addr)      page_shrinker/try_to_unmap_one

    mmap_write_lock_killable()

                                    addr = old_addr
                                    lock(pte_ptl)
    lock(pmd_ptl)
    pmd = *old_pmd
    pmd_clear(old_pmd)
    flush_tlb_range(old_addr)

    *new_pmd = pmd
                                                                            *new_addr = 10; and fills
                                                                            TLB with new addr
                                                                            and old pfn

    unlock(pmd_ptl)
                                    ptep_clear_flush()
                                    old pfn is free.
                                                                            Stale TLB entry

Optimized PUD move also suffers from a similar race.
Both the above race condition can be fixed if we force mremap path to take rmap lock.

Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com
Acked-by: Hugh Dickins <hughd@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 mm/mremap.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index dacfa9111ab1..b8eed7645cea 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -504,7 +504,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
 
 			if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
-					   old_pud, new_pud, need_rmap_locks))
+					   old_pud, new_pud, true))
 				continue;
 		}
 
@@ -531,7 +531,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			 * moving at the PMD level if possible.
 			 */
 			if (move_pgt_entry(NORMAL_PMD, vma, old_addr, new_addr,
-					   old_pmd, new_pmd, need_rmap_locks))
+					   old_pmd, new_pmd, true))
 				continue;
 		}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-10  8:35   ` Aneesh Kumar K.V
  (?)
@ 2021-06-10 18:16   ` Linus Torvalds
  2021-06-13  9:06     ` Aneesh Kumar K.V
  -1 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2021-06-10 18:16 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes,
	Kirill A . Shutemov, Andrew Morton, linuxppc-dev

On Thu, Jun 10, 2021 at 1:36 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
>
>         VM_BUG_ON(!pud_none(*new_pud));
>
> -       /* Set the new pud */
> -       set_pud_at(mm, new_addr, new_pud, pud);
> +       pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
>         flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
>         if (new_ptl != old_ptl)
>                 spin_unlock(new_ptl);

That "(pmd_t *)pud_page_vaddr(pud)" looks a bit odd and doesn't match
the pattern.

Should we perhaps have a wrapper for it? Maybe called "pud_pgtable()",
the same way we have pmd_pgtable()?

And that wrapper would be good to have a comment or two about what it
actually is. The same way that pmd_pgtable() should but doesn't ;(

            Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper
  2021-06-10  8:35   ` Aneesh Kumar K.V
@ 2021-06-10 22:03     ` Hugh Dickins
  -1 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2021-06-10 22:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linux-mm, akpm, mpe, linuxppc-dev, kaleshsingh, npiggin, joel,
	Christophe Leroy, Linus Torvalds, Kirill A . Shutemov

On Thu, 10 Jun 2021, Aneesh Kumar K.V wrote:

> With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
> Add a helper to move huge PUD entries on mremap().
> 
> This will be used by a later patch to optimize mremap of PUD_SIZE aligned
> level 4 PTE mapped address
> 
> This also make sure we support mremap on huge PUD entries even with
> CONFIG_HAVE_MOVE_PUD disabled.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 73 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 47c255b60150..92ab7d24a587 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD

Should that say
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
?

(I'm a PUD-THP-sceptic, but if it's just for DAX then probably okay.)

> +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
> +			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> +	spinlock_t *old_ptl, *new_ptl;
> +	struct mm_struct *mm = vma->vm_mm;
> +	pud_t pud;
> +
> +	/*
> +	 * The destination pud shouldn't be established, free_pgtables()
> +	 * should have released it.
> +	 */
> +	if (WARN_ON_ONCE(!pud_none(*new_pud)))
> +		return false;
> +
> +	/*
> +	 * We don't have to worry about the ordering of src and dst
> +	 * ptlocks because exclusive mmap_lock prevents deadlock.
> +	 */
> +	old_ptl = pud_lock(vma->vm_mm, old_pud);
> +	new_ptl = pud_lockptr(mm, new_pud);
> +	if (new_ptl != old_ptl)
> +		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> +	/* Clear the pud */
> +	pud = *old_pud;
> +	pud_clear(old_pud);
> +
> +	VM_BUG_ON(!pud_none(*new_pud));
> +
> +	/* Set the new pud */
> +	/* mark soft_ditry when we add pud level soft dirty support */
> +	set_pud_at(mm, new_addr, new_pud, pud);
> +	flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
> +	if (new_ptl != old_ptl)
> +		spin_unlock(new_ptl);
> +	spin_unlock(old_ptl);
> +
> +	return true;
> +}
> +#else
> +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
> +			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> +	WARN_ON_ONCE(1);
> +	return false;
> +
> +}
> +#endif
> +
>  enum pgt_entry {
>  	NORMAL_PMD,
>  	HPAGE_PMD,
>  	NORMAL_PUD,
> +	HPAGE_PUD,
>  };
>  
>  /*
> @@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry,
>  		mask = PMD_MASK;
>  		size = PMD_SIZE;
>  		break;
> +	case HPAGE_PUD:
>  	case NORMAL_PUD:
>  		mask = PUD_MASK;
>  		size = PUD_SIZE;
> @@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>  			move_huge_pmd(vma, old_addr, new_addr, old_entry,
>  				      new_entry);
>  		break;
> +	case HPAGE_PUD:
> +		moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
> +				      new_entry);
> +		break;
> +
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;
> @@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	unsigned long extent, old_end;
>  	struct mmu_notifier_range range;
>  	pmd_t *old_pmd, *new_pmd;
> +	pud_t *old_pud, *new_pud;
>  
>  	old_end = old_addr + len;
>  	flush_cache_range(vma, old_addr, old_end);
> @@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		 * PUD level if possible.
>  		 */
>  		extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> -		if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
> -			pud_t *old_pud, *new_pud;
>  
> -			old_pud = get_old_pud(vma->vm_mm, old_addr);
> -			if (!old_pud)
> +		old_pud = get_old_pud(vma->vm_mm, old_addr);
> +		if (!old_pud)
> +			continue;
> +		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> +		if (!new_pud)
> +			break;
> +		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
> +			if (extent == HPAGE_PUD_SIZE) {
> +				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
> +					       old_pud, new_pud, need_rmap_locks);
> +				/* We ignore and continue on error? */
>  				continue;
> -			new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> -			if (!new_pud)
> -				break;
> +			}
> +		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
> +
>  			if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
>  					   old_pud, new_pud, need_rmap_locks))
>  				continue;
> -- 
> 2.31.1
> 
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper
@ 2021-06-10 22:03     ` Hugh Dickins
  0 siblings, 0 replies; 27+ messages in thread
From: Hugh Dickins @ 2021-06-10 22:03 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Linus Torvalds, npiggin, linux-mm, kaleshsingh, joel,
	Kirill A . Shutemov, akpm, linuxppc-dev

On Thu, 10 Jun 2021, Aneesh Kumar K.V wrote:

> With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries.
> Add a helper to move huge PUD entries on mremap().
> 
> This will be used by a later patch to optimize mremap of PUD_SIZE aligned
> level 4 PTE mapped address
> 
> This also make sure we support mremap on huge PUD entries even with
> CONFIG_HAVE_MOVE_PUD disabled.
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  mm/mremap.c | 80 ++++++++++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 73 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 47c255b60150..92ab7d24a587 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -324,10 +324,62 @@ static inline bool move_normal_pud(struct vm_area_struct *vma,
>  }
>  #endif
>  
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE_PUD

Should that say
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
?

(I'm a PUD-THP-sceptic, but if it's just for DAX then probably okay.)

> +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
> +			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> +	spinlock_t *old_ptl, *new_ptl;
> +	struct mm_struct *mm = vma->vm_mm;
> +	pud_t pud;
> +
> +	/*
> +	 * The destination pud shouldn't be established, free_pgtables()
> +	 * should have released it.
> +	 */
> +	if (WARN_ON_ONCE(!pud_none(*new_pud)))
> +		return false;
> +
> +	/*
> +	 * We don't have to worry about the ordering of src and dst
> +	 * ptlocks because exclusive mmap_lock prevents deadlock.
> +	 */
> +	old_ptl = pud_lock(vma->vm_mm, old_pud);
> +	new_ptl = pud_lockptr(mm, new_pud);
> +	if (new_ptl != old_ptl)
> +		spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
> +
> +	/* Clear the pud */
> +	pud = *old_pud;
> +	pud_clear(old_pud);
> +
> +	VM_BUG_ON(!pud_none(*new_pud));
> +
> +	/* Set the new pud */
> +	/* mark soft_ditry when we add pud level soft dirty support */
> +	set_pud_at(mm, new_addr, new_pud, pud);
> +	flush_pud_tlb_range(vma, old_addr, old_addr + HPAGE_PUD_SIZE);
> +	if (new_ptl != old_ptl)
> +		spin_unlock(new_ptl);
> +	spin_unlock(old_ptl);
> +
> +	return true;
> +}
> +#else
> +static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
> +			  unsigned long new_addr, pud_t *old_pud, pud_t *new_pud)
> +{
> +	WARN_ON_ONCE(1);
> +	return false;
> +
> +}
> +#endif
> +
>  enum pgt_entry {
>  	NORMAL_PMD,
>  	HPAGE_PMD,
>  	NORMAL_PUD,
> +	HPAGE_PUD,
>  };
>  
>  /*
> @@ -347,6 +399,7 @@ static __always_inline unsigned long get_extent(enum pgt_entry entry,
>  		mask = PMD_MASK;
>  		size = PMD_SIZE;
>  		break;
> +	case HPAGE_PUD:
>  	case NORMAL_PUD:
>  		mask = PUD_MASK;
>  		size = PUD_SIZE;
> @@ -395,6 +448,11 @@ static bool move_pgt_entry(enum pgt_entry entry, struct vm_area_struct *vma,
>  			move_huge_pmd(vma, old_addr, new_addr, old_entry,
>  				      new_entry);
>  		break;
> +	case HPAGE_PUD:
> +		moved = move_huge_pud(vma, old_addr, new_addr, old_entry,
> +				      new_entry);
> +		break;
> +
>  	default:
>  		WARN_ON_ONCE(1);
>  		break;
> @@ -414,6 +472,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  	unsigned long extent, old_end;
>  	struct mmu_notifier_range range;
>  	pmd_t *old_pmd, *new_pmd;
> +	pud_t *old_pud, *new_pud;
>  
>  	old_end = old_addr + len;
>  	flush_cache_range(vma, old_addr, old_end);
> @@ -429,15 +488,22 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
>  		 * PUD level if possible.
>  		 */
>  		extent = get_extent(NORMAL_PUD, old_addr, old_end, new_addr);
> -		if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
> -			pud_t *old_pud, *new_pud;
>  
> -			old_pud = get_old_pud(vma->vm_mm, old_addr);
> -			if (!old_pud)
> +		old_pud = get_old_pud(vma->vm_mm, old_addr);
> +		if (!old_pud)
> +			continue;
> +		new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> +		if (!new_pud)
> +			break;
> +		if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
> +			if (extent == HPAGE_PUD_SIZE) {
> +				move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
> +					       old_pud, new_pud, need_rmap_locks);
> +				/* We ignore and continue on error? */
>  				continue;
> -			new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
> -			if (!new_pud)
> -				break;
> +			}
> +		} else if (IS_ENABLED(CONFIG_HAVE_MOVE_PUD) && extent == PUD_SIZE) {
> +
>  			if (move_pgt_entry(NORMAL_PUD, vma, old_addr, new_addr,
>  					   old_pud, new_pud, need_rmap_locks))
>  				continue;
> -- 
> 2.31.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving page table entries.
  2021-06-10  8:35   ` Aneesh Kumar K.V
  (?)
@ 2021-06-11  8:11   ` Jann Horn
  -1 siblings, 0 replies; 27+ messages in thread
From: Jann Horn @ 2021-06-11  8:11 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Linus Torvalds, Hugh Dickins, Nicholas Piggin, Linux-MM,
	Kalesh Singh, Joel Fernandes (Google),
	Kirill A . Shutemov, Andrew Morton, linuxppc-dev,
	Kirill A . Shutemov

On Thu, Jun 10, 2021 at 10:35 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
> To avoid a race between rmap walk and mremap, mremap does take_rmap_locks().
> The lock was taken to ensure that rmap walk don't miss a page table entry due to
> PTE moves via move_pagetables(). The kernel does further optimization of
> this lock such that if we are going to find the newly added vma after the
> old vma, the rmap lock is not taken. This is because rmap walk would find the
> vmas in the same order and if we don't find the page table attached to
> older vma we would find it with the new vma which we would iterate later.
[...]
> Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
> Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")

probably also "Cc: stable@vger.kernel.org"?

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-10 18:16   ` Linus Torvalds
@ 2021-06-13  9:06     ` Aneesh Kumar K.V
  2021-06-13 10:50         ` Matthew Wilcox
  2021-06-13 18:53         ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-13  9:06 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes,
	Kirill A . Shutemov, Andrew Morton, linuxppc-dev

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Thu, Jun 10, 2021 at 1:36 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> @@ -306,8 +306,7 @@ static bool move_normal_pud(struct vm_area_struct *vma, unsigned long old_addr,
>>
>>         VM_BUG_ON(!pud_none(*new_pud));
>>
>> -       /* Set the new pud */
>> -       set_pud_at(mm, new_addr, new_pud, pud);
>> +       pud_populate(mm, new_pud, (pmd_t *)pud_page_vaddr(pud));
>>         flush_tlb_range(vma, old_addr, old_addr + PUD_SIZE);
>>         if (new_ptl != old_ptl)
>>                 spin_unlock(new_ptl);
>
> That "(pmd_t *)pud_page_vaddr(pud)" looks a bit odd and doesn't match
> the pattern.
>
> Should we perhaps have a wrapper for it? Maybe called "pud_pgtable()",
> the same way we have pmd_pgtable()?

IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
is arch dependent. On some architecture it is pte_t * and on the other
struct page *. The reason being highmem and level 4 page table can
be located in highmem. 

The above is not true with the level 3 table and hence we didn't add an
extra type to point to level 3 table.

We could add pud_pgtable(), but then it will essentially be a typecast
as I did above?

Even if we want to do that, that should be done as a separate patch than
this series?

>
> And that wrapper would be good to have a comment or two about what it
> actually is. The same way that pmd_pgtable() should but doesn't ;(
>
>             Linus


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-13  9:06     ` Aneesh Kumar K.V
@ 2021-06-13 10:50         ` Matthew Wilcox
  2021-06-13 18:53         ` Linus Torvalds
  1 sibling, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2021-06-13 10:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Linus Torvalds, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev

On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem. 

That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
we have pgtable_t for the benefit of s390's crazy sub-page page table
sizes.

Also, please stop numbering page tables upside down.  PTEs are first
level, not fourth.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
@ 2021-06-13 10:50         ` Matthew Wilcox
  0 siblings, 0 replies; 27+ messages in thread
From: Matthew Wilcox @ 2021-06-13 10:50 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: linuxppc-dev, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds

On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem. 

That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
we have pgtable_t for the benefit of s390's crazy sub-page page table
sizes.

Also, please stop numbering page tables upside down.  PTEs are first
level, not fourth.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-13 10:50         ` Matthew Wilcox
@ 2021-06-13 11:13           ` Aneesh Kumar K.V
  -1 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-13 11:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Linus Torvalds, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev

On 6/13/21 4:20 PM, Matthew Wilcox wrote:
> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>> is arch dependent. On some architecture it is pte_t * and on the other
>> struct page *. The reason being highmem and level 4 page table can
>> be located in highmem.
> 
> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
> we have pgtable_t for the benefit of s390's crazy sub-page page table
> sizes.

That is also true with ppc64. We do sub-page page table size. I was 
trying to explain why it can't be pte_t * everywhere and why we have
it as struct page *.

> 
> Also, please stop numbering page tables upside down.  PTEs are first
> level, not fourth.
> 

POWER ISA do name it the other way. I also see some pages explaining 
levels the other way

https://www.bottomupcs.com/virtual_memory_linux.xhtml

whereas 
https://en.wikipedia.org/wiki/Intel_5-level_paging#/media/File:Page_Tables_(5_levels).svg

I am pretty sure I had commits that explained page table level as I did 
in this thread. I will switch to your suggestion in further discussions. 
May be the best solution is to attribute it with more details like level 
1 (pte_t *)?


-aneesh



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
@ 2021-06-13 11:13           ` Aneesh Kumar K.V
  0 siblings, 0 replies; 27+ messages in thread
From: Aneesh Kumar K.V @ 2021-06-13 11:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linuxppc-dev, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds

On 6/13/21 4:20 PM, Matthew Wilcox wrote:
> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>> is arch dependent. On some architecture it is pte_t * and on the other
>> struct page *. The reason being highmem and level 4 page table can
>> be located in highmem.
> 
> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
> we have pgtable_t for the benefit of s390's crazy sub-page page table
> sizes.

That is also true with ppc64. We do sub-page page table size. I was 
trying to explain why it can't be pte_t * everywhere and why we have
it as struct page *.

> 
> Also, please stop numbering page tables upside down.  PTEs are first
> level, not fourth.
> 

POWER ISA do name it the other way. I also see some pages explaining 
levels the other way

https://www.bottomupcs.com/virtual_memory_linux.xhtml

whereas 
https://en.wikipedia.org/wiki/Intel_5-level_paging#/media/File:Page_Tables_(5_levels).svg

I am pretty sure I had commits that explained page table level as I did 
in this thread. I will switch to your suggestion in further discussions. 
May be the best solution is to attribute it with more details like level 
1 (pte_t *)?


-aneesh


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-13  9:06     ` Aneesh Kumar K.V
@ 2021-06-13 18:53         ` Linus Torvalds
  2021-06-13 18:53         ` Linus Torvalds
  1 sibling, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2021-06-13 18:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes,
	Kirill A . Shutemov, Andrew Morton, linuxppc-dev

On Sun, Jun 13, 2021 at 2:06 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem.

Honestly, the same confusion is real - in a different way - about
pud_page_vaddr().

I really hate that function.

Just grep for the uses, and the definitions, to see what I mean. It's crazy.

I'm perfectly happy not having a "pud_pagetable()" function, but that
cast on pud_page_vaddr() is indicative of real problems.

One solution might be to just say "pud_page_vaddr()" must return a "pmd_t *".

I think it's what all the users actually want anyway.

              Linus


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
@ 2021-06-13 18:53         ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2021-06-13 18:53 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: Nick Piggin, Linux-MM, Kalesh Singh, Joel Fernandes,
	Kirill A . Shutemov, Andrew Morton, linuxppc-dev

On Sun, Jun 13, 2021 at 2:06 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
> is arch dependent. On some architecture it is pte_t * and on the other
> struct page *. The reason being highmem and level 4 page table can
> be located in highmem.

Honestly, the same confusion is real - in a different way - about
pud_page_vaddr().

I really hate that function.

Just grep for the uses, and the definitions, to see what I mean. It's crazy.

I'm perfectly happy not having a "pud_pagetable()" function, but that
cast on pud_page_vaddr() is indicative of real problems.

One solution might be to just say "pud_page_vaddr()" must return a "pmd_t *".

I think it's what all the users actually want anyway.

              Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
  2021-06-13 11:13           ` Aneesh Kumar K.V
@ 2021-06-14  5:27             ` Christophe Leroy
  -1 siblings, 0 replies; 27+ messages in thread
From: Christophe Leroy @ 2021-06-14  5:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Matthew Wilcox
  Cc: linuxppc-dev, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton,
	Linus Torvalds



Le 13/06/2021 à 13:13, Aneesh Kumar K.V a écrit :
> On 6/13/21 4:20 PM, Matthew Wilcox wrote:
>> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>>> is arch dependent. On some architecture it is pte_t * and on the other
>>> struct page *. The reason being highmem and level 4 page table can
>>> be located in highmem.
>>
>> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
>> we have pgtable_t for the benefit of s390's crazy sub-page page table
>> sizes.
> 
> That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't 
> be pte_t * everywhere and why we have
> it as struct page *.

ppc32 as well. On the 8xx, with 16k size pages, the HW still use 4k page tables, so we do use sub-pages.

In order too keep the code simple, we have converted all powerpc to sub-pages for that, allthough 
some powerpc platforms have only one sub-page per page.

Christophe


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries
@ 2021-06-14  5:27             ` Christophe Leroy
  0 siblings, 0 replies; 27+ messages in thread
From: Christophe Leroy @ 2021-06-14  5:27 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Matthew Wilcox
  Cc: Linus Torvalds, Nick Piggin, Linux-MM, Kalesh Singh,
	Joel Fernandes, Kirill A . Shutemov, Andrew Morton, linuxppc-dev



Le 13/06/2021 à 13:13, Aneesh Kumar K.V a écrit :
> On 6/13/21 4:20 PM, Matthew Wilcox wrote:
>> On Sun, Jun 13, 2021 at 02:36:13PM +0530, Aneesh Kumar K.V wrote:
>>> IIUC the reason why we do have pmd_pgtable() is that pgtable_t type
>>> is arch dependent. On some architecture it is pte_t * and on the other
>>> struct page *. The reason being highmem and level 4 page table can
>>> be located in highmem.
>>
>> That is ahistorical.  See 2f569afd9ced9ebec9a6eb3dbf6f83429be0a7b4 --
>> we have pgtable_t for the benefit of s390's crazy sub-page page table
>> sizes.
> 
> That is also true with ppc64. We do sub-page page table size. I was trying to explain why it can't 
> be pte_t * everywhere and why we have
> it as struct page *.

ppc32 as well. On the 8xx, with 16k size pages, the HW still use 4k page tables, so we do use sub-pages.

In order too keep the code simple, we have converted all powerpc to sub-pages for that, allthough 
some powerpc platforms have only one sub-page per page.

Christophe

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-06-14  5:27 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-10  8:35 [PATCH 0/6] mremap fixes Aneesh Kumar K.V
2021-06-10  8:35 ` Aneesh Kumar K.V
2021-06-10  8:35 ` [PATCH 1/6] selftest/mremap_test: Update the test to handle pagesize other than 4K Aneesh Kumar K.V
2021-06-10  8:35   ` Aneesh Kumar K.V
2021-06-10  8:35 ` [PATCH 2/6] selftest/mremap_test: Avoid crash with static build Aneesh Kumar K.V
2021-06-10  8:35   ` Aneesh Kumar K.V
2021-06-10  8:35 ` [PATCH 3/6] mm/mremap: Convert huge PUD move to separate helper Aneesh Kumar K.V
2021-06-10  8:35   ` Aneesh Kumar K.V
2021-06-10 22:03   ` Hugh Dickins
2021-06-10 22:03     ` Hugh Dickins
2021-06-10  8:35 ` [PATCH 4/6] mm/mremap: Don't enable optimized PUD move if page table levels is 2 Aneesh Kumar K.V
2021-06-10  8:35   ` Aneesh Kumar K.V
2021-06-10  8:35 ` [PATCH 5/6] mm/mremap: Use pmd/pud_poplulate to update page table entries Aneesh Kumar K.V
2021-06-10  8:35   ` Aneesh Kumar K.V
2021-06-10 18:16   ` Linus Torvalds
2021-06-13  9:06     ` Aneesh Kumar K.V
2021-06-13 10:50       ` Matthew Wilcox
2021-06-13 10:50         ` Matthew Wilcox
2021-06-13 11:13         ` Aneesh Kumar K.V
2021-06-13 11:13           ` Aneesh Kumar K.V
2021-06-14  5:27           ` Christophe Leroy
2021-06-14  5:27             ` Christophe Leroy
2021-06-13 18:53       ` Linus Torvalds
2021-06-13 18:53         ` Linus Torvalds
2021-06-10  8:35 ` [PATCH 6/6] mm/mremap: hold the rmap lock in write mode when moving " Aneesh Kumar K.V
2021-06-10  8:35   ` Aneesh Kumar K.V
2021-06-11  8:11   ` Jann Horn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.