linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/3] Add hugetlb MADV_DONTNEED support
@ 2022-02-02  1:40 Mike Kravetz
  2022-02-02  1:40 ` [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Mike Kravetz @ 2022-02-02  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Michal Hocko, Peter Xu, Andrea Arcangeli, Shuah Khan,
	Andrew Morton, Mike Kravetz

Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
testing.  However, mremap support was recently added in commit
550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
vma").  While attempting to enable mremap support in the test, it was
discovered that the mremap test indirectly depends on MADV_DONTNEED.

madvise does not allow MADV_DONTNEED for hugetlb mappings.  However,
that is primarily due to the check in can_madv_lru_vma().  By simply
removing the check and adding huge page alignment, MADV_DONTNEED can
be made to work for hugetlb mappings.

Do note that there is no compelling use case for adding this support.
This was discussed in the RFC [1].  However, adding support makes sense
as it is fairly trivial and brings hugetlb functionality more in line
with 'normal' memory.

After enabling support, add selftest for MADV_DONTNEED as well as
MADV_REMOVE.  Then update userfaultfd selftest.

v1 -> v2
- Use is_vm_hugetlb_page() instead of open coding vma hugetlb check.
- Add new test to .gitignore and use meaningful symbolic names (#define)
  for constants used in test.  Shuah
- Updated help text in userfaultfd test and modified run_vmtests to not
  pass in a file for userfaultfd hugetlb test.  Axel
- Added Reviewed-by for selftest patches.

RFC -> v1
- Fixed alignment issues when calling zap_page_range.  Naoya
- Added checks for invalid arguments and misalignment to selftest.

[1] https://lore.kernel.org/linux-mm/20220113180308.15610-1-mike.kravetz@oracle.com/

Mike Kravetz (3):
  mm: enable MADV_DONTNEED for hugetlb mappings
  selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
  userfaultfd/selftests: enable huegtlb remap and remove event testing

 mm/madvise.c                                 |  24 +-
 tools/testing/selftests/vm/.gitignore        |   1 +
 tools/testing/selftests/vm/Makefile          |   1 +
 tools/testing/selftests/vm/hugetlb-madvise.c | 413 +++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh    |  15 +-
 tools/testing/selftests/vm/userfaultfd.c     |  69 ++--
 6 files changed, 485 insertions(+), 38 deletions(-)
 create mode 100644 tools/testing/selftests/vm/hugetlb-madvise.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-02  1:40 [PATCH v2 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
@ 2022-02-02  1:40 ` Mike Kravetz
  2022-02-02  8:14   ` David Hildenbrand
  2022-02-10  3:21   ` Peter Xu
  2022-02-02  1:40 ` [PATCH v2 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test Mike Kravetz
  2022-02-02  1:40 ` [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
  2 siblings, 2 replies; 17+ messages in thread
From: Mike Kravetz @ 2022-02-02  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Michal Hocko, Peter Xu, Andrea Arcangeli, Shuah Khan,
	Andrew Morton, Mike Kravetz

MADV_DONTNEED is currently disabled for hugetlb mappings.  This
certainly makes sense in shared file mappings as the pagecache maintains
a reference to the page and it will never be freed.  However, it could
be useful to unmap and free pages in private mappings.

The only thing preventing MADV_DONTNEED from working on hugetlb mappings
is a check in can_madv_lru_vma().  To allow support for hugetlb mappings
create and use a new routine madvise_dontneed_free_valid_vma() that will
allow hugetlb mappings.  Also, before calling zap_page_range in the
DONTNEED case align start and size to huge page size for hugetlb vmas.
madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
requires huge page size alignment.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 5604064df464..7ae891e030a4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 					unsigned long start, unsigned long end)
 {
+	/*
+	 * start and size (end - start) must be huge page size aligned
+	 * for hugetlb vmas.
+	 */
+	if (is_vm_hugetlb_page(vma)) {
+		struct hstate *h = hstate_vma(vma);
+
+		start = ALIGN_DOWN(start, huge_page_size(h));
+		end = ALIGN(end, huge_page_size(h));
+	}
+
 	zap_page_range(vma, start, end - start);
 	return 0;
 }
 
+static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
+						int behavior)
+{
+	if (is_vm_hugetlb_page(vma))
+		return behavior == MADV_DONTNEED;
+	else
+		return can_madv_lru_vma(vma);
+}
+
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
@@ -808,7 +828,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
-	if (!can_madv_lru_vma(vma))
+	if (!madvise_dontneed_free_valid_vma(vma, behavior))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -830,7 +850,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_lru_vma(vma))
+		if (!madvise_dontneed_free_valid_vma(vma, behavior))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
  2022-02-02  1:40 [PATCH v2 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
  2022-02-02  1:40 ` [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
@ 2022-02-02  1:40 ` Mike Kravetz
  2022-02-02  1:40 ` [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
  2 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2022-02-02  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Michal Hocko, Peter Xu, Andrea Arcangeli, Shuah Khan,
	Andrew Morton, Mike Kravetz, Shuah Khan

Now that MADV_DONTNEED support for hugetlb is enabled, add corresponding
tests.  MADV_REMOVE has been enabled for some time, but no tests exist
so add them as well.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Shuah Khan <skhan@linuxfoundation.org>
---
 tools/testing/selftests/vm/.gitignore        |   1 +
 tools/testing/selftests/vm/Makefile          |   1 +
 tools/testing/selftests/vm/hugetlb-madvise.c | 413 +++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh    |  12 +
 4 files changed, 427 insertions(+)
 create mode 100644 tools/testing/selftests/vm/hugetlb-madvise.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 3b5faec3c04f..d7507f3c7c76 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -3,6 +3,7 @@ hugepage-mmap
 hugepage-mremap
 hugepage-shm
 hugepage-vmemmap
+hugetlb-madvise
 khugepaged
 map_hugetlb
 map_populate
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 96714d2d49dc..5e43f072f5b7 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -28,6 +28,7 @@ LDLIBS = -lrt -lpthread
 TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_test
 TEST_GEN_FILES += hmm-tests
+TEST_GEN_FILES += hugetlb-madvise
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/vm/hugetlb-madvise.c b/tools/testing/selftests/vm/hugetlb-madvise.c
new file mode 100644
index 000000000000..77af0ffff01d
--- /dev/null
+++ b/tools/testing/selftests/vm/hugetlb-madvise.c
@@ -0,0 +1,413 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * hugepage-madvise:
+ *
+ * Basic functional testing of madvise MADV_DONTNEED and MADV_REMOVE
+ * on hugetlb mappings.
+ *
+ * Before running this test, make sure the administrator has pre-allocated
+ * at least MIN_FREE_PAGES hugetlb pages and they are free.  In addition,
+ * the test takes an argument that is the path to a file in a hugetlbfs
+ * filesystem.  Therefore, a hugetlbfs filesystem must be mounted on some
+ * directory.
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#define __USE_GNU
+#include <fcntl.h>
+
+#define USAGE	"USAGE: %s <hugepagefile_name>\n"
+#define MIN_FREE_PAGES	20
+#define NR_HUGE_PAGES	10	/* common map/allocation value */
+
+#define validate_free_pages(exp_free)					\
+	do {								\
+		int fhp = get_free_hugepages();				\
+		if (fhp != (exp_free)) {				\
+			printf("Unexpected number of free huge "	\
+				"pages line %d\n", __LINE__);		\
+			exit(1);					\
+		}							\
+	} while (0)
+
+unsigned long huge_page_size;
+unsigned long base_page_size;
+
+/*
+ * default_huge_page_size copied from mlock2-tests.c
+ */
+unsigned long default_huge_page_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+
+	if (!f)
+		return 0;
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:       %lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+
+	free(line);
+	fclose(f);
+	return hps;
+}
+
+unsigned long get_free_hugepages(void)
+{
+	unsigned long fhp = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+
+	if (!f)
+		return fhp;
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "HugePages_Free:      %lu", &fhp) == 1)
+			break;
+	}
+
+	free(line);
+	fclose(f);
+	return fhp;
+}
+
+void write_fault_pages(void *addr, unsigned long nr_pages)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++)
+		*((unsigned long *)(addr + (i * huge_page_size))) = i;
+}
+
+void read_fault_pages(void *addr, unsigned long nr_pages)
+{
+	unsigned long i, tmp;
+
+	for (i = 0; i < nr_pages; i++)
+		tmp += *((unsigned long *)(addr + (i * huge_page_size)));
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long free_hugepages;
+	void *addr, *addr2;
+	int fd;
+	int ret;
+
+	if (argc != 2) {
+		printf(USAGE, argv[0]);
+		exit(1);
+	}
+
+	huge_page_size = default_huge_page_size();
+	if (!huge_page_size) {
+		printf("Unable to determine huge page size, exiting!\n");
+		exit(1);
+	}
+	base_page_size = sysconf(_SC_PAGE_SIZE);
+	if (!huge_page_size) {
+		printf("Unable to determine base page size, exiting!\n");
+		exit(1);
+	}
+
+	free_hugepages = get_free_hugepages();
+	if (free_hugepages < MIN_FREE_PAGES) {
+		printf("Not enough free huge pages to test, exiting!\n");
+		exit(1);
+	}
+
+	fd = open(argv[1], O_CREAT | O_RDWR, 0755);
+	if (fd < 0) {
+		perror("Open failed");
+		exit(1);
+	}
+
+	/*
+	 * Test validity of MADV_DONTNEED addr and length arguments.  mmap
+	 * size is NR_HUGE_PAGES + 2.  One page at the beginning and end of
+	 * the mapping will be unmapped so we KNOW there is nothing mapped
+	 * there.
+	 */
+	addr = mmap(NULL, (NR_HUGE_PAGES + 2) * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+			-1, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+	if (munmap(addr, huge_page_size) ||
+			munmap(addr + (NR_HUGE_PAGES + 1) * huge_page_size,
+				huge_page_size)) {
+		perror("munmap");
+		exit(1);
+	}
+	addr = addr + huge_page_size;
+
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* addr before mapping should fail */
+	ret = madvise(addr - base_page_size, NR_HUGE_PAGES * huge_page_size,
+		MADV_DONTNEED);
+	if (!ret) {
+		printf("Unexpected success of madvise call with invalid addr line %d\n",
+				__LINE__);
+			exit(1);
+	}
+
+	/* addr + length after mapping should fail */
+	ret = madvise(addr, (NR_HUGE_PAGES * huge_page_size) + base_page_size,
+		MADV_DONTNEED);
+	if (!ret) {
+		printf("Unexpected success of madvise call with invalid length line %d\n",
+				__LINE__);
+			exit(1);
+	}
+
+	(void)munmap(addr, NR_HUGE_PAGES * huge_page_size);
+
+	/*
+	 * Test alignment of MADV_DONTNEED addr and length arguments
+	 */
+	addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+			-1, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* addr should be aligned down to huge page size */
+	if (madvise(addr + base_page_size,
+			NR_HUGE_PAGES * huge_page_size - base_page_size,
+			MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+
+	/* should free all pages in mapping */
+	validate_free_pages(free_hugepages);
+
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* addr + length should be aligned up to huge page size */
+	if (madvise(addr, (NR_HUGE_PAGES * huge_page_size) - base_page_size,
+			MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+
+	/* should free all pages in mapping */
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, NR_HUGE_PAGES * huge_page_size);
+
+	/*
+	 * Test MADV_DONTNEED on anonymous private mapping
+	 */
+	addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+			-1, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+
+	/* should free all pages in mapping */
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, NR_HUGE_PAGES * huge_page_size);
+
+	/*
+	 * Test MADV_DONTNEED on private mapping of hugetlb file
+	 */
+	if (fallocate(fd, 0, 0, NR_HUGE_PAGES * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* read should not consume any pages */
+	read_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* madvise should not free any pages */
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* writes should allocate private pages */
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES));
+
+	/* madvise should free private pages */
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* writes should allocate private pages */
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES));
+
+	/*
+	 * The fallocate below certainly should free the pages associated
+	 * with the file.  However, pages in the private mapping are also
+	 * freed.  This is not the 'correct' behavior, but is expected
+	 * because this is how it has worked since the initial hugetlb
+	 * implementation.
+	 */
+	if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+					0, NR_HUGE_PAGES * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, NR_HUGE_PAGES * huge_page_size);
+
+	/*
+	 * Test MADV_DONTNEED on shared mapping of hugetlb file
+	 */
+	if (fallocate(fd, 0, 0, NR_HUGE_PAGES * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* write should not consume any pages */
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* madvise should not free any pages */
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/*
+	 * Test MADV_REMOVE on shared mapping of hugetlb file
+	 *
+	 * madvise is same as hole punch and should free all pages.
+	 */
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_REMOVE)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages);
+	(void)munmap(addr, NR_HUGE_PAGES * huge_page_size);
+
+	/*
+	 * Test MADV_REMOVE on shared and private mapping of hugetlb file
+	 */
+	if (fallocate(fd, 0, 0, NR_HUGE_PAGES * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	addr = mmap(NULL, NR_HUGE_PAGES * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* shared write should not consume any additional pages */
+	write_fault_pages(addr, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	addr2 = mmap(NULL, NR_HUGE_PAGES * huge_page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE, fd, 0);
+	if (addr2 == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* private read should not consume any pages */
+	read_fault_pages(addr2, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* private write should consume additional pages */
+	write_fault_pages(addr2, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES));
+
+	/* madvise of shared mapping should not free any pages */
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES));
+
+	/* madvise of private mapping should free private pages */
+	if (madvise(addr2, NR_HUGE_PAGES * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - NR_HUGE_PAGES);
+
+	/* private write should consume additional pages again */
+	write_fault_pages(addr2, NR_HUGE_PAGES);
+	validate_free_pages(free_hugepages - (2 * NR_HUGE_PAGES));
+
+	/*
+	 * madvise should free both file and private pages although this is
+	 * not correct.  private pages should not be freed, but this is
+	 * expected.  See comment associated with FALLOC_FL_PUNCH_HOLE call.
+	 */
+	if (madvise(addr, NR_HUGE_PAGES * huge_page_size, MADV_REMOVE)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, NR_HUGE_PAGES * huge_page_size);
+	(void)munmap(addr2, NR_HUGE_PAGES * huge_page_size);
+
+	close(fd);
+	unlink(argv[1]);
+	return 0;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index e09040a3dc08..ff901ee66838 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -130,6 +130,18 @@ else
 	echo "[PASS]"
 fi
 
+echo "-----------------------"
+echo "running hugetlb-madvise"
+echo "-----------------------"
+./hugetlb-madvise $mnt/madvise-test
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+rm -f $mnt/madvise-test
+
 echo "NOTE: The above hugetlb tests provide minimal coverage.  Use"
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing
  2022-02-02  1:40 [PATCH v2 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
  2022-02-02  1:40 ` [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
  2022-02-02  1:40 ` [PATCH v2 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test Mike Kravetz
@ 2022-02-02  1:40 ` Mike Kravetz
  2022-02-02  6:11   ` Mike Rapoport
  2 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2022-02-02  1:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Naoya Horiguchi, David Hildenbrand, Axel Rasmussen, Mina Almasry,
	Michal Hocko, Peter Xu, Andrea Arcangeli, Shuah Khan,
	Andrew Morton, Mike Kravetz

With MADV_DONTNEED support added to hugetlb mappings, mremap testing
can also be enabled for hugetlb.

Modify the tests to use madvise MADV_DONTNEED and MADV_REMOVE instead of
fallocate hole puch for releasing hugetlb pages.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
---
 tools/testing/selftests/vm/run_vmtests.sh |  3 +-
 tools/testing/selftests/vm/userfaultfd.c  | 69 ++++++++++++-----------
 2 files changed, 36 insertions(+), 36 deletions(-)

diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index ff901ee66838..7178d6fa8c92 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -207,14 +207,13 @@ echo "running userfaultfd_hugetlb"
 echo "---------------------------"
 # Test requires source and destination huge pages.  Size of source
 # (half_ufd_size_MB) is passed as argument to test.
-./userfaultfd hugetlb $half_ufd_size_MB 32 $mnt/ufd_test_file
+./userfaultfd hugetlb $half_ufd_size_MB 32
 if [ $? -ne 0 ]; then
 	echo "[FAIL]"
 	exitcode=1
 else
 	echo "[PASS]"
 fi
-rm -f $mnt/ufd_test_file
 
 echo "-------------------------"
 echo "running userfaultfd_shmem"
diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index d2480ab93037..c18beebc0dd6 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,7 +88,6 @@ static bool test_uffdio_minor = false;
 static bool map_shared;
 static int shm_fd;
 static int huge_fd;
-static char *huge_fd_off0;
 static unsigned long long *count_verify;
 static int uffd = -1;
 static int uffd_flags, finished, *pipefd;
@@ -127,9 +126,9 @@ const char *examples =
     "./userfaultfd anon 100 99999\n\n"
     "# Run share memory test on 1GiB region with 99 bounces:\n"
     "./userfaultfd shmem 1000 99\n\n"
-    "# Run hugetlb memory test on 256MiB region with 50 bounces (using /dev/hugepages/hugefile):\n"
-    "./userfaultfd hugetlb 256 50 /dev/hugepages/hugefile\n\n"
-    "# Run the same hugetlb test but using shmem:\n"
+    "# Run hugetlb memory test on 256MiB region with 50 bounces:\n"
+    "./userfaultfd hugetlb 256 50\n\n"
+    "# Run the same hugetlb test but using shared file:\n"
     "./userfaultfd hugetlb_shared 256 50 /dev/hugepages/hugefile\n\n"
     "# 10MiB-~6GiB 999 bounces anonymous test, "
     "continue forever unless an error triggers\n"
@@ -226,10 +225,13 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
 
 static void hugetlb_release_pages(char *rel_area)
 {
-	if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-		      rel_area == huge_fd_off0 ? 0 : nr_pages * page_size,
-		      nr_pages * page_size))
-		err("fallocate() failed");
+	if (!map_shared) {
+		if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED))
+			err("madvise(MADV_DONTNEED) failed");
+	} else {
+		if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE))
+			err("madvise(MADV_REMOVE) failed");
+	}
 }
 
 static void hugetlb_allocate_area(void **alloc_area)
@@ -237,26 +239,37 @@ static void hugetlb_allocate_area(void **alloc_area)
 	void *area_alias = NULL;
 	char **alloc_area_alias;
 
-	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			   (map_shared ? MAP_SHARED : MAP_PRIVATE) |
-			   MAP_HUGETLB |
-			   (*alloc_area == area_src ? 0 : MAP_NORESERVE),
-			   huge_fd, *alloc_area == area_src ? 0 :
-			   nr_pages * page_size);
+	if (!map_shared)
+		*alloc_area = mmap(NULL,
+			nr_pages * page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB |
+				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+			-1,
+			0);
+	else
+		*alloc_area = mmap(NULL,
+			nr_pages * page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_SHARED |
+				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+			huge_fd,
+			*alloc_area == area_src ? 0 : nr_pages * page_size);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of hugetlbfs file failed");
 
 	if (map_shared) {
-		area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-				  MAP_SHARED | MAP_HUGETLB,
-				  huge_fd, *alloc_area == area_src ? 0 :
-				  nr_pages * page_size);
+		area_alias = mmap(NULL,
+			nr_pages * page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_SHARED,
+			huge_fd,
+			*alloc_area == area_src ? 0 : nr_pages * page_size);
 		if (area_alias == MAP_FAILED)
 			err("mmap of hugetlb file alias failed");
 	}
 
 	if (*alloc_area == area_src) {
-		huge_fd_off0 = *alloc_area;
 		alloc_area_alias = &area_src_alias;
 	} else {
 		alloc_area_alias = &area_dst_alias;
@@ -269,12 +282,7 @@ static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset
 {
 	if (!map_shared)
 		return;
-	/*
-	 * We can't zap just the pagetable with hugetlbfs because
-	 * MADV_DONTEED won't work. So exercise -EEXIST on a alias
-	 * mapping where the pagetables are not established initially,
-	 * this way we'll exercise the -EEXEC at the fs level.
-	 */
+
 	*start = (unsigned long) area_dst_alias + offset;
 }
 
@@ -427,7 +435,6 @@ static void uffd_test_ctx_clear(void)
 		uffd = -1;
 	}
 
-	huge_fd_off0 = NULL;
 	munmap_area((void **)&area_src);
 	munmap_area((void **)&area_src_alias);
 	munmap_area((void **)&area_dst);
@@ -925,10 +932,7 @@ static int faulting_process(int signal_test)
 	struct sigaction act;
 	unsigned long signalled = 0;
 
-	if (test_type != TEST_HUGETLB)
-		split_nr_pages = (nr_pages + 1) / 2;
-	else
-		split_nr_pages = nr_pages;
+	split_nr_pages = (nr_pages + 1) / 2;
 
 	if (signal_test) {
 		sigbuf = &jbuf;
@@ -985,9 +989,6 @@ static int faulting_process(int signal_test)
 	if (signal_test)
 		return signalled != split_nr_pages;
 
-	if (test_type == TEST_HUGETLB)
-		return 0;
-
 	area_dst = mremap(area_dst, nr_pages * page_size,  nr_pages * page_size,
 			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
 	if (area_dst == MAP_FAILED)
@@ -1670,7 +1671,7 @@ int main(int argc, char **argv)
 	}
 	nr_pages = nr_pages_per_cpu * nr_cpus;
 
-	if (test_type == TEST_HUGETLB) {
+	if (test_type == TEST_HUGETLB && map_shared) {
 		if (argc < 5)
 			usage();
 		huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing
  2022-02-02  1:40 ` [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
@ 2022-02-02  6:11   ` Mike Rapoport
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Rapoport @ 2022-02-02  6:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Michal Hocko, Peter Xu,
	Andrea Arcangeli, Shuah Khan, Andrew Morton

On Tue, Feb 01, 2022 at 05:40:34PM -0800, Mike Kravetz wrote:
> Subject: [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing

Nit:                                                  ^ hugetlb

> With MADV_DONTNEED support added to hugetlb mappings, mremap testing
> can also be enabled for hugetlb.
> 
> Modify the tests to use madvise MADV_DONTNEED and MADV_REMOVE instead of
> fallocate hole puch for releasing hugetlb pages.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
>  tools/testing/selftests/vm/run_vmtests.sh |  3 +-
>  tools/testing/selftests/vm/userfaultfd.c  | 69 ++++++++++++-----------
>  2 files changed, 36 insertions(+), 36 deletions(-)
> 
> diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
> index ff901ee66838..7178d6fa8c92 100755
> --- a/tools/testing/selftests/vm/run_vmtests.sh
> +++ b/tools/testing/selftests/vm/run_vmtests.sh
> @@ -207,14 +207,13 @@ echo "running userfaultfd_hugetlb"
>  echo "---------------------------"
>  # Test requires source and destination huge pages.  Size of source
>  # (half_ufd_size_MB) is passed as argument to test.
> -./userfaultfd hugetlb $half_ufd_size_MB 32 $mnt/ufd_test_file
> +./userfaultfd hugetlb $half_ufd_size_MB 32
>  if [ $? -ne 0 ]; then
>  	echo "[FAIL]"
>  	exitcode=1
>  else
>  	echo "[PASS]"
>  fi
> -rm -f $mnt/ufd_test_file
>  
>  echo "-------------------------"
>  echo "running userfaultfd_shmem"
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index d2480ab93037..c18beebc0dd6 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -88,7 +88,6 @@ static bool test_uffdio_minor = false;
>  static bool map_shared;
>  static int shm_fd;
>  static int huge_fd;
> -static char *huge_fd_off0;
>  static unsigned long long *count_verify;
>  static int uffd = -1;
>  static int uffd_flags, finished, *pipefd;
> @@ -127,9 +126,9 @@ const char *examples =
>      "./userfaultfd anon 100 99999\n\n"
>      "# Run share memory test on 1GiB region with 99 bounces:\n"
>      "./userfaultfd shmem 1000 99\n\n"
> -    "# Run hugetlb memory test on 256MiB region with 50 bounces (using /dev/hugepages/hugefile):\n"
> -    "./userfaultfd hugetlb 256 50 /dev/hugepages/hugefile\n\n"
> -    "# Run the same hugetlb test but using shmem:\n"
> +    "# Run hugetlb memory test on 256MiB region with 50 bounces:\n"
> +    "./userfaultfd hugetlb 256 50\n\n"
> +    "# Run the same hugetlb test but using shared file:\n"
>      "./userfaultfd hugetlb_shared 256 50 /dev/hugepages/hugefile\n\n"
>      "# 10MiB-~6GiB 999 bounces anonymous test, "
>      "continue forever unless an error triggers\n"
> @@ -226,10 +225,13 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
>  
>  static void hugetlb_release_pages(char *rel_area)
>  {
> -	if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> -		      rel_area == huge_fd_off0 ? 0 : nr_pages * page_size,
> -		      nr_pages * page_size))
> -		err("fallocate() failed");
> +	if (!map_shared) {
> +		if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED))
> +			err("madvise(MADV_DONTNEED) failed");
> +	} else {
> +		if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE))
> +			err("madvise(MADV_REMOVE) failed");
> +	}
>  }
>  
>  static void hugetlb_allocate_area(void **alloc_area)
> @@ -237,26 +239,37 @@ static void hugetlb_allocate_area(void **alloc_area)
>  	void *area_alias = NULL;
>  	char **alloc_area_alias;
>  
> -	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
> -			   (map_shared ? MAP_SHARED : MAP_PRIVATE) |
> -			   MAP_HUGETLB |
> -			   (*alloc_area == area_src ? 0 : MAP_NORESERVE),
> -			   huge_fd, *alloc_area == area_src ? 0 :
> -			   nr_pages * page_size);
> +	if (!map_shared)
> +		*alloc_area = mmap(NULL,
> +			nr_pages * page_size,
> +			PROT_READ | PROT_WRITE,
> +			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB |
> +				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
> +			-1,
> +			0);
> +	else
> +		*alloc_area = mmap(NULL,
> +			nr_pages * page_size,
> +			PROT_READ | PROT_WRITE,
> +			MAP_SHARED |
> +				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
> +			huge_fd,
> +			*alloc_area == area_src ? 0 : nr_pages * page_size);
>  	if (*alloc_area == MAP_FAILED)
>  		err("mmap of hugetlbfs file failed");
>  
>  	if (map_shared) {
> -		area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
> -				  MAP_SHARED | MAP_HUGETLB,
> -				  huge_fd, *alloc_area == area_src ? 0 :
> -				  nr_pages * page_size);
> +		area_alias = mmap(NULL,
> +			nr_pages * page_size,
> +			PROT_READ | PROT_WRITE,
> +			MAP_SHARED,
> +			huge_fd,
> +			*alloc_area == area_src ? 0 : nr_pages * page_size);
>  		if (area_alias == MAP_FAILED)
>  			err("mmap of hugetlb file alias failed");
>  	}
>  
>  	if (*alloc_area == area_src) {
> -		huge_fd_off0 = *alloc_area;
>  		alloc_area_alias = &area_src_alias;
>  	} else {
>  		alloc_area_alias = &area_dst_alias;
> @@ -269,12 +282,7 @@ static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset
>  {
>  	if (!map_shared)
>  		return;
> -	/*
> -	 * We can't zap just the pagetable with hugetlbfs because
> -	 * MADV_DONTEED won't work. So exercise -EEXIST on a alias
> -	 * mapping where the pagetables are not established initially,
> -	 * this way we'll exercise the -EEXEC at the fs level.
> -	 */
> +
>  	*start = (unsigned long) area_dst_alias + offset;
>  }
>  
> @@ -427,7 +435,6 @@ static void uffd_test_ctx_clear(void)
>  		uffd = -1;
>  	}
>  
> -	huge_fd_off0 = NULL;
>  	munmap_area((void **)&area_src);
>  	munmap_area((void **)&area_src_alias);
>  	munmap_area((void **)&area_dst);
> @@ -925,10 +932,7 @@ static int faulting_process(int signal_test)
>  	struct sigaction act;
>  	unsigned long signalled = 0;
>  
> -	if (test_type != TEST_HUGETLB)
> -		split_nr_pages = (nr_pages + 1) / 2;
> -	else
> -		split_nr_pages = nr_pages;
> +	split_nr_pages = (nr_pages + 1) / 2;
>  
>  	if (signal_test) {
>  		sigbuf = &jbuf;
> @@ -985,9 +989,6 @@ static int faulting_process(int signal_test)
>  	if (signal_test)
>  		return signalled != split_nr_pages;
>  
> -	if (test_type == TEST_HUGETLB)
> -		return 0;
> -
>  	area_dst = mremap(area_dst, nr_pages * page_size,  nr_pages * page_size,
>  			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
>  	if (area_dst == MAP_FAILED)
> @@ -1670,7 +1671,7 @@ int main(int argc, char **argv)
>  	}
>  	nr_pages = nr_pages_per_cpu * nr_cpus;
>  
> -	if (test_type == TEST_HUGETLB) {
> +	if (test_type == TEST_HUGETLB && map_shared) {
>  		if (argc < 5)
>  			usage();
>  		huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
> -- 
> 2.34.1
> 
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-02  1:40 ` [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
@ 2022-02-02  8:14   ` David Hildenbrand
  2022-02-02 19:32     ` Mike Kravetz
  2022-02-10  3:21   ` Peter Xu
  1 sibling, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2022-02-02  8:14 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

On 02.02.22 02:40, Mike Kravetz wrote:
> MADV_DONTNEED is currently disabled for hugetlb mappings.  This
> certainly makes sense in shared file mappings as the pagecache maintains
> a reference to the page and it will never be freed.  However, it could
> be useful to unmap and free pages in private mappings.
> 
> The only thing preventing MADV_DONTNEED from working on hugetlb mappings
> is a check in can_madv_lru_vma().  To allow support for hugetlb mappings
> create and use a new routine madvise_dontneed_free_valid_vma() that will
> allow hugetlb mappings.  Also, before calling zap_page_range in the
> DONTNEED case align start and size to huge page size for hugetlb vmas.
> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
> requires huge page size alignment.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/madvise.c | 24 ++++++++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5604064df464..7ae891e030a4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
> +	/*
> +	 * start and size (end - start) must be huge page size aligned
> +	 * for hugetlb vmas.
> +	 */
> +	if (is_vm_hugetlb_page(vma)) {
> +		struct hstate *h = hstate_vma(vma);
> +
> +		start = ALIGN_DOWN(start, huge_page_size(h));
> +		end = ALIGN(end, huge_page_size(h));

So you effectively extend the range silently. IIUC, if someone would zap
a 4k range you would implicitly zap a whole 2M page and effectively zero
out more data than requested.


Looking at do_madvise(), we:
(1) reject start addresses that are not page-aligned
(2) shrink lengths that are not page-aligned and refuse if it turns 0

The man page documents (1) but doesn't really document (2).

Naturally I'd have assume that we apply the same logic to huge page
sizes and documenting it in the man page accordingly.


Why did you decide to extend the range? I'd assume MADV_REMOVE behaves
like FALLOC_FL_PUNCH_HOLE:
  "Within the specified range, partial filesystem blocks are zeroed, and
   whole filesystem blocks are removed from the file.  After a
   successful call, subsequent reads from  this  range will return
   zeros."
So we don't "discard more than requested".


I see the following possible alternatives:
(a) Fail if the range is not aligned
-> Clear semantics
(b) Fail if the start is not aligned, shrink the end if required
-> Same rules as for PAGE_SIZE
(c) Zero out the requested part
-> Same semantics as FALLOC_FL_PUNCH_HOLE.

My preference would be a), properly documenting it in the man page.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-02  8:14   ` David Hildenbrand
@ 2022-02-02 19:32     ` Mike Kravetz
  2022-02-04  8:35       ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2022-02-02 19:32 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

On 2/2/22 00:14, David Hildenbrand wrote:
> On 02.02.22 02:40, Mike Kravetz wrote:
>> MADV_DONTNEED is currently disabled for hugetlb mappings.  This
>> certainly makes sense in shared file mappings as the pagecache maintains
>> a reference to the page and it will never be freed.  However, it could
>> be useful to unmap and free pages in private mappings.
>>
>> The only thing preventing MADV_DONTNEED from working on hugetlb mappings
>> is a check in can_madv_lru_vma().  To allow support for hugetlb mappings
>> create and use a new routine madvise_dontneed_free_valid_vma() that will
>> allow hugetlb mappings.  Also, before calling zap_page_range in the
>> DONTNEED case align start and size to huge page size for hugetlb vmas.
>> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
>> requires huge page size alignment.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/madvise.c | 24 ++++++++++++++++++++++--
>>  1 file changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 5604064df464..7ae891e030a4 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>>  					unsigned long start, unsigned long end)
>>  {
>> +	/*
>> +	 * start and size (end - start) must be huge page size aligned
>> +	 * for hugetlb vmas.
>> +	 */
>> +	if (is_vm_hugetlb_page(vma)) {
>> +		struct hstate *h = hstate_vma(vma);
>> +
>> +		start = ALIGN_DOWN(start, huge_page_size(h));
>> +		end = ALIGN(end, huge_page_size(h));
> 
> So you effectively extend the range silently. IIUC, if someone would zap
> a 4k range you would implicitly zap a whole 2M page and effectively zero
> out more data than requested.
> 
> 
> Looking at do_madvise(), we:
> (1) reject start addresses that are not page-aligned
> (2) shrink lengths that are not page-aligned and refuse if it turns 0

I believe length is extended (rounded up) by this line:
	len = PAGE_ALIGN(len_in);

but, I see your point.

> The man page documents (1) but doesn't really document (2).
> 
> Naturally I'd have assume that we apply the same logic to huge page
> sizes and documenting it in the man page accordingly.
> 
> 
> Why did you decide to extend the range? I'd assume MADV_REMOVE behaves
> like FALLOC_FL_PUNCH_HOLE:
>   "Within the specified range, partial filesystem blocks are zeroed, and
>    whole filesystem blocks are removed from the file.  After a
>    successful call, subsequent reads from  this  range will return
>    zeros."
> So we don't "discard more than requested".

Well.  hugetlbfs does not follow the man page. :(  It does not zero
partial blocks.  I assume a filesystem block would be a huge page.
Instead it does,

        /*
         * For hole punch round up the beginning offset of the hole and
         * round down the end.
         */
        hole_start = round_up(offset, hpage_size);
        hole_end = round_down(offset + len, hpage_size);

So, not only is this patch not following the man page.  It is not even
following the existing MADV_REMOVE hugetlb code.  Thanks for pointing
that out.  Part of my reason for adding this functionality was to make
hugetlb be more like 'normal' memory.  I clearly failed.

Related comment about madvise man page for PAGE_SIZE MADV_REMOVE.  The man
page says.

       MADV_REMOVE (since Linux 2.6.16)
              Free up a given range of pages and its associated backing store.
              This is equivalent to punching a hole in the corresponding  byte
              range  of  the backing store (see fallocate(2)).  Subsequent ac‐
              cesses in the specified address range will see bytes  containing
              zero.

This may need some clarification.  It says it will free pages.  We know
madvise only operates on pages (PAGE_ALIGN(len)).  Yet, the statement about
equivalent to a fallocate byte range may lead one to believe that length is
treated the same in madvise and fallocate.

> I see the following possible alternatives:
> (a) Fail if the range is not aligned
> -> Clear semantics
> (b) Fail if the start is not aligned, shrink the end if required
> -> Same rules as for PAGE_SIZE
> (c) Zero out the requested part
> -> Same semantics as FALLOC_FL_PUNCH_HOLE.
> 
> My preference would be a), properly documenting it in the man page.

However, a) would make hugetlb behave differently than other memory as
len does not need to be aligned.

I would prefer b) as it is more in line with PAGE_SIZE.  But, that does
make it different than MADV_REMOVE hugetlb alignment.

I thought this was simple. :)
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-02 19:32     ` Mike Kravetz
@ 2022-02-04  8:35       ` David Hildenbrand
  2022-02-07 23:47         ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2022-02-04  8:35 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

>>> +	/*
>>> +	 * start and size (end - start) must be huge page size aligned
>>> +	 * for hugetlb vmas.
>>> +	 */
>>> +	if (is_vm_hugetlb_page(vma)) {
>>> +		struct hstate *h = hstate_vma(vma);
>>> +
>>> +		start = ALIGN_DOWN(start, huge_page_size(h));
>>> +		end = ALIGN(end, huge_page_size(h));
>>
>> So you effectively extend the range silently. IIUC, if someone would zap
>> a 4k range you would implicitly zap a whole 2M page and effectively zero
>> out more data than requested.
>>
>>
>> Looking at do_madvise(), we:
>> (1) reject start addresses that are not page-aligned
>> (2) shrink lengths that are not page-aligned and refuse if it turns 0
> 
> I believe length is extended (rounded up) by this line:
> 	len = PAGE_ALIGN(len_in);

Ah, right. I was confused by the "!len" check below that, but the
comment explains how this applies to negative values only.

> 
> but, I see your point.
> 
>> The man page documents (1) but doesn't really document (2).
>>
>> Naturally I'd have assume that we apply the same logic to huge page
>> sizes and documenting it in the man page accordingly.
>>
>>
>> Why did you decide to extend the range? I'd assume MADV_REMOVE behaves
>> like FALLOC_FL_PUNCH_HOLE:
>>   "Within the specified range, partial filesystem blocks are zeroed, and
>>    whole filesystem blocks are removed from the file.  After a
>>    successful call, subsequent reads from  this  range will return
>>    zeros."
>> So we don't "discard more than requested".
> 
> Well.  hugetlbfs does not follow the man page. :(  It does not zero
> partial blocks.  I assume a filesystem block would be a huge page.
> Instead it does,
> 
>         /*
>          * For hole punch round up the beginning offset of the hole and
>          * round down the end.
>          */
>         hole_start = round_up(offset, hpage_size);
>         hole_end = round_down(offset + len, hpage_size);

Okay, so we skip any zeroing and only free completely covered blocks. We
might want to document that behavior. See below.

> 
> So, not only is this patch not following the man page.  It is not even
> following the existing MADV_REMOVE hugetlb code.  Thanks for pointing
> that out.  Part of my reason for adding this functionality was to make
> hugetlb be more like 'normal' memory.  I clearly failed.

:)

> 
> Related comment about madvise man page for PAGE_SIZE MADV_REMOVE.  The man
> page says.
> 
>        MADV_REMOVE (since Linux 2.6.16)
>               Free up a given range of pages and its associated backing store.
>               This is equivalent to punching a hole in the corresponding  byte
>               range  of  the backing store (see fallocate(2)).  Subsequent ac‐
>               cesses in the specified address range will see bytes  containing
>               zero.
> 
> This may need some clarification.  It says it will free pages.  We know
> madvise only operates on pages (PAGE_ALIGN(len)).  Yet, the statement about
> equivalent to a fallocate byte range may lead one to believe that length is
> treated the same in madvise and fallocate.

Yes

> 
>> I see the following possible alternatives:
>> (a) Fail if the range is not aligned
>> -> Clear semantics
>> (b) Fail if the start is not aligned, shrink the end if required
>> -> Same rules as for PAGE_SIZE
>> (c) Zero out the requested part
>> -> Same semantics as FALLOC_FL_PUNCH_HOLE.
>>
>> My preference would be a), properly documenting it in the man page.
> 
> However, a) would make hugetlb behave differently than other memory as
> len does not need to be aligned.
> 
> I would prefer b) as it is more in line with PAGE_SIZE.  But, that does
> make it different than MADV_REMOVE hugetlb alignment.
> 
> I thought this was simple. :)

It really bugs me that it's under-specified what's supposed to happen
when the length is not aligned.

BUT: in the posix world, "calling posix_madvise() shall not affect the
semantics of access to memory in the specified range". So we don't care
too much about if we align up/down, because it wouldn't affect the
semantics. Especially for MADV_DONTNEED/MADV_REMOVE as implemented by
Linux this is certainly different and the alignment handling matters.

So I guess especially for MADV_DONTNEED/MADV_REMOVE we need a clear
specification what's supposed to happen if the length falls into the
middle of a huge page. We should document alignment handling for
madvise() in general I assume.

IMHO we should have bailed out right from the start whenever something
is not properly aligned, but that ship has sailed. So I agree, maybe we
can make at least hugetlb MADV_DONTNEED obey the same (weird) rules as
ordinary pages.

So b) would mean, requiring start to be hugepage aligned and aligning-up
the end. Still feels wrong but at least matches existing semantics.

Hugetlb MADV_REMOVE semantics are unfortunate and we should document the
exception.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-04  8:35       ` David Hildenbrand
@ 2022-02-07 23:47         ` Mike Kravetz
  2022-02-10 13:09           ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2022-02-07 23:47 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

On 2/4/22 00:35, David Hildenbrand wrote:
>> I thought this was simple. :)
> 
> It really bugs me that it's under-specified what's supposed to happen
> when the length is not aligned.
> 
> BUT: in the posix world, "calling posix_madvise() shall not affect the
> semantics of access to memory in the specified range". So we don't care
> too much about if we align up/down, because it wouldn't affect the
> semantics. Especially for MADV_DONTNEED/MADV_REMOVE as implemented by
> Linux this is certainly different and the alignment handling matters.
> 
> So I guess especially for MADV_DONTNEED/MADV_REMOVE we need a clear
> specification what's supposed to happen if the length falls into the
> middle of a huge page. We should document alignment handling for
> madvise() in general I assume.
> 
> IMHO we should have bailed out right from the start whenever something
> is not properly aligned, but that ship has sailed. So I agree, maybe we
> can make at least hugetlb MADV_DONTNEED obey the same (weird) rules as
> ordinary pages.
> 
> So b) would mean, requiring start to be hugepage aligned and aligning-up
> the end. Still feels wrong but at least matches existing semantics.
> 
> Hugetlb MADV_REMOVE semantics are unfortunate and we should document the
> exception.

Thank you for all your comments David!

So, my plan was to make MADV_DONTNEED behave as described above:
- EINVAL if start address not huge page size aligned
- Align end/length up to huge page size.

The code I had for this was very specific to MADV_DONTNEED.  I then thought,
why not do the same for MADV_REMOVE as well?  Or even more general, add this
check and alignment to the vma parsing code in madvise.

It was then that I realized there are several madvise behaviors that take
non-huge page size aligned addresses for hugetlb mappings today.  Making
huge page size alignment a requirement for all madvise behaviors could break
existing code.  So, it seems like it could only be added to MADV_DONTNEED as
this functionality does not exist today.  We then end up with MADV_DONTNEED
as the only behavior requiring huge page size alignment for hugetlb mappings.
Sigh!!!

I am now rethinking the decision to proceed with b) as described above.

With the exception of MADV_REMOVE (which we may be able to change for
hugetlb),  madvise operations operate on huge page size pages for hugetlb
mappings.  If start address is in the middle of a hugetlb page, we essentially
align down to the beginning of the hugetlb page.  If length lands in the
middle of a hugetlb page, we essentially round up.

When adding MADV_REMOVE perhaps we should go with this align down start and
align up end strategy that is used everywhere else?  I really wish we could
go back and change things, but as you know it is too late for that.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-02  1:40 ` [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
  2022-02-02  8:14   ` David Hildenbrand
@ 2022-02-10  3:21   ` Peter Xu
  2022-02-10 21:36     ` Mike Kravetz
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Xu @ 2022-02-10  3:21 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Michal Hocko, Andrea Arcangeli,
	Shuah Khan, Andrew Morton

(Sorry for the late comment)

On Tue, Feb 01, 2022 at 05:40:32PM -0800, Mike Kravetz wrote:
> MADV_DONTNEED is currently disabled for hugetlb mappings.  This
> certainly makes sense in shared file mappings as the pagecache maintains
> a reference to the page and it will never be freed.  However, it could
> be useful to unmap and free pages in private mappings.
> 
> The only thing preventing MADV_DONTNEED from working on hugetlb mappings
> is a check in can_madv_lru_vma().  To allow support for hugetlb mappings
> create and use a new routine madvise_dontneed_free_valid_vma() that will
> allow hugetlb mappings.  Also, before calling zap_page_range in the
> DONTNEED case align start and size to huge page size for hugetlb vmas.
> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
> requires huge page size alignment.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/madvise.c | 24 ++++++++++++++++++++++--
>  1 file changed, 22 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 5604064df464..7ae891e030a4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>  					unsigned long start, unsigned long end)
>  {
> +	/*
> +	 * start and size (end - start) must be huge page size aligned
> +	 * for hugetlb vmas.
> +	 */
> +	if (is_vm_hugetlb_page(vma)) {
> +		struct hstate *h = hstate_vma(vma);
> +
> +		start = ALIGN_DOWN(start, huge_page_size(h));
> +		end = ALIGN(end, huge_page_size(h));
> +	}
> +

Maybe check the alignment before userfaultfd_remove()?  Otherwise there'll be a
fake message generated to the tracer.

>  	zap_page_range(vma, start, end - start);
>  	return 0;
>  }
>  
> +static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
> +						int behavior)
> +{
> +	if (is_vm_hugetlb_page(vma))
> +		return behavior == MADV_DONTNEED;
> +	else
> +		return can_madv_lru_vma(vma);
> +}

can_madv_lru_vma() will check hugetlb again which looks a bit weird.  Would it
look better to write it as:

madvise_dontneed_free_valid_vma()
{
    return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP));
}

can_madv_lru_vma()
{
    return madvise_dontneed_free_valid_vma() && !is_vm_hugetlb_page(vma);
}

?

Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
this is the only api that can force strip the hugetlb mapped pgtable without
losing pagecache data.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-07 23:47         ` Mike Kravetz
@ 2022-02-10 13:09           ` David Hildenbrand
  2022-02-10 22:11             ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: David Hildenbrand @ 2022-02-10 13:09 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

On 08.02.22 00:47, Mike Kravetz wrote:
> On 2/4/22 00:35, David Hildenbrand wrote:
>>> I thought this was simple. :)
>>
>> It really bugs me that it's under-specified what's supposed to happen
>> when the length is not aligned.
>>
>> BUT: in the posix world, "calling posix_madvise() shall not affect the
>> semantics of access to memory in the specified range". So we don't care
>> too much about if we align up/down, because it wouldn't affect the
>> semantics. Especially for MADV_DONTNEED/MADV_REMOVE as implemented by
>> Linux this is certainly different and the alignment handling matters.
>>
>> So I guess especially for MADV_DONTNEED/MADV_REMOVE we need a clear
>> specification what's supposed to happen if the length falls into the
>> middle of a huge page. We should document alignment handling for
>> madvise() in general I assume.
>>
>> IMHO we should have bailed out right from the start whenever something
>> is not properly aligned, but that ship has sailed. So I agree, maybe we
>> can make at least hugetlb MADV_DONTNEED obey the same (weird) rules as
>> ordinary pages.
>>
>> So b) would mean, requiring start to be hugepage aligned and aligning-up
>> the end. Still feels wrong but at least matches existing semantics.
>>
>> Hugetlb MADV_REMOVE semantics are unfortunate and we should document the
>> exception.
> 
> Thank you for all your comments David!
> 
> So, my plan was to make MADV_DONTNEED behave as described above:
> - EINVAL if start address not huge page size aligned
> - Align end/length up to huge page size.
> 
> The code I had for this was very specific to MADV_DONTNEED.  I then thought,
> why not do the same for MADV_REMOVE as well?  Or even more general, add this
> check and alignment to the vma parsing code in madvise.
> 
> It was then that I realized there are several madvise behaviors that take
> non-huge page size aligned addresses for hugetlb mappings today.  Making
> huge page size alignment a requirement for all madvise behaviors could break
> existing code.  So, it seems like it could only be added to MADV_DONTNEED as
> this functionality does not exist today.  We then end up with MADV_DONTNEED
> as the only behavior requiring huge page size alignment for hugetlb mappings.
> Sigh!!!

:/

> 
> I am now rethinking the decision to proceed with b) as described above.
> 
> With the exception of MADV_REMOVE (which we may be able to change for
> hugetlb),  madvise operations operate on huge page size pages for hugetlb
> mappings.  If start address is in the middle of a hugetlb page, we essentially
> align down to the beginning of the hugetlb page.  If length lands in the
> middle of a hugetlb page, we essentially round up.

Which MADV calls would be affected?

The "bad" thing about MADV_DONTNEED and MADV_REMOVE are that they
destroy data, which is why they heavily diverge from the original posix
madvise odea.

> 
> When adding MADV_REMOVE perhaps we should go with this align down start and
> align up end strategy that is used everywhere else?  I really wish we could
> go back and change things, but as you know it is too late for that.

I assume whatever we do, we should document it at least cleanly in the
man page. Unfortunately, hugetlb is a gift that keeps on giving. Making
it at least somehow consistent, even if it's "hugtlb being consistent in
its own mess", that would be preferable I guess.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-10  3:21   ` Peter Xu
@ 2022-02-10 21:36     ` Mike Kravetz
  2022-02-11  2:28       ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2022-02-10 21:36 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Michal Hocko, Andrea Arcangeli,
	Shuah Khan, Andrew Morton

On 2/9/22 19:21, Peter Xu wrote:
> (Sorry for the late comment)

Thanks for taking a look.

> 
> On Tue, Feb 01, 2022 at 05:40:32PM -0800, Mike Kravetz wrote:
>> MADV_DONTNEED is currently disabled for hugetlb mappings.  This
>> certainly makes sense in shared file mappings as the pagecache maintains
>> a reference to the page and it will never be freed.  However, it could
>> be useful to unmap and free pages in private mappings.
>>
>> The only thing preventing MADV_DONTNEED from working on hugetlb mappings
>> is a check in can_madv_lru_vma().  To allow support for hugetlb mappings
>> create and use a new routine madvise_dontneed_free_valid_vma() that will
>> allow hugetlb mappings.  Also, before calling zap_page_range in the
>> DONTNEED case align start and size to huge page size for hugetlb vmas.
>> madvise only requires PAGE_SIZE alignment, but the hugetlb unmap routine
>> requires huge page size alignment.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>  mm/madvise.c | 24 ++++++++++++++++++++++--
>>  1 file changed, 22 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 5604064df464..7ae891e030a4 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -796,10 +796,30 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
>>  static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
>>  					unsigned long start, unsigned long end)
>>  {
>> +	/*
>> +	 * start and size (end - start) must be huge page size aligned
>> +	 * for hugetlb vmas.
>> +	 */
>> +	if (is_vm_hugetlb_page(vma)) {
>> +		struct hstate *h = hstate_vma(vma);
>> +
>> +		start = ALIGN_DOWN(start, huge_page_size(h));
>> +		end = ALIGN(end, huge_page_size(h));
>> +	}
>> +
> 
> Maybe check the alignment before userfaultfd_remove()?  Otherwise there'll be a
> fake message generated to the tracer.

Yes, we should pass the aligned addresses to userfaultfd_remove.  We will
also need to potentially align again after the call.

> 
>>  	zap_page_range(vma, start, end - start);
>>  	return 0;
>>  }
>>  
>> +static bool madvise_dontneed_free_valid_vma(struct vm_area_struct *vma,
>> +						int behavior)
>> +{
>> +	if (is_vm_hugetlb_page(vma))
>> +		return behavior == MADV_DONTNEED;
>> +	else
>> +		return can_madv_lru_vma(vma);
>> +}
> 
> can_madv_lru_vma() will check hugetlb again which looks a bit weird.  Would it
> look better to write it as:
> 
> madvise_dontneed_free_valid_vma()
> {
>     return !(vma->vm_flags & (VM_LOCKED|VM_PFNMAP));
> }
> 
> can_madv_lru_vma()
> {
>     return madvise_dontneed_free_valid_vma() && !is_vm_hugetlb_page(vma);
> }
> 
> ?

Yes, that would look better.

> 
> Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> this is the only api that can force strip the hugetlb mapped pgtable without
> losing pagecache data.
> 

Correct.  However, I do not know if uffd-minor users would ever want to
do this.  Perhaps?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-10 13:09           ` David Hildenbrand
@ 2022-02-10 22:11             ` Mike Kravetz
  2022-02-11  8:43               ` David Hildenbrand
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2022-02-10 22:11 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

On 2/10/22 05:09, David Hildenbrand wrote:
> On 08.02.22 00:47, Mike Kravetz wrote:
>> On 2/4/22 00:35, David Hildenbrand wrote:
>>>> I thought this was simple. :)
>>>
>>> It really bugs me that it's under-specified what's supposed to happen
>>> when the length is not aligned.
>>>
>>> BUT: in the posix world, "calling posix_madvise() shall not affect the
>>> semantics of access to memory in the specified range". So we don't care
>>> too much about if we align up/down, because it wouldn't affect the
>>> semantics. Especially for MADV_DONTNEED/MADV_REMOVE as implemented by
>>> Linux this is certainly different and the alignment handling matters.
>>>
>>> So I guess especially for MADV_DONTNEED/MADV_REMOVE we need a clear
>>> specification what's supposed to happen if the length falls into the
>>> middle of a huge page. We should document alignment handling for
>>> madvise() in general I assume.
>>>
>>> IMHO we should have bailed out right from the start whenever something
>>> is not properly aligned, but that ship has sailed. So I agree, maybe we
>>> can make at least hugetlb MADV_DONTNEED obey the same (weird) rules as
>>> ordinary pages.
>>>
>>> So b) would mean, requiring start to be hugepage aligned and aligning-up
>>> the end. Still feels wrong but at least matches existing semantics.
>>>
>>> Hugetlb MADV_REMOVE semantics are unfortunate and we should document the
>>> exception.
>>
>> Thank you for all your comments David!
>>
>> So, my plan was to make MADV_DONTNEED behave as described above:
>> - EINVAL if start address not huge page size aligned
>> - Align end/length up to huge page size.
>>
>> The code I had for this was very specific to MADV_DONTNEED.  I then thought,
>> why not do the same for MADV_REMOVE as well?  Or even more general, add this
>> check and alignment to the vma parsing code in madvise.
>>
>> It was then that I realized there are several madvise behaviors that take
>> non-huge page size aligned addresses for hugetlb mappings today.  Making
>> huge page size alignment a requirement for all madvise behaviors could break
>> existing code.  So, it seems like it could only be added to MADV_DONTNEED as
>> this functionality does not exist today.  We then end up with MADV_DONTNEED
>> as the only behavior requiring huge page size alignment for hugetlb mappings.
>> Sigh!!!
> 
> :/
> 
>>
>> I am now rethinking the decision to proceed with b) as described above.
>>
>> With the exception of MADV_REMOVE (which we may be able to change for
>> hugetlb),  madvise operations operate on huge page size pages for hugetlb
>> mappings.  If start address is in the middle of a hugetlb page, we essentially
>> align down to the beginning of the hugetlb page.  If length lands in the
>> middle of a hugetlb page, we essentially round up.
> 
> Which MADV calls would be affected?

Not sure I understand the question.  I was saying that madvise calls which
operate on hugetlb mappings today only operate on huge pages.  So, this is
essentially align down starting address and align up end address.
For example consider the MADV_POPULATE calls you recently added.  They will
only fault in huge pages in a hugetlb vma.

> The "bad" thing about MADV_DONTNEED and MADV_REMOVE are that they
> destroy data, which is why they heavily diverge from the original posix
> madvise odea.

Hmmm.  That may be a good argument for strict alignment.  We do not want
to remove (or unmap) more than the user intended.  The unmap system call
has such alignment requirements.

Darn!  I'm thinking that I should go back to requiring alignment for
MADV_DONTNEED.

There really is no 'right' answer.

>>
>> When adding MADV_REMOVE perhaps we should go with this align down start and
>> align up end strategy that is used everywhere else?  I really wish we could
>> go back and change things, but as you know it is too late for that.
> 
> I assume whatever we do, we should document it at least cleanly in the
> man page. Unfortunately, hugetlb is a gift that keeps on giving. Making
> it at least somehow consistent, even if it's "hugtlb being consistent in
> its own mess", that would be preferable I guess.

Yes, more than anything we need to document behavior.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-10 21:36     ` Mike Kravetz
@ 2022-02-11  2:28       ` Peter Xu
  2022-02-11 19:08         ` Axel Rasmussen
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2022-02-11  2:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Naoya Horiguchi, David Hildenbrand,
	Axel Rasmussen, Mina Almasry, Michal Hocko, Andrea Arcangeli,
	Shuah Khan, Andrew Morton

On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
> > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> > this is the only api that can force strip the hugetlb mapped pgtable without
> > losing pagecache data.
> 
> Correct.  However, I do not know if uffd-minor users would ever want to
> do this.  Perhaps?

My understanding is before this patch uffd-minor upon hugetlbfs requires the
huge file to be mapped twice, one to populate the content, then we'll be able
to trap MINOR faults via the other mapping.  Or we could munmap() the range and
remap it again on the same file offset to drop the pgtables, I think. But that
sounds tricky.  MINOR faults only works with pgtables dropped.

With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
because we can explicitly drop the pgtables of hugetlbfs files without any
other tricks.

However I have no real use case of it.  Initially I thought it could be useful
for QEMU because QEMU migration routine is run with the same mm context with
the hypervisor, so by default is doesn't have two mappings of the same guest
memory.  If QEMU wants to leverage minor faults, DONTNEED could help.

However when I was measuring bitmap transfer (assuming that's what minor fault
could help with qemu's postcopy) there some months ago I found it's not as slow
as I thought at all..  Either I could have missed something, or we're facing
different problems with what it is when uffd minor is firstly proposed by Axel.

This is probably too out of topic, though..  Let me go back..

Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this
could further abuse DONTNEED, as the original POSIX definition is as simple as:

  The application expects that it will not access the specified address range
  in the near future.

Linux did it by tearing down pgtable, which looks okay so far.  It could be a
bit more weird to apply it to hugetlbfs because from its definition it's a hint
to page reclaims, however hugetlbfs is not a target of page reclaim, neither is
it LRU-aware.  It goes further into some MADV_ZAP styled syscall.

I think it could still be fine as posix doesn't define that behavior
specifically on hugetlb so it can be defined by Linux, but not sure whether
there can be other implications.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-10 22:11             ` Mike Kravetz
@ 2022-02-11  8:43               ` David Hildenbrand
  0 siblings, 0 replies; 17+ messages in thread
From: David Hildenbrand @ 2022-02-11  8:43 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Naoya Horiguchi, Axel Rasmussen, Mina Almasry, Michal Hocko,
	Peter Xu, Andrea Arcangeli, Shuah Khan, Andrew Morton

>>>
>>> I am now rethinking the decision to proceed with b) as described above.
>>>
>>> With the exception of MADV_REMOVE (which we may be able to change for
>>> hugetlb),  madvise operations operate on huge page size pages for hugetlb
>>> mappings.  If start address is in the middle of a hugetlb page, we essentially
>>> align down to the beginning of the hugetlb page.  If length lands in the
>>> middle of a hugetlb page, we essentially round up.
>>
>> Which MADV calls would be affected?
> 
> Not sure I understand the question.  I was saying that madvise calls which
> operate on hugetlb mappings today only operate on huge pages.  So, this is
> essentially align down starting address and align up end address.

Let me clarify:

If you accidentially
MADV_NORMAL/MADV_RANDOM/MADV_SEQUENTIAL/MADV_WILLNEED a range that's
slightly bigger/smaller than the requested one you don't actually care,
because it will only slightly affect the performance of an application,
if at all.  MADV_COLD/MADV_PAGEOUT should be similar. I assume these
don't apply to hugetlb at all.

The effects of
MADV_MERGEABLE/MADV_UNMERGEABLE/MADV_HUGEPAGE/MADV_NOHUGEPAGE should in
theory be similar, however, there can be some user-space visible effects
when you get it wrong. I assume these don't apply to hugetlb at all.

However, for
MADV_DONTNEED/MADV_REMOVE/MADV_DONTFORK/MADV_DOFORK/MADV_FREE/MADV_WIPEONFORK/MADV_KEEPONFORK/MADV_DONTDUMP/MADV_DODUMP/....
the application could easily detect the difference of the actual range
handling.


> For example consider the MADV_POPULATE calls you recently added.  They will
> only fault in huge pages in a hugetlb vma.

On a related note: I don't see my man page updates upstream yet. And the
last update upstream seems to have happened 5 months ago ... not sure
why the man project seems to have stalled.

https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-11  2:28       ` Peter Xu
@ 2022-02-11 19:08         ` Axel Rasmussen
  2022-02-11 19:18           ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Axel Rasmussen @ 2022-02-11 19:08 UTC (permalink / raw)
  To: Peter Xu
  Cc: Mike Kravetz, Linux MM, LKML, Naoya Horiguchi, David Hildenbrand,
	Mina Almasry, Michal Hocko, Andrea Arcangeli, Shuah Khan,
	Andrew Morton

On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
> > > Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
> > > this is the only api that can force strip the hugetlb mapped pgtable without
> > > losing pagecache data.
> >
> > Correct.  However, I do not know if uffd-minor users would ever want to
> > do this.  Perhaps?

I talked with some colleagues, and I didn't come up with any
production *requirement* for it, but it may be a convenience in some
cases (make certain code cleaner, e.g. not having to unmap-and-remap
to tear down page tables as Peter mentioned). I think Peter's
assessment below is right.

>
> My understanding is before this patch uffd-minor upon hugetlbfs requires the
> huge file to be mapped twice, one to populate the content, then we'll be able
> to trap MINOR faults via the other mapping.  Or we could munmap() the range and
> remap it again on the same file offset to drop the pgtables, I think. But that
> sounds tricky.  MINOR faults only works with pgtables dropped.
>
> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
> because we can explicitly drop the pgtables of hugetlbfs files without any
> other tricks.
>
> However I have no real use case of it.  Initially I thought it could be useful
> for QEMU because QEMU migration routine is run with the same mm context with
> the hypervisor, so by default is doesn't have two mappings of the same guest
> memory.  If QEMU wants to leverage minor faults, DONTNEED could help.).
>
> However when I was measuring bitmap transfer (assuming that's what minor fault
> could help with qemu's postcopy) there some months ago I found it's not as slow
> as I thought at all..  Either I could have missed something, or we're facing
> different problems with what it is when uffd minor is firstly proposed by Axel.

Re: the bitmap, that matters most on machines with lots of RAM. For
example, GCE offers some VMs with up to 12 *TB* of RAM
(https://cloud.google.com/compute/docs/memory-optimized-machines), I
think with this size of machine we see a significant benefit, as it
may take some significant time for the bitmap to arrive over the
network.

But I think that's a bit of an edge case, most machines are not that
big. :) I think the benefit is more often seen just in avoiding
copies. E.g. if we find a page is already up-to-date after precopy, we
just install PTEs, no copying or page allocation needed. And even when
we have to go fetch a page over the network, one can imagine an RDMA
setup where we can avoid any copies/allocations at all even in that
case. I suppose this also has a bigger effect on larger machines, e.g.
ones that are backed by 1G pages instead of 4k.

>
> This is probably too out of topic, though..  Let me go back..
>
> Said that, one thing I'm not sure about DONTNEED on hugetlb is whether this
> could further abuse DONTNEED, as the original POSIX definition is as simple as:
>
>   The application expects that it will not access the specified address range
>   in the near future.
>
> Linux did it by tearing down pgtable, which looks okay so far.  It could be a
> bit more weird to apply it to hugetlbfs because from its definition it's a hint
> to page reclaims, however hugetlbfs is not a target of page reclaim, neither is
> it LRU-aware.  It goes further into some MADV_ZAP styled syscall.
>
> I think it could still be fine as posix doesn't define that behavior
> specifically on hugetlb so it can be defined by Linux, but not sure whether
> there can be other implications.
>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-02-11 19:08         ` Axel Rasmussen
@ 2022-02-11 19:18           ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2022-02-11 19:18 UTC (permalink / raw)
  To: Axel Rasmussen, Peter Xu
  Cc: Linux MM, LKML, Naoya Horiguchi, David Hildenbrand, Mina Almasry,
	Michal Hocko, Andrea Arcangeli, Shuah Khan, Andrew Morton

On 2/11/22 11:08, Axel Rasmussen wrote:
> On Thu, Feb 10, 2022 at 6:29 PM Peter Xu <peterx@redhat.com> wrote:
>>
>> On Thu, Feb 10, 2022 at 01:36:57PM -0800, Mike Kravetz wrote:
>>>> Another use case of DONTNEED upon hugetlbfs could be uffd-minor, because afaiu
>>>> this is the only api that can force strip the hugetlb mapped pgtable without
>>>> losing pagecache data.
>>>
>>> Correct.  However, I do not know if uffd-minor users would ever want to
>>> do this.  Perhaps?
> 
> I talked with some colleagues, and I didn't come up with any
> production *requirement* for it, but it may be a convenience in some
> cases (make certain code cleaner, e.g. not having to unmap-and-remap
> to tear down page tables as Peter mentioned). I think Peter's
> assessment below is right.
> 
>>
>> My understanding is before this patch uffd-minor upon hugetlbfs requires the
>> huge file to be mapped twice, one to populate the content, then we'll be able
>> to trap MINOR faults via the other mapping.  Or we could munmap() the range and
>> remap it again on the same file offset to drop the pgtables, I think. But that
>> sounds tricky.  MINOR faults only works with pgtables dropped.
>>
>> With DONTNEED upon hugetlbfs we can rely on one single mapping of the file,
>> because we can explicitly drop the pgtables of hugetlbfs files without any
>> other tricks.
>>
>> However I have no real use case of it.  Initially I thought it could be useful
>> for QEMU because QEMU migration routine is run with the same mm context with
>> the hypervisor, so by default is doesn't have two mappings of the same guest
>> memory.  If QEMU wants to leverage minor faults, DONTNEED could help.).
>>
>> However when I was measuring bitmap transfer (assuming that's what minor fault
>> could help with qemu's postcopy) there some months ago I found it's not as slow
>> as I thought at all..  Either I could have missed something, or we're facing
>> different problems with what it is when uffd minor is firstly proposed by Axel.
> 
> Re: the bitmap, that matters most on machines with lots of RAM. For
> example, GCE offers some VMs with up to 12 *TB* of RAM
> (https://cloud.google.com/compute/docs/memory-optimized-machines), I
> think with this size of machine we see a significant benefit, as it
> may take some significant time for the bitmap to arrive over the
> network.
> 
> But I think that's a bit of an edge case, most machines are not that
> big. :) I think the benefit is more often seen just in avoiding
> copies. E.g. if we find a page is already up-to-date after precopy, we
> just install PTEs, no copying or page allocation needed. And even when
> we have to go fetch a page over the network, one can imagine an RDMA
> setup where we can avoid any copies/allocations at all even in that
> case. I suppose this also has a bigger effect on larger machines, e.g.
> ones that are backed by 1G pages instead of 4k.
> 

Thanks Peter and Axel!

As mentioned, my primary motivation was to simply clean up the userfaultfd
selftest.  Glad to see there might be more use cases.  If we can simplify
other code as in the case of userfaultfd selftest, that would be a win.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-02-11 19:18 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-02  1:40 [PATCH v2 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
2022-02-02  1:40 ` [PATCH v2 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
2022-02-02  8:14   ` David Hildenbrand
2022-02-02 19:32     ` Mike Kravetz
2022-02-04  8:35       ` David Hildenbrand
2022-02-07 23:47         ` Mike Kravetz
2022-02-10 13:09           ` David Hildenbrand
2022-02-10 22:11             ` Mike Kravetz
2022-02-11  8:43               ` David Hildenbrand
2022-02-10  3:21   ` Peter Xu
2022-02-10 21:36     ` Mike Kravetz
2022-02-11  2:28       ` Peter Xu
2022-02-11 19:08         ` Axel Rasmussen
2022-02-11 19:18           ` Mike Kravetz
2022-02-02  1:40 ` [PATCH v2 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test Mike Kravetz
2022-02-02  1:40 ` [PATCH v2 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
2022-02-02  6:11   ` Mike Rapoport

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).