All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
@ 2022-01-13 18:03 Mike Kravetz
  2022-01-13 18:03 ` [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-01-13 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton,
	Mike Kravetz

Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
testing.  However, mremap support was recently added in commit
550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
vma").  While attempting to enable mremap support in the test, it was
discovered that the mremap test indirectly depends on MADV_DONTNEED.

hugetlb does not support MADV_DONTNEED.  However, the only thing
preventing support is a check in can_madv_lru_vma().  Simply removing
the check will enable support.

This is sent as a RFC because there is no existing use case calling
for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
However, adding support makes sense as it is fairly trivial and brings
hugetlb functionality more in line with 'normal' memory.

After enabling support, add selftest for MADV_DONTNEED as well as
MADV_REMOVE.  Then update userfaultfd selftest.

Once support is added, the madvise man page will need to be updated.

Mike Kravetz (3):
  mm: enable MADV_DONTNEED for hugetlb mappings
  selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
  userfaultfd/selftests: enable huegtlb remap and remove event testing

 mm/madvise.c                                 |   9 +-
 tools/testing/selftests/vm/Makefile          |   1 +
 tools/testing/selftests/vm/hugetlb-madvise.c | 315 +++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh    |  12 +
 tools/testing/selftests/vm/userfaultfd.c     |  67 ++--
 5 files changed, 369 insertions(+), 35 deletions(-)
 create mode 100644 tools/testing/selftests/vm/hugetlb-madvise.c

-- 
2.34.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-01-13 18:03 [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
@ 2022-01-13 18:03 ` Mike Kravetz
  2022-01-27  2:58   ` Naoya Horiguchi
  2022-01-13 18:03 ` [RFC PATCH 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2022-01-13 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton,
	Mike Kravetz

MADV_DONTNEED is currently disabled for hugetlb mappings.  This
certainly makes sense in shared file mappings as the pagecache maintains
a reference to the page and it will never be freed.  However, it could
be useful to unmap and free pages in private mappings.

The only thing preventing MADV_DONTNEED (and MADV_FREE) from working on
hugetlb mappings is a check in can_madv_lru_vma().  To allow support for
hugetlb mappings create and use a new routine madvise_dontneed_valid_vma()
that will allow hugetlb mappings.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 8c927202bbe6..fc8992f4ae40 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -766,6 +766,11 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
 	return 0;
 }
 
+static bool madvise_dontneed_valid_vma(struct vm_area_struct *vma)
+{
+	return (vma->vm_flags & VM_HUGETLB) || can_madv_lru_vma(vma);
+}
+
 static long madvise_dontneed_free(struct vm_area_struct *vma,
 				  struct vm_area_struct **prev,
 				  unsigned long start, unsigned long end,
@@ -774,7 +779,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 
 	*prev = vma;
-	if (!can_madv_lru_vma(vma))
+	if (!madvise_dontneed_valid_vma(vma))
 		return -EINVAL;
 
 	if (!userfaultfd_remove(vma, start, end)) {
@@ -796,7 +801,7 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
 			 */
 			return -ENOMEM;
 		}
-		if (!can_madv_lru_vma(vma))
+		if (!madvise_dontneed_valid_vma(vma))
 			return -EINVAL;
 		if (end > vma->vm_end) {
 			/*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test
  2022-01-13 18:03 [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
  2022-01-13 18:03 ` [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
@ 2022-01-13 18:03 ` Mike Kravetz
  2022-01-13 18:03 ` [RFC PATCH 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
  2022-01-27 11:57 ` [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support David Hildenbrand
  3 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-01-13 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton,
	Mike Kravetz

Now that MADV_DONTNEED support for hugetlb is enabled, add corresponding
tests.  MADV_REMOVE has been enabled for some time, but no tests exist
so add them as well.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 tools/testing/selftests/vm/Makefile          |   1 +
 tools/testing/selftests/vm/hugetlb-madvise.c | 315 +++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh    |  12 +
 3 files changed, 328 insertions(+)
 create mode 100644 tools/testing/selftests/vm/hugetlb-madvise.c

diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 1607322a112c..f60cf43bbf61 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -28,6 +28,7 @@ LDLIBS = -lrt -lpthread
 TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_test
 TEST_GEN_FILES += hmm-tests
+TEST_GEN_FILES += hugetlb-madvise
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
diff --git a/tools/testing/selftests/vm/hugetlb-madvise.c b/tools/testing/selftests/vm/hugetlb-madvise.c
new file mode 100644
index 000000000000..de6f2088ad40
--- /dev/null
+++ b/tools/testing/selftests/vm/hugetlb-madvise.c
@@ -0,0 +1,315 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * hugepage-madvise:
+ *
+ * Basic functional testing of madvise MADV_DONTNEED and MADV_REMOVE
+ * on hugetlb mappings.
+ *
+ * Before running this test, make sure the administrator has pre-allocated
+ * at least MIN_FREE_PAGES hugetlb pages and they are free.  In addition,
+ * the test takes an argument that is the path to a file in a hugetlbfs
+ * filesystem.  Therefore, a hugetlbfs filesystem must be mounted on some
+ * directory.
+ */
+
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#define __USE_GNU
+#include <fcntl.h>
+
+#define USAGE	"USAGE: %s <hugepagefile_name>\n"
+#define MIN_FREE_PAGES	20
+
+#define validate_free_pages(exp_free)					\
+	do {								\
+		int fhp = get_free_hugepages();				\
+		if (fhp != (exp_free)) {				\
+			printf("Unexpected number of free huge "	\
+				"pages line %d\n", __LINE__);		\
+			exit(1);					\
+		}							\
+	} while (0)
+
+unsigned long huge_page_size;
+
+/*
+ * default_huge_page_size copied from mlock2-tests.c
+ */
+unsigned long default_huge_page_size(void)
+{
+	unsigned long hps = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+
+	if (!f)
+		return 0;
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "Hugepagesize:       %lu kB", &hps) == 1) {
+			hps <<= 10;
+			break;
+		}
+	}
+
+	free(line);
+	fclose(f);
+	return hps;
+}
+
+unsigned long get_free_hugepages(void)
+{
+	unsigned long fhp = 0;
+	char *line = NULL;
+	size_t linelen = 0;
+	FILE *f = fopen("/proc/meminfo", "r");
+
+	if (!f)
+		return fhp;
+	while (getline(&line, &linelen, f) > 0) {
+		if (sscanf(line, "HugePages_Free:      %lu", &fhp) == 1)
+			break;
+	}
+
+	free(line);
+	fclose(f);
+	return fhp;
+}
+
+void write_fault_pages(void *addr, unsigned long nr_pages)
+{
+	unsigned long i;
+
+	for (i = 0; i < nr_pages; i++)
+		*((unsigned long *)(addr + (i * huge_page_size))) = i;
+}
+
+void read_fault_pages(void *addr, unsigned long nr_pages)
+{
+	unsigned long i, tmp;
+
+	for (i = 0; i < nr_pages; i++)
+		tmp += *((unsigned long *)(addr + (i * huge_page_size)));
+}
+
+int main(int argc, char **argv)
+{
+	unsigned long free_hugepages;
+	void *addr, *addr2;
+	int fd;
+
+	if (argc != 2) {
+		printf(USAGE, argv[0]);
+		exit(1);
+	}
+
+	huge_page_size = default_huge_page_size();
+	if (!huge_page_size) {
+		printf("Unable to determine huge page size, exiting!\n");
+		exit(1);
+	}
+
+	free_hugepages = get_free_hugepages();
+	if (free_hugepages < MIN_FREE_PAGES) {
+		printf("Not enough free huge pages to test, exiting!\n");
+		exit(1);
+	}
+
+	fd = open(argv[1], O_CREAT | O_RDWR, 0755);
+	if (fd < 0) {
+		perror("Open failed");
+		exit(1);
+	}
+
+	/*
+	 * Test MADV_DONTNEED on anonymous private mapping
+	 */
+	addr = mmap(NULL, 10 * huge_page_size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
+			-1, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+	write_fault_pages(addr, 10);
+	validate_free_pages(free_hugepages - 10);
+
+	if (madvise(addr, 10 * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+
+	/* should free all pages in mapping */
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, 10 * huge_page_size);
+
+	/*
+	 * Test MADV_DONTNEED on private mapping of hugetlb file
+	 */
+	if (fallocate(fd, 0, 0, 10 * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	addr = mmap(NULL, 10 * huge_page_size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* read should not consume any pages */
+	read_fault_pages(addr, 10);
+	validate_free_pages(free_hugepages - 10);
+
+	/* madvise should not free any pages */
+	if (madvise(addr, 10 * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	/* writes should allocate private pages */
+	write_fault_pages(addr, 10);
+	validate_free_pages(free_hugepages - 20);
+
+	/* madvise should free private pages */
+	if (madvise(addr, 10 * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	/* writes should allocate private pages */
+	write_fault_pages(addr, 10);
+	validate_free_pages(free_hugepages - 20);
+
+	/*
+	 * The fallocate below certainly should free the pages associated
+	 * with the file.  However, pages in the private mapping are also
+	 * freed.  This is not the 'correct' behavior, but is expected
+	 * because this is how it has worked since the initial hugetlb
+	 * implementation.
+	 */
+	if (fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+						0, 10 * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, 10 * huge_page_size);
+
+	/*
+	 * Test MADV_DONTNEED on shared mapping of hugetlb file
+	 */
+	if (fallocate(fd, 0, 0, 10 * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	addr = mmap(NULL, 10 * huge_page_size, PROT_READ | PROT_WRITE,
+			MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* write should not consume any pages */
+	write_fault_pages(addr, 10);
+	validate_free_pages(free_hugepages - 10);
+
+	/* madvise should not free any pages */
+	if (madvise(addr, 10 * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	/*
+	 * Test MADV_REMOVE on shared mapping of hugetlb file
+	 *
+	 * madvise is same as hole punch and should free all pages.
+	 */
+	if (madvise(addr, 10 * huge_page_size, MADV_REMOVE)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages);
+	(void)munmap(addr, 10 * huge_page_size);
+
+	/*
+	 * Test MADV_REMOVE on shared and private mapping of hugetlb file
+	 */
+	if (fallocate(fd, 0, 0, 10 * huge_page_size)) {
+		perror("fallocate");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	addr = mmap(NULL, 10 * huge_page_size, PROT_READ | PROT_WRITE,
+			MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* shared write should not consume any additional pages */
+	write_fault_pages(addr, 10);
+	validate_free_pages(free_hugepages - 10);
+
+	addr2 = mmap(NULL, 10 * huge_page_size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE, fd, 0);
+	if (addr2 == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	/* private read should not consume any pages */
+	read_fault_pages(addr2, 10);
+	validate_free_pages(free_hugepages - 10);
+
+	/* private write should consume additional pages */
+	write_fault_pages(addr2, 10);
+	validate_free_pages(free_hugepages - 20);
+
+	/* madvise of shared mapping should not free any pages */
+	if (madvise(addr, 10 * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 20);
+
+	/* madvise of private mapping should free private pages */
+	if (madvise(addr2, 10 * huge_page_size, MADV_DONTNEED)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages - 10);
+
+	/* private write should consume additional pages again */
+	write_fault_pages(addr2, 10);
+	validate_free_pages(free_hugepages - 20);
+
+	/*
+	 * madvise should free both file and private pages although this is
+	 * not correct.  private pages should not be freed, but this is
+	 * expected.  See comment associated with FALLOC_FL_PUNCH_HOLE call.
+	 */
+	if (madvise(addr, 10 * huge_page_size, MADV_REMOVE)) {
+		perror("madvise");
+		exit(1);
+	}
+	validate_free_pages(free_hugepages);
+
+	(void)munmap(addr, 10 * huge_page_size);
+	(void)munmap(addr2, 10 * huge_page_size);
+
+	close(fd);
+	unlink(argv[1]);
+	return 0;
+}
diff --git a/tools/testing/selftests/vm/run_vmtests.sh b/tools/testing/selftests/vm/run_vmtests.sh
index a24d30af3094..f46988c5db35 100755
--- a/tools/testing/selftests/vm/run_vmtests.sh
+++ b/tools/testing/selftests/vm/run_vmtests.sh
@@ -119,6 +119,18 @@ else
 	echo "[PASS]"
 fi
 
+echo "-----------------------"
+echo "running hugetlb-madvise"
+echo "-----------------------"
+./hugetlb-madvise $mnt/madvise-test
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+rm -f $mnt/madvise-test
+
 echo "NOTE: The above hugetlb tests provide minimal coverage.  Use"
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC PATCH 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing
  2022-01-13 18:03 [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
  2022-01-13 18:03 ` [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
  2022-01-13 18:03 ` [RFC PATCH 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test Mike Kravetz
@ 2022-01-13 18:03 ` Mike Kravetz
  2022-01-27 11:57 ` [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support David Hildenbrand
  3 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-01-13 18:03 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton,
	Mike Kravetz

With MADV_DONTNEED support added to hugetlb mappings, mremap testing
can also be enabled for hugetlb.

Modify the tests to use madvise MADV_DONTNEED and MADV_REMOVE instead of
fallocate hole puch for releasing hugetlb pages.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 67 ++++++++++++------------
 1 file changed, 34 insertions(+), 33 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 0b543a9a42b2..226f385158df 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -88,7 +88,6 @@ static bool test_uffdio_minor = false;
 static bool map_shared;
 static int shm_fd;
 static int huge_fd;
-static char *huge_fd_off0;
 static unsigned long long *count_verify;
 static int uffd = -1;
 static int uffd_flags, finished, *pipefd;
@@ -124,9 +123,9 @@ const char *examples =
     "./userfaultfd anon 100 99999\n\n"
     "# Run share memory test on 1GiB region with 99 bounces:\n"
     "./userfaultfd shmem 1000 99\n\n"
-    "# Run hugetlb memory test on 256MiB region with 50 bounces (using /dev/hugepages/hugefile):\n"
+    "# Run hugetlb memory test on 256MiB region with 50 bounces:\n"
     "./userfaultfd hugetlb 256 50 /dev/hugepages/hugefile\n\n"
-    "# Run the same hugetlb test but using shmem:\n"
+    "# Run the same hugetlb test but using shared file:\n"
     "./userfaultfd hugetlb_shared 256 50 /dev/hugepages/hugefile\n\n"
     "# 10MiB-~6GiB 999 bounces anonymous test, "
     "continue forever unless an error triggers\n"
@@ -223,10 +222,13 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
 
 static void hugetlb_release_pages(char *rel_area)
 {
-	if (fallocate(huge_fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
-		      rel_area == huge_fd_off0 ? 0 : nr_pages * page_size,
-		      nr_pages * page_size))
-		err("fallocate() failed");
+	if (!map_shared) {
+		if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED))
+			err("madvise(MADV_DONTNEED) failed");
+	} else {
+		if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE))
+			err("madvise(MADV_REMOVE) failed");
+	}
 }
 
 static void hugetlb_allocate_area(void **alloc_area)
@@ -234,26 +236,37 @@ static void hugetlb_allocate_area(void **alloc_area)
 	void *area_alias = NULL;
 	char **alloc_area_alias;
 
-	*alloc_area = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-			   (map_shared ? MAP_SHARED : MAP_PRIVATE) |
-			   MAP_HUGETLB |
-			   (*alloc_area == area_src ? 0 : MAP_NORESERVE),
-			   huge_fd, *alloc_area == area_src ? 0 :
-			   nr_pages * page_size);
+	if (!map_shared)
+		*alloc_area = mmap(NULL,
+			nr_pages * page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB |
+				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+			-1,
+			0);
+	else
+		*alloc_area = mmap(NULL,
+			nr_pages * page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_SHARED |
+				(*alloc_area == area_src ? 0 : MAP_NORESERVE),
+			huge_fd,
+			*alloc_area == area_src ? 0 : nr_pages * page_size);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of hugetlbfs file failed");
 
 	if (map_shared) {
-		area_alias = mmap(NULL, nr_pages * page_size, PROT_READ | PROT_WRITE,
-				  MAP_SHARED | MAP_HUGETLB,
-				  huge_fd, *alloc_area == area_src ? 0 :
-				  nr_pages * page_size);
+		area_alias = mmap(NULL,
+			nr_pages * page_size,
+			PROT_READ | PROT_WRITE,
+			MAP_SHARED,
+			huge_fd,
+			*alloc_area == area_src ? 0 : nr_pages * page_size);
 		if (area_alias == MAP_FAILED)
 			err("mmap of hugetlb file alias failed");
 	}
 
 	if (*alloc_area == area_src) {
-		huge_fd_off0 = *alloc_area;
 		alloc_area_alias = &area_src_alias;
 	} else {
 		alloc_area_alias = &area_dst_alias;
@@ -266,12 +279,7 @@ static void hugetlb_alias_mapping(__u64 *start, size_t len, unsigned long offset
 {
 	if (!map_shared)
 		return;
-	/*
-	 * We can't zap just the pagetable with hugetlbfs because
-	 * MADV_DONTEED won't work. So exercise -EEXIST on a alias
-	 * mapping where the pagetables are not established initially,
-	 * this way we'll exercise the -EEXEC at the fs level.
-	 */
+
 	*start = (unsigned long) area_dst_alias + offset;
 }
 
@@ -424,7 +432,6 @@ static void uffd_test_ctx_clear(void)
 		uffd = -1;
 	}
 
-	huge_fd_off0 = NULL;
 	munmap_area((void **)&area_src);
 	munmap_area((void **)&area_src_alias);
 	munmap_area((void **)&area_dst);
@@ -919,10 +926,7 @@ static int faulting_process(int signal_test)
 	struct sigaction act;
 	unsigned long signalled = 0;
 
-	if (test_type != TEST_HUGETLB)
-		split_nr_pages = (nr_pages + 1) / 2;
-	else
-		split_nr_pages = nr_pages;
+	split_nr_pages = (nr_pages + 1) / 2;
 
 	if (signal_test) {
 		sigbuf = &jbuf;
@@ -979,9 +983,6 @@ static int faulting_process(int signal_test)
 	if (signal_test)
 		return signalled != split_nr_pages;
 
-	if (test_type == TEST_HUGETLB)
-		return 0;
-
 	area_dst = mremap(area_dst, nr_pages * page_size,  nr_pages * page_size,
 			  MREMAP_MAYMOVE | MREMAP_FIXED, area_src);
 	if (area_dst == MAP_FAILED)
@@ -1669,7 +1670,7 @@ int main(int argc, char **argv)
 	}
 	nr_pages = nr_pages_per_cpu * nr_cpus;
 
-	if (test_type == TEST_HUGETLB) {
+	if (test_type == TEST_HUGETLB && map_shared) {
 		if (argc < 5)
 			usage();
 		huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-01-13 18:03 ` [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
@ 2022-01-27  2:58   ` Naoya Horiguchi
  2022-01-27  3:19     ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Naoya Horiguchi @ 2022-01-27  2:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Michal Hocko, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton

On Thu, Jan 13, 2022 at 10:03:06AM -0800, Mike Kravetz wrote:
> MADV_DONTNEED is currently disabled for hugetlb mappings.  This
> certainly makes sense in shared file mappings as the pagecache maintains
> a reference to the page and it will never be freed.  However, it could
> be useful to unmap and free pages in private mappings.
> 
> The only thing preventing MADV_DONTNEED (and MADV_FREE) from working on
> hugetlb mappings is a check in can_madv_lru_vma().  To allow support for
> hugetlb mappings create and use a new routine madvise_dontneed_valid_vma()
> that will allow hugetlb mappings.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

I briefly tested the patch and it seems that when calling madvise(MADV_DONTNEED)
with the range unaligned to hugepage size (like 4kB) triggered the following crash.
Could you double check around the address range issue?

[  220.915316] ------------[ cut here ]------------
[  220.916792] kernel BUG at mm/hugetlb.c:4946!
[  220.918519] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[  220.920344] CPU: 2 PID: 1665 Comm: a.out Tainted: G            E     5.17.0-rc1-220126-1543+ #31
[  220.930536] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[  220.934484] RIP: 0010:__unmap_hugepage_range+0x4d6/0x580
[  220.936443] Code: ff ff 49 c7 44 24 10 ff ff ff ff 41 80 64 24 20 83 e9 34 fd ff ff 0f 0b 49 8b 47 30 48 f7 d0 4c 85 c8 0f 84 93 fb ff ff 0f 0b <0f> 0b 48 8d 7c 24 48 83 4c 24 68 01 e8 69 b9 00 00 4c 8b 4c 24 08
[  220.943818] RSP: 0018:ffffabe9019dfc98 EFLAGS: 00010206
[  220.945454] RAX: 000000000000a000 RBX: ffff8a0e02253ed8 RCX: 0000000000000009
[  220.947661] RDX: 00007f1c08000000 RSI: ffff8a0e02253ed8 RDI: ffffabe9019dfdb8
[  220.949677] RBP: 0000000000200000 R08: 0000000000000000 R09: 00007f1c08000000
[  220.951902] R10: 0000000000000000 R11: 0000000000000000 R12: ffffabe9019dfdb8
[  220.954040] R13: ffffabe9019dfdb8 R14: 00007f1c08000000 R15: ffffffff9420f960
[  220.955990] FS:  00007f1c08480540(0000) GS:ffff8a0efbc00000(0000) knlGS:0000000000000000
[  220.958034] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  220.959517] CR2: 00007f1c0849aff8 CR3: 0000000146a52000 CR4: 00000000001506e0
[  220.961321] Call Trace:
[  220.961945]  <TASK>
[  220.962535]  ? __slab_free+0xba/0x370
[  220.963468]  ? update_load_avg+0x7e/0x5f0
[  220.964427]  ? __cgroup_account_cputime+0x4c/0x70
[  220.965596]  ? page_counter_uncharge+0x1d/0x30
[  220.966733]  __unmap_hugepage_range_final+0xe/0x20
[  220.967861]  unmap_single_vma+0xc7/0xf0
[  220.968795]  zap_page_range+0xcc/0x130
[  220.969619]  ? find_vma+0x73/0x80
[  220.970378]  do_madvise.part.0+0xb65/0xea0
[  220.971279]  ? do_sigaction+0x111/0x240
[  220.972081]  __x64_sys_madvise+0x56/0x70
[  220.972969]  do_syscall_64+0x3b/0x90
[  220.973779]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  220.974883] RIP: 0033:0x7f1c083b24eb
[  220.975632] Code: c3 48 8b 15 9f 59 0c 00 f7 d8 64 89 02 b8 ff ff ff ff eb c2 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 1c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 6d 59 0c 00 f7 d8 64 89 01 48
[  220.979331] RSP: 002b:00007ffe7f9bc038 EFLAGS: 00000202 ORIG_RAX: 000000000000001c
[  220.980791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1c083b24eb
[  220.982162] RDX: 0000000000000004 RSI: 000000000000a000 RDI: 00007f1c08000000
[  220.983550] RBP: 00007ffe7f9bc070 R08: 0000000000000000 R09: 0000000000000000
[  220.984997] R10: 000000000040049c R11: 0000000000000202 R12: 00000000004010d0
[  220.986399] R13: 00007ffe7f9bc150 R14: 0000000000000000 R15: 0000000000000000
[  220.987765]  </TASK>


Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings
  2022-01-27  2:58   ` Naoya Horiguchi
@ 2022-01-27  3:19     ` Mike Kravetz
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2022-01-27  3:19 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-kernel, linux-mm, Michal Hocko, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton

On 1/26/22 18:58, Naoya Horiguchi wrote:
> On Thu, Jan 13, 2022 at 10:03:06AM -0800, Mike Kravetz wrote:
>> MADV_DONTNEED is currently disabled for hugetlb mappings.  This
>> certainly makes sense in shared file mappings as the pagecache maintains
>> a reference to the page and it will never be freed.  However, it could
>> be useful to unmap and free pages in private mappings.
>>
>> The only thing preventing MADV_DONTNEED (and MADV_FREE) from working on
>> hugetlb mappings is a check in can_madv_lru_vma().  To allow support for
>> hugetlb mappings create and use a new routine madvise_dontneed_valid_vma()
>> that will allow hugetlb mappings.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> I briefly tested the patch and it seems that when calling madvise(MADV_DONTNEED)
> with the range unaligned to hugepage size (like 4kB) triggered the following crash.
> Could you double check around the address range issue?

Thanks Naoya!  My bad for not considering this and doing more testing.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
  2022-01-13 18:03 [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
                   ` (2 preceding siblings ...)
  2022-01-13 18:03 ` [RFC PATCH 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
@ 2022-01-27 11:57 ` David Hildenbrand
  2022-01-27 17:52   ` Axel Rasmussen
  2022-01-27 17:55   ` Mike Kravetz
  3 siblings, 2 replies; 11+ messages in thread
From: David Hildenbrand @ 2022-01-27 11:57 UTC (permalink / raw)
  To: Mike Kravetz, linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton

On 13.01.22 19:03, Mike Kravetz wrote:
> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
> testing.  However, mremap support was recently added in commit
> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
> vma").  While attempting to enable mremap support in the test, it was
> discovered that the mremap test indirectly depends on MADV_DONTNEED.
> 
> hugetlb does not support MADV_DONTNEED.  However, the only thing
> preventing support is a check in can_madv_lru_vma().  Simply removing
> the check will enable support.
> 
> This is sent as a RFC because there is no existing use case calling
> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
> However, adding support makes sense as it is fairly trivial and brings
> hugetlb functionality more in line with 'normal' memory.
> 

Just a note:

QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...)
but instead always goes either via hugetlbfs or via memfd. 

For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems
to get the job done (IOW: also discards private anon pages). See the
comments in the QEMU code below. I remember that that is somewhat
inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we
always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED)
to make sure

a) All file pages are removed
b) All private anon pages are removed

IIRC hugetlbfs really is different in that regard, but maybe other fs
behave similarly.

That's why QEMU was able to live for now without MADV_DONTNEED support
for hugetlbfs and most probably won't ever need it.


...
       /* The logic here is messy;
         *    madvise DONTNEED fails for hugepages
         *    fallocate works on hugepages and shmem
         *    shared anonymous memory requires madvise REMOVE
         */
        need_madvise = (rb->page_size == qemu_host_page_size);
        need_fallocate = rb->fd != -1;
        if (need_fallocate) {
            /* For a file, this causes the area of the file to be zero'd
             * if read, and for hugetlbfs also causes it to be unmapped
             * so a userfault will trigger.
             */
#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
            ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                            start, length);
            if (ret) {
                ret = -errno;
                error_report("ram_block_discard_range: Failed to fallocate "
                             "%s:%" PRIx64 " +%zx (%d)",
                             rb->idstr, start, length, ret);
                goto err;
            }
#else
            ret = -ENOSYS;
            error_report("ram_block_discard_range: fallocate not available/file"
                         "%s:%" PRIx64 " +%zx (%d)",
                         rb->idstr, start, length, ret);
            goto err;
#endif
        }
        if (need_madvise) {
            /* For normal RAM this causes it to be unmapped,
             * for shared memory it causes the local mapping to disappear
             * and to fall back on the file contents (which we just
             * fallocate'd away).
             */
#if defined(CONFIG_MADVISE)
            if (qemu_ram_is_shared(rb) && rb->fd < 0) {
                ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE);
            } else {
                ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
            }
            if (ret) {
                ret = -errno;
                error_report("ram_block_discard_range: Failed to discard range "
                             "%s:%" PRIx64 " +%zx (%d)",
                             rb->idstr, start, length, ret);
                goto err;
            }
#else
...

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
  2022-01-27 11:57 ` [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support David Hildenbrand
@ 2022-01-27 17:52   ` Axel Rasmussen
  2022-01-28  9:44     ` David Hildenbrand
  2022-01-27 17:55   ` Mike Kravetz
  1 sibling, 1 reply; 11+ messages in thread
From: Axel Rasmussen @ 2022-01-27 17:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Kravetz, LKML, Linux MM, Michal Hocko, Naoya Horiguchi,
	Peter Xu, Andrea Arcangeli, Mina Almasry, Shuah Khan,
	Andrew Morton

On Thu, Jan 27, 2022 at 3:57 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 13.01.22 19:03, Mike Kravetz wrote:
> > Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
> > testing.  However, mremap support was recently added in commit
> > 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
> > vma").  While attempting to enable mremap support in the test, it was
> > discovered that the mremap test indirectly depends on MADV_DONTNEED.
> >
> > hugetlb does not support MADV_DONTNEED.  However, the only thing
> > preventing support is a check in can_madv_lru_vma().  Simply removing
> > the check will enable support.
> >
> > This is sent as a RFC because there is no existing use case calling
> > for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
> > However, adding support makes sense as it is fairly trivial and brings
> > hugetlb functionality more in line with 'normal' memory.
> >
>
> Just a note:
>
> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...)
> but instead always goes either via hugetlbfs or via memfd.
>
> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems
> to get the job done (IOW: also discards private anon pages). See the
> comments in the QEMU code below. I remember that that is somewhat
> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we
> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED)
> to make sure
>
> a) All file pages are removed
> b) All private anon pages are removed
>
> IIRC hugetlbfs really is different in that regard, but maybe other fs
> behave similarly.
>
> That's why QEMU was able to live for now without MADV_DONTNEED support
> for hugetlbfs and most probably won't ever need it.

Agreed, all of the production use cases I'm aware of use hugetlbfs,
not MAP_HUGE...

But, I would say this is convenient for testing purposes. It's
slightly more convenient to not have to mount hugetlbfs / perform the
associated setup for tests.

Perhaps that's only a small motivation for enabling this, but then
again Mike's patch to do so is likewise very small. :)


>
>
> ...
>        /* The logic here is messy;
>          *    madvise DONTNEED fails for hugepages
>          *    fallocate works on hugepages and shmem
>          *    shared anonymous memory requires madvise REMOVE
>          */
>         need_madvise = (rb->page_size == qemu_host_page_size);
>         need_fallocate = rb->fd != -1;
>         if (need_fallocate) {
>             /* For a file, this causes the area of the file to be zero'd
>              * if read, and for hugetlbfs also causes it to be unmapped
>              * so a userfault will trigger.
>              */
> #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>                             start, length);
>             if (ret) {
>                 ret = -errno;
>                 error_report("ram_block_discard_range: Failed to fallocate "
>                              "%s:%" PRIx64 " +%zx (%d)",
>                              rb->idstr, start, length, ret);
>                 goto err;
>             }
> #else
>             ret = -ENOSYS;
>             error_report("ram_block_discard_range: fallocate not available/file"
>                          "%s:%" PRIx64 " +%zx (%d)",
>                          rb->idstr, start, length, ret);
>             goto err;
> #endif
>         }
>         if (need_madvise) {
>             /* For normal RAM this causes it to be unmapped,
>              * for shared memory it causes the local mapping to disappear
>              * and to fall back on the file contents (which we just
>              * fallocate'd away).
>              */
> #if defined(CONFIG_MADVISE)
>             if (qemu_ram_is_shared(rb) && rb->fd < 0) {
>                 ret = madvise(host_startaddr, length, QEMU_MADV_REMOVE);
>             } else {
>                 ret = madvise(host_startaddr, length, QEMU_MADV_DONTNEED);
>             }
>             if (ret) {
>                 ret = -errno;
>                 error_report("ram_block_discard_range: Failed to discard range "
>                              "%s:%" PRIx64 " +%zx (%d)",
>                              rb->idstr, start, length, ret);
>                 goto err;
>             }
> #else
> ...
>
> --
> Thanks,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
  2022-01-27 11:57 ` [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support David Hildenbrand
  2022-01-27 17:52   ` Axel Rasmussen
@ 2022-01-27 17:55   ` Mike Kravetz
  2022-01-28  9:55     ` David Hildenbrand
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2022-01-27 17:55 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton

On 1/27/22 03:57, David Hildenbrand wrote:
> On 13.01.22 19:03, Mike Kravetz wrote:
>> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
>> testing.  However, mremap support was recently added in commit
>> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
>> vma").  While attempting to enable mremap support in the test, it was
>> discovered that the mremap test indirectly depends on MADV_DONTNEED.
>>
>> hugetlb does not support MADV_DONTNEED.  However, the only thing
>> preventing support is a check in can_madv_lru_vma().  Simply removing
>> the check will enable support.
>>
>> This is sent as a RFC because there is no existing use case calling
>> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
>> However, adding support makes sense as it is fairly trivial and brings
>> hugetlb functionality more in line with 'normal' memory.
>>
> 
> Just a note:
> 
> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...)
> but instead always goes either via hugetlbfs or via memfd. 
> 
> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems
> to get the job done (IOW: also discards private anon pages). See the
> comments in the QEMU code below. I remember that that is somewhat
> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we
> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED)
> to make sure
> 
> a) All file pages are removed
> b) All private anon pages are removed
> 
> IIRC hugetlbfs really is different in that regard, but maybe other fs
> behave similarly.

Yes it is really different.  And, some might even consider that a bug?
Imagine if those private anon (COW) pages contain important data.  They
could be unmapped/freed by some other process that has write access to
the hugetlb file on which the private mapping is based.

I believe this same issue exists for hugetlbfs ftruncate.  When fallocate
hole punch support was added, it was based on the ftruncate functionality.

I am hesitant to change the behavior of hugetlb hole punch or truncate
as people may be relying on that behavior today.  Your QEMU example is
one such case.

Thanks,
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
  2022-01-27 17:52   ` Axel Rasmussen
@ 2022-01-28  9:44     ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2022-01-28  9:44 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Mike Kravetz, LKML, Linux MM, Michal Hocko, Naoya Horiguchi,
	Peter Xu, Andrea Arcangeli, Mina Almasry, Shuah Khan,
	Andrew Morton

On 27.01.22 18:52, Axel Rasmussen wrote:
> On Thu, Jan 27, 2022 at 3:57 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.01.22 19:03, Mike Kravetz wrote:
>>> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
>>> testing.  However, mremap support was recently added in commit
>>> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
>>> vma").  While attempting to enable mremap support in the test, it was
>>> discovered that the mremap test indirectly depends on MADV_DONTNEED.
>>>
>>> hugetlb does not support MADV_DONTNEED.  However, the only thing
>>> preventing support is a check in can_madv_lru_vma().  Simply removing
>>> the check will enable support.
>>>
>>> This is sent as a RFC because there is no existing use case calling
>>> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
>>> However, adding support makes sense as it is fairly trivial and brings
>>> hugetlb functionality more in line with 'normal' memory.
>>>
>>
>> Just a note:
>>
>> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...)
>> but instead always goes either via hugetlbfs or via memfd.
>>
>> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems
>> to get the job done (IOW: also discards private anon pages). See the
>> comments in the QEMU code below. I remember that that is somewhat
>> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we
>> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED)
>> to make sure
>>
>> a) All file pages are removed
>> b) All private anon pages are removed
>>
>> IIRC hugetlbfs really is different in that regard, but maybe other fs
>> behave similarly.
>>
>> That's why QEMU was able to live for now without MADV_DONTNEED support
>> for hugetlbfs and most probably won't ever need it.
> 
> Agreed, all of the production use cases I'm aware of use hugetlbfs,
> not MAP_HUGE...
> 
> But, I would say this is convenient for testing purposes. It's
> slightly more convenient to not have to mount hugetlbfs / perform the
> associated setup for tests.

Creating a memfd is not too hard, but yes, not a single-liner. Maybe the
uffd test should go via memfds for hugetlb instead. But maybe that
limits the mremap functionality? No expert.

> 
> Perhaps that's only a small motivation for enabling this, but then
> again Mike's patch to do so is likewise very small. :)

... and apparently buggy :P

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support
  2022-01-27 17:55   ` Mike Kravetz
@ 2022-01-28  9:55     ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2022-01-28  9:55 UTC (permalink / raw)
  To: Mike Kravetz, linux-kernel, linux-mm
  Cc: Michal Hocko, Naoya Horiguchi, Axel Rasmussen, Peter Xu,
	Andrea Arcangeli, Mina Almasry, Shuah Khan, Andrew Morton

On 27.01.22 18:55, Mike Kravetz wrote:
> On 1/27/22 03:57, David Hildenbrand wrote:
>> On 13.01.22 19:03, Mike Kravetz wrote:
>>> Userfaultfd selftests for hugetlb does not perform UFFD_EVENT_REMAP
>>> testing.  However, mremap support was recently added in commit
>>> 550a7d60bd5e ("mm, hugepages: add mremap() support for hugepage backed
>>> vma").  While attempting to enable mremap support in the test, it was
>>> discovered that the mremap test indirectly depends on MADV_DONTNEED.
>>>
>>> hugetlb does not support MADV_DONTNEED.  However, the only thing
>>> preventing support is a check in can_madv_lru_vma().  Simply removing
>>> the check will enable support.
>>>
>>> This is sent as a RFC because there is no existing use case calling
>>> for hugetlb MADV_DONTNEED support except possibly the userfaultfd test.
>>> However, adding support makes sense as it is fairly trivial and brings
>>> hugetlb functionality more in line with 'normal' memory.
>>>
>>
>> Just a note:
>>
>> QEMU doesn't use huge anonymous memory directly (MAP_ANON | MAP_HUGE...)
>> but instead always goes either via hugetlbfs or via memfd. 
>>
>> For MAP_PRIVATE hugetlb mappings, fallocate(FALLOC_FL_PUNCH_HOLE) seems
>> to get the job done (IOW: also discards private anon pages). See the
>> comments in the QEMU code below. I remember that that is somewhat
>> inconsistent. For ordinary MAP_PRIVATE mapped files I remember that we
>> always need fallocate(FALLOC_FL_PUNCH_HOLE) + madvise(QEMU_MADV_DONTNEED)
>> to make sure
>>
>> a) All file pages are removed
>> b) All private anon pages are removed
>>
>> IIRC hugetlbfs really is different in that regard, but maybe other fs
>> behave similarly.
> 
> Yes it is really different.  And, some might even consider that a bug?
> Imagine if those private anon (COW) pages contain important data.  They
> could be unmapped/freed by some other process that has write access to
> the hugetlb file on which the private mapping is based.

Right, that's also what I once worried about in QEMU code. But then I
realized that any kind of fallocate(FALLOC_FL_PUNCH_HOLE) on a file
shared by multiple parties mapped MAP_PRIVATE might be bogus already.

Assume you have a VM running with MAP_SHARED on a file. The file
contains the VM memory state. Assume you pause the VM and want to
convert it into 2 instances that will continue running independently
based on the captured file state.

You'd have to MAP_PRIVATE the file such that both VMs start with the
original state and only see their modifications.

... but if one process decides to fallocate(FALLOC_FL_PUNCH_HOLE), for
example, due to memory ballooning, even a page that's still shared by
both processes (!COW), you'd corrupt the other VM.

So my assumption is that MAP_PRIVATE in combination with
fallocate(FALLOC_FL_PUNCH_HOLE) on a file mapped by more than one
process is just bogus already.

> 
> I believe this same issue exists for hugetlbfs ftruncate.  When fallocate
> hole punch support was added, it was based on the ftruncate functionality.
> 
> I am hesitant to change the behavior of hugetlb hole punch or truncate
> as people may be relying on that behavior today.  Your QEMU example is
> one such case.

Yes, I assume we're stuck with that.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-01-28  9:55 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-13 18:03 [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support Mike Kravetz
2022-01-13 18:03 ` [RFC PATCH 1/3] mm: enable MADV_DONTNEED for hugetlb mappings Mike Kravetz
2022-01-27  2:58   ` Naoya Horiguchi
2022-01-27  3:19     ` Mike Kravetz
2022-01-13 18:03 ` [RFC PATCH 2/3] selftests/vm: add hugetlb madvise MADV_DONTNEED MADV_REMOVE test Mike Kravetz
2022-01-13 18:03 ` [RFC PATCH 3/3] userfaultfd/selftests: enable huegtlb remap and remove event testing Mike Kravetz
2022-01-27 11:57 ` [RFC PATCH 0/3] Add hugetlb MADV_DONTNEED support David Hildenbrand
2022-01-27 17:52   ` Axel Rasmussen
2022-01-28  9:44     ` David Hildenbrand
2022-01-27 17:55   ` Mike Kravetz
2022-01-28  9:55     ` David Hildenbrand

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.