[PATCH v2 0/3] split vm_normal_pages for LRU and non-LRU handling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] split vm_normal_pages for LRU and non-LRU handling
@ 2022-03-30 21:25 Alex Sierra
  2022-03-30 21:25 ` [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only Alex Sierra
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Alex Sierra @ 2022-03-30 21:25 UTC (permalink / raw)
  To: jgg
  Cc: david, Felix.Kuehling, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, hch, jglisse, apopple, willy,
	akpm

With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP. The
difference between new vm_normal_lru_pages vs vm_normal_pages() is,
the former makes sure to return pages that are LRU handled only.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

HMM tests were added to selftest to excercise these changes with
device coherent pages. New test called hmm_cow_in_device, will test
pages marked as COW, allocated in device zone. Also, more
configurations were added into hmm_gup_test to test basic get
user pages and get user pages fast paths in device zone pages.

v2:
- Changed the general description for this cover letter.
- Changed commit message for patch 1/3.
- Keep vm_normal_pages and add vm_normal_lru_pages, instead of rename
both.
- Add proper kernel-doc format to new function and minimize code
churn.

TODO: vm_normal_pages with pte_devmap entries still return NULL,
instead of return the actual device page. The reason is
page->_mapcount is never incremented for device pages that are mmap
through DAX mechanism using pmem driver mounted into ext4 filesystem.
When these pages are unmap, zap_pte_range is called and
vm_normal_page return a valid page with page_mapcount() = 0, before
page_remove_rmap is called. 

Alex Sierra (3):
  mm: add vm_normal_lru_pages for LRU handled pages only
  tools: add more gup configs to hmm_gup selftests
  tools: add selftests to hmm for COW in device memory

 fs/proc/task_mmu.c                     |   2 +-
 include/linux/mm.h                     |   9 +-
 mm/gup.c                               |   8 +-
 mm/huge_memory.c                       |   2 +-
 mm/khugepaged.c                        |   8 +-
 mm/ksm.c                               |   4 +-
 mm/madvise.c                           |   4 +-
 mm/memory.c                            |  40 ++++++-
 mm/mempolicy.c                         |   4 +-
 mm/migrate.c                           |   2 +-
 mm/mlock.c                             |   6 +-
 mm/mprotect.c                          |   2 +-
 tools/testing/selftests/vm/hmm-tests.c | 139 +++++++++++++++++++++----
 13 files changed, 187 insertions(+), 43 deletions(-)

-- 
2.32.0

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-03-30 21:25 [PATCH v2 0/3] split vm_normal_pages for LRU and non-LRU handling Alex Sierra
@ 2022-03-30 21:25 ` Alex Sierra
  2022-03-31  8:53   ` Christoph Hellwig
  2022-03-30 21:25 ` [PATCH v2 2/3] tools: add more gup configs to hmm_gup selftests Alex Sierra
  2022-03-30 21:25 ` [PATCH v2 3/3] tools: add selftests to hmm for COW in device memory Alex Sierra
  2 siblings, 1 reply; 10+ messages in thread
From: Alex Sierra @ 2022-03-30 21:25 UTC (permalink / raw)
  To: jgg
  Cc: david, Felix.Kuehling, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, hch, jglisse, apopple, willy,
	akpm

With DEVICE_COHERENT, we'll soon have vm_normal_pages() return
device-managed anonymous pages that are not LRU pages. Although they
behave like normal pages for purposes of mapping in CPU page, and for
COW. They do not support LRU lists, NUMA migration or THP. The
difference between new vm_normal_lru_pages vs vm_normal_pages() is,
the former makes sure to return pages that are LRU handled only.

We also introduced a FOLL_LRU flag that adds the same behaviour to
follow_page and related APIs, to allow callers to specify that they
expect to put pages on an LRU list.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 fs/proc/task_mmu.c |  2 +-
 include/linux/mm.h |  9 ++++++---
 mm/gup.c           |  8 +++++---
 mm/huge_memory.c   |  2 +-
 mm/khugepaged.c    |  8 ++++----
 mm/ksm.c           |  4 ++--
 mm/madvise.c       |  4 ++--
 mm/memory.c        | 40 ++++++++++++++++++++++++++++++++++++++--
 mm/mempolicy.c     |  4 ++--
 mm/migrate.c       |  2 +-
 mm/mlock.c         |  6 +++---
 mm/mprotect.c      |  2 +-
 12 files changed, 66 insertions(+), 25 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index f46060eb91b5..1791a86d9ecb 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1784,7 +1784,7 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
 	if (!pte_present(pte))
 		return NULL;
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_lru_page(vma, addr, pte);
 	if (!page)
 		return NULL;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d507c32724c0..0299dbb35335 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -593,8 +593,8 @@ struct vm_operations_struct {
 					unsigned long addr);
 #endif
 	/*
-	 * Called by vm_normal_page() for special PTEs to find the
-	 * page for @addr.  This is useful if the default behavior
+	 * Called by vm_normal_*_page() for special PTEs to find the
+	 * page for @addr. This is useful if the default behavior
 	 * (using pte_page()) would not find the correct page.
 	 */
 	struct page *(*find_special_page)(struct vm_area_struct *vma,
@@ -1783,6 +1783,8 @@ extern void user_shm_unlock(size_t, struct ucounts *);
 
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
+struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
+			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd);
 
@@ -2901,6 +2903,7 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
 #define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
+#define FOLL_LRU	0x100000 /* return only LRU (anon or page cache) */
 
 /*
  * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
@@ -3227,7 +3230,7 @@ extern long copy_huge_page_from_user(struct page *dst_page,
  * @vma: Pointer to the struct vm_area_struct to consider
  *
  * Whether transhuge page-table entries are considered "special" following
- * the definition in vm_normal_page().
+ * the definition in vm_normal_*_page().
  *
  * Return: true if transhuge page-table entries should be considered special,
  * false otherwise.
diff --git a/mm/gup.c b/mm/gup.c
index 41349b685eaf..d0494428a0c3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -539,8 +539,10 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
 		pte_unmap_unlock(ptep, ptl);
 		return NULL;
 	}
-
-	page = vm_normal_page(vma, address, pte);
+	if (flags & (FOLL_MLOCK | FOLL_LRU))
+		page = vm_normal_lru_page(vma, address, pte);
+	else
+		page = vm_normal_page(vma, address, pte);
 	if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
 		/*
 		 * Only return device mapping pages in the FOLL_GET or FOLL_PIN
@@ -824,7 +826,7 @@ static struct page *follow_p4d_mask(struct vm_area_struct *vma,
  *
  * Return: the mapped (struct page *), %NULL if no mapping exists, or
  * an error pointer if there is a mapping to something not represented
- * by a page descriptor (see also vm_normal_page()).
+ * by a page descriptor (see also vm_normal_*_page()).
  */
 static struct page *follow_page_mask(struct vm_area_struct *vma,
 			      unsigned long address, unsigned int flags,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..ea1efc825774 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2966,7 +2966,7 @@ static int split_huge_pages_pid(int pid, unsigned long vaddr_start,
 		}
 
 		/* FOLL_DUMP to ignore special (like zero) pages */
-		follflags = FOLL_GET | FOLL_DUMP;
+		follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
 		page = follow_page(vma, addr, follflags);
 
 		if (IS_ERR(page))
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 131492fd1148..a7153db09afa 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -627,7 +627,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 			result = SCAN_PTE_NON_PRESENT;
 			goto out;
 		}
-		page = vm_normal_page(vma, address, pteval);
+		page = vm_normal_lru_page(vma, address, pteval);
 		if (unlikely(!page)) {
 			result = SCAN_PAGE_NULL;
 			goto out;
@@ -1286,7 +1286,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 		if (pte_write(pteval))
 			writable = true;
 
-		page = vm_normal_page(vma, _address, pteval);
+		page = vm_normal_lru_page(vma, _address, pteval);
 		if (unlikely(!page)) {
 			result = SCAN_PAGE_NULL;
 			goto out_unmap;
@@ -1494,7 +1494,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 		if (!pte_present(*pte))
 			goto abort;
 
-		page = vm_normal_page(vma, addr, *pte);
+		page = vm_normal_lru_page(vma, addr, *pte);
 
 		/*
 		 * Note that uprobe, debugger, or MAP_PRIVATE may change the
@@ -1512,7 +1512,7 @@ void collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr)
 
 		if (pte_none(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
+		page = vm_normal_lru_page(vma, addr, *pte);
 		page_remove_rmap(page, false);
 	}
 
diff --git a/mm/ksm.c b/mm/ksm.c
index c20bd4d9a0d9..352d37e44694 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -474,7 +474,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 	do {
 		cond_resched();
 		page = follow_page(vma, addr,
-				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
+				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
 		if (IS_ERR_OR_NULL(page))
 			break;
 		if (PageKsm(page))
@@ -559,7 +559,7 @@ static struct page *get_mergeable_page(struct rmap_item *rmap_item)
 	if (!vma)
 		goto out;
 
-	page = follow_page(vma, addr, FOLL_GET);
+	page = follow_page(vma, addr, FOLL_GET | FOLL_LRU);
 	if (IS_ERR_OR_NULL(page))
 		goto out;
 	if (PageAnon(page)) {
diff --git a/mm/madvise.c b/mm/madvise.c
index 38d0f515d548..8dcfb93fe0d5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -411,7 +411,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		if (!pte_present(ptent))
 			continue;
 
-		page = vm_normal_page(vma, addr, ptent);
+		page = vm_normal_lru_page(vma, addr, ptent);
 		if (!page)
 			continue;
 
@@ -621,7 +621,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			continue;
 		}
 
-		page = vm_normal_page(vma, addr, ptent);
+		page = vm_normal_lru_page(vma, addr, ptent);
 		if (!page)
 			continue;
 
diff --git a/mm/memory.c b/mm/memory.c
index c125c4969913..5e83b65a74aa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -621,6 +621,13 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 		if (is_zero_pfn(pfn))
 			return NULL;
 		if (pte_devmap(pte))
+/*
+ * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
+ * is never incremented for device pages that are mmap through DAX mechanism
+ * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
+ * zap_pte_range is called and vm_normal_page return a valid page with
+ * page_mapcount() = 0, before page_remove_rmap is called.
+ */
 			return NULL;
 
 		print_bad_pte(vma, addr, pte, NULL);
@@ -661,6 +668,35 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 	return pfn_to_page(pfn);
 }
 
+/**
+ * vm_normal_lru_page - gets LRU handled page associated with a pte.
+ *
+ * @vma: user vma the page belongs to.
+ * @addr: user address the page belongs to.
+ * @pte: page table entry associated to the page.
+ *
+ * This function gets the "struct page" associated with a pte, only for pages
+ * that can be put on an LRU list and that support NUMA migration, KSM and
+ * THP.
+ *
+ * With DEVICE_COHERENT introduction, vm_normal_pages() could return
+ * device-managed anonymous pages that are not LRU pages. This
+ * vm_normal_lru_page function, makes sure to return LRU handled pages only.
+ *
+ * Return: "struct page" reference associated with the pte.
+ */
+struct page *vm_normal_lru_page(struct vm_area_struct *vma, unsigned long addr,
+			    pte_t pte)
+{
+	struct page *page;
+
+	page = vm_normal_page(vma, addr, pte);
+	if (page && is_zone_device_page(page))
+		return NULL;
+
+	return page;
+}
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 				pmd_t pmd)
@@ -2168,7 +2204,7 @@ EXPORT_SYMBOL(vmf_insert_pfn);
 
 static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn)
 {
-	/* these checks mirror the abort conditions in vm_normal_page */
+	/* these checks mirror the abort conditions in vm_normal_lru_page */
 	if (vma->vm_flags & VM_MIXEDMAP)
 		return true;
 	if (pfn_t_devmap(pfn))
@@ -4364,7 +4400,7 @@ static vm_fault_t do_numa_page(struct vm_fault *vmf)
 	old_pte = ptep_get(vmf->pte);
 	pte = pte_modify(old_pte, vma->vm_page_prot);
 
-	page = vm_normal_page(vma, vmf->address, pte);
+	page = vm_normal_lru_page(vma, vmf->address, pte);
 	if (!page)
 		goto out_map;
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 69284d3b5e53..651408f14b3e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -527,11 +527,11 @@ static int queue_pages_pte_range(pmd_t *pmd, unsigned long addr,
 	for (; addr != end; pte++, addr += PAGE_SIZE) {
 		if (!pte_present(*pte))
 			continue;
-		page = vm_normal_page(vma, addr, *pte);
+		page = vm_normal_lru_page(vma, addr, *pte);
 		if (!page)
 			continue;
 		/*
-		 * vm_normal_page() filters out zero pages, but there might
+		 * vm_normal_lru_page() filters out zero pages, but there might
 		 * still be PageReserved pages to skip, perhaps in a VDSO.
 		 */
 		if (PageReserved(page))
diff --git a/mm/migrate.c b/mm/migrate.c
index c31d04b46a5e..17d049311b78 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1614,7 +1614,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		goto out;
 
 	/* FOLL_DUMP to ignore special (like zero) pages */
-	follflags = FOLL_GET | FOLL_DUMP;
+	follflags = FOLL_GET | FOLL_DUMP | FOLL_LRU;
 	page = follow_page(vma, addr, follflags);
 
 	err = PTR_ERR(page);
diff --git a/mm/mlock.c b/mm/mlock.c
index 25934e7db3e1..bb09926aeee7 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -342,7 +342,7 @@ static void __munlock_pagevec(struct pagevec *pvec, struct zone *zone)
  * a non-TPH page already pinned and in the @pvec, and that it belongs to @zone.
  *
  * The rest of @pvec is filled by subsequent pages within the same pmd and same
- * zone, as long as the pte's are present and vm_normal_page() succeeds. These
+ * zone, as long as the pte's are present and vm_normal_lru_page() succeeds. These
  * pages also get pinned.
  *
  * Returns the address of the next page that should be scanned. This equals
@@ -373,7 +373,7 @@ static unsigned long __munlock_pagevec_fill(struct pagevec *pvec,
 		struct page *page = NULL;
 		pte++;
 		if (pte_present(*pte))
-			page = vm_normal_page(vma, start, *pte);
+			page = vm_normal_lru_page(vma, start, *pte);
 		/*
 		 * Break if page could not be obtained or the page's node+zone does not
 		 * match
@@ -439,7 +439,7 @@ void munlock_vma_pages_range(struct vm_area_struct *vma,
 		 * suits munlock very well (and if somehow an abnormal page
 		 * has sneaked into the range, we won't oops here: great).
 		 */
-		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP);
+		page = follow_page(vma, start, FOLL_GET | FOLL_DUMP | FOLL_LRU);
 
 		if (page && !IS_ERR(page)) {
 			if (PageTransTail(page)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2887644fd150..bc3e75334aeb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -88,7 +88,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				if (pte_protnone(oldpte))
 					continue;
 
-				page = vm_normal_page(vma, addr, oldpte);
+				page = vm_normal_lru_page(vma, addr, oldpte);
 				if (!page || PageKsm(page))
 					continue;
 
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/3] tools: add more gup configs to hmm_gup selftests
  2022-03-30 21:25 [PATCH v2 0/3] split vm_normal_pages for LRU and non-LRU handling Alex Sierra
  2022-03-30 21:25 ` [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only Alex Sierra
@ 2022-03-30 21:25 ` Alex Sierra
  2022-03-30 21:25 ` [PATCH v2 3/3] tools: add selftests to hmm for COW in device memory Alex Sierra
  2 siblings, 0 replies; 10+ messages in thread
From: Alex Sierra @ 2022-03-30 21:25 UTC (permalink / raw)
  To: jgg
  Cc: david, Felix.Kuehling, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, hch, jglisse, apopple, willy,
	akpm

Test device pages with get_user_pages and get_user_pages_fast.
The motivation is to test device coherent type pages in the gup and
gup fast paths, after vm_normal_pages was split into LRU and non-LRU
handled.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 65 +++++++++++++++++---------
 1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 11b83a8084fe..65e30ab6494c 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1769,6 +1769,24 @@ TEST_F(hmm, exclusive_cow)
 	hmm_buffer_free(buffer);
 }
 
+static int gup_test_exec(int gup_fd, unsigned long addr,
+			 int cmd, int npages, int size)
+{
+	struct gup_test gup = {
+		.nr_pages_per_call	= npages,
+		.addr			= addr,
+		.gup_flags		= FOLL_WRITE,
+		.size			= size,
+	};
+
+	if (ioctl(gup_fd, cmd, &gup)) {
+		perror("ioctl on error\n");
+		return errno;
+	}
+
+	return 0;
+}
+
 /*
  * Test get user device pages through gup_test. Setting PIN_LONGTERM flag.
  * This should trigger a migration back to system memory for both, private
@@ -1779,7 +1797,6 @@ TEST_F(hmm, exclusive_cow)
 TEST_F(hmm, hmm_gup_test)
 {
 	struct hmm_buffer *buffer;
-	struct gup_test gup;
 	int gup_fd;
 	unsigned long npages;
 	unsigned long size;
@@ -1792,8 +1809,7 @@ TEST_F(hmm, hmm_gup_test)
 	if (gup_fd == -1)
 		SKIP(return, "Skipping test, could not find gup_test driver");
 
-	npages = 4;
-	ASSERT_NE(npages, 0);
+	npages = 3;
 	size = npages << self->page_shift;
 
 	buffer = malloc(sizeof(*buffer));
@@ -1822,28 +1838,35 @@ TEST_F(hmm, hmm_gup_test)
 	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
 		ASSERT_EQ(ptr[i], i);
 
-	gup.nr_pages_per_call = npages;
-	gup.addr = (unsigned long)buffer->ptr;
-	gup.gup_flags = FOLL_WRITE;
-	gup.size = size;
-	/*
-	 * Calling gup_test ioctl. It will try to PIN_LONGTERM these device pages
-	 * causing a migration back to system memory for both, private and coherent
-	 * type pages.
-	 */
-	if (ioctl(gup_fd, PIN_LONGTERM_BENCHMARK, &gup)) {
-		perror("ioctl on PIN_LONGTERM_BENCHMARK\n");
-		goto out_test;
-	}
-
-	/* Take snapshot to make sure pages have been migrated to sys memory */
+	ASSERT_EQ(gup_test_exec(gup_fd,
+				(unsigned long)buffer->ptr,
+				GUP_BASIC_TEST, 1, self->page_size), 0);
+	ASSERT_EQ(gup_test_exec(gup_fd,
+				(unsigned long)buffer->ptr + 1 * self->page_size,
+				GUP_FAST_BENCHMARK, 1, self->page_size), 0);
+	ASSERT_EQ(gup_test_exec(gup_fd,
+				(unsigned long)buffer->ptr + 2 * self->page_size,
+				PIN_LONGTERM_BENCHMARK, 1, self->page_size), 0);
+
+	/* Take snapshot to CPU pagetables */
 	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
 	ASSERT_EQ(ret, 0);
 	ASSERT_EQ(buffer->cpages, npages);
 	m = buffer->mirror;
-	for (i = 0; i < npages; i++)
-		ASSERT_EQ(m[i], HMM_DMIRROR_PROT_WRITE);
-out_test:
+	if (hmm_is_coherent_type(variant->device_number)) {
+		ASSERT_EQ(HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL | HMM_DMIRROR_PROT_WRITE, m[0]);
+		ASSERT_EQ(HMM_DMIRROR_PROT_DEV_COHERENT_LOCAL | HMM_DMIRROR_PROT_WRITE, m[1]);
+	} else {
+		ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[0]);
+		ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[1]);
+	}
+	ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[2]);
+	/* Check again the content on the pages. Make sure there's no
+	 * corrupted data.
+	 */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
 	close(gup_fd);
 	hmm_buffer_free(buffer);
 }
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/3] tools: add selftests to hmm for COW in device memory
  2022-03-30 21:25 [PATCH v2 0/3] split vm_normal_pages for LRU and non-LRU handling Alex Sierra
  2022-03-30 21:25 ` [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only Alex Sierra
  2022-03-30 21:25 ` [PATCH v2 2/3] tools: add more gup configs to hmm_gup selftests Alex Sierra
@ 2022-03-30 21:25 ` Alex Sierra
  2 siblings, 0 replies; 10+ messages in thread
From: Alex Sierra @ 2022-03-30 21:25 UTC (permalink / raw)
  To: jgg
  Cc: david, Felix.Kuehling, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, hch, jglisse, apopple, willy,
	akpm

The objective is to test device migration mechanism in pages marked
as COW, for private and coherent device type. In case of writing to
COW private page(s), a page fault will migrate pages back to system
memory first. Then, these pages will be duplicated. In case of COW
device coherent type, pages are duplicated directly from device
memory.

Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Acked-by: Felix Kuehling <Felix.Kuehling@amd.com>
---
 tools/testing/selftests/vm/hmm-tests.c | 80 ++++++++++++++++++++++++++
 1 file changed, 80 insertions(+)

diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 65e30ab6494c..d70b780df877 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1870,4 +1870,84 @@ TEST_F(hmm, hmm_gup_test)
 	close(gup_fd);
 	hmm_buffer_free(buffer);
 }
+
+/*
+ * Test copy-on-write in device pages.
+ * In case of writing to COW private page(s), a page fault will migrate pages
+ * back to system memory first. Then, these pages will be duplicated. In case
+ * of COW device coherent type, pages are duplicated directly from device
+ * memory.
+ */
+TEST_F(hmm, hmm_cow_in_device)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	unsigned char *m;
+	pid_t pid;
+	int status;
+
+	npages = 4;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+
+	ret = hmm_migrate_sys_to_dev(self->fd, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (!pid) {
+		/* Child process waitd for SIGTERM from the parent. */
+		while (1) {
+		}
+		perror("Should not reach this\n");
+		exit(0);
+	}
+	/* Parent process writes to COW pages(s) and gets a
+	 * new copy in system. In case of device private pages,
+	 * this write causes a migration to system mem first.
+	 */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Terminate child and wait */
+	EXPECT_EQ(0, kill(pid, SIGTERM));
+	EXPECT_EQ(pid, waitpid(pid, &status, 0));
+	EXPECT_NE(0, WIFSIGNALED(status));
+	EXPECT_EQ(SIGTERM, WTERMSIG(status));
+
+	/* Take snapshot to CPU pagetables */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	m = buffer->mirror;
+	for (i = 0; i < npages; i++)
+		ASSERT_EQ(HMM_DMIRROR_PROT_WRITE, m[i]);
+
+	hmm_buffer_free(buffer);
+}
 TEST_HARNESS_MAIN
-- 
2.32.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-03-30 21:25 ` [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only Alex Sierra
@ 2022-03-31  8:53   ` Christoph Hellwig
  2022-03-31  8:55     ` David Hildenbrand
  2022-04-01 20:08     ` Felix Kuehling
  0 siblings, 2 replies; 10+ messages in thread
From: Christoph Hellwig @ 2022-03-31  8:53 UTC (permalink / raw)
  To: Alex Sierra
  Cc: jgg, david, Felix.Kuehling, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, hch, jglisse, apopple, willy,
	akpm

> -	page = vm_normal_page(vma, addr, pte);
> +	page = vm_normal_lru_page(vma, addr, pte);

Why can't this deal with ZONE_DEVICE pages?  It certainly has
nothing do with a LRU I think.  In fact being able to have
stats that count say the number of device pages here would
probably be useful at some point.

In general I find the vm_normal_lru_page vs vm_normal_page
API highly confusing.  An explicit check for zone device pages
in the dozen or so spots that care has a much better documentation
value, especially if accompanied by comments where it isn't entirely
obvious.

>  		page = follow_page(vma, addr,
> -				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
> +				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);

Overly long line here.

> +/*
> + * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
> + * is never incremented for device pages that are mmap through DAX mechanism
> + * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
> + * zap_pte_range is called and vm_normal_page return a valid page with
> + * page_mapcount() = 0, before page_remove_rmap is called.
> + */

Please properly indent comments.

> + * zone, as long as the pte's are present and vm_normal_lru_page() succeeds. These
>   * pages also get pinned.

Another overly long line here.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-03-31  8:53   ` Christoph Hellwig
@ 2022-03-31  8:55     ` David Hildenbrand
  2022-03-31  8:57       ` Christoph Hellwig
  2022-04-01 20:08     ` Felix Kuehling
  1 sibling, 1 reply; 10+ messages in thread
From: David Hildenbrand @ 2022-03-31  8:55 UTC (permalink / raw)
  To: Christoph Hellwig, Alex Sierra
  Cc: jgg, Felix.Kuehling, linux-mm, rcampbell, linux-ext4, linux-xfs,
	amd-gfx, dri-devel, jglisse, apopple, willy, akpm

On 31.03.22 10:53, Christoph Hellwig wrote:
>> -	page = vm_normal_page(vma, addr, pte);
>> +	page = vm_normal_lru_page(vma, addr, pte);
> 
> Why can't this deal with ZONE_DEVICE pages?  It certainly has
> nothing do with a LRU I think.  In fact being able to have
> stats that count say the number of device pages here would
> probably be useful at some point.
> 
> In general I find the vm_normal_lru_page vs vm_normal_page
> API highly confusing.  An explicit check for zone device pages
> in the dozen or so spots that care has a much better documentation
> value, especially if accompanied by comments where it isn't entirely
> obvious.

What's your thought on FOLL_LRU?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-03-31  8:55     ` David Hildenbrand
@ 2022-03-31  8:57       ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2022-03-31  8:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Christoph Hellwig, Alex Sierra, jgg, Felix.Kuehling, linux-mm,
	rcampbell, linux-ext4, linux-xfs, amd-gfx, dri-devel, jglisse,
	apopple, willy, akpm

On Thu, Mar 31, 2022 at 10:55:13AM +0200, David Hildenbrand wrote:
> > Why can't this deal with ZONE_DEVICE pages?  It certainly has
> > nothing do with a LRU I think.  In fact being able to have
> > stats that count say the number of device pages here would
> > probably be useful at some point.
> > 
> > In general I find the vm_normal_lru_page vs vm_normal_page
> > API highly confusing.  An explicit check for zone device pages
> > in the dozen or so spots that care has a much better documentation
> > value, especially if accompanied by comments where it isn't entirely
> > obvious.
> 
> What's your thought on FOLL_LRU?

Also a bit confusing, but inbetween all these FOLL_ flags it doesn't
really matter any more.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-03-31  8:53   ` Christoph Hellwig
  2022-03-31  8:55     ` David Hildenbrand
@ 2022-04-01 20:08     ` Felix Kuehling
  2022-04-04 17:38       ` Jason Gunthorpe
  1 sibling, 1 reply; 10+ messages in thread
From: Felix Kuehling @ 2022-04-01 20:08 UTC (permalink / raw)
  To: Christoph Hellwig, Alex Sierra
  Cc: jgg, david, linux-mm, rcampbell, linux-ext4, linux-xfs, amd-gfx,
	dri-devel, jglisse, apopple, willy, akpm


On 2022-03-31 04:53, Christoph Hellwig wrote:
>> -	page = vm_normal_page(vma, addr, pte);
>> +	page = vm_normal_lru_page(vma, addr, pte);
> Why can't this deal with ZONE_DEVICE pages?  It certainly has
> nothing do with a LRU I think.  In fact being able to have
> stats that count say the number of device pages here would
> probably be useful at some point.

Maybe at some point. However, this is in a function called 
"can_gather_numa_stats". There are no meaningful NUMA stats for device 
pages. I agree that the name "vm_normal_lru_page" is not optimal in this 
case.


>
> In general I find the vm_normal_lru_page vs vm_normal_page
> API highly confusing.  An explicit check for zone device pages
> in the dozen or so spots that care has a much better documentation
> value, especially if accompanied by comments where it isn't entirely
> obvious.

OK. We can do that. It would solve the function naming problem, and we'd 
have more visibility of device page handling in more places in the 
kernel, which has educational value.

Regards,
   Felix


>
>>   		page = follow_page(vma, addr,
>> -				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE);
>> +				FOLL_GET | FOLL_MIGRATION | FOLL_REMOTE | FOLL_LRU);
> Overly long line here.
>
>> +/*
>> + * NOTE: Technically this should goto check_pfn label. However, page->_mapcount
>> + * is never incremented for device pages that are mmap through DAX mechanism
>> + * using pmem driver mounted into ext4 filesystem. When these pages are unmap,
>> + * zap_pte_range is called and vm_normal_page return a valid page with
>> + * page_mapcount() = 0, before page_remove_rmap is called.
>> + */
> Please properly indent comments.
>
>> + * zone, as long as the pte's are present and vm_normal_lru_page() succeeds. These
>>    * pages also get pinned.
> Another overly long line here.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-04-01 20:08     ` Felix Kuehling
@ 2022-04-04 17:38       ` Jason Gunthorpe
  2022-04-04 19:22         ` Sierra Guiza, Alejandro (Alex)
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2022-04-04 17:38 UTC (permalink / raw)
  To: Felix Kuehling
  Cc: Christoph Hellwig, Alex Sierra, david, linux-mm, rcampbell,
	linux-ext4, linux-xfs, amd-gfx, dri-devel, jglisse, apopple,
	willy, akpm

On Fri, Apr 01, 2022 at 04:08:35PM -0400, Felix Kuehling wrote:

> > In general I find the vm_normal_lru_page vs vm_normal_page
> > API highly confusing.  An explicit check for zone device pages
> > in the dozen or so spots that care has a much better documentation
> > value, especially if accompanied by comments where it isn't entirely
> > obvious.
> 
> OK. We can do that. It would solve the function naming problem, and we'd
> have more visibility of device page handling in more places in the kernel,
> which has educational value.

Personally I find the 'is page XYZ' pretty confusing, like I don't
know half of what the PageKsm annotations are for..

Testing against a specific property the code goes on to use right away
seems more descriptive to me.

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only
  2022-04-04 17:38       ` Jason Gunthorpe
@ 2022-04-04 19:22         ` Sierra Guiza, Alejandro (Alex)
  0 siblings, 0 replies; 10+ messages in thread
From: Sierra Guiza, Alejandro (Alex) @ 2022-04-04 19:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Felix Kuehling
  Cc: Christoph Hellwig, david, linux-mm, rcampbell, linux-ext4,
	linux-xfs, amd-gfx, dri-devel, jglisse, apopple, willy, akpm


On 4/4/2022 12:38 PM, Jason Gunthorpe wrote:
> On Fri, Apr 01, 2022 at 04:08:35PM -0400, Felix Kuehling wrote:
>
>>> In general I find the vm_normal_lru_page vs vm_normal_page
>>> API highly confusing.  An explicit check for zone device pages
>>> in the dozen or so spots that care has a much better documentation
>>> value, especially if accompanied by comments where it isn't entirely
>>> obvious.
>> OK. We can do that. It would solve the function naming problem, and we'd
>> have more visibility of device page handling in more places in the kernel,
>> which has educational value.
> Personally I find the 'is page XYZ' pretty confusing, like I don't
> know half of what the PageKsm annotations are for..
>
> Testing against a specific property the code goes on to use right away
> seems more descriptive to me.

Hi Jason,

Are you referring to test for properties such as is_lru_page, 
is_numa_page, is_lockable_page, etc?
Otherwise, could you provide an example?

Regards,
Alex Sierra

>
> Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-04-04 19:23 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-30 21:25 [PATCH v2 0/3] split vm_normal_pages for LRU and non-LRU handling Alex Sierra
2022-03-30 21:25 ` [PATCH v2 1/3] mm: add vm_normal_lru_pages for LRU handled pages only Alex Sierra
2022-03-31  8:53   ` Christoph Hellwig
2022-03-31  8:55     ` David Hildenbrand
2022-03-31  8:57       ` Christoph Hellwig
2022-04-01 20:08     ` Felix Kuehling
2022-04-04 17:38       ` Jason Gunthorpe
2022-04-04 19:22         ` Sierra Guiza, Alejandro (Alex)
2022-03-30 21:25 ` [PATCH v2 2/3] tools: add more gup configs to hmm_gup selftests Alex Sierra
2022-03-30 21:25 ` [PATCH v2 3/3] tools: add selftests to hmm for COW in device memory Alex Sierra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).