linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_*
@ 2020-11-06  0:51 Ralph Campbell
  2020-11-06  0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
                   ` (5 more replies)
  0 siblings, 6 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

This series adds support for transparent huge page migration to
migrate_vma_*() and adds nouveau SVM and HMM selftests as consumers.
Earlier versions were posted previously [1] and [2].

The patches apply cleanly to the linux-mm 5.10.0-rc2 tree. There are a
lot of other THP patches being posted. I don't think there are any
semantic conflicts but there may be some merge conflicts depending on
the order Andrew applies these.

Changes in v3:
Sent the patch ("mm/thp: fix __split_huge_pmd_locked() for migration PMD")
as a separate patch from this series.

Rebased to linux-mm 5.10.0-rc2.

Changes in v2:
Added splitting a THP midway in the migration process:
i.e., in migrate_vma_pages().

[1] https://lore.kernel.org/linux-mm/20200619215649.32297-1-rcampbell@nvidia.com
[2] https://lore.kernel.org/linux-mm/20200902165830.5367-1-rcampbell@nvidia.com

Ralph Campbell (6):
  mm/thp: add prep_transhuge_device_private_page()
  mm/migrate: move migrate_vma_collect_skip()
  mm: support THP migration to device private memory
  mm/thp: add THP allocation helper
  mm/hmm/test: add self tests for THP migration
  nouveau: support THP migration to private memory

 drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 +++++++++++-----
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  11 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 include/linux/gfp.h                    |  10 +
 include/linux/huge_mm.h                |  12 +
 include/linux/memremap.h               |   9 +
 include/linux/migrate.h                |   2 +
 lib/test_hmm.c                         | 437 +++++++++++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 mm/huge_memory.c                       | 147 +++++++--
 mm/memcontrol.c                        |  25 +-
 mm/memory.c                            |  10 +-
 mm/memremap.c                          |   4 +-
 mm/migrate.c                           | 429 +++++++++++++++++++-----
 mm/rmap.c                              |   2 +-
 tools/testing/selftests/vm/hmm-tests.c | 404 +++++++++++++++++++++++
 16 files changed, 1522 insertions(+), 275 deletions(-)

-- 
2.20.1



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()
  2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
@ 2020-11-06  0:51 ` Ralph Campbell
  2020-11-06  7:55   ` Christoph Hellwig
  2020-11-06 12:14   ` Matthew Wilcox
  2020-11-06  0:51 ` [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip() Ralph Campbell
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

Add a helper function to allow device drivers to create device private
transparent huge pages. This is intended to help support device private
THP migrations.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 include/linux/huge_mm.h | 5 +++++
 mm/huge_memory.c        | 9 +++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0365aa97f8e7..3ec26ef27a93 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -184,6 +184,7 @@ extern unsigned long thp_get_unmapped_area(struct file *filp,
 		unsigned long flags);
 
 extern void prep_transhuge_page(struct page *page);
+extern void prep_transhuge_device_private_page(struct page *page);
 extern void free_transhuge_page(struct page *page);
 bool is_transparent_hugepage(struct page *page);
 
@@ -377,6 +378,10 @@ static inline bool transhuge_vma_suitable(struct vm_area_struct *vma,
 
 static inline void prep_transhuge_page(struct page *page) {}
 
+static inline void prep_transhuge_device_private_page(struct page *page)
+{
+}
+
 static inline bool is_transparent_hugepage(struct page *page)
 {
 	return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 08a183f6c3ab..b4141f12ff31 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -498,6 +498,15 @@ void prep_transhuge_page(struct page *page)
 	set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
 }
 
+void prep_transhuge_device_private_page(struct page *page)
+{
+	prep_compound_page(page, HPAGE_PMD_ORDER);
+	prep_transhuge_page(page);
+	/* Only the head page has a reference to the pgmap. */
+	percpu_ref_put_many(page->pgmap->ref, HPAGE_PMD_NR - 1);
+}
+EXPORT_SYMBOL_GPL(prep_transhuge_device_private_page);
+
 bool is_transparent_hugepage(struct page *page)
 {
 	if (!PageCompound(page))
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()
  2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
  2020-11-06  0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
@ 2020-11-06  0:51 ` Ralph Campbell
  2020-11-06  7:56   ` Christoph Hellwig
  2020-11-06  7:57   ` Christoph Hellwig
  2020-11-06  0:51 ` [PATCH v3 3/6] mm: support THP migration to device private memory Ralph Campbell
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

Move the definition of migrate_vma_collect_skip() to make it callable
by migrate_vma_collect_hole(). This helps make the next patch easier
to read.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 mm/migrate.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index c1585ec29827..665516319b66 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2253,6 +2253,21 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_DEVICE_PRIVATE
+static int migrate_vma_collect_skip(unsigned long start,
+				    unsigned long end,
+				    struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE) {
+		migrate->dst[migrate->npages] = 0;
+		migrate->src[migrate->npages++] = 0;
+	}
+
+	return 0;
+}
+
 static int migrate_vma_collect_hole(unsigned long start,
 				    unsigned long end,
 				    __always_unused int depth,
@@ -2281,21 +2296,6 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
-static int migrate_vma_collect_skip(unsigned long start,
-				    unsigned long end,
-				    struct mm_walk *walk)
-{
-	struct migrate_vma *migrate = walk->private;
-	unsigned long addr;
-
-	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		migrate->dst[migrate->npages] = 0;
-		migrate->src[migrate->npages++] = 0;
-	}
-
-	return 0;
-}
-
 static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				   unsigned long start,
 				   unsigned long end,
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
  2020-11-06  0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
  2020-11-06  0:51 ` [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip() Ralph Campbell
@ 2020-11-06  0:51 ` Ralph Campbell
  2020-11-06  8:03   ` Christoph Hellwig
  2020-11-06  0:51 ` [PATCH v3 4/6] mm/thp: add THP allocation helper Ralph Campbell
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

Support transparent huge page migration to ZONE_DEVICE private memory.
A new selection flag (MIGRATE_VMA_SELECT_COMPOUND) is added to request
THP migration. Otherwise, THPs are split when filling in the source PFN
array. A new flag (MIGRATE_PFN_COMPOUND) is added to the source PFN array
to indicate a huge page can be migrated. If the device driver can allocate
a huge page, it sets the MIGRATE_PFN_COMPOUND flag in the destination PFN
array. migrate_vma_pages() will fallback to PAGE_SIZE pages if
MIGRATE_PFN_COMPOUND is not set in both source and destination arrays.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 include/linux/huge_mm.h  |   7 +
 include/linux/memremap.h |   9 +
 include/linux/migrate.h  |   2 +
 mm/huge_memory.c         | 124 +++++++++---
 mm/memcontrol.c          |  25 ++-
 mm/memory.c              |  10 +-
 mm/memremap.c            |   4 +-
 mm/migrate.c             | 413 ++++++++++++++++++++++++++++++++-------
 mm/rmap.c                |   2 +-
 9 files changed, 486 insertions(+), 110 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3ec26ef27a93..1e8625cc233c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -190,6 +190,8 @@ bool is_transparent_hugepage(struct page *page);
 
 bool can_split_huge_page(struct page *page, int *pextra_pins);
 int split_huge_page_to_list(struct page *page, struct list_head *list);
+int split_migrating_huge_page(struct vm_area_struct *vma, pmd_t *pmd,
+			      unsigned long address, struct page *page);
 static inline int split_huge_page(struct page *page)
 {
 	return split_huge_page_to_list(page, NULL);
@@ -456,6 +458,11 @@ static inline bool is_huge_zero_page(struct page *page)
 	return false;
 }
 
+static inline bool is_huge_zero_pmd(pmd_t pmd)
+{
+	return false;
+}
+
 static inline bool is_huge_zero_pud(pud_t pud)
 {
 	return false;
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 86c6c368ce9b..9b39a896af37 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -87,6 +87,15 @@ struct dev_pagemap_ops {
 	 * the page back to a CPU accessible page.
 	 */
 	vm_fault_t (*migrate_to_ram)(struct vm_fault *vmf);
+
+	/*
+	 * Used for private (un-addressable) device memory only.
+	 * This is called when a compound device private page is split.
+	 * The driver uses this callback to set tail_page->pgmap and
+	 * tail_page->zone_device_data appropriately based on the head
+	 * page.
+	 */
+	void (*page_split)(struct page *head, struct page *tail_page);
 };
 
 #define PGMAP_ALTMAP_VALID	(1 << 0)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 0f8d1583fa8e..92179bf360d1 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -144,6 +144,7 @@ static inline int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 #define MIGRATE_PFN_MIGRATE	(1UL << 1)
 #define MIGRATE_PFN_LOCKED	(1UL << 2)
 #define MIGRATE_PFN_WRITE	(1UL << 3)
+#define MIGRATE_PFN_COMPOUND	(1UL << 4)
 #define MIGRATE_PFN_SHIFT	6
 
 static inline struct page *migrate_pfn_to_page(unsigned long mpfn)
@@ -161,6 +162,7 @@ static inline unsigned long migrate_pfn(unsigned long pfn)
 enum migrate_vma_direction {
 	MIGRATE_VMA_SELECT_SYSTEM = 1 << 0,
 	MIGRATE_VMA_SELECT_DEVICE_PRIVATE = 1 << 1,
+	MIGRATE_VMA_SELECT_COMPOUND = 1 << 2,
 };
 
 struct migrate_vma {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b4141f12ff31..a073e66d0ee2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1682,23 +1682,35 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	} else {
 		struct page *page = NULL;
 		int flush_needed = 1;
+		bool is_anon = false;
 
 		if (pmd_present(orig_pmd)) {
 			page = pmd_page(orig_pmd);
+			is_anon = PageAnon(page);
 			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
 		} else if (thp_migration_supported()) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
-			page = pfn_to_page(swp_offset(entry));
+			if (is_device_private_entry(entry)) {
+				page = device_private_entry_to_page(entry);
+				is_anon = PageAnon(page);
+				page_remove_rmap(page, true);
+				VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
+				VM_BUG_ON_PAGE(!PageHead(page), page);
+				put_page(page);
+			} else {
+				VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
+				page = pfn_to_page(swp_offset(entry));
+				is_anon = PageAnon(page);
+			}
 			flush_needed = 0;
 		} else
 			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
 
-		if (PageAnon(page)) {
+		if (is_anon) {
 			zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
@@ -2358,9 +2370,10 @@ static void remap_page(struct page *page, unsigned int nr)
 }
 
 static void __split_huge_page_tail(struct page *head, int tail,
-		struct lruvec *lruvec, struct list_head *list)
+		struct lruvec *lruvec, struct list_head *list, bool remap)
 {
 	struct page *page_tail = head + tail;
+	int pin_count;
 
 	VM_BUG_ON_PAGE(atomic_read(&page_tail->_mapcount) != -1, page_tail);
 
@@ -2396,15 +2409,24 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	smp_wmb();
 
 	/*
-	 * Clear PageTail before unfreezing page refcount.
+	 * A successful get_page_unless_zero() might follow page_ref_unfreeze()
+	 * so PageTail needs to be cleared before unfreezing the page refcount
+	 * in order for compound_head() to work correctly.
 	 *
-	 * After successful get_page_unless_zero() might follow put_page()
-	 * which needs correct compound_head().
+	 * Also, ZONE_DEVICE struct pages share the compound_head field and
+	 * need to restore the pgmap pointer before unfreezing page refcount
+	 * in order for is_zone_device_page() to work correctly.
 	 */
-	clear_compound_head(page_tail);
+	if (is_device_private_page(head)) {
+		head->pgmap->ops->page_split(head, page_tail);
+		pin_count = 2;
+	} else {
+		clear_compound_head(page_tail);
+		pin_count = 1;
+	}
 
 	/* Finally unfreeze refcount. Additional reference from page cache. */
-	page_ref_unfreeze(page_tail, 1 + (!PageAnon(head) ||
+	page_ref_unfreeze(page_tail, pin_count + (!PageAnon(head) ||
 					  PageSwapCache(head)));
 
 	if (page_is_young(head))
@@ -2419,11 +2441,12 @@ static void __split_huge_page_tail(struct page *head, int tail,
 	 * pages to show after the currently processed elements - e.g.
 	 * migrate_pages
 	 */
-	lru_add_page_tail(head, page_tail, lruvec, list);
+	if (remap)
+		lru_add_page_tail(head, page_tail, lruvec, list);
 }
 
 static void __split_huge_page(struct page *page, struct list_head *list,
-		pgoff_t end, unsigned long flags)
+		pgoff_t end, unsigned long flags, bool remap)
 {
 	struct page *head = compound_head(page);
 	pg_data_t *pgdat = page_pgdat(head);
@@ -2447,7 +2470,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	}
 
 	for (i = nr - 1; i >= 1; i--) {
-		__split_huge_page_tail(head, i, lruvec, list);
+		__split_huge_page_tail(head, i, lruvec, list, remap);
 		/* Some pages can be beyond i_size: drop them from page cache */
 		if (head[i].index >= end) {
 			ClearPageDirty(head + i);
@@ -2474,6 +2497,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 		if (PageSwapCache(head)) {
 			page_ref_add(head, 2);
 			xa_unlock(&swap_cache->i_pages);
+		} else if (is_device_private_page(head)) {
+			percpu_ref_get_many(page->pgmap->ref, nr - 1);
+			page_ref_add(head, 2);
 		} else {
 			page_ref_inc(head);
 		}
@@ -2485,6 +2511,9 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	spin_unlock_irqrestore(&pgdat->lru_lock, flags);
 
+	if (!remap)
+		return;
+
 	remap_page(head, nr);
 
 	if (PageSwapCache(head)) {
@@ -2602,6 +2631,7 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
 		extra_pins = PageSwapCache(page) ? thp_nr_pages(page) : 0;
 	else
 		extra_pins = thp_nr_pages(page);
+	extra_pins += is_device_private_page(page);
 	if (pextra_pins)
 		*pextra_pins = extra_pins;
 	return total_mapcount(page) == page_count(page) - extra_pins - 1;
@@ -2626,7 +2656,8 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
  * Returns -EBUSY if the page is pinned or if anon_vma disappeared from under
  * us.
  */
-int split_huge_page_to_list(struct page *page, struct list_head *list)
+static int __split_huge_page_to_list(struct page *page, struct list_head *list,
+				     bool remap)
 {
 	struct page *head = compound_head(page);
 	struct pglist_data *pgdata = NODE_DATA(page_to_nid(head));
@@ -2653,14 +2684,16 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = page_get_anon_vma(head);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (remap) {
+			anon_vma = page_get_anon_vma(head);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		end = -1;
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		mapping = head->mapping;
 
@@ -2686,13 +2719,19 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	/*
 	 * Racy check if we can split the page, before unmap_page() will
 	 * split PMDs
+	 * If we are splitting a migrating THP, there is no check needed
+	 * because the page is already unmapped and isolated from the LRU.
 	 */
-	if (!can_split_huge_page(head, &extra_pins)) {
+	if (!remap)
+		extra_pins = thp_nr_pages(page) - 1 +
+			is_device_private_page(head);
+	else if (!can_split_huge_page(head, &extra_pins)) {
 		ret = -EBUSY;
 		goto out_unlock;
 	}
 
-	unmap_page(head);
+	if (remap)
+		unmap_page(head);
 	VM_BUG_ON_PAGE(compound_mapcount(head), head);
 
 	/* prevent PageLRU to go away from under us, and freeze lru stats */
@@ -2717,7 +2756,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 	if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
 		if (!list_empty(page_deferred_list(head))) {
 			ds_queue->split_queue_len--;
-			list_del(page_deferred_list(head));
+			list_del_init(page_deferred_list(head));
 		}
 		spin_unlock(&ds_queue->split_queue_lock);
 		if (mapping) {
@@ -2727,7 +2766,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 				__dec_lruvec_page_state(head, NR_FILE_THPS);
 		}
 
-		__split_huge_page(page, list, end, flags);
+		__split_huge_page(page, list, end, flags, remap);
 		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
@@ -2742,7 +2781,8 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 fail:		if (mapping)
 			xa_unlock(&mapping->i_pages);
 		spin_unlock_irqrestore(&pgdata->lru_lock, flags);
-		remap_page(head, thp_nr_pages(head));
+		if (remap)
+			remap_page(head, thp_nr_pages(head));
 		ret = -EBUSY;
 	}
 
@@ -2758,6 +2798,36 @@ fail:		if (mapping)
 	return ret;
 }
 
+int split_huge_page_to_list(struct page *page, struct list_head *list)
+{
+	return __split_huge_page_to_list(page, list, true);
+}
+
+/*
+ * Split a migrating huge page.
+ * The caller should have mmap_lock_read() held, the huge page unmapped and
+ * isolated, and the PMD page table entry set to a migration entry for the
+ * given head page.
+ */
+int split_migrating_huge_page(struct vm_area_struct *vma, pmd_t *pmd,
+			      unsigned long address, struct page *head)
+{
+	spinlock_t *ptl;
+
+	VM_BUG_ON_PAGE(is_huge_zero_page(head), head);
+	VM_BUG_ON_PAGE(!PageLocked(head), head);
+	VM_BUG_ON_PAGE(!PageHead(head), head);
+	VM_BUG_ON_PAGE(PageWriteback(head), head);
+	VM_BUG_ON_PAGE(PageLRU(head), head);
+	VM_BUG_ON_PAGE(compound_mapcount(head), head);
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	__split_huge_pmd_locked(vma, pmd, address, false);
+	spin_unlock(ptl);
+
+	return __split_huge_page_to_list(head, NULL, false);
+}
+
 void free_transhuge_page(struct page *page)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(page);
@@ -2766,9 +2836,11 @@ void free_transhuge_page(struct page *page)
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	if (!list_empty(page_deferred_list(page))) {
 		ds_queue->split_queue_len--;
-		list_del(page_deferred_list(page));
+		list_del_init(page_deferred_list(page));
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	if (is_device_private_page(page))
+		return;
 	free_compound_page(page);
 }
 
@@ -2986,6 +3058,10 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_write_migration_entry(entry))
 		pmde = maybe_pmd_mkwrite(pmde, vma);
+	if (unlikely(is_device_private_page(new))) {
+		entry = make_device_private_entry(new, pmd_write(pmde));
+		pmde = swp_entry_to_pmd(entry);
+	}
 
 	flush_cache_range(vma, mmun_start, mmun_start + HPAGE_PMD_SIZE);
 	if (PageAnon(new))
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a12df292712..12d3d79c4e32 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5792,12 +5792,22 @@ static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
 
-	if (unlikely(is_swap_pmd(pmd))) {
-		VM_BUG_ON(thp_migration_supported() &&
-				  !is_pmd_migration_entry(pmd));
+	if (!(mc.flags & MOVE_ANON))
 		return ret;
+	if (unlikely(is_swap_pmd(pmd))) {
+		swp_entry_t entry = pmd_to_swp_entry(pmd);
+
+		if (!is_device_private_entry(entry)) {
+			VM_BUG_ON(thp_migration_supported() &&
+					  !is_pmd_migration_entry(pmd));
+			return ret;
+		}
+		page = device_private_entry_to_page(entry);
+		ret = MC_TARGET_DEVICE;
+	} else {
+		page = pmd_page(pmd);
+		ret = MC_TARGET_PAGE;
 	}
-	page = pmd_page(pmd);
 	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
 		return ret;
@@ -5828,12 +5838,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		/*
-		 * Note their can not be MC_TARGET_DEVICE for now as we do not
-		 * support transparent huge page with MEMORY_DEVICE_PRIVATE but
-		 * this might change.
-		 */
-		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
+		if (get_mctgt_type_thp(vma, addr, *pmd, NULL))
 			mc.precharge += HPAGE_PMD_NR;
 		spin_unlock(ptl);
 		return 0;
diff --git a/mm/memory.c b/mm/memory.c
index f8d66f0e8da7..963c168a93dc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4485,9 +4485,15 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 		barrier();
 		if (unlikely(is_swap_pmd(orig_pmd))) {
+			swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+			if (is_device_private_entry(entry)) {
+				vmf.page = device_private_entry_to_page(entry);
+				return vmf.page->pgmap->ops->migrate_to_ram(&vmf);
+			}
 			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(orig_pmd));
-			if (is_pmd_migration_entry(orig_pmd))
+					  !is_migration_entry(entry));
+			if (is_migration_entry(entry))
 				pmd_migration_entry_wait(mm, vmf.pmd);
 			return 0;
 		}
diff --git a/mm/memremap.c b/mm/memremap.c
index d72ce30da94e..8b4e6f12e58f 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -92,7 +92,7 @@ static unsigned long pfn_next(unsigned long pfn)
 {
 	if (pfn % 1024 == 0)
 		cond_resched();
-	return pfn + 1;
+	return pfn + thp_nr_pages(pfn_to_page(pfn));
 }
 
 /*
@@ -509,6 +509,8 @@ void free_devmap_managed_page(struct page *page)
 	__ClearPageWaiters(page);
 
 	mem_cgroup_uncharge(page);
+	if (PageHead(page))
+		free_transhuge_page(page);
 
 	/*
 	 * When a device_private page is freed, the page->mapping field
diff --git a/mm/migrate.c b/mm/migrate.c
index 665516319b66..7b69a5f91d0a 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -51,6 +51,7 @@
 #include <linux/oom.h>
 
 #include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/migrate.h>
@@ -2275,19 +2276,28 @@ static int migrate_vma_collect_hole(unsigned long start,
 {
 	struct migrate_vma *migrate = walk->private;
 	unsigned long addr;
+	unsigned long mpfn;
 
 	/* Only allow populating anonymous memory. */
-	if (!vma_is_anonymous(walk->vma)) {
-		for (addr = start; addr < end; addr += PAGE_SIZE) {
-			migrate->src[migrate->npages] = 0;
-			migrate->dst[migrate->npages] = 0;
-			migrate->npages++;
-		}
-		return 0;
+	if (!vma_is_anonymous(walk->vma) ||
+	    !((migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)))
+		return migrate_vma_collect_skip(start, end, walk);
+
+	if (thp_migration_supported() &&
+	    (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+	    (start & ~PMD_MASK) == 0 && (end & ~PMD_MASK) == 0) {
+		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE |
+						MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+		return migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
 	}
 
+	mpfn = (migrate->vma->vm_flags & VM_WRITE) ?
+		(MIGRATE_PFN_MIGRATE | MIGRATE_PFN_WRITE) : MIGRATE_PFN_MIGRATE;
 	for (addr = start; addr < end; addr += PAGE_SIZE) {
-		migrate->src[migrate->npages] = MIGRATE_PFN_MIGRATE;
+		migrate->src[migrate->npages] = mpfn;
 		migrate->dst[migrate->npages] = 0;
 		migrate->npages++;
 		migrate->cpages++;
@@ -2296,59 +2306,133 @@ static int migrate_vma_collect_hole(unsigned long start,
 	return 0;
 }
 
-static int migrate_vma_collect_pmd(pmd_t *pmdp,
-				   unsigned long start,
-				   unsigned long end,
-				   struct mm_walk *walk)
+static int migrate_vma_handle_pmd(pmd_t *pmdp, unsigned long start,
+				  unsigned long end, struct mm_walk *walk)
 {
 	struct migrate_vma *migrate = walk->private;
 	struct vm_area_struct *vma = walk->vma;
 	struct mm_struct *mm = vma->vm_mm;
-	unsigned long addr = start, unmapped = 0;
 	spinlock_t *ptl;
-	pte_t *ptep;
+	struct page *page;
+	unsigned long write = 0;
+	int ret;
 
-again:
-	if (pmd_none(*pmdp))
+	ptl = pmd_lock(mm, pmdp);
+	if (pmd_none(*pmdp)) {
+		spin_unlock(ptl);
 		return migrate_vma_collect_hole(start, end, -1, walk);
-
+	}
 	if (pmd_trans_huge(*pmdp)) {
-		struct page *page;
-
-		ptl = pmd_lock(mm, pmdp);
-		if (unlikely(!pmd_trans_huge(*pmdp))) {
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
 			spin_unlock(ptl);
-			goto again;
+			return migrate_vma_collect_skip(start, end, walk);
 		}
-
 		page = pmd_page(*pmdp);
 		if (is_huge_zero_page(page)) {
 			spin_unlock(ptl);
-			split_huge_pmd(vma, pmdp, addr);
-			if (pmd_trans_unstable(pmdp))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-		} else {
-			int ret;
+			return migrate_vma_collect_hole(start, end, -1, walk);
+		}
+		if (pmd_write(*pmdp))
+			write = MIGRATE_PFN_WRITE;
+	} else if (!pmd_present(*pmdp)) {
+		swp_entry_t entry = pmd_to_swp_entry(*pmdp);
+
+		if (is_migration_entry(entry)) {
+			bool wait;
 
-			get_page(page);
+			page = migration_entry_to_page(entry);
+			wait = get_page_unless_zero(page);
 			spin_unlock(ptl);
-			if (unlikely(!trylock_page(page)))
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			ret = split_huge_page(page);
-			unlock_page(page);
-			put_page(page);
-			if (ret)
-				return migrate_vma_collect_skip(start, end,
-								walk);
-			if (pmd_none(*pmdp))
-				return migrate_vma_collect_hole(start, end, -1,
-								walk);
+			if (wait)
+				put_and_wait_on_page_locked(page);
+			return -EAGAIN;
+		}
+		if (!is_device_private_entry(entry)) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
+		}
+		page = device_private_entry_to_page(entry);
+		if (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
+		    page->pgmap->owner != migrate->pgmap_owner) {
+			spin_unlock(ptl);
+			return migrate_vma_collect_skip(start, end, walk);
 		}
+		if (is_write_device_private_entry(entry))
+			write = MIGRATE_PFN_WRITE;
+	} else {
+		spin_unlock(ptl);
+		return -EAGAIN;
+	}
+
+	get_page(page);
+	if (unlikely(!trylock_page(page))) {
+		spin_unlock(ptl);
+		put_page(page);
+		return migrate_vma_collect_skip(start, end, walk);
+	}
+	if (thp_migration_supported() &&
+	    (migrate->flags & MIGRATE_VMA_SELECT_COMPOUND) &&
+	    (start & ~PMD_MASK) == 0 && (start + PMD_SIZE) == end) {
+		struct page_vma_mapped_walk vmw = {
+			.vma = vma,
+			.address = start,
+			.pmd = pmdp,
+			.ptl = ptl,
+		};
+
+		migrate->src[migrate->npages] = write |
+			migrate_pfn(page_to_pfn(page)) |
+			MIGRATE_PFN_MIGRATE | MIGRATE_PFN_LOCKED |
+			MIGRATE_PFN_COMPOUND;
+		migrate->dst[migrate->npages] = 0;
+		migrate->npages++;
+		migrate->cpages++;
+		migrate_vma_collect_skip(start + PAGE_SIZE, end, walk);
+
+		/* Note this also removes the page from the rmap. */
+		set_pmd_migration_entry(&vmw, page);
+		spin_unlock(ptl);
+
+		return 0;
+	}
+	spin_unlock(ptl);
+
+	ret = split_huge_page(page);
+	unlock_page(page);
+	put_page(page);
+
+	if (ret)
+		return migrate_vma_collect_skip(start, end, walk);
+	if (pmd_none(*pmdp))
+		return migrate_vma_collect_hole(start, end, -1, walk);
+
+	/* This just causes migrate_vma_collect_pmd() to handle PTEs. */
+	return -ENOENT;
+}
+
+static int migrate_vma_collect_pmd(pmd_t *pmdp,
+				   unsigned long start,
+				   unsigned long end,
+				   struct mm_walk *walk)
+{
+	struct migrate_vma *migrate = walk->private;
+	struct vm_area_struct *vma = walk->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long addr = start, unmapped = 0;
+	spinlock_t *ptl;
+	pte_t *ptep;
+
+again:
+	if (pmd_trans_huge(*pmdp) || !pmd_present(*pmdp)) {
+		int ret = migrate_vma_handle_pmd(pmdp, start, end, walk);
+
+		if (!ret)
+			return 0;
+		if (ret == -EAGAIN)
+			goto again;
 	}
 
-	if (unlikely(pmd_bad(*pmdp)))
+	if (unlikely(pmd_bad(*pmdp) || pmd_devmap(*pmdp)))
 		return migrate_vma_collect_skip(start, end, walk);
 
 	ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
@@ -2404,8 +2488,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
 		}
 
-		/* FIXME support THP */
-		if (!page || !page->mapping || PageTransCompound(page)) {
+		if (!page || !page->mapping) {
 			mpfn = 0;
 			goto next;
 		}
@@ -2527,14 +2610,6 @@ static bool migrate_vma_check_page(struct page *page)
 	 */
 	int extra = 1;
 
-	/*
-	 * FIXME support THP (transparent huge page), it is bit more complex to
-	 * check them than regular pages, because they can be mapped with a pmd
-	 * or with a pte (split pte mapping).
-	 */
-	if (PageCompound(page))
-		return false;
-
 	/* Page from ZONE_DEVICE have one extra reference */
 	if (is_zone_device_page(page)) {
 		/*
@@ -2833,13 +2908,191 @@ int migrate_vma_setup(struct migrate_vma *args)
 }
 EXPORT_SYMBOL(migrate_vma_setup);
 
+static pmd_t *find_pmd(struct mm_struct *mm, unsigned long addr)
+{
+	pgd_t *pgdp;
+	p4d_t *p4dp;
+	pud_t *pudp;
+
+	pgdp = pgd_offset(mm, addr);
+	p4dp = p4d_alloc(mm, pgdp, addr);
+	if (!p4dp)
+		return NULL;
+	pudp = pud_alloc(mm, p4dp, addr);
+	if (!pudp)
+		return NULL;
+	return pmd_alloc(mm, pudp, addr);
+}
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+/*
+ * This code closely follows:
+ * do_huge_pmd_anonymous_page()
+ *   __do_huge_pmd_anonymous_page()
+ * except that the page being inserted is likely to be a device private page
+ * instead of an allocated or zero page.
+ */
+static int insert_huge_pmd_anonymous_page(struct vm_area_struct *vma,
+					  unsigned long haddr,
+					  struct page *page,
+					  unsigned long *src,
+					  pmd_t *pmdp)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned int i;
+	spinlock_t *ptl;
+	bool flush = false;
+	pgtable_t pgtable;
+	gfp_t gfp;
+	pmd_t entry;
+
+	if (WARN_ON_ONCE(compound_order(page) != HPAGE_PMD_ORDER))
+		goto abort;
+
+	if (unlikely(anon_vma_prepare(vma)))
+		goto abort;
+
+	prep_transhuge_page(page);
+
+	gfp = GFP_TRANSHUGE_LIGHT;
+	if (mem_cgroup_charge(page, mm, gfp))
+		goto abort;
+
+	pgtable = pte_alloc_one(mm);
+	if (unlikely(!pgtable))
+		goto abort;
+
+	__SetPageUptodate(page);
+
+	if (is_zone_device_page(page)) {
+		if (!is_device_private_page(page))
+			goto pgtable_abort;
+		entry = swp_entry_to_pmd(make_device_private_entry(page,
+						vma->vm_flags & VM_WRITE));
+	} else {
+		entry = mk_huge_pmd(page, vma->vm_page_prot);
+		entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
+	}
+
+	ptl = pmd_lock(mm, pmdp);
+
+	if (check_stable_address_space(mm))
+		goto unlock_abort;
+
+	/*
+	 * Check for userfaultfd but do not deliver the fault. Instead,
+	 * just back off.
+	 */
+	if (userfaultfd_missing(vma))
+		goto unlock_abort;
+
+	if (pmd_present(*pmdp)) {
+		if (!is_huge_zero_pmd(*pmdp))
+			goto unlock_abort;
+		flush = true;
+	} else if (!pmd_none(*pmdp))
+		goto unlock_abort;
+
+	get_page(page);
+	page_add_new_anon_rmap(page, vma, haddr, true);
+	if (!is_zone_device_page(page))
+		lru_cache_add_inactive_or_unevictable(page, vma);
+	if (flush) {
+		pte_free(mm, pgtable);
+		flush_cache_range(vma, haddr, haddr + HPAGE_PMD_SIZE);
+		pmdp_invalidate(vma, haddr, pmdp);
+	} else {
+		pgtable_trans_huge_deposit(mm, pmdp, pgtable);
+		mm_inc_nr_ptes(mm);
+	}
+	set_pmd_at(mm, haddr, pmdp, entry);
+	update_mmu_cache_pmd(vma, haddr, pmdp);
+	add_mm_counter(mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	spin_unlock(ptl);
+	count_vm_event(THP_FAULT_ALLOC);
+	count_memcg_event_mm(mm, THP_FAULT_ALLOC);
+
+	return 0;
+
+unlock_abort:
+	spin_unlock(ptl);
+pgtable_abort:
+	pte_free(mm, pgtable);
+abort:
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		src[i] &= ~MIGRATE_PFN_MIGRATE;
+	return -EINVAL;
+}
+
+static void migrate_vma_split(struct migrate_vma *migrate, unsigned long i,
+			      unsigned long addr)
+{
+	const unsigned long npages = i + HPAGE_PMD_NR;
+	unsigned long mpfn;
+	unsigned long j;
+	bool migrating = false;
+	struct page *page;
+
+	migrate->src[i] &= ~MIGRATE_PFN_COMPOUND;
+
+	/* If no part of the THP is migrating, we can skip splitting. */
+	for (j = i; j < npages; j++) {
+		if (migrate->dst[j] & MIGRATE_PFN_VALID) {
+			migrating = true;
+			break;
+		}
+	}
+	if (!migrating)
+		return;
+
+	mpfn = migrate->src[i];
+	page = migrate_pfn_to_page(mpfn);
+	if (page) {
+		pmd_t *pmdp;
+		int ret;
+
+		pmdp = find_pmd(migrate->vma->vm_mm, addr);
+		if (!pmdp) {
+			migrate->src[i] = mpfn & ~MIGRATE_PFN_MIGRATE;
+			return;
+		}
+		ret = split_migrating_huge_page(migrate->vma, pmdp, addr, page);
+		if (ret) {
+			migrate->src[i] = mpfn & ~MIGRATE_PFN_MIGRATE;
+			return;
+		}
+		while (++i < npages) {
+			mpfn += 1UL << MIGRATE_PFN_SHIFT;
+			migrate->src[i] = mpfn;
+		}
+	} else {
+		while (++i < npages)
+			migrate->src[i] = mpfn;
+	}
+}
+#else
+static int insert_huge_pmd_anonymous_page(struct vm_area_struct *vma,
+					  unsigned long haddr,
+					  struct page *page,
+					  unsigned long *src,
+					  pmd_t *pmdp)
+{
+	return 0;
+}
+
+static void migrate_vma_split(struct migrate_vma *migrate, unsigned long i,
+			      unsigned long addr)
+{
+}
+#endif
+
 /*
  * This code closely matches the code in:
  *   __handle_mm_fault()
  *     handle_pte_fault()
  *       do_anonymous_page()
- * to map in an anonymous zero page but the struct page will be a ZONE_DEVICE
- * private page.
+ * to map in an anonymous zero page except the struct page is already allocated
+ * and will likely be a ZONE_DEVICE private page.
  */
 static void migrate_vma_insert_page(struct migrate_vma *migrate,
 				    unsigned long addr,
@@ -2852,9 +3105,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	bool flush = false;
 	spinlock_t *ptl;
 	pte_t entry;
-	pgd_t *pgdp;
-	p4d_t *p4dp;
-	pud_t *pudp;
 	pmd_t *pmdp;
 	pte_t *ptep;
 
@@ -2862,19 +3112,25 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	if (!vma_is_anonymous(vma))
 		goto abort;
 
-	pgdp = pgd_offset(mm, addr);
-	p4dp = p4d_alloc(mm, pgdp, addr);
-	if (!p4dp)
-		goto abort;
-	pudp = pud_alloc(mm, p4dp, addr);
-	if (!pudp)
-		goto abort;
-	pmdp = pmd_alloc(mm, pudp, addr);
+	pmdp = find_pmd(mm, addr);
 	if (!pmdp)
 		goto abort;
 
-	if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
-		goto abort;
+	if (thp_migration_supported() && *dst & MIGRATE_PFN_COMPOUND) {
+		int ret = insert_huge_pmd_anonymous_page(vma, addr, page, src,
+							 pmdp);
+		if (ret)
+			goto abort;
+		return;
+	}
+	if (!pmd_none(*pmdp)) {
+		if (pmd_trans_huge(*pmdp)) {
+			if (!is_huge_zero_pmd(*pmdp))
+				goto abort;
+			__split_huge_pmd(vma, pmdp, addr, false, NULL);
+		} else if (pmd_leaf(*pmdp))
+			goto abort;
+	}
 
 	/*
 	 * Use pte_alloc() instead of pte_alloc_map().  We can't run
@@ -2909,9 +3165,11 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		if (is_device_private_page(page)) {
 			swp_entry_t swp_entry;
 
-			swp_entry = make_device_private_entry(page, vma->vm_flags & VM_WRITE);
+			swp_entry = make_device_private_entry(page,
+						vma->vm_flags & VM_WRITE);
 			entry = swp_entry_to_pte(swp_entry);
-		}
+		} else
+			goto abort;
 	} else {
 		entry = mk_pte(page, vma->vm_page_prot);
 		if (vma->vm_flags & VM_WRITE)
@@ -2940,10 +3198,10 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 		goto unlock_abort;
 
 	inc_mm_counter(mm, MM_ANONPAGES);
+	get_page(page);
 	page_add_new_anon_rmap(page, vma, addr, false);
 	if (!is_zone_device_page(page))
 		lru_cache_add_inactive_or_unevictable(page, vma);
-	get_page(page);
 
 	if (flush) {
 		flush_cache_page(vma, addr, pte_pfn(*ptep));
@@ -2957,7 +3215,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
 	}
 
 	pte_unmap_unlock(ptep, ptl);
-	*src = MIGRATE_PFN_MIGRATE;
 	return;
 
 unlock_abort:
@@ -2988,11 +3245,23 @@ void migrate_vma_pages(struct migrate_vma *migrate)
 		struct address_space *mapping;
 		int r;
 
+		/*
+		 * If the caller didn't allocate a THP, split the PMD and
+		 * fix up the src array.
+		 */
+		if (thp_migration_supported() &&
+		    (migrate->src[i] & MIGRATE_PFN_MIGRATE) &&
+		    (migrate->src[i] & MIGRATE_PFN_COMPOUND) &&
+		    !(migrate->dst[i] & MIGRATE_PFN_COMPOUND))
+			migrate_vma_split(migrate, i, addr);
+
+		newpage = migrate_pfn_to_page(migrate->dst[i]);
 		if (!newpage) {
 			migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 			continue;
 		}
 
+		page = migrate_pfn_to_page(migrate->src[i]);
 		if (!page) {
 			if (!(migrate->src[i] & MIGRATE_PFN_MIGRATE))
 				continue;
diff --git a/mm/rmap.c b/mm/rmap.c
index 1b84945d655c..13eb0247d8b7 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1497,7 +1497,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 		}
 
 		if (IS_ENABLED(CONFIG_MIGRATION) &&
-		    (flags & TTU_MIGRATION) &&
+		    (flags & (TTU_MIGRATION | TTU_SPLIT_FREEZE)) &&
 		    is_zone_device_page(page)) {
 			swp_entry_t entry;
 			pte_t swp_pte;
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 4/6] mm/thp: add THP allocation helper
  2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
                   ` (2 preceding siblings ...)
  2020-11-06  0:51 ` [PATCH v3 3/6] mm: support THP migration to device private memory Ralph Campbell
@ 2020-11-06  0:51 ` Ralph Campbell
  2020-11-06  8:01   ` Christoph Hellwig
  2020-11-06  0:51 ` [PATCH v3 5/6] mm/hmm/test: add self tests for THP migration Ralph Campbell
  2020-11-06  0:51 ` [PATCH v3 6/6] nouveau: support THP migration to private memory Ralph Campbell
  5 siblings, 1 reply; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

Transparent huge page allocation policy is controlled by several sysfs
variables. Rather than expose these to each device driver that needs to
allocate THPs, provide a helper function.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 include/linux/gfp.h | 10 ++++++++++
 mm/huge_memory.c    | 14 ++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237e006c..242398c4b556 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -564,6 +564,16 @@ static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
 	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
+					unsigned long addr);
+#else
+static inline struct page *alloc_transhugepage(struct vm_area_struct *vma,
+						unsigned long addr)
+{
+	return NULL;
+}
+#endif
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a073e66d0ee2..c2c1d3e7c35f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -765,6 +765,20 @@ vm_fault_t do_huge_pmd_anonymous_page(struct vm_fault *vmf)
 	return __do_huge_pmd_anonymous_page(vmf, page, gfp);
 }
 
+struct page *alloc_transhugepage(struct vm_area_struct *vma,
+				 unsigned long haddr)
+{
+	gfp_t gfp;
+	struct page *page;
+
+	gfp = alloc_hugepage_direct_gfpmask(vma);
+	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
+	if (page)
+		prep_transhuge_page(page);
+	return page;
+}
+EXPORT_SYMBOL_GPL(alloc_transhugepage);
+
 static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
 		pmd_t *pmd, pfn_t pfn, pgprot_t prot, bool write,
 		pgtable_t pgtable)
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 5/6] mm/hmm/test: add self tests for THP migration
  2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
                   ` (3 preceding siblings ...)
  2020-11-06  0:51 ` [PATCH v3 4/6] mm/thp: add THP allocation helper Ralph Campbell
@ 2020-11-06  0:51 ` Ralph Campbell
  2020-11-06  0:51 ` [PATCH v3 6/6] nouveau: support THP migration to private memory Ralph Campbell
  5 siblings, 0 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

Add some basic stand alone self tests for migrating system memory to device
private memory and back.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 lib/test_hmm.c                         | 437 +++++++++++++++++++++----
 lib/test_hmm_uapi.h                    |   3 +
 tools/testing/selftests/vm/hmm-tests.c | 404 +++++++++++++++++++++++
 3 files changed, 775 insertions(+), 69 deletions(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..456f1a90bcc3 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -66,6 +66,7 @@ struct dmirror {
 	struct xarray			pt;
 	struct mmu_interval_notifier	notifier;
 	struct mutex			mutex;
+	__u64				flags;
 };
 
 /*
@@ -91,6 +92,7 @@ struct dmirror_device {
 	unsigned long		calloc;
 	unsigned long		cfree;
 	struct page		*free_pages;
+	struct page		*free_huge_pages;
 	spinlock_t		lock;		/* protects the above */
 };
 
@@ -450,6 +452,7 @@ static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
 }
 
 static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+				   bool is_huge,
 				   struct page **ppage)
 {
 	struct dmirror_chunk *devmem;
@@ -503,28 +506,51 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 
 	mutex_unlock(&mdevice->devmem_lock);
 
-	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+	pr_info("dev %u added %u MB (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+		MINOR(mdevice->cdevice.dev),
 		DEVMEM_CHUNK_SIZE / (1024 * 1024),
 		mdevice->devmem_count,
 		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
 		pfn_first, pfn_last);
 
 	spin_lock(&mdevice->lock);
-	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+	for (pfn = pfn_first; pfn < pfn_last; ) {
 		struct page *page = pfn_to_page(pfn);
 
+		if (is_huge && (pfn & (HPAGE_PMD_NR - 1)) == 0 &&
+		    pfn + HPAGE_PMD_NR <= pfn_last) {
+			prep_transhuge_device_private_page(page);
+			page->zone_device_data = mdevice->free_huge_pages;
+			mdevice->free_huge_pages = page;
+			pfn += HPAGE_PMD_NR;
+			continue;
+		}
 		page->zone_device_data = mdevice->free_pages;
 		mdevice->free_pages = page;
+		pfn++;
 	}
 	if (ppage) {
-		*ppage = mdevice->free_pages;
-		mdevice->free_pages = (*ppage)->zone_device_data;
-		mdevice->calloc++;
+		if (is_huge) {
+			if (!mdevice->free_huge_pages)
+				goto err_unlock;
+			*ppage = mdevice->free_huge_pages;
+			mdevice->free_huge_pages = (*ppage)->zone_device_data;
+			mdevice->calloc += thp_nr_pages(*ppage);
+		} else if (mdevice->free_pages) {
+			*ppage = mdevice->free_pages;
+			mdevice->free_pages = (*ppage)->zone_device_data;
+			mdevice->calloc++;
+		} else
+			goto err_unlock;
 	}
 	spin_unlock(&mdevice->lock);
 
 	return true;
 
+err_unlock:
+	spin_unlock(&mdevice->lock);
+	return false;
+
 err_release:
 	mutex_unlock(&mdevice->devmem_lock);
 	release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
@@ -534,7 +560,8 @@ static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
 	return false;
 }
 
-static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice,
+					      bool is_huge)
 {
 	struct page *dpage = NULL;
 	struct page *rpage;
@@ -549,17 +576,40 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 
 	spin_lock(&mdevice->lock);
 
-	if (mdevice->free_pages) {
+	if (is_huge && mdevice->free_huge_pages) {
+		dpage = mdevice->free_huge_pages;
+		mdevice->free_huge_pages = dpage->zone_device_data;
+		mdevice->calloc += thp_nr_pages(dpage);
+		spin_unlock(&mdevice->lock);
+	} else if (!is_huge && mdevice->free_pages) {
 		dpage = mdevice->free_pages;
 		mdevice->free_pages = dpage->zone_device_data;
 		mdevice->calloc++;
 		spin_unlock(&mdevice->lock);
 	} else {
 		spin_unlock(&mdevice->lock);
-		if (!dmirror_allocate_chunk(mdevice, &dpage))
+		if (!dmirror_allocate_chunk(mdevice, is_huge, &dpage))
 			goto error;
 	}
 
+	if (is_huge) {
+		unsigned int nr_pages = thp_nr_pages(dpage);
+		unsigned int i;
+		struct page **tpage;
+
+		tpage = kmap(rpage);
+		for (i = 0; i < nr_pages; i++, tpage++) {
+			*tpage = alloc_page(GFP_HIGHUSER);
+			if (!*tpage) {
+				while (i--)
+					__free_page(*--tpage);
+				kunmap(rpage);
+				goto error;
+			}
+		}
+		kunmap(rpage);
+	}
+
 	dpage->zone_device_data = rpage;
 	get_page(dpage);
 	lock_page(dpage);
@@ -570,22 +620,26 @@ static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
 	return NULL;
 }
 
-static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
-					   struct dmirror *dmirror)
+static int dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
+					  struct dmirror *dmirror)
 {
 	struct dmirror_device *mdevice = dmirror->mdevice;
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
-	unsigned long addr;
+	unsigned long end_pfn = args->end >> PAGE_SHIFT;
+	unsigned long pfn;
 
-	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
-						   src++, dst++) {
+	for (pfn = args->start >> PAGE_SHIFT; pfn < end_pfn; ) {
 		struct page *spage;
 		struct page *dpage;
 		struct page *rpage;
+		bool is_huge;
+		unsigned long write;
+		struct page **tpage;
+		unsigned long endp;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
 		/*
 		 * Note that spage might be NULL which is OK since it is an
@@ -593,15 +647,39 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 */
 		spage = migrate_pfn_to_page(*src);
 
-		dpage = dmirror_devmem_alloc_page(mdevice);
-		if (!dpage)
+		/* This flag is only set if a whole huge page is migrated. */
+		is_huge = *src & MIGRATE_PFN_COMPOUND;
+		write = (*src & MIGRATE_PFN_WRITE) ? MIGRATE_PFN_WRITE : 0;
+
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else
+			dpage = dmirror_devmem_alloc_page(mdevice, is_huge);
+		if (!dpage) {
+			if (!is_huge)
+				return -ENOMEM;
+			/* Try falling back to PAGE_SIZE pages. */
+			endp = pfn + HPAGE_PMD_NR;
+			while (pfn < endp) {
+				dpage = dmirror_devmem_alloc_page(mdevice,
+								  false);
+				if (!dpage)
+					return -ENOMEM;
+				rpage = dpage->zone_device_data;
+				rpage->zone_device_data = dmirror;
+				*dst = migrate_pfn(page_to_pfn(dpage)) |
+					MIGRATE_PFN_LOCKED | write;
+				if (spage)
+					copy_highpage(rpage, spage++);
+				else
+					clear_highpage(rpage);
+				pfn++;
+				src++;
+				dst++;
+			}
 			continue;
-
-		rpage = dpage->zone_device_data;
-		if (spage)
-			copy_highpage(rpage, spage);
-		else
-			clear_highpage(rpage);
+		}
 
 		/*
 		 * Normally, a device would use the page->zone_device_data to
@@ -609,14 +687,40 @@ static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
 		 * the simulated device memory and that page holds the pointer
 		 * to the mirror.
 		 */
+		rpage = dpage->zone_device_data;
 		rpage->zone_device_data = dmirror;
 
-		*dst = migrate_pfn(page_to_pfn(dpage)) |
-			    MIGRATE_PFN_LOCKED;
-		if ((*src & MIGRATE_PFN_WRITE) ||
-		    (!spage && args->vma->vm_flags & VM_WRITE))
-			*dst |= MIGRATE_PFN_WRITE;
+		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED |
+			write;
+
+		if (is_huge) {
+			endp = pfn + thp_nr_pages(dpage);
+			*dst |= MIGRATE_PFN_COMPOUND;
+			tpage = kmap(rpage);
+			while (pfn < endp) {
+				if (spage)
+					copy_highpage(*tpage, spage++);
+				else
+					clear_highpage(*tpage);
+				tpage++;
+				pfn++;
+				src++;
+				dst++;
+			}
+			kunmap(rpage);
+			continue;
+		}
+
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+next:
+		pfn++;
+		src++;
+		dst++;
 	}
+	return 0;
 }
 
 static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
@@ -627,38 +731,75 @@ static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
 	const unsigned long *src = args->src;
 	const unsigned long *dst = args->dst;
 	unsigned long pfn;
+	int ret = 0;
 
 	/* Map the migrated pages into the device's page tables. */
 	mutex_lock(&dmirror->mutex);
 
-	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
-								src++, dst++) {
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); ) {
+		unsigned long mpfn;
 		struct page *dpage;
+		struct page *rpage;
 		void *entry;
 
 		if (!(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
 
-		dpage = migrate_pfn_to_page(*dst);
+		mpfn = *dst;
+		dpage = migrate_pfn_to_page(mpfn);
 		if (!dpage)
-			continue;
+			goto next;
 
 		/*
 		 * Store the page that holds the data so the page table
 		 * doesn't have to deal with ZONE_DEVICE private pages.
 		 */
-		entry = dpage->zone_device_data;
-		if (*dst & MIGRATE_PFN_WRITE)
+		rpage = dpage->zone_device_data;
+		if (mpfn & MIGRATE_PFN_COMPOUND) {
+			struct page **tpage;
+			unsigned long end_pfn = pfn + thp_nr_pages(dpage);
+
+			ret = 0;
+			tpage = kmap(rpage);
+			while (pfn < end_pfn) {
+				entry = *tpage;
+				if (mpfn & MIGRATE_PFN_WRITE)
+					entry = xa_tag_pointer(entry,
+							DPT_XA_TAG_WRITE);
+				entry = xa_store(&dmirror->pt, pfn, entry,
+						 GFP_KERNEL);
+				if (xa_is_err(entry)) {
+					ret = xa_err(entry);
+					break;
+				}
+				tpage++;
+				pfn++;
+				src++;
+				dst++;
+			}
+			kunmap(rpage);
+			if (ret)
+				goto err;
+			continue;
+		}
+
+		entry = rpage;
+		if (mpfn & MIGRATE_PFN_WRITE)
 			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
 		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
 		if (xa_is_err(entry)) {
-			mutex_unlock(&dmirror->mutex);
-			return xa_err(entry);
+			ret = xa_err(entry);
+			goto err;
 		}
+next:
+		pfn++;
+		src++;
+		dst++;
 	}
 
+err:
 	mutex_unlock(&dmirror->mutex);
-	return 0;
+	return ret;
 }
 
 static int dmirror_migrate(struct dmirror *dmirror,
@@ -668,8 +809,8 @@ static int dmirror_migrate(struct dmirror *dmirror,
 	unsigned long size = cmd->npages << PAGE_SHIFT;
 	struct mm_struct *mm = dmirror->notifier.mm;
 	struct vm_area_struct *vma;
-	unsigned long src_pfns[64];
-	unsigned long dst_pfns[64];
+	unsigned long *src_pfns;
+	unsigned long *dst_pfns;
 	struct dmirror_bounce bounce;
 	struct migrate_vma args;
 	unsigned long next;
@@ -684,6 +825,17 @@ static int dmirror_migrate(struct dmirror *dmirror,
 	if (!mmget_not_zero(mm))
 		return -EINVAL;
 
+	src_pfns = kmalloc_array(PTRS_PER_PTE, sizeof(*src_pfns), GFP_KERNEL);
+	if (!src_pfns) {
+		ret = -ENOMEM;
+		goto out_put;
+	}
+	dst_pfns = kmalloc_array(PTRS_PER_PTE, sizeof(*dst_pfns), GFP_KERNEL);
+	if (!dst_pfns) {
+		ret = -ENOMEM;
+		goto out_free_src;
+	}
+
 	mmap_read_lock(mm);
 	for (addr = start; addr < end; addr = next) {
 		vma = find_vma(mm, addr);
@@ -692,7 +844,7 @@ static int dmirror_migrate(struct dmirror *dmirror,
 			ret = -EINVAL;
 			goto out;
 		}
-		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		next = pmd_addr_end(addr, end);
 		if (next > vma->vm_end)
 			next = vma->vm_end;
 
@@ -702,17 +854,24 @@ static int dmirror_migrate(struct dmirror *dmirror,
 		args.start = addr;
 		args.end = next;
 		args.pgmap_owner = dmirror->mdevice;
-		args.flags = MIGRATE_VMA_SELECT_SYSTEM;
+		args.flags = MIGRATE_VMA_SELECT_SYSTEM |
+			     MIGRATE_VMA_SELECT_COMPOUND;
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out;
 
-		dmirror_migrate_alloc_and_copy(&args, dmirror);
-		migrate_vma_pages(&args);
-		dmirror_migrate_finalize_and_map(&args, dmirror);
+		ret = dmirror_migrate_alloc_and_copy(&args, dmirror);
+		if (!ret) {
+			migrate_vma_pages(&args);
+			dmirror_migrate_finalize_and_map(&args, dmirror);
+		}
 		migrate_vma_finalize(&args);
+		if (ret)
+			goto out;
 	}
 	mmap_read_unlock(mm);
+	kfree(dst_pfns);
+	kfree(src_pfns);
 	mmput(mm);
 
 	/* Return the migrated data for verification. */
@@ -733,6 +892,10 @@ static int dmirror_migrate(struct dmirror *dmirror,
 
 out:
 	mmap_read_unlock(mm);
+	kfree(dst_pfns);
+out_free_src:
+	kfree(src_pfns);
+out_put:
 	mmput(mm);
 	return ret;
 }
@@ -953,6 +1116,11 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp,
 		ret = dmirror_snapshot(dmirror, &cmd);
 		break;
 
+	case HMM_DMIRROR_FLAGS:
+		dmirror->flags = cmd.npages;
+		ret = 0;
+		break;
+
 	default:
 		return -EINVAL;
 	}
@@ -976,22 +1144,70 @@ static const struct file_operations dmirror_fops = {
 static void dmirror_devmem_free(struct page *page)
 {
 	struct page *rpage = page->zone_device_data;
+	unsigned int order = thp_order(page);
+	unsigned int nr_pages = 1U << order;
 	struct dmirror_device *mdevice;
 
-	if (rpage)
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (rpage) {
+		if (order) {
+			unsigned int i;
+			struct page **tpage;
+			void *kaddr;
+
+			kaddr = kmap_atomic(rpage);
+			tpage = kaddr;
+			for (i = 0; i < nr_pages; i++, tpage++)
+				__free_page(*tpage);
+			kunmap_atomic(kaddr);
+		}
 		__free_page(rpage);
+	}
 
 	mdevice = dmirror_page_to_device(page);
 
 	spin_lock(&mdevice->lock);
-	mdevice->cfree++;
-	page->zone_device_data = mdevice->free_pages;
-	mdevice->free_pages = page;
+	if (order) {
+		page->zone_device_data = mdevice->free_huge_pages;
+		mdevice->free_huge_pages = page;
+	} else {
+		page->zone_device_data = mdevice->free_pages;
+		mdevice->free_pages = page;
+	}
+	mdevice->cfree += nr_pages;
 	spin_unlock(&mdevice->lock);
 }
 
+static void dmirror_devmem_split(struct page *head, struct page *page)
+{
+	struct page *rpage = head->zone_device_data;
+	unsigned long i;
+	struct page **tpage;
+	void *kaddr;
+
+	page->pgmap = head->pgmap;
+
+	if (!rpage) {
+		page->zone_device_data = NULL;
+		return;
+	}
+
+	kaddr = kmap_atomic(rpage);
+	tpage = kaddr;
+	i = page - head;
+	page->zone_device_data = tpage[i];
+	if (i == 1) {
+		head->zone_device_data = tpage[0];
+		kunmap_atomic(kaddr);
+		__free_page(rpage);
+	} else
+		kunmap_atomic(kaddr);
+}
+
 static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
-						      struct dmirror *dmirror)
+						      struct dmirror *dmirror,
+						      unsigned long fault_addr)
 {
 	const unsigned long *src = args->src;
 	unsigned long *dst = args->dst;
@@ -999,25 +1215,71 @@ static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
 	unsigned long end = args->end;
 	unsigned long addr;
 
-	for (addr = start; addr < end; addr += PAGE_SIZE,
-				       src++, dst++) {
-		struct page *dpage, *spage;
+	for (addr = start; addr < end; ) {
+		struct page *spage, *dpage;
+		unsigned int order = 0;
+		unsigned int nr_pages = 1;
+		struct page **tpage;
+		unsigned int i;
 
 		spage = migrate_pfn_to_page(*src);
 		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
-			continue;
+			goto next;
+		order = thp_order(spage);
+		nr_pages = 1U << order;
+		/* The source page is the ZONE_DEVICE private page. */
 		spage = spage->zone_device_data;
 
-		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, args->vma, addr);
-		if (!dpage)
-			continue;
+		if (dmirror->flags & HMM_DMIRROR_FLAG_FAIL_ALLOC) {
+			dmirror->flags &= ~HMM_DMIRROR_FLAG_FAIL_ALLOC;
+			dpage = NULL;
+		} else if (order)
+			dpage = alloc_transhugepage(args->vma, addr);
+		else
+			dpage = alloc_pages_vma(GFP_HIGHUSER_MOVABLE, 0,
+						args->vma, addr,
+						numa_node_id(), false);
+		if (!dpage) {
+			if (!order)
+				return VM_FAULT_OOM;
+			/* Try falling back to PAGE_SIZE pages. */
+			dpage = alloc_pages_vma(GFP_HIGHUSER_MOVABLE, 0,
+						args->vma, addr,
+						numa_node_id(), false);
+			if (!dpage)
+				return VM_FAULT_OOM;
+			lock_page(dpage);
+			xa_erase(&dmirror->pt, fault_addr >> PAGE_SHIFT);
+			i = (fault_addr - start) >> PAGE_SHIFT;
+			dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+			if (*src & MIGRATE_PFN_WRITE)
+				dst[i] |= MIGRATE_PFN_WRITE;
+			tpage = kmap(spage);
+			copy_highpage(dpage, tpage[i]);
+			kunmap(spage);
+			goto next;
+		}
 
 		lock_page(dpage);
 		xa_erase(&dmirror->pt, addr >> PAGE_SHIFT);
-		copy_highpage(dpage, spage);
 		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
 		if (*src & MIGRATE_PFN_WRITE)
 			*dst |= MIGRATE_PFN_WRITE;
+		if (order) {
+			*dst |= MIGRATE_PFN_COMPOUND;
+			tpage = kmap(spage);
+			for (i = 0; i < nr_pages; i++) {
+				copy_highpage(dpage, *tpage);
+				tpage++;
+				dpage++;
+			}
+			kunmap(spage);
+		} else
+			copy_highpage(dpage, spage);
+next:
+		addr += PAGE_SIZE << order;
+		src += nr_pages;
+		dst += nr_pages;
 	}
 	return 0;
 }
@@ -1027,33 +1289,55 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	struct migrate_vma args;
 	unsigned long src_pfns;
 	unsigned long dst_pfns;
+	struct page *page;
 	struct page *rpage;
+	unsigned int order;
 	struct dmirror *dmirror;
 	vm_fault_t ret;
 
+	page = thp_head(vmf->page);
+	order = thp_order(page);
+
 	/*
 	 * Normally, a device would use the page->zone_device_data to point to
 	 * the mirror but here we use it to hold the page for the simulated
 	 * device memory and that page holds the pointer to the mirror.
 	 */
-	rpage = vmf->page->zone_device_data;
+	rpage = page->zone_device_data;
 	dmirror = rpage->zone_device_data;
 
-	/* FIXME demonstrate how we can adjust migrate range */
+	if (order) {
+		args.start = vmf->address & (PAGE_MASK << order);
+		args.end = args.start + (PAGE_SIZE << order);
+		args.src = kcalloc(PTRS_PER_PTE, sizeof(*args.src),
+				   GFP_KERNEL);
+		if (!args.src)
+			return VM_FAULT_OOM;
+		args.dst = kcalloc(PTRS_PER_PTE, sizeof(*args.dst),
+				   GFP_KERNEL);
+		if (!args.dst) {
+			ret = VM_FAULT_OOM;
+			goto error_src;
+		}
+	} else {
+		args.start = vmf->address;
+		args.end = args.start + PAGE_SIZE;
+		args.src = &src_pfns;
+		args.dst = &dst_pfns;
+	}
 	args.vma = vmf->vma;
-	args.start = vmf->address;
-	args.end = args.start + PAGE_SIZE;
-	args.src = &src_pfns;
-	args.dst = &dst_pfns;
 	args.pgmap_owner = dmirror->mdevice;
-	args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE;
+	args.flags = MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+		     MIGRATE_VMA_SELECT_COMPOUND;
 
-	if (migrate_vma_setup(&args))
-		return VM_FAULT_SIGBUS;
+	if (migrate_vma_setup(&args)) {
+		ret = VM_FAULT_SIGBUS;
+		goto error_dst;
+	}
 
-	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror);
+	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror, vmf->address);
 	if (ret)
-		return ret;
+		goto error_fin;
 	migrate_vma_pages(&args);
 	/*
 	 * No device finalize step is needed since
@@ -1061,12 +1345,27 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * invalidated the device page table.
 	 */
 	migrate_vma_finalize(&args);
+	if (order) {
+		kfree(args.dst);
+		kfree(args.src);
+	}
 	return 0;
+
+error_fin:
+	migrate_vma_finalize(&args);
+error_dst:
+	if (args.dst != &dst_pfns)
+		kfree(args.dst);
+error_src:
+	if (args.src != &src_pfns)
+		kfree(args.src);
+	return ret;
 }
 
 static const struct dev_pagemap_ops dmirror_devmem_ops = {
 	.page_free	= dmirror_devmem_free,
 	.migrate_to_ram	= dmirror_devmem_fault,
+	.page_split	= dmirror_devmem_split,
 };
 
 static int dmirror_device_init(struct dmirror_device *mdevice, int id)
@@ -1085,7 +1384,7 @@ static int dmirror_device_init(struct dmirror_device *mdevice, int id)
 		return ret;
 
 	/* Build a list of free ZONE_DEVICE private struct pages */
-	dmirror_allocate_chunk(mdevice, NULL);
+	dmirror_allocate_chunk(mdevice, false, NULL);
 
 	return 0;
 }
diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h
index 670b4ef2a5b6..39e6ef3b67b9 100644
--- a/lib/test_hmm_uapi.h
+++ b/lib/test_hmm_uapi.h
@@ -33,6 +33,9 @@ struct hmm_dmirror_cmd {
 #define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
 #define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_FLAGS		_IOWR('H', 0x04, struct hmm_dmirror_cmd)
+
+#define HMM_DMIRROR_FLAG_FAIL_ALLOC	(1ULL << 0)
 
 /*
  * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
index 5d1ac691b9f4..069c3cc3c89b 100644
--- a/tools/testing/selftests/vm/hmm-tests.c
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -1485,4 +1485,408 @@ TEST_F(hmm2, double_map)
 	hmm_buffer_free(buffer);
 }
 
+/*
+ * Migrate private anonymous huge empty page.
+ */
+TEST_F(hmm, migrate_anon_huge_empty)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page.
+ */
+TEST_F(hmm, migrate_anon_huge_zero)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+	int val;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize a read-only zero huge page. */
+	val = *(int *)buffer->ptr;
+	ASSERT_EQ(val, 0);
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i) {
+		ASSERT_EQ(ptr[i], 0);
+		/* If it asserts once, it probably will 500,000 times */
+		if (ptr[i] != 0)
+			break;
+	}
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and free.
+ */
+TEST_F(hmm, migrate_anon_huge_free)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try freeing it. */
+	ret = madvise(map, size, MADV_FREE);
+	ASSERT_EQ(ret, 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page and fault back to sysmem.
+ */
+TEST_F(hmm, migrate_anon_huge_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, size);
+
+	buffer->ptr = mmap(NULL, 2 * size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ | PROT_WRITE,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate THP to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/*
+	 * Force an allocation error when faulting back a THP resident in the
+	 * device.
+	 */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 2048);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate private anonymous huge zero page with allocation errors.
+ */
+TEST_F(hmm, migrate_anon_huge_zero_err)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = 2 * size;
+	buffer->mirror = malloc(2 * size);
+	ASSERT_NE(buffer->mirror, NULL);
+	memset(buffer->mirror, 0xFF, 2 * size);
+
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Try faulting back a single (PAGE_SIZE) page. */
+	ptr = buffer->ptr;
+	ASSERT_EQ(ptr[2048], 0);
+
+	/* unmap and remap the region to reset things. */
+	ret = munmap(old_ptr, 2 * size);
+	ASSERT_EQ(ret, 0);
+	old_ptr = mmap(NULL, 2 * size, PROT_READ,
+			MAP_PRIVATE | MAP_ANONYMOUS, buffer->fd, 0);
+	ASSERT_NE(old_ptr, MAP_FAILED);
+	map = (void *)ALIGN((uintptr_t)old_ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	buffer->ptr = map;
+
+	/* Initialize buffer in system memory (zero THP page). */
+	ret = ptr[0];
+	ASSERT_EQ(ret, 0);
+
+	/* Migrate memory to device but force a THP allocation error. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_FLAGS, buffer,
+			      HMM_DMIRROR_FLAG_FAIL_ALLOC);
+	ASSERT_EQ(ret, 0);
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Fault the device memory back and check it. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
 TEST_HARNESS_MAIN
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v3 6/6] nouveau: support THP migration to private memory
  2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
                   ` (4 preceding siblings ...)
  2020-11-06  0:51 ` [PATCH v3 5/6] mm/hmm/test: add self tests for THP migration Ralph Campbell
@ 2020-11-06  0:51 ` Ralph Campbell
  5 siblings, 0 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06  0:51 UTC (permalink / raw)
  To: linux-mm, nouveau, linux-kselftest, linux-kernel
  Cc: Jerome Glisse, John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Ralph Campbell

Add support for migrating transparent huge pages to and from device
private memory.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 289 ++++++++++++++++++-------
 drivers/gpu/drm/nouveau/nouveau_svm.c  |  11 +-
 drivers/gpu/drm/nouveau/nouveau_svm.h  |   3 +-
 3 files changed, 215 insertions(+), 88 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 92987daa5e17..93eea8e9d987 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -82,6 +82,7 @@ struct nouveau_dmem {
 	struct list_head chunks;
 	struct mutex mutex;
 	struct page *free_pages;
+	struct page *free_huge_pages;
 	spinlock_t lock;
 };
 
@@ -112,8 +113,13 @@ static void nouveau_dmem_page_free(struct page *page)
 	struct nouveau_dmem *dmem = chunk->drm->dmem;
 
 	spin_lock(&dmem->lock);
-	page->zone_device_data = dmem->free_pages;
-	dmem->free_pages = page;
+	if (PageHead(page)) {
+		page->zone_device_data = dmem->free_huge_pages;
+		dmem->free_huge_pages = page;
+	} else {
+		page->zone_device_data = dmem->free_pages;
+		dmem->free_pages = page;
+	}
 
 	WARN_ON(!chunk->callocated);
 	chunk->callocated--;
@@ -139,51 +145,100 @@ static void nouveau_dmem_fence_done(struct nouveau_fence **fence)
 
 static vm_fault_t nouveau_dmem_fault_copy_one(struct nouveau_drm *drm,
 		struct vm_fault *vmf, struct migrate_vma *args,
-		dma_addr_t *dma_addr)
+		struct page *spage, bool is_huge, dma_addr_t *dma_addr)
 {
+	struct nouveau_svmm *svmm = spage->zone_device_data;
 	struct device *dev = drm->dev->dev;
-	struct page *dpage, *spage;
-	struct nouveau_svmm *svmm;
-
-	spage = migrate_pfn_to_page(args->src[0]);
-	if (!spage || !(args->src[0] & MIGRATE_PFN_MIGRATE))
-		return 0;
+	struct page *dpage;
+	unsigned int i;
 
-	dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
+	if (is_huge)
+		dpage = alloc_transhugepage(vmf->vma, args->start);
+	else
+		dpage = alloc_page_vma(GFP_HIGHUSER, vmf->vma, vmf->address);
 	if (!dpage)
-		return VM_FAULT_SIGBUS;
-	lock_page(dpage);
+		return VM_FAULT_OOM;
+	WARN_ON_ONCE(thp_order(spage) != thp_order(dpage));
 
-	*dma_addr = dma_map_page(dev, dpage, 0, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	*dma_addr = dma_map_page(dev, dpage, 0, page_size(dpage),
+				 DMA_BIDIRECTIONAL);
 	if (dma_mapping_error(dev, *dma_addr))
 		goto error_free_page;
 
-	svmm = spage->zone_device_data;
+	lock_page(dpage);
+	i = (vmf->address - args->start) >> PAGE_SHIFT;
+	spage += i;
 	mutex_lock(&svmm->mutex);
 	nouveau_svmm_invalidate(svmm, args->start, args->end);
-	if (drm->dmem->migrate.copy_func(drm, 1, NOUVEAU_APER_HOST, *dma_addr,
-			NOUVEAU_APER_VRAM, nouveau_dmem_page_addr(spage)))
+	if (drm->dmem->migrate.copy_func(drm, thp_nr_pages(dpage),
+			NOUVEAU_APER_HOST, *dma_addr, NOUVEAU_APER_VRAM,
+			nouveau_dmem_page_addr(spage)))
 		goto error_dma_unmap;
 	mutex_unlock(&svmm->mutex);
 
-	args->dst[0] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	args->dst[i] = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	if (is_huge)
+		args->dst[i] |= MIGRATE_PFN_COMPOUND;
 	return 0;
 
 error_dma_unmap:
 	mutex_unlock(&svmm->mutex);
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
+	unlock_page(dpage);
+	dma_unmap_page(dev, *dma_addr, page_size(dpage), DMA_BIDIRECTIONAL);
 error_free_page:
 	__free_page(dpage);
 	return VM_FAULT_SIGBUS;
 }
 
+static vm_fault_t nouveau_dmem_fault_chunk(struct nouveau_drm *drm,
+		struct vm_fault *vmf, struct migrate_vma *args)
+{
+	struct device *dev = drm->dev->dev;
+	struct nouveau_fence *fence;
+	struct page *spage;
+	unsigned long src = args->src[0];
+	bool is_huge = (src & (MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND)) ==
+		(MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND);
+	unsigned long dma_page_size;
+	dma_addr_t dma_addr;
+	vm_fault_t ret = 0;
+
+	spage = migrate_pfn_to_page(src);
+	if (!spage) {
+		ret = VM_FAULT_SIGBUS;
+		goto out;
+	}
+	if (is_huge) {
+		dma_page_size = PMD_SIZE;
+		ret = nouveau_dmem_fault_copy_one(drm, vmf, args, spage, true,
+						  &dma_addr);
+		if (!ret)
+			goto fence;
+		/*
+		 * If we couldn't allocate a huge page, fallback to migrating
+		 * a single page.
+		 */
+	}
+	dma_page_size = PAGE_SIZE;
+	ret = nouveau_dmem_fault_copy_one(drm, vmf, args, spage, false,
+					  &dma_addr);
+	if (ret)
+		goto out;
+fence:
+	nouveau_fence_new(drm->dmem->migrate.chan, false, &fence);
+	migrate_vma_pages(args);
+	nouveau_dmem_fence_done(&fence);
+	dma_unmap_page(dev, dma_addr, dma_page_size, DMA_BIDIRECTIONAL);
+out:
+	migrate_vma_finalize(args);
+	return ret;
+}
+
 static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 {
 	struct nouveau_drm *drm = page_to_drm(vmf->page);
-	struct nouveau_dmem *dmem = drm->dmem;
-	struct nouveau_fence *fence;
 	unsigned long src = 0, dst = 0;
-	dma_addr_t dma_addr = 0;
+	struct page *page;
 	vm_fault_t ret;
 	struct migrate_vma args = {
 		.vma		= vmf->vma,
@@ -192,39 +247,64 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 		.src		= &src,
 		.dst		= &dst,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE,
+		.flags		= MIGRATE_VMA_SELECT_DEVICE_PRIVATE |
+				  MIGRATE_VMA_SELECT_COMPOUND,
 	};
 
+	/*
+	 * If the page was migrated to the GPU as a huge page, try to
+	 * migrate it back the same way.
+	 */
+	page = thp_head(vmf->page);
+	if (PageHead(page)) {
+		unsigned int order = thp_order(page);
+		unsigned int nr_pages = 1U << order;
+
+		args.start &= PAGE_MASK << order;
+		args.end = args.start + (PAGE_SIZE << order);
+		args.src = kmalloc_array(nr_pages, sizeof(*args.src),
+					 GFP_KERNEL);
+		if (!args.src)
+			return VM_FAULT_OOM;
+		args.dst = kmalloc_array(nr_pages, sizeof(*args.dst),
+					 GFP_KERNEL);
+		if (!args.dst) {
+			ret = VM_FAULT_OOM;
+			goto error_src;
+		}
+	}
+
 	/*
 	 * FIXME what we really want is to find some heuristic to migrate more
 	 * than just one page on CPU fault. When such fault happens it is very
 	 * likely that more surrounding page will CPU fault too.
 	 */
-	if (migrate_vma_setup(&args) < 0)
-		return VM_FAULT_SIGBUS;
-	if (!args.cpages)
-		return 0;
-
-	ret = nouveau_dmem_fault_copy_one(drm, vmf, &args, &dma_addr);
-	if (ret || dst == 0)
-		goto done;
-
-	nouveau_fence_new(dmem->migrate.chan, false, &fence);
-	migrate_vma_pages(&args);
-	nouveau_dmem_fence_done(&fence);
-	dma_unmap_page(drm->dev->dev, dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
-done:
-	migrate_vma_finalize(&args);
+	if (migrate_vma_setup(&args))
+		ret = VM_FAULT_SIGBUS;
+	else
+		ret = nouveau_dmem_fault_chunk(drm, vmf, &args);
+	if (args.dst != &dst)
+		kfree(args.dst);
+error_src:
+	if (args.src != &src)
+		kfree(args.src);
 	return ret;
 }
 
+static void nouveau_page_split(struct page *head, struct page *page)
+{
+	page->pgmap = head->pgmap;
+	page->zone_device_data = head->zone_device_data;
+}
+
 static const struct dev_pagemap_ops nouveau_dmem_pagemap_ops = {
 	.page_free		= nouveau_dmem_page_free,
 	.migrate_to_ram		= nouveau_dmem_migrate_to_ram,
+	.page_split		= nouveau_page_split,
 };
 
-static int
-nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
+static int nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, bool is_huge,
+				    struct page **ppage)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct resource *res;
@@ -278,16 +358,20 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
-	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
-		page->zone_device_data = drm->dmem->free_pages;
-		drm->dmem->free_pages = page;
-	}
+	if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && is_huge)
+		prep_transhuge_device_private_page(page);
+	else
+		for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
+			page->zone_device_data = drm->dmem->free_pages;
+			drm->dmem->free_pages = page;
+		}
 	*ppage = page;
 	chunk->callocated++;
 	spin_unlock(&drm->dmem->lock);
 
-	NV_INFO(drm, "DMEM: registered %ldMB of device memory\n",
-		DMEM_CHUNK_SIZE >> 20);
+	NV_INFO(drm, "DMEM: registered %ldMB of %sdevice memory %lx %lx\n",
+		DMEM_CHUNK_SIZE >> 20, is_huge ? "huge " : "", pfn_first,
+		nouveau_dmem_page_addr(page));
 
 	return 0;
 
@@ -304,14 +388,20 @@ nouveau_dmem_chunk_alloc(struct nouveau_drm *drm, struct page **ppage)
 }
 
 static struct page *
-nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
+nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm, bool is_huge)
 {
 	struct nouveau_dmem_chunk *chunk;
 	struct page *page = NULL;
 	int ret;
 
 	spin_lock(&drm->dmem->lock);
-	if (drm->dmem->free_pages) {
+	if (is_huge && drm->dmem->free_huge_pages) {
+		page = drm->dmem->free_huge_pages;
+		drm->dmem->free_huge_pages = page->zone_device_data;
+		chunk = nouveau_page_to_chunk(page);
+		chunk->callocated++;
+		spin_unlock(&drm->dmem->lock);
+	} else if (!is_huge && drm->dmem->free_pages) {
 		page = drm->dmem->free_pages;
 		drm->dmem->free_pages = page->zone_device_data;
 		chunk = nouveau_page_to_chunk(page);
@@ -319,7 +409,7 @@ nouveau_dmem_page_alloc_locked(struct nouveau_drm *drm)
 		spin_unlock(&drm->dmem->lock);
 	} else {
 		spin_unlock(&drm->dmem->lock);
-		ret = nouveau_dmem_chunk_alloc(drm, &page);
+		ret = nouveau_dmem_chunk_alloc(drm, is_huge, &page);
 		if (ret)
 			return NULL;
 	}
@@ -567,31 +657,22 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 
 static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, unsigned long src,
-		dma_addr_t *dma_addr, u64 *pfn)
+		struct page *spage, bool is_huge, dma_addr_t dma_addr, u64 *pfn)
 {
-	struct device *dev = drm->dev->dev;
-	struct page *dpage, *spage;
+	struct page *dpage;
 	unsigned long paddr;
+	unsigned long dst;
 
-	spage = migrate_pfn_to_page(src);
-	if (!(src & MIGRATE_PFN_MIGRATE))
-		goto out;
-
-	dpage = nouveau_dmem_page_alloc_locked(drm);
+	dpage = nouveau_dmem_page_alloc_locked(drm, is_huge);
 	if (!dpage)
 		goto out;
 
 	paddr = nouveau_dmem_page_addr(dpage);
 	if (spage) {
-		*dma_addr = dma_map_page(dev, spage, 0, page_size(spage),
-					 DMA_BIDIRECTIONAL);
-		if (dma_mapping_error(dev, *dma_addr))
+		if (drm->dmem->migrate.copy_func(drm, thp_nr_pages(dpage),
+			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, dma_addr))
 			goto out_free_page;
-		if (drm->dmem->migrate.copy_func(drm, 1,
-			NOUVEAU_APER_VRAM, paddr, NOUVEAU_APER_HOST, *dma_addr))
-			goto out_dma_unmap;
 	} else {
-		*dma_addr = DMA_MAPPING_ERROR;
 		if (drm->dmem->migrate.clear_func(drm, page_size(dpage),
 			NOUVEAU_APER_VRAM, paddr))
 			goto out_free_page;
@@ -602,10 +683,11 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
 		((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
 	if (src & MIGRATE_PFN_WRITE)
 		*pfn |= NVIF_VMM_PFNMAP_V0_W;
-	return migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+	if (PageHead(dpage))
+		dst |= MIGRATE_PFN_COMPOUND;
+	return dst;
 
-out_dma_unmap:
-	dma_unmap_page(dev, *dma_addr, PAGE_SIZE, DMA_BIDIRECTIONAL);
 out_free_page:
 	nouveau_dmem_page_free_locked(drm, dpage);
 out:
@@ -617,26 +699,64 @@ static void nouveau_dmem_migrate_chunk(struct nouveau_drm *drm,
 		struct nouveau_svmm *svmm, struct migrate_vma *args,
 		dma_addr_t *dma_addrs, u64 *pfns)
 {
+	struct device *dev = drm->dev->dev;
 	struct nouveau_fence *fence;
 	unsigned long addr = args->start, nr_dma = 0, i;
+	unsigned int page_shift = PAGE_SHIFT;
+	struct page *spage;
+	unsigned long src = args->src[0];
+	bool is_huge = (src & (MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND)) ==
+		(MIGRATE_PFN_MIGRATE | MIGRATE_PFN_COMPOUND);
+	unsigned long dma_page_size = is_huge ? PMD_SIZE : PAGE_SIZE;
+
+	if (is_huge) {
+		spage = migrate_pfn_to_page(src);
+		if (spage) {
+			dma_addrs[nr_dma] = dma_map_page(dev, spage, 0,
+							 page_size(spage),
+							 DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(dev, dma_addrs[nr_dma]))
+				goto out;
+			nr_dma++;
+		}
+		args->dst[0] = nouveau_dmem_migrate_copy_one(drm, svmm, src,
+				spage, true, *dma_addrs, pfns);
+		if (args->dst[0] & MIGRATE_PFN_COMPOUND) {
+			page_shift = PMD_SHIFT;
+			i = 1;
+			goto fence;
+		}
+	}
 
-	for (i = 0; addr < args->end; i++) {
-		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm,
-				args->src[i], dma_addrs + nr_dma, pfns + i);
-		if (!dma_mapping_error(drm->dev->dev, dma_addrs[nr_dma]))
+	for (i = 0; addr < args->end; i++, addr += PAGE_SIZE) {
+		src = args->src[i];
+		if (!(src & MIGRATE_PFN_MIGRATE))
+			continue;
+		spage = migrate_pfn_to_page(src);
+		if (spage && !is_huge) {
+			dma_addrs[i] = dma_map_page(dev, spage, 0,
+						    page_size(spage),
+						    DMA_BIDIRECTIONAL);
+			if (dma_mapping_error(dev, dma_addrs[i]))
+				break;
 			nr_dma++;
-		addr += PAGE_SIZE;
+		} else if (spage && is_huge && i != 0)
+			dma_addrs[i] = dma_addrs[i - 1] + PAGE_SIZE;
+		args->dst[i] = nouveau_dmem_migrate_copy_one(drm, svmm, src,
+				spage, false, dma_addrs[i], pfns + i);
 	}
 
+fence:
 	nouveau_fence_new(drm->dmem->migrate.chan, false, &fence);
 	migrate_vma_pages(args);
 	nouveau_dmem_fence_done(&fence);
-	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i);
+	nouveau_pfns_map(svmm, args->vma->vm_mm, args->start, pfns, i,
+			 page_shift);
 
-	while (nr_dma--) {
-		dma_unmap_page(drm->dev->dev, dma_addrs[nr_dma], PAGE_SIZE,
-				DMA_BIDIRECTIONAL);
-	}
+	while (nr_dma)
+		dma_unmap_page(drm->dev->dev, dma_addrs[--nr_dma],
+				dma_page_size, DMA_BIDIRECTIONAL);
+out:
 	migrate_vma_finalize(args);
 }
 
@@ -648,25 +768,25 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 			 unsigned long end)
 {
 	unsigned long npages = (end - start) >> PAGE_SHIFT;
-	unsigned long max = min(SG_MAX_SINGLE_ALLOC, npages);
+	unsigned long max = min(1UL << (PMD_SHIFT - PAGE_SHIFT), npages);
 	dma_addr_t *dma_addrs;
 	struct migrate_vma args = {
 		.vma		= vma,
 		.start		= start,
 		.pgmap_owner	= drm->dev,
-		.flags		= MIGRATE_VMA_SELECT_SYSTEM,
+		.flags		= MIGRATE_VMA_SELECT_SYSTEM |
+				  MIGRATE_VMA_SELECT_COMPOUND,
 	};
-	unsigned long i;
 	u64 *pfns;
 	int ret = -ENOMEM;
 
 	if (drm->dmem == NULL)
 		return -ENODEV;
 
-	args.src = kcalloc(max, sizeof(*args.src), GFP_KERNEL);
+	args.src = kmalloc_array(max, sizeof(*args.src), GFP_KERNEL);
 	if (!args.src)
 		goto out;
-	args.dst = kcalloc(max, sizeof(*args.dst), GFP_KERNEL);
+	args.dst = kmalloc_array(max, sizeof(*args.dst), GFP_KERNEL);
 	if (!args.dst)
 		goto out_free_src;
 
@@ -678,8 +798,10 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	if (!pfns)
 		goto out_free_dma;
 
-	for (i = 0; i < npages; i += max) {
-		args.end = start + (max << PAGE_SHIFT);
+	for (; args.start < end; args.start = args.end) {
+		args.end = min(end, ALIGN(args.start, PMD_SIZE));
+		if (args.start == args.end)
+			args.end = min(end, args.start + PMD_SIZE);
 		ret = migrate_vma_setup(&args);
 		if (ret)
 			goto out_free_pfns;
@@ -687,7 +809,6 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 		if (args.cpages)
 			nouveau_dmem_migrate_chunk(drm, svmm, &args, dma_addrs,
 						   pfns);
-		args.start = args.end;
 	}
 
 	ret = 0;
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 4f69e4c3dafd..3db0997f21b5 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -681,7 +681,6 @@ nouveau_svm_fault(struct nvif_notify *notify)
 			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
 			continue;
 		}
-		SVMM_DBG(svmm, "addr %016llx", buffer->fault[fi]->addr);
 
 		/* We try and group handling of faults within a small
 		 * window into a single update.
@@ -733,6 +732,10 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		}
 		mmput(mm);
 
+		SVMM_DBG(svmm, "addr %llx %s %c", buffer->fault[fi]->addr,
+			args.phys[0] & NVIF_VMM_PFNMAP_V0_VRAM ?
+			"vram" : "sysmem",
+			args.i.p.size > PAGE_SIZE ? 'H' : 'N');
 		limit = args.i.p.addr + args.i.p.size;
 		for (fn = fi; ++fn < buffer->fault_nr; ) {
 			/* It's okay to skip over duplicate addresses from the
@@ -804,13 +807,15 @@ nouveau_pfns_free(u64 *pfns)
 
 void
 nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		 unsigned long addr, u64 *pfns, unsigned long npages)
+		 unsigned long addr, u64 *pfns, unsigned long npages,
+		 unsigned int page_shift)
 {
 	struct nouveau_pfnmap_args *args = nouveau_pfns_to_args(pfns);
 	int ret;
 
 	args->p.addr = addr;
-	args->p.size = npages << PAGE_SHIFT;
+	args->p.page = page_shift;
+	args->p.size = npages << args->p.page;
 
 	mutex_lock(&svmm->mutex);
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.h b/drivers/gpu/drm/nouveau/nouveau_svm.h
index e7d63d7f0c2d..3fd78662f17e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.h
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.h
@@ -33,7 +33,8 @@ void nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit);
 u64 *nouveau_pfns_alloc(unsigned long npages);
 void nouveau_pfns_free(u64 *pfns);
 void nouveau_pfns_map(struct nouveau_svmm *svmm, struct mm_struct *mm,
-		      unsigned long addr, u64 *pfns, unsigned long npages);
+		      unsigned long addr, u64 *pfns, unsigned long npages,
+		      unsigned int page_shift);
 #else /* IS_ENABLED(CONFIG_DRM_NOUVEAU_SVM) */
 static inline void nouveau_svm_init(struct nouveau_drm *drm) {}
 static inline void nouveau_svm_fini(struct nouveau_drm *drm) {}
-- 
2.20.1



^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()
  2020-11-06  0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
@ 2020-11-06  7:55   ` Christoph Hellwig
  2020-11-06 20:56     ` Ralph Campbell
  2020-11-06 12:14   ` Matthew Wilcox
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-11-06  7:55 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

On Thu, Nov 05, 2020 at 04:51:42PM -0800, Ralph Campbell wrote:
> +extern void prep_transhuge_device_private_page(struct page *page);

No need for the extern.

> +static inline void prep_transhuge_device_private_page(struct page *page)
> +{
> +}

Is the code to call this even reachable if THP support is configured
out?  If not just declaring it unconditionally and letting dead code
elimination do its job might be a tad cleaner.

> +void prep_transhuge_device_private_page(struct page *page)

I think a kerneldoc comment explaining what this function is useful for
would be helpful.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()
  2020-11-06  0:51 ` [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip() Ralph Campbell
@ 2020-11-06  7:56   ` Christoph Hellwig
  2020-11-06  7:57   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-11-06  7:56 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

On Thu, Nov 05, 2020 at 04:51:43PM -0800, Ralph Campbell wrote:
> Move the definition of migrate_vma_collect_skip() to make it callable
> by migrate_vma_collect_hole(). This helps make the next patch easier
> to read.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip()
  2020-11-06  0:51 ` [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip() Ralph Campbell
  2020-11-06  7:56   ` Christoph Hellwig
@ 2020-11-06  7:57   ` Christoph Hellwig
  1 sibling, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-11-06  7:57 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/6] mm/thp: add THP allocation helper
  2020-11-06  0:51 ` [PATCH v3 4/6] mm/thp: add THP allocation helper Ralph Campbell
@ 2020-11-06  8:01   ` Christoph Hellwig
  2020-11-06 21:09     ` Ralph Campbell
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:01 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
> +					unsigned long addr);

No need for the extern.  And also here: do we actually need the stub,
or can the caller make sure (using IS_ENABLED and similar) that the
compiler knows the code is dead?

> +struct page *alloc_transhugepage(struct vm_area_struct *vma,
> +				 unsigned long haddr)
> +{
> +	gfp_t gfp;
> +	struct page *page;
> +
> +	gfp = alloc_hugepage_direct_gfpmask(vma);
> +	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
> +	if (page)
> +		prep_transhuge_page(page);
> +	return page;

I think do_huge_pmd_anonymous_page should be switched to use this
helper as well.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-06  0:51 ` [PATCH v3 3/6] mm: support THP migration to device private memory Ralph Campbell
@ 2020-11-06  8:03   ` Christoph Hellwig
  2020-11-06 21:26     ` Ralph Campbell
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-11-06  8:03 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

I hate the extra pin count magic here.  IMHO we really need to finish
off the series to get rid of the extra references on the ZONE_DEVICE
pages first.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()
  2020-11-06  0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
  2020-11-06  7:55   ` Christoph Hellwig
@ 2020-11-06 12:14   ` Matthew Wilcox
  2020-11-06 20:34     ` Ralph Campbell
  1 sibling, 1 reply; 25+ messages in thread
From: Matthew Wilcox @ 2020-11-06 12:14 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

On Thu, Nov 05, 2020 at 04:51:42PM -0800, Ralph Campbell wrote:
> Add a helper function to allow device drivers to create device private
> transparent huge pages. This is intended to help support device private
> THP migrations.

I think you'd be better off with these calling conventions:

-void prep_transhuge_page(struct page *page)
+struct page *thp_prep(struct page *page)
 {
+       if (!page || compound_order(page) == 0)
+               return page;
        /*
-        * we use page->mapping and page->indexlru in second tail page
+        * we use page->mapping and page->index in second tail page
         * as list_head: assuming THP order >= 2
         */
+       BUG_ON(compound_order(page) == 1);
 
        INIT_LIST_HEAD(page_deferred_list(page));
        set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
+
+       return page;
 }

It simplifies the users.

> +void prep_transhuge_device_private_page(struct page *page)
> +{
> +	prep_compound_page(page, HPAGE_PMD_ORDER);
> +	prep_transhuge_page(page);
> +	/* Only the head page has a reference to the pgmap. */
> +	percpu_ref_put_many(page->pgmap->ref, HPAGE_PMD_NR - 1);
> +}
> +EXPORT_SYMBOL_GPL(prep_transhuge_device_private_page);

Something else that may interest you from my patch series is support
for page sizes other than PMD_SIZE.  I don't know what page sizes your
hardware supports.  There's no support for page sizes other than PMD
for anonymous memory, so this might not be too useful for you yet.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()
  2020-11-06 12:14   ` Matthew Wilcox
@ 2020-11-06 20:34     ` Ralph Campbell
  0 siblings, 0 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06 20:34 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Christoph Hellwig,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton


On 11/6/20 4:14 AM, Matthew Wilcox wrote:
> On Thu, Nov 05, 2020 at 04:51:42PM -0800, Ralph Campbell wrote:
>> Add a helper function to allow device drivers to create device private
>> transparent huge pages. This is intended to help support device private
>> THP migrations.
> 
> I think you'd be better off with these calling conventions:
> 
> -void prep_transhuge_page(struct page *page)
> +struct page *thp_prep(struct page *page)
>   {
> +       if (!page || compound_order(page) == 0)
> +               return page;
>          /*
> -        * we use page->mapping and page->indexlru in second tail page
> +        * we use page->mapping and page->index in second tail page
>           * as list_head: assuming THP order >= 2
>           */
> +       BUG_ON(compound_order(page) == 1);
>   
>          INIT_LIST_HEAD(page_deferred_list(page));
>          set_compound_page_dtor(page, TRANSHUGE_PAGE_DTOR);
> +
> +       return page;
>   }
> 
> It simplifies the users.

I'm not sure what the simplification is.
If you mean the name change from prep_transhuge_page() to thp_prep(),
that seems fine to me. The following could also be renamed to
thp_prep_device_private_page() or similar.

>> +void prep_transhuge_device_private_page(struct page *page)
>> +{
>> +	prep_compound_page(page, HPAGE_PMD_ORDER);
>> +	prep_transhuge_page(page);
>> +	/* Only the head page has a reference to the pgmap. */
>> +	percpu_ref_put_many(page->pgmap->ref, HPAGE_PMD_NR - 1);
>> +}
>> +EXPORT_SYMBOL_GPL(prep_transhuge_device_private_page);
> 
> Something else that may interest you from my patch series is support
> for page sizes other than PMD_SIZE.  I don't know what page sizes 
> hardware supports.  There's no support for page sizes other than PMD
> for anonymous memory, so this might not be too useful for you yet.

I did see those changes. It might help some device drivers to do DMA in
larger than PAGE_SIZE blocks but less than PMD_SIZE. It might help
reduce page table sizes since 2MB, 64K, and 4K are commonly supported
GPU page sizes. The MIGRATE_PFN_COMPOUND flag is intended to indicate
that the page size is determined by page_size() so I was thinking ahead
to other than PMD sized pages. However, when migrating a pte_none() or
pmd_none() page, there is no source page to determine the size.
Maybe I need to encode the page order in the migrate PFN entry like
hmm_range_fault().

Anyway, I agree that thinking about page sizes other than PMD is good.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page()
  2020-11-06  7:55   ` Christoph Hellwig
@ 2020-11-06 20:56     ` Ralph Campbell
  0 siblings, 0 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06 20:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
	Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
	Andrew Morton


On 11/5/20 11:55 PM, Christoph Hellwig wrote:
> On Thu, Nov 05, 2020 at 04:51:42PM -0800, Ralph Campbell wrote:
>> +extern void prep_transhuge_device_private_page(struct page *page);
> 
> No need for the extern.

Right, I was just copying the style.
Would you like to see a preparatory patch that removes extern for the other
declarations in huge_mm.h?

>> +static inline void prep_transhuge_device_private_page(struct page *page)
>> +{
>> +}
> 
> Is the code to call this even reachable if THP support is configured
> out?  If not just declaring it unconditionally and letting dead code
> elimination do its job might be a tad cleaner.

The HMM test driver depends on TRANSPARENT_HUGEPAGE but the nouveau SVM
option doesn't and SVM is still useful if TRANSPARENT_HUGEPAGE is not configured.

The problem with defining prep_transhuge_device_private_page() in huge_mm.h
as a static inline function is that prep_compound_page() and prep_transhuge_page()
would have to be EXPORT_SYMBOL_GPL which are currently mm internal only.
The intent is to make this helper callable by separate device driver modules
using struct pages created with memremap_pages().

>> +void prep_transhuge_device_private_page(struct page *page)
> 
> I think a kerneldoc comment explaining what this function is useful for
> would be helpful.

That is a good idea. I'll add it to v4.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 4/6] mm/thp: add THP allocation helper
  2020-11-06  8:01   ` Christoph Hellwig
@ 2020-11-06 21:09     ` Ralph Campbell
  0 siblings, 0 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06 21:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
	Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
	Andrew Morton


On 11/6/20 12:01 AM, Christoph Hellwig wrote:
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> +extern struct page *alloc_transhugepage(struct vm_area_struct *vma,
>> +					unsigned long addr);
> 
> No need for the extern.  And also here: do we actually need the stub,
> or can the caller make sure (using IS_ENABLED and similar) that the
> compiler knows the code is dead?

Same problem as with prep_transhuge_device_private_page(), since
alloc_hugepage_direct_gfpmask() and alloc_hugepage_vma() are not
EXPORT_SYMBOL_GPL.

>> +struct page *alloc_transhugepage(struct vm_area_struct *vma,
>> +				 unsigned long haddr)
>> +{
>> +	gfp_t gfp;
>> +	struct page *page;
>> +
>> +	gfp = alloc_hugepage_direct_gfpmask(vma);
>> +	page = alloc_hugepage_vma(gfp, vma, haddr, HPAGE_PMD_ORDER);
>> +	if (page)
>> +		prep_transhuge_page(page);
>> +	return page;
> 
> I think do_huge_pmd_anonymous_page should be switched to use this
> helper as well.

Sure, I'll do that for v4.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-06  8:03   ` Christoph Hellwig
@ 2020-11-06 21:26     ` Ralph Campbell
  2020-11-09  9:14       ` Christoph Hellwig
  0 siblings, 1 reply; 25+ messages in thread
From: Ralph Campbell @ 2020-11-06 21:26 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
	Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
	Andrew Morton


On 11/6/20 12:03 AM, Christoph Hellwig wrote:
> I hate the extra pin count magic here.  IMHO we really need to finish
> off the series to get rid of the extra references on the ZONE_DEVICE
> pages first.

First, thanks for the review comments.

I don't like the extra refcount either, that is why I tried to fix that up
before resending this series. However, you didn't like me just fixing the
refcount only for device private pages and I don't know the dax/pmem code
and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
long it will take me to fix all the use cases.
So I wanted to make progress on the THP migration code in the mean time.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-06 21:26     ` Ralph Campbell
@ 2020-11-09  9:14       ` Christoph Hellwig
  2020-11-09 21:34         ` Ralph Campbell
  2020-11-11 23:38         ` Ralph Campbell
  0 siblings, 2 replies; 25+ messages in thread
From: Christoph Hellwig @ 2020-11-09  9:14 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Christoph Hellwig, linux-mm, nouveau, linux-kselftest,
	linux-kernel, Jerome Glisse, John Hubbard, Alistair Popple,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote:
>
> On 11/6/20 12:03 AM, Christoph Hellwig wrote:
>> I hate the extra pin count magic here.  IMHO we really need to finish
>> off the series to get rid of the extra references on the ZONE_DEVICE
>> pages first.
>
> First, thanks for the review comments.
>
> I don't like the extra refcount either, that is why I tried to fix that up
> before resending this series. However, you didn't like me just fixing the
> refcount only for device private pages and I don't know the dax/pmem code
> and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
> long it will take me to fix all the use cases.
> So I wanted to make progress on the THP migration code in the mean time.

I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like
a normal memory allocator.  DAX is the interesting case, any specific
help that you need with that?


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-09  9:14       ` Christoph Hellwig
@ 2020-11-09 21:34         ` Ralph Campbell
  2020-11-11 23:38         ` Ralph Campbell
  1 sibling, 0 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-09 21:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
	Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
	Andrew Morton


On 11/9/20 1:14 AM, Christoph Hellwig wrote:
> On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote:
>>
>> On 11/6/20 12:03 AM, Christoph Hellwig wrote:
>>> I hate the extra pin count magic here.  IMHO we really need to finish
>>> off the series to get rid of the extra references on the ZONE_DEVICE
>>> pages first.
>>
>> First, thanks for the review comments.
>>
>> I don't like the extra refcount either, that is why I tried to fix that up
>> before resending this series. However, you didn't like me just fixing the
>> refcount only for device private pages and I don't know the dax/pmem code
>> and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
>> long it will take me to fix all the use cases.
>> So I wanted to make progress on the THP migration code in the mean time.
> 
> I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like
> a normal memory allocator.  DAX is the interesting case, any specific
> help that you need with that?

Thanks for the offer. I'm putting a list together... :-)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-09  9:14       ` Christoph Hellwig
  2020-11-09 21:34         ` Ralph Campbell
@ 2020-11-11 23:38         ` Ralph Campbell
  2020-11-20 20:01           ` Jason Gunthorpe
  2020-12-02 10:14           ` Christoph Hellwig
  1 sibling, 2 replies; 25+ messages in thread
From: Ralph Campbell @ 2020-11-11 23:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
	Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
	Andrew Morton


On 11/9/20 1:14 AM, Christoph Hellwig wrote:
> On Fri, Nov 06, 2020 at 01:26:50PM -0800, Ralph Campbell wrote:
>>
>> On 11/6/20 12:03 AM, Christoph Hellwig wrote:
>>> I hate the extra pin count magic here.  IMHO we really need to finish
>>> off the series to get rid of the extra references on the ZONE_DEVICE
>>> pages first.
>>
>> First, thanks for the review comments.
>>
>> I don't like the extra refcount either, that is why I tried to fix that up
>> before resending this series. However, you didn't like me just fixing the
>> refcount only for device private pages and I don't know the dax/pmem code
>> and peer-to-peer PCIe uses of ZONE_DEVICE pages well enough to say how
>> long it will take me to fix all the use cases.
>> So I wanted to make progress on the THP migration code in the mean time.
> 
> I think P2P is pretty trivial, given that ZONE_DEVICE pages are used like
> a normal memory allocator.  DAX is the interesting case, any specific
> help that you need with that?

There are 4 types of ZONE_DEVICE struct pages:
MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
MEMORY_DEVICE_PCI_P2PDMA.

Currently, memremap_pages() allocates struct pages for a physical address range
with a page_ref_count(page) of one and increments the pgmap->ref per CPU
reference count by the number of pages created since each ZONE_DEVICE struct
page has a pointer to the pgmap.

The struct pages are not freed until memunmap_pages() is called which
calls put_page() which calls put_dev_pagemap() which releases a reference to
pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
the struct page reference count can go to zero and back to non-zero without
changing the pgmap->ref reference count.

Q1: Is that safe? Is there some code that depends on put_page() dropping
the pgmap->ref reference count as part of memunmap_pages()?
My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.

MEMORY_DEVICE_PCI_P2PDMA:
Struct pages are created in pci_p2pdma_add_resource() and represent device
memory accessible by PCIe bar address space. Memory is allocated with
pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
call will allocate memory in a minimum of PAGE_SIZE units.
Reference counting is +1 per *allocation* on the pgmap->ref reference count.
Note that this is not +1 per page which is what put_page() expects. So
currently, a get_page()/put_page() works OK because the page reference count
only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
would be incorrect if the allocation size was greater than one page.

I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
There are a huge number of places sg_page() is called so it is hard to tell
whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
pages. pci_p2pmem_virt_to_bus() will return the physical address and I guess
pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.

Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
reference count. Or maybe just leave it at one like it is now.

MEMORY_DEVICE_GENERIC:
Struct pages are created in dev_dax_probe() and represent non-volatile memory.
The device can be mmap()'ed which calls dax_mmap() which sets
vma->vm_flags | VM_HUGEPAGE.
A CPU page fault will result in a PTE, PMD, or PUD sized page
(but not compound) to be inserted by vmf_insert_mixed() which will call either
insert_pfn() or insert_page().
Neither insert_pfn() nor insert_page() increments the page reference count.
Invalidations don't callback into the device driver so I don't see how page
reference counts can be tracked without adding a mmu_interval_notifier.

I think just leaving the page reference count at one is better than trying
to use the mmu_interval_notifier or changing vmf_insert_mixed() and
invalidations of pfn_t_devmap(pfn) to adjust the page reference count.

MEMORY_DEVICE_PRIVATE:
This case has the most core mm code having to specially check for
is_device_private_page() and adjusting the expected reference count when the
page isn't mapped by any process. There is a clear allocation and free so it
can be changed to use a reference count of zero while free (see [2]).

MEMORY_DEVICE_FS_DAX:
Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
an initial reference count of one.
The problem I see is that there are 3 states that are important:
a) memory is free and not allocated to any file (page_ref_count() == 0).
b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
    (page_ref_count() > 1). ext4_break_layouts() basically waits until the
    page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
    to wake up ext4_break_layouts().
The current code doesn't seem to distinguish (a) and (b). If we want to use
the 0->1 reference count to signal (c), then the page cache would have hold
entries with a page_ref_count() == 0 which doesn't match the general page cache
assumptions.

Q2: So how should I resolve that?

[1] https://lore.kernel.org/linux-mm/20201001181715.17416-1-rcampbell@nvidia.com
[2] https://lore.kernel.org/linux-mm/20201012174540.17328-1-rcampbell@nvidia.com


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-11 23:38         ` Ralph Campbell
@ 2020-11-20 20:01           ` Jason Gunthorpe
  2020-12-02 10:08             ` Christoph Hellwig
  2020-12-02 10:14           ` Christoph Hellwig
  1 sibling, 1 reply; 25+ messages in thread
From: Jason Gunthorpe @ 2020-11-20 20:01 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Christoph Hellwig, linux-mm, nouveau, linux-kselftest,
	linux-kernel, Jerome Glisse, John Hubbard, Alistair Popple,
	Bharata B Rao, Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs,
	Shuah Khan, Andrew Morton

On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:

> MEMORY_DEVICE_GENERIC:
> Struct pages are created in dev_dax_probe() and represent non-volatile memory.
> The device can be mmap()'ed which calls dax_mmap() which sets
> vma->vm_flags | VM_HUGEPAGE.
> A CPU page fault will result in a PTE, PMD, or PUD sized page
> (but not compound) to be inserted by vmf_insert_mixed() which will call either
> insert_pfn() or insert_page().
> Neither insert_pfn() nor insert_page() increments the page reference
> count.

But why was this done? It seems very strange to put a pfn with a
struct page into a VMA and then deliberately not take the refcount for
the duration of that pfn being in the VMA?

What prevents memunmap_pages() from progressing while VMAs still point
at the memory?

> I think just leaving the page reference count at one is better than trying
> to use the mmu_interval_notifier or changing vmf_insert_mixed() and
> invalidations of pfn_t_devmap(pfn) to adjust the page reference count.

Why so? The entire point of getting struct page's for this stuff was
to be able to follow the struct page flow. I never did learn a reason
why there is devmap stuff all over the place in the page table code...

> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
>    (page_ref_count() > 1). ext4_break_layouts() basically waits until the
>    page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
>    to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page cache
> assumptions.

This explanation feels confusing. If *anything* has a reference on the
page it cannot be recycled. I would have guess the logic is to remove
it from the page cache then wait for a 0 reference??

Jason


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-20 20:01           ` Jason Gunthorpe
@ 2020-12-02 10:08             ` Christoph Hellwig
  2020-12-05  8:22               ` Roger Pau Monné
  0 siblings, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-12-02 10:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ralph Campbell, Christoph Hellwig, linux-mm, nouveau,
	linux-kselftest, linux-kernel, Jerome Glisse, John Hubbard,
	Alistair Popple, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Roger Pau Monne

On Fri, Nov 20, 2020 at 04:01:33PM -0400, Jason Gunthorpe wrote:
> On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> 
> > MEMORY_DEVICE_GENERIC:
> > Struct pages are created in dev_dax_probe() and represent non-volatile memory.
> > The device can be mmap()'ed which calls dax_mmap() which sets
> > vma->vm_flags | VM_HUGEPAGE.
> > A CPU page fault will result in a PTE, PMD, or PUD sized page
> > (but not compound) to be inserted by vmf_insert_mixed() which will call either
> > insert_pfn() or insert_page().
> > Neither insert_pfn() nor insert_page() increments the page reference
> > count.
> 
> But why was this done? It seems very strange to put a pfn with a
> struct page into a VMA and then deliberately not take the refcount for
> the duration of that pfn being in the VMA?
> 
> What prevents memunmap_pages() from progressing while VMAs still point
> at the memory?

Agreed.  Adding Roger who added MEMORY_DEVICE_GENERIC and the only
user.

> > I think just leaving the page reference count at one is better than trying
> > to use the mmu_interval_notifier or changing vmf_insert_mixed() and
> > invalidations of pfn_t_devmap(pfn) to adjust the page reference count.
> 
> Why so? The entire point of getting struct page's for this stuff was
> to be able to follow the struct page flow. I never did learn a reason
> why there is devmap stuff all over the place in the page table code...

Exactly.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-11-11 23:38         ` Ralph Campbell
  2020-11-20 20:01           ` Jason Gunthorpe
@ 2020-12-02 10:14           ` Christoph Hellwig
  2020-12-02 18:01             ` Logan Gunthorpe
  1 sibling, 1 reply; 25+ messages in thread
From: Christoph Hellwig @ 2020-12-02 10:14 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Christoph Hellwig, linux-mm, nouveau, linux-kselftest,
	linux-kernel, Jerome Glisse, John Hubbard, Alistair Popple,
	Jason Gunthorpe, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton, Logan Gunthorpe,
	Dan Williams, linux-nvdimm, linux-fsdevel

[adding a few of the usual suspects]

On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> There are 4 types of ZONE_DEVICE struct pages:
> MEMORY_DEVICE_PRIVATE, MEMORY_DEVICE_FS_DAX, MEMORY_DEVICE_GENERIC, and
> MEMORY_DEVICE_PCI_P2PDMA.
>
> Currently, memremap_pages() allocates struct pages for a physical address range
> with a page_ref_count(page) of one and increments the pgmap->ref per CPU
> reference count by the number of pages created since each ZONE_DEVICE struct
> page has a pointer to the pgmap.
>
> The struct pages are not freed until memunmap_pages() is called which
> calls put_page() which calls put_dev_pagemap() which releases a reference to
> pgmap->ref. memunmap_pages() blocks waiting for pgmap->ref reference count
> to be zero. As far as I can tell, the put_page() in memunmap_pages() has to
> be the *last* put_page() (see MEMORY_DEVICE_PCI_P2PDMA).
> My RFC [1] breaks this put_page() -> put_dev_pagemap() connection so that
> the struct page reference count can go to zero and back to non-zero without
> changing the pgmap->ref reference count.
>
> Q1: Is that safe? Is there some code that depends on put_page() dropping
> the pgmap->ref reference count as part of memunmap_pages()?
> My testing of [1] seems OK but I'm sure there are lots of cases I didn't test.

It should be safe, but the audit you've done is important to make sure
we do not miss anything important.

> MEMORY_DEVICE_PCI_P2PDMA:
> Struct pages are created in pci_p2pdma_add_resource() and represent device
> memory accessible by PCIe bar address space. Memory is allocated with
> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
> call will allocate memory in a minimum of PAGE_SIZE units.
> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
> Note that this is not +1 per page which is what put_page() expects. So
> currently, a get_page()/put_page() works OK because the page reference count
> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
> would be incorrect if the allocation size was greater than one page.
>
> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
> There are a huge number of places sg_page() is called so it is hard to tell
> whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
> pages.

Nothing should call get_page/put_page on them, as they are not treated
as refcountable memory.  More importantly nothing is allowed to keep
a reference longer than the time of the I/O.

> pci_p2pmem_virt_to_bus() will return the physical address and I guess
> pfn_to_page(physaddr >> PAGE_SHIFT) could return the struct page.
>
> Since there is a clear allocation/free, pci_alloc_p2pmem() can probably be
> modified to increment/decrement the MEMORY_DEVICE_PCI_P2PDMA struct page
> reference count. Or maybe just leave it at one like it is now.

And yes, doing that is probably a sensible safe guard.

> MEMORY_DEVICE_FS_DAX:
> Struct pages are created in pmem_attach_disk() and virtio_fs_setup_dax() with
> an initial reference count of one.
> The problem I see is that there are 3 states that are important:
> a) memory is free and not allocated to any file (page_ref_count() == 0).
> b) memory is allocated to a file and in the page cache (page_ref_count() == 1).
> c) some gup() or I/O has a reference even after calling unmap_mapping_pages()
>    (page_ref_count() > 1). ext4_break_layouts() basically waits until the
>    page_ref_count() == 1 with put_page() calling wake_up_var(&page->_refcount)
>    to wake up ext4_break_layouts().
> The current code doesn't seem to distinguish (a) and (b). If we want to use
> the 0->1 reference count to signal (c), then the page cache would have hold
> entries with a page_ref_count() == 0 which doesn't match the general page cache

I think the sensible model here is to grab a reference when it is
added to the page cache.  That is exactly how normal system memory pages
work.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-12-02 10:14           ` Christoph Hellwig
@ 2020-12-02 18:01             ` Logan Gunthorpe
  0 siblings, 0 replies; 25+ messages in thread
From: Logan Gunthorpe @ 2020-12-02 18:01 UTC (permalink / raw)
  To: Christoph Hellwig, Ralph Campbell
  Cc: linux-mm, nouveau, linux-kselftest, linux-kernel, Jerome Glisse,
	John Hubbard, Alistair Popple, Jason Gunthorpe, Bharata B Rao,
	Zi Yan, Kirill A . Shutemov, Yang Shi, Ben Skeggs, Shuah Khan,
	Andrew Morton, Dan Williams, linux-nvdimm, linux-fsdevel



On 2020-12-02 3:14 a.m., Christoph Hellwig wrote:>>
MEMORY_DEVICE_PCI_P2PDMA:
>> Struct pages are created in pci_p2pdma_add_resource() and represent device
>> memory accessible by PCIe bar address space. Memory is allocated with
>> pci_alloc_p2pmem() based on a byte length but the gen_pool_alloc_owner()
>> call will allocate memory in a minimum of PAGE_SIZE units.
>> Reference counting is +1 per *allocation* on the pgmap->ref reference count.
>> Note that this is not +1 per page which is what put_page() expects. So
>> currently, a get_page()/put_page() works OK because the page reference count
>> only goes 1->2 and 2->1. If it went to zero, the pgmap->ref reference count
>> would be incorrect if the allocation size was greater than one page.
>>
>> I see pci_alloc_p2pmem() is called by nvme_alloc_sq_cmds() and
>> pci_p2pmem_alloc_sgl() to create a command queue and a struct scatterlist *.
>> Looks like sg_page(sg) returns the ZONE_DEVICE struct page of the scatterlist.
>> There are a huge number of places sg_page() is called so it is hard to tell
>> whether or not get_page()/put_page() is ever called on MEMORY_DEVICE_PCI_P2PDMA
>> pages.
> 
> Nothing should call get_page/put_page on them, as they are not treated
> as refcountable memory.  More importantly nothing is allowed to keep
> a reference longer than the time of the I/O.

Yes, right now this is safe, as Christoph notes there are no places
where these should be got/put.

But eventually we'll need to change how pci_alloc_p2pmem() works to take
references on the actual pages and allow freeing individual pages,
similar to what you suggest. This is one of the issues Jason pointed out
in my last RFC to try to pass these pages through GUP.

Logan


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v3 3/6] mm: support THP migration to device private memory
  2020-12-02 10:08             ` Christoph Hellwig
@ 2020-12-05  8:22               ` Roger Pau Monné
  0 siblings, 0 replies; 25+ messages in thread
From: Roger Pau Monné @ 2020-12-05  8:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, Ralph Campbell, linux-mm, nouveau,
	linux-kselftest, linux-kernel, Jerome Glisse, John Hubbard,
	Alistair Popple, Bharata B Rao, Zi Yan, Kirill A . Shutemov,
	Yang Shi, Ben Skeggs, Shuah Khan, Andrew Morton

On Wed, Dec 02, 2020 at 11:08:54AM +0100, Christoph Hellwig wrote:
> On Fri, Nov 20, 2020 at 04:01:33PM -0400, Jason Gunthorpe wrote:
> > On Wed, Nov 11, 2020 at 03:38:42PM -0800, Ralph Campbell wrote:
> > 
> > > MEMORY_DEVICE_GENERIC:
> > > Struct pages are created in dev_dax_probe() and represent non-volatile memory.
> > > The device can be mmap()'ed which calls dax_mmap() which sets
> > > vma->vm_flags | VM_HUGEPAGE.
> > > A CPU page fault will result in a PTE, PMD, or PUD sized page
> > > (but not compound) to be inserted by vmf_insert_mixed() which will call either
> > > insert_pfn() or insert_page().
> > > Neither insert_pfn() nor insert_page() increments the page reference
> > > count.
> > 
> > But why was this done? It seems very strange to put a pfn with a
> > struct page into a VMA and then deliberately not take the refcount for
> > the duration of that pfn being in the VMA?
> > 
> > What prevents memunmap_pages() from progressing while VMAs still point
> > at the memory?
> 
> Agreed.  Adding Roger who added MEMORY_DEVICE_GENERIC and the only
> user.

MEMORY_DEVICE_GENERIC is just a rename of the previous
MEMORY_DEVICE_DEVDAX, and seems to be used by the DAX device apart
from Xen?

It's main purpose is to be able to allocate unused physical memory
ranges and have a baking struct page for them, so they can be used to
map foreign memory when running on Xen.

I'm currently on leave and won't be back until the end of the month,
could you please Cc the Xen maintainers if you modify the logic here
in order to make sure it will work for Xen?

Thanks, Roger.


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2020-12-05  8:23 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-06  0:51 [PATCH v3 0/6] mm/hmm/nouveau: add THP migration to migrate_vma_* Ralph Campbell
2020-11-06  0:51 ` [PATCH v3 1/6] mm/thp: add prep_transhuge_device_private_page() Ralph Campbell
2020-11-06  7:55   ` Christoph Hellwig
2020-11-06 20:56     ` Ralph Campbell
2020-11-06 12:14   ` Matthew Wilcox
2020-11-06 20:34     ` Ralph Campbell
2020-11-06  0:51 ` [PATCH v3 2/6] mm/migrate: move migrate_vma_collect_skip() Ralph Campbell
2020-11-06  7:56   ` Christoph Hellwig
2020-11-06  7:57   ` Christoph Hellwig
2020-11-06  0:51 ` [PATCH v3 3/6] mm: support THP migration to device private memory Ralph Campbell
2020-11-06  8:03   ` Christoph Hellwig
2020-11-06 21:26     ` Ralph Campbell
2020-11-09  9:14       ` Christoph Hellwig
2020-11-09 21:34         ` Ralph Campbell
2020-11-11 23:38         ` Ralph Campbell
2020-11-20 20:01           ` Jason Gunthorpe
2020-12-02 10:08             ` Christoph Hellwig
2020-12-05  8:22               ` Roger Pau Monné
2020-12-02 10:14           ` Christoph Hellwig
2020-12-02 18:01             ` Logan Gunthorpe
2020-11-06  0:51 ` [PATCH v3 4/6] mm/thp: add THP allocation helper Ralph Campbell
2020-11-06  8:01   ` Christoph Hellwig
2020-11-06 21:09     ` Ralph Campbell
2020-11-06  0:51 ` [PATCH v3 5/6] mm/hmm/test: add self tests for THP migration Ralph Campbell
2020-11-06  0:51 ` [PATCH v3 6/6] nouveau: support THP migration to private memory Ralph Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).