linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/3] HMM tests and minor fixes
@ 2019-10-23 19:55 Ralph Campbell
  2019-10-23 19:55 ` [PATCH v3 1/3] mm/hmm: make full use of walk_page_range() Ralph Campbell
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Ralph Campbell @ 2019-10-23 19:55 UTC (permalink / raw)
  To: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, Ralph Campbell

These changes are based on Jason's rdma/hmm branch (5.4.0-rc1).
Patch 1 was previously posted here [1] but was dropped from that orginal
series. Hopefully, the tests will reduce concerns about edge conditions.
I'm sure more tests could be usefully added but I thought this was a good
starting point.

Changes since v2:
patch 1:
Removed hmm_range_needs_fault() and just use hmm_range_need_fault().
Updated the change log to include that it fixes a bug where
hmm_range_fault() incorrectly returned an error when no fault is requested.
patch 2:
Removed the confusing change log wording about DMA.
Changed hmm_range_fault() to return the PFN of the zero page like any other
page.
patch 3:
Adjusted the test code to match the new zero page behavior.

Changes since v1:
Rebased to Jason's rdma/hmm branch (5.4.0-rc1).
Cleaned up locking for the test driver's page tables.
Incorporated Christoph Hellwig's comments.

[1] https://lore.kernel.org/linux-mm/20190726005650.2566-6-rcampbell@nvidia.com/

Ralph Campbell (3):
  mm/hmm: make full use of walk_page_range()
  mm/hmm: allow snapshot of the special zero page
  mm/hmm/test: add self tests for HMM

 MAINTAINERS                            |    3 +
 drivers/char/Kconfig                   |   11 +
 drivers/char/Makefile                  |    1 +
 drivers/char/hmm_dmirror.c             | 1566 ++++++++++++++++++++++++
 include/Kbuild                         |    1 +
 include/uapi/linux/hmm_dmirror.h       |   74 ++
 mm/hmm.c                               |  136 +-
 tools/testing/selftests/vm/.gitignore  |    1 +
 tools/testing/selftests/vm/Makefile    |    3 +
 tools/testing/selftests/vm/config      |    3 +
 tools/testing/selftests/vm/hmm-tests.c | 1311 ++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests |   16 +
 tools/testing/selftests/vm/test_hmm.sh |   97 ++
 13 files changed, 3158 insertions(+), 65 deletions(-)
 create mode 100644 drivers/char/hmm_dmirror.c
 create mode 100644 include/uapi/linux/hmm_dmirror.h
 create mode 100644 tools/testing/selftests/vm/hmm-tests.c
 create mode 100755 tools/testing/selftests/vm/test_hmm.sh

-- 
2.20.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/3] mm/hmm: make full use of walk_page_range()
  2019-10-23 19:55 [PATCH v3 0/3] HMM tests and minor fixes Ralph Campbell
@ 2019-10-23 19:55 ` Ralph Campbell
  2019-10-29 17:40   ` Jason Gunthorpe
  2019-10-23 19:55 ` [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page Ralph Campbell
  2019-10-23 19:55 ` [PATCH v3 3/3] mm/hmm/test: add self tests for HMM Ralph Campbell
  2 siblings, 1 reply; 19+ messages in thread
From: Ralph Campbell @ 2019-10-23 19:55 UTC (permalink / raw)
  To: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, Ralph Campbell

hmm_range_fault() calls find_vma() and walk_page_range() in a loop.
This is unnecessary duplication since walk_page_range() calls find_vma()
in a loop already.
Simplify hmm_range_fault() by defining a walk_test() callback function
to filter unhandled vmas.
This also fixes a bug where hmm_range_fault() was not checking
start >= vma->vm_start before checking vma->vm_flags so hmm_range_fault()
could return an error based on the wrong vma for the requested range.
It also fixes a bug when the vma has no read access and the caller did
not request a fault, there shouldn't be any error return code.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Christoph Hellwig <hch@lst.de>
---
 mm/hmm.c | 126 +++++++++++++++++++++++++++----------------------------
 1 file changed, 63 insertions(+), 63 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 902f5fa6bf93..acf7a664b38c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -252,18 +252,15 @@ static int hmm_vma_do_fault(struct mm_walk *walk, unsigned long addr,
 	return -EFAULT;
 }
 
-static int hmm_pfns_bad(unsigned long addr,
-			unsigned long end,
-			struct mm_walk *walk)
+static int hmm_pfns_fill(unsigned long addr, unsigned long end,
+		struct hmm_range *range, enum hmm_pfn_value_e value)
 {
-	struct hmm_vma_walk *hmm_vma_walk = walk->private;
-	struct hmm_range *range = hmm_vma_walk->range;
 	uint64_t *pfns = range->pfns;
 	unsigned long i;
 
 	i = (addr - range->start) >> PAGE_SHIFT;
 	for (; addr < end; addr += PAGE_SIZE, i++)
-		pfns[i] = range->values[HMM_PFN_ERROR];
+		pfns[i] = range->values[value];
 
 	return 0;
 }
@@ -584,7 +581,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 		}
 		return 0;
 	} else if (!pmd_present(pmd))
-		return hmm_pfns_bad(start, end, walk);
+		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 
 	if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
 		/*
@@ -612,7 +609,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
 	 * recover.
 	 */
 	if (pmd_bad(pmd))
-		return hmm_pfns_bad(start, end, walk);
+		return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
 
 	ptep = pte_offset_map(pmdp, addr);
 	i = (addr - range->start) >> PAGE_SHIFT;
@@ -770,13 +767,51 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask,
 #define hmm_vma_walk_hugetlb_entry NULL
 #endif /* CONFIG_HUGETLB_PAGE */
 
-static void hmm_pfns_clear(struct hmm_range *range,
-			   uint64_t *pfns,
-			   unsigned long addr,
-			   unsigned long end)
+static int hmm_vma_walk_test(unsigned long start, unsigned long end,
+			     struct mm_walk *walk)
 {
-	for (; addr < end; addr += PAGE_SIZE, pfns++)
-		*pfns = range->values[HMM_PFN_NONE];
+	struct hmm_vma_walk *hmm_vma_walk = walk->private;
+	struct hmm_range *range = hmm_vma_walk->range;
+	struct vm_area_struct *vma = walk->vma;
+
+	/* If range is no longer valid, force retry. */
+	if (!range->valid)
+		return -EBUSY;
+
+	/*
+	 * Skip vma ranges that don't have struct page backing them or
+	 * map I/O devices directly.
+	 */
+	if (vma->vm_flags & (VM_IO | VM_PFNMAP | VM_MIXEDMAP))
+		return -EFAULT;
+
+	/*
+	 * If the vma does not allow read access, then assume that it does not
+	 * allow write access either. HMM does not support architectures
+	 * that allow write without read.
+	 */
+	if (!(vma->vm_flags & VM_READ)) {
+		bool fault, write_fault;
+
+		/*
+		 * Check to see if a fault is requested for any page in the
+		 * range.
+		 */
+		hmm_range_need_fault(hmm_vma_walk, range->pfns +
+					((start - range->start) >> PAGE_SHIFT),
+					(end - start) >> PAGE_SHIFT,
+					0, &fault, &write_fault);
+		if (fault || write_fault)
+			return -EFAULT;
+
+		hmm_pfns_fill(start, end, range, HMM_PFN_NONE);
+		hmm_vma_walk->last = end;
+
+		/* Skip this vma and continue processing the next vma. */
+		return 1;
+	}
+
+	return 0;
 }
 
 /*
@@ -857,6 +892,7 @@ static const struct mm_walk_ops hmm_walk_ops = {
 	.pmd_entry	= hmm_vma_walk_pmd,
 	.pte_hole	= hmm_vma_walk_hole,
 	.hugetlb_entry	= hmm_vma_walk_hugetlb_entry,
+	.test_walk	= hmm_vma_walk_test,
 };
 
 /**
@@ -889,63 +925,27 @@ static const struct mm_walk_ops hmm_walk_ops = {
  */
 long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 {
-	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
-	unsigned long start = range->start, end;
-	struct hmm_vma_walk hmm_vma_walk;
+	unsigned long start = range->start;
+	struct hmm_vma_walk hmm_vma_walk = {
+		.range = range,
+		.last = start,
+		.flags = flags,
+	};
 	struct hmm *hmm = range->hmm;
-	struct vm_area_struct *vma;
 	int ret;
 
 	lockdep_assert_held(&hmm->mmu_notifier.mm->mmap_sem);
 
 	do {
-		/* If range is no longer valid force retry. */
-		if (!range->valid)
-			return -EBUSY;
-
-		vma = find_vma(hmm->mmu_notifier.mm, start);
-		if (vma == NULL || (vma->vm_flags & device_vma))
-			return -EFAULT;
-
-		if (!(vma->vm_flags & VM_READ)) {
-			/*
-			 * If vma do not allow read access, then assume that it
-			 * does not allow write access, either. HMM does not
-			 * support architecture that allow write without read.
-			 */
-			hmm_pfns_clear(range, range->pfns,
-				range->start, range->end);
-			return -EPERM;
-		}
-
-		hmm_vma_walk.pgmap = NULL;
-		hmm_vma_walk.last = start;
-		hmm_vma_walk.flags = flags;
-		hmm_vma_walk.range = range;
-		end = min(range->end, vma->vm_end);
-
-		walk_page_range(vma->vm_mm, start, end, &hmm_walk_ops,
-				&hmm_vma_walk);
-
-		do {
-			ret = walk_page_range(vma->vm_mm, start, end,
-					&hmm_walk_ops, &hmm_vma_walk);
-			start = hmm_vma_walk.last;
+		ret = walk_page_range(hmm->mmu_notifier.mm, start, range->end,
+				      &hmm_walk_ops, &hmm_vma_walk);
+		start = hmm_vma_walk.last;
 
-			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && range->valid);
+		/* Keep trying while the range is valid. */
+	} while (ret == -EBUSY && range->valid);
 
-		if (ret) {
-			unsigned long i;
-
-			i = (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
-			hmm_pfns_clear(range, &range->pfns[i],
-				hmm_vma_walk.last, range->end);
-			return ret;
-		}
-		start = end;
-
-	} while (start < range->end);
+	if (ret)
+		return ret;
 
 	return (hmm_vma_walk.last - range->start) >> PAGE_SHIFT;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page
  2019-10-23 19:55 [PATCH v3 0/3] HMM tests and minor fixes Ralph Campbell
  2019-10-23 19:55 ` [PATCH v3 1/3] mm/hmm: make full use of walk_page_range() Ralph Campbell
@ 2019-10-23 19:55 ` Ralph Campbell
  2019-10-23 20:27   ` Jerome Glisse
                     ` (2 more replies)
  2019-10-23 19:55 ` [PATCH v3 3/3] mm/hmm/test: add self tests for HMM Ralph Campbell
  2 siblings, 3 replies; 19+ messages in thread
From: Ralph Campbell @ 2019-10-23 19:55 UTC (permalink / raw)
  To: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, Ralph Campbell

If a device driver like nouveau tries to use hmm_range_fault() to access
the special shared zero page in system memory, hmm_range_fault() will
return -EFAULT and kill the process.
Allow hmm_range_fault() to return success (0) when the CPU pagetable
entry points to the special shared zero page.
page_to_pfn() and pfn_to_page() are defined on the zero page so just
handle it like any other page.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index acf7a664b38c..8c96c9ddcae5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -529,8 +529,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (unlikely(!hmm_vma_walk->pgmap))
 			return -EBUSY;
 	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
-		*pfn = range->values[HMM_PFN_SPECIAL];
-		return -EFAULT;
+		if (!is_zero_pfn(pte_pfn(pte))) {
+			*pfn = range->values[HMM_PFN_SPECIAL];
+			return -EFAULT;
+		}
+		/*
+		 * Since each architecture defines a struct page for the zero
+		 * page, just fall through and treat it like a normal page.
+		 */
 	}
 
 	*pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-23 19:55 [PATCH v3 0/3] HMM tests and minor fixes Ralph Campbell
  2019-10-23 19:55 ` [PATCH v3 1/3] mm/hmm: make full use of walk_page_range() Ralph Campbell
  2019-10-23 19:55 ` [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page Ralph Campbell
@ 2019-10-23 19:55 ` Ralph Campbell
  2019-10-23 20:28   ` Jerome Glisse
  2019-10-29 17:58   ` Jason Gunthorpe
  2 siblings, 2 replies; 19+ messages in thread
From: Ralph Campbell @ 2019-10-23 19:55 UTC (permalink / raw)
  To: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, Ralph Campbell

Add self tests for HMM.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 MAINTAINERS                            |    3 +
 drivers/char/Kconfig                   |   11 +
 drivers/char/Makefile                  |    1 +
 drivers/char/hmm_dmirror.c             | 1566 ++++++++++++++++++++++++
 include/Kbuild                         |    1 +
 include/uapi/linux/hmm_dmirror.h       |   74 ++
 tools/testing/selftests/vm/.gitignore  |    1 +
 tools/testing/selftests/vm/Makefile    |    3 +
 tools/testing/selftests/vm/config      |    3 +
 tools/testing/selftests/vm/hmm-tests.c | 1311 ++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests |   16 +
 tools/testing/selftests/vm/test_hmm.sh |   97 ++
 12 files changed, 3087 insertions(+)
 create mode 100644 drivers/char/hmm_dmirror.c
 create mode 100644 include/uapi/linux/hmm_dmirror.h
 create mode 100644 tools/testing/selftests/vm/hmm-tests.c
 create mode 100755 tools/testing/selftests/vm/test_hmm.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 296de2b51c83..9890b6b8eea0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7427,8 +7427,11 @@ M:	Jérôme Glisse <jglisse@redhat.com>
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
+F:	drivers/char/hmm*
 F:	include/linux/hmm*
+F:	include/uapi/linux/hmm*
 F:	Documentation/vm/hmm.rst
+F:	tools/testing/selftests/vm/*hmm*
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
index df0fc997dc3e..cc8ddb99550d 100644
--- a/drivers/char/Kconfig
+++ b/drivers/char/Kconfig
@@ -535,6 +535,17 @@ config ADI
 	  and SSM (Silicon Secured Memory).  Intended consumers of this
 	  driver include crash and makedumpfile.
 
+config HMM_DMIRROR
+	tristate "HMM driver for testing Heterogeneous Memory Management"
+	depends on HMM_MIRROR
+	depends on DEVICE_PRIVATE
+	help
+	  This is a pseudo device driver solely for testing HMM.
+	  Say Y here if you want to build the HMM test driver.
+	  Doing so will allow you to run tools/testing/selftest/vm/hmm-tests.
+
+	  If in doubt, say "N".
+
 endmenu
 
 config RANDOM_TRUST_CPU
diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index 7c5ea6f9df14..d4a168c0c138 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -52,3 +52,4 @@ js-rtc-y = rtc.o
 obj-$(CONFIG_XILLYBUS)		+= xillybus/
 obj-$(CONFIG_POWERNV_OP_PANEL)	+= powernv-op-panel.o
 obj-$(CONFIG_ADI)		+= adi.o
+obj-$(CONFIG_HMM_DMIRROR)	+= hmm_dmirror.o
diff --git a/drivers/char/hmm_dmirror.c b/drivers/char/hmm_dmirror.c
new file mode 100644
index 000000000000..5a1ed34e72e1
--- /dev/null
+++ b/drivers/char/hmm_dmirror.c
@@ -0,0 +1,1566 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * This is a driver to exercice the HMM (heterogeneous memory management)
+ * mirror and zone device private memory migration APIs of the kernel.
+ * Userspace programs can register with the driver to mirror their own address
+ * space and can use the device to read/write any valid virtual address.
+ *
+ * In some ways it can also serve as an example driver for people wanting to use
+ * HMM inside their own device driver.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+#include <linux/pagemap.h>
+#include <linux/hmm.h>
+#include <linux/vmalloc.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/sched/mm.h>
+#include <linux/platform_device.h>
+
+#include <uapi/linux/hmm_dmirror.h>
+
+#define DMIRROR_NDEVICES		2
+#define DMIRROR_RANGE_FAULT_TIMEOUT	1000
+#define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
+#define DEVMEM_CHUNKS_RESERVE		16
+
+static const struct dev_pagemap_ops dmirror_devmem_ops;
+static dev_t dmirror_dev;
+static struct platform_device *dmirror_platform_devices[DMIRROR_NDEVICES];
+static struct page *dmirror_zero_page;
+
+struct dmirror_device;
+
+struct dmirror_bounce {
+	void			*ptr;
+	unsigned long		size;
+	unsigned long		addr;
+	unsigned long		cpages;
+};
+
+#define DPT_SHIFT PAGE_SHIFT
+#define DPT_VALID (1UL << 0)
+#define DPT_WRITE (1UL << 1)
+#define DPT_DPAGE (1UL << 2)
+#define DPT_ZPAGE 0x20UL
+
+const uint64_t dmirror_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	[HMM_PFN_VALID] = DPT_VALID,
+	[HMM_PFN_WRITE] = DPT_WRITE,
+	[HMM_PFN_DEVICE_PRIVATE] = DPT_DPAGE,
+};
+
+static const uint64_t dmirror_hmm_values[HMM_PFN_VALUE_MAX] = {
+	[HMM_PFN_NONE]    = 0,
+	[HMM_PFN_ERROR]   = 0x10,
+	[HMM_PFN_SPECIAL] = 0x10,
+};
+
+struct dmirror_pt {
+	u64			pgd[PTRS_PER_PGD];
+	struct rw_semaphore	lock;
+};
+
+/*
+ * Data attached to the open device file.
+ * Note that it might be shared after a fork().
+ */
+struct dmirror {
+	struct hmm_mirror	mirror;
+	struct dmirror_device	*mdevice;
+	struct dmirror_pt	pt;
+	struct mutex		mutex;
+};
+
+/*
+ * ZONE_DEVICE pages for migration and simulating device memory.
+ */
+struct dmirror_chunk {
+	struct dev_pagemap	pagemap;
+	struct dmirror_device	*mdevice;
+};
+
+/*
+ * Per device data.
+ */
+struct dmirror_device {
+	struct cdev		cdevice;
+	struct hmm_devmem	*devmem;
+	struct platform_device	*pdevice;
+
+	unsigned int		devmem_capacity;
+	unsigned int		devmem_count;
+	struct dmirror_chunk	**devmem_chunks;
+	struct mutex		devmem_lock;	/* protects the above */
+
+	unsigned long		calloc;
+	unsigned long		cfree;
+	struct page		*free_pages;
+	spinlock_t		lock;		/* protects the above */
+};
+
+static inline unsigned long dmirror_pt_pgd(unsigned long addr)
+{
+	return (addr >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1);
+}
+
+static inline unsigned long dmirror_pt_pud(unsigned long addr)
+{
+	return (addr >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
+}
+
+static inline unsigned long dmirror_pt_pmd(unsigned long addr)
+{
+	return (addr >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
+}
+
+static inline unsigned long dmirror_pt_pte(unsigned long addr)
+{
+	return (addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
+}
+
+static inline struct page *dmirror_pt_page(u64 *dptep)
+{
+	u64 dpte = *dptep;
+
+	if (dpte == DPT_ZPAGE)
+		return dmirror_zero_page;
+	if (!(dpte & DPT_VALID))
+		return NULL;
+	return pfn_to_page((u64)dpte >> DPT_SHIFT);
+}
+
+static inline struct page *dmirror_pt_page_write(u64 *dptep)
+{
+	u64 dpte = *dptep;
+
+	if (!(dpte & DPT_VALID) || !(dpte & DPT_WRITE))
+		return NULL;
+	return pfn_to_page((u64)dpte >> DPT_SHIFT);
+}
+
+static inline u64 dmirror_pt_from_page(struct page *page)
+{
+	if (!page)
+		return 0;
+	return (page_to_pfn(page) << DPT_SHIFT) | DPT_VALID;
+}
+
+static struct page *populate_pt(struct dmirror *dmirror, u64 *dptep)
+{
+	struct page *page;
+
+	/*
+	 * Since we don't free page tables until the process exits,
+	 * we can unlock and relock without the page table being freed
+	 * from under us.
+	 */
+	mutex_unlock(&dmirror->mutex);
+	page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+	mutex_lock(&dmirror->mutex);
+	if (page) {
+		if (unlikely(*dptep)) {
+			__free_page(page);
+			page = dmirror_pt_page(dptep);
+		} else
+			*dptep = dmirror_pt_from_page(page);
+	} else if (*dptep)
+		page = dmirror_pt_page(dptep);
+	return page;
+}
+
+static inline unsigned long dmirror_pt_pud_end(unsigned long addr)
+{
+	return (addr & PGDIR_MASK) + ((unsigned long)PTRS_PER_PUD << PUD_SHIFT);
+}
+
+static inline unsigned long dmirror_pt_pmd_end(unsigned long addr)
+{
+	return (addr & PUD_MASK) + ((unsigned long)PTRS_PER_PMD << PMD_SHIFT);
+}
+
+static inline unsigned long dmirror_pt_pte_end(unsigned long addr)
+{
+	return (addr & PMD_MASK) + ((unsigned long)PTRS_PER_PTE << PAGE_SHIFT);
+}
+
+typedef int (*dmirror_walk_cb_t)(struct dmirror *dmirror,
+				 unsigned long start,
+				 unsigned long end,
+				 u64 *dptep,
+				 void *private);
+
+static int dmirror_pt_walk(struct dmirror *dmirror,
+			   dmirror_walk_cb_t cb,
+			   unsigned long start,
+			   unsigned long end,
+			   void *private,
+			   bool populate)
+{
+	u64 *dpgdp = &dmirror->pt.pgd[dmirror_pt_pgd(start)];
+	unsigned long addr;
+	int ret = -ENOENT;
+
+	for (addr = start; addr < end; dpgdp++) {
+		u64 *dpudp;
+		unsigned long pud_end;
+		struct page *pud_page;
+
+		pud_end = min(end, dmirror_pt_pud_end(addr));
+		pud_page = dmirror_pt_page(dpgdp);
+		if (!pud_page) {
+			if (!populate) {
+				addr = pud_end;
+				continue;
+			}
+			pud_page = populate_pt(dmirror, dpgdp);
+			if (!pud_page)
+				return -ENOMEM;
+		}
+		dpudp = kmap(pud_page);
+		dpudp += dmirror_pt_pud(addr);
+		for (; addr != pud_end; dpudp++) {
+			u64 *dpmdp;
+			unsigned long pmd_end;
+			struct page *pmd_page;
+
+			pmd_end = min(end, dmirror_pt_pmd_end(addr));
+			pmd_page = dmirror_pt_page(dpudp);
+			if (!pmd_page) {
+				if (!populate) {
+					addr = pmd_end;
+					continue;
+				}
+				pmd_page = populate_pt(dmirror, dpudp);
+				if (!pmd_page) {
+					kunmap(pud_page);
+					return -ENOMEM;
+				}
+			}
+			dpmdp = kmap(pmd_page);
+			dpmdp += dmirror_pt_pmd(addr);
+			for (; addr != pmd_end; dpmdp++) {
+				u64 *dptep;
+				unsigned long pte_end;
+				struct page *pte_page;
+
+				pte_end = min(end, dmirror_pt_pte_end(addr));
+				pte_page = dmirror_pt_page(dpmdp);
+				if (!pte_page) {
+					if (!populate) {
+						addr = pte_end;
+						continue;
+					}
+					pte_page = populate_pt(dmirror, dpmdp);
+					if (!pte_page) {
+						kunmap(pmd_page);
+						kunmap(pud_page);
+						return -ENOMEM;
+					}
+				}
+				if (!cb) {
+					addr = pte_end;
+					continue;
+				}
+				dptep = kmap(pte_page);
+				dptep += dmirror_pt_pte(addr);
+				ret = cb(dmirror, addr, pte_end, dptep,
+					 private);
+				kunmap(pte_page);
+				if (ret) {
+					kunmap(pmd_page);
+					kunmap(pud_page);
+					return ret;
+				}
+				addr = pte_end;
+			}
+			kunmap(pmd_page);
+			addr = pmd_end;
+		}
+		kunmap(pud_page);
+		addr = pud_end;
+	}
+
+	return ret;
+}
+
+static void dmirror_pt_free(struct dmirror *dmirror)
+{
+	u64 *dpgdp = dmirror->pt.pgd;
+
+	for (; dpgdp != dmirror->pt.pgd + PTRS_PER_PGD; dpgdp++) {
+		u64 *dpudp, *dpudp_orig;
+		u64 *dpudp_end;
+		struct page *pud_page;
+
+		pud_page = dmirror_pt_page(dpgdp);
+		if (!pud_page)
+			continue;
+
+		dpudp_orig = kmap_atomic(pud_page);
+		dpudp = dpudp_orig;
+		dpudp_end = dpudp + PTRS_PER_PUD;
+		for (; dpudp != dpudp_end; dpudp++) {
+			u64 *dpmdp, *dpmdp_orig;
+			u64 *dpmdp_end;
+			struct page *pmd_page;
+
+			pmd_page = dmirror_pt_page(dpudp);
+			if (!pmd_page)
+				continue;
+
+			dpmdp_orig = kmap_atomic(pmd_page);
+			dpmdp = dpmdp_orig;
+			dpmdp_end = dpmdp + PTRS_PER_PMD;
+			for (; dpmdp != dpmdp_end; dpmdp++) {
+				struct page *pte_page;
+
+				pte_page = dmirror_pt_page(dpmdp);
+				if (!pte_page)
+					continue;
+
+				*dpmdp = 0;
+				__free_page(pte_page);
+			}
+			kunmap_atomic(dpmdp_orig);
+			*dpudp = 0;
+			__free_page(pmd_page);
+		}
+		kunmap_atomic(dpudp_orig);
+		*dpgdp = 0;
+		__free_page(pud_page);
+	}
+}
+
+static int dmirror_bounce_init(struct dmirror_bounce *bounce,
+			       unsigned long addr,
+			       unsigned long size)
+{
+	bounce->addr = addr;
+	bounce->size = size;
+	bounce->cpages = 0;
+	bounce->ptr = vmalloc(size);
+	if (!bounce->ptr)
+		return -ENOMEM;
+	return 0;
+}
+
+static int dmirror_bounce_copy_from(struct dmirror_bounce *bounce,
+				    unsigned long addr)
+{
+	unsigned long end = addr + bounce->size;
+	char __user *uptr = (void __user *)addr;
+	void *ptr = bounce->ptr;
+
+	for (; addr < end; addr += PAGE_SIZE, ptr += PAGE_SIZE,
+					      uptr += PAGE_SIZE) {
+		int ret;
+
+		ret = copy_from_user(ptr, uptr, PAGE_SIZE);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static int dmirror_bounce_copy_to(struct dmirror_bounce *bounce,
+				  unsigned long addr)
+{
+	unsigned long end = addr + bounce->size;
+	char __user *uptr = (void __user *)addr;
+	void *ptr = bounce->ptr;
+
+	for (; addr < end; addr += PAGE_SIZE, ptr += PAGE_SIZE,
+					      uptr += PAGE_SIZE) {
+		int ret;
+
+		ret = copy_to_user(uptr, ptr, PAGE_SIZE);
+		if (ret)
+			return ret;
+	}
+
+	return 0;
+}
+
+static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
+{
+	vfree(bounce->ptr);
+}
+
+static int dmirror_do_update(struct dmirror *dmirror,
+			     unsigned long addr,
+			     unsigned long end,
+			     u64 *dptep,
+			     void *private)
+{
+	/*
+	 * The page table doesn't hold references to pages since it relies on
+	 * the mmu notifier to clear pointers when they become stale.
+	 * Therefore, it is OK to just clear the pte.
+	 */
+	for (; addr < end; addr += PAGE_SIZE, ++dptep)
+		*dptep = 0;
+
+	return 0;
+}
+
+static int dmirror_update(struct hmm_mirror *mirror,
+			  const struct mmu_notifier_range *update)
+{
+	struct dmirror *dmirror = container_of(mirror, struct dmirror, mirror);
+
+	if (mmu_notifier_range_blockable(update))
+		mutex_lock(&dmirror->mutex);
+	else if (!mutex_trylock(&dmirror->mutex))
+		return -EAGAIN;
+
+	dmirror_pt_walk(dmirror, dmirror_do_update, update->start,
+			update->end, NULL, false);
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+static const struct hmm_mirror_ops dmirror_ops = {
+	.sync_cpu_device_pagetables	= &dmirror_update,
+};
+
+/*
+ * dmirror_new() - allocate and initialize dmirror struct.
+ *
+ * @mdevice: The device this mirror is associated with.
+ * @filp: The active device file descriptor this mirror is associated with.
+ */
+static struct dmirror *dmirror_new(struct dmirror_device *mdevice)
+{
+	struct mm_struct *mm = get_task_mm(current);
+	struct dmirror *dmirror;
+	int ret;
+
+	if (!mm)
+		goto err;
+
+	/* Mirror this process address space */
+	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+	if (dmirror == NULL)
+		goto err_mmput;
+
+	dmirror->mdevice = mdevice;
+	dmirror->mirror.ops = &dmirror_ops;
+	mutex_init(&dmirror->mutex);
+
+	down_write(&mm->mmap_sem);
+	ret = hmm_mirror_register(&dmirror->mirror, mm);
+	up_write(&mm->mmap_sem);
+	if (ret)
+		goto err_free;
+
+	mmput(mm);
+	return dmirror;
+
+err_free:
+	kfree(dmirror);
+err_mmput:
+	mmput(mm);
+err:
+	return NULL;
+}
+
+static void dmirror_del(struct dmirror *dmirror)
+{
+	hmm_mirror_unregister(&dmirror->mirror);
+	dmirror_pt_free(dmirror);
+	kfree(dmirror);
+}
+
+/*
+ * Below are the file operation for the dmirror device file. Only ioctl matters.
+ *
+ * Note this is highly specific to the dmirror device driver and should not be
+ * construed as an example on how to design the API a real device driver would
+ * expose to userspace.
+ */
+static ssize_t dmirror_fops_read(struct file *filp,
+			       char __user *buf,
+			       size_t count,
+			       loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static ssize_t dmirror_fops_write(struct file *filp,
+				const char __user *buf,
+				size_t count,
+				loff_t *ppos)
+{
+	return -EINVAL;
+}
+
+static int dmirror_fops_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+	/* Forbid mmap of the dmirror device file. */
+	return -EINVAL;
+}
+
+static int dmirror_fops_open(struct inode *inode, struct file *filp)
+{
+	struct cdev *cdev = inode->i_cdev;
+	struct dmirror_device *mdevice;
+	struct dmirror *dmirror;
+
+	/* No exclusive opens. */
+	if (filp->f_flags & O_EXCL)
+		return -EINVAL;
+
+	mdevice = container_of(cdev, struct dmirror_device, cdevice);
+	dmirror = dmirror_new(mdevice);
+	if (!dmirror)
+		return -ENOMEM;
+
+	/* Only the first open registers the address space. */
+	mutex_lock(&mdevice->devmem_lock);
+	if (filp->private_data)
+		goto err_busy;
+	filp->private_data = dmirror;
+	mutex_unlock(&mdevice->devmem_lock);
+
+	return 0;
+
+err_busy:
+	mutex_unlock(&mdevice->devmem_lock);
+	dmirror_del(dmirror);
+	return -EBUSY;
+}
+
+static int dmirror_fops_release(struct inode *inode, struct file *filp)
+{
+	struct dmirror *dmirror = filp->private_data;
+
+	if (!dmirror)
+		return 0;
+
+	dmirror_del(dmirror);
+	filp->private_data = NULL;
+
+	return 0;
+}
+
+static inline struct dmirror_device *dmirror_page_to_device(struct page *page)
+
+{
+	struct dmirror_chunk *devmem;
+
+	devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap);
+	return devmem->mdevice;
+}
+
+static bool dmirror_device_is_mine(struct dmirror_device *mdevice,
+				   struct page *page)
+{
+	if (!is_zone_device_page(page))
+		return false;
+	return page->pgmap->ops == &dmirror_devmem_ops &&
+		dmirror_page_to_device(page) == mdevice;
+}
+
+static int dmirror_do_fault(struct dmirror *dmirror,
+			    unsigned long addr,
+			    unsigned long end,
+			    u64 *dptep,
+			    void *private)
+{
+	struct hmm_range *range = private;
+	unsigned long idx = (addr - range->start) >> PAGE_SHIFT;
+	uint64_t *pfns = range->pfns;
+
+	for (; addr < end; addr += PAGE_SIZE, ++dptep, ++idx) {
+		struct page *page;
+		u64 dpte;
+
+		/*
+		 * HMM_PFN_ERROR is returned if it is accessing invalid memory
+		 * either because of memory error (hardware detected memory
+		 * corruption) or more likely because of truncate on mmap
+		 * file.
+		 */
+		if (pfns[idx] == range->values[HMM_PFN_ERROR])
+			return -EFAULT;
+		/*
+		 * The only special PFN HMM returns is the read-only zero page
+		 * which doesn't have a matching struct page.
+		 */
+		if (pfns[idx] == range->values[HMM_PFN_SPECIAL]) {
+			*dptep = DPT_ZPAGE;
+			continue;
+		}
+		if (!(pfns[idx] & range->flags[HMM_PFN_VALID]))
+			return -EFAULT;
+		page = hmm_device_entry_to_page(range, pfns[idx]);
+		/* We asked for pages to be populated but check anyway. */
+		if (!page)
+			return -EFAULT;
+		dpte = dmirror_pt_from_page(page);
+		if (is_zone_device_page(page)) {
+			if (!dmirror_device_is_mine(dmirror->mdevice, page))
+				continue;
+			dpte |= DPT_DPAGE;
+		}
+		if (pfns[idx] & range->flags[HMM_PFN_WRITE])
+			dpte |= DPT_WRITE;
+		else if (range->default_flags & range->flags[HMM_PFN_WRITE])
+			return -EFAULT;
+		*dptep = dpte;
+	}
+
+	return 0;
+}
+
+static int dmirror_fault(struct dmirror *dmirror,
+			 unsigned long start,
+			 unsigned long end,
+			 bool write)
+{
+	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
+	unsigned long addr;
+	unsigned long next;
+	uint64_t pfns[64];
+	struct hmm_range range = {
+		.pfns = pfns,
+		.flags = dmirror_hmm_flags,
+		.values = dmirror_hmm_values,
+		.pfn_shift = DPT_SHIFT,
+		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
+				    dmirror_hmm_flags[HMM_PFN_WRITE]),
+		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
+				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
+	};
+	int ret = 0;
+
+	for (addr = start; addr < end; ) {
+		long count;
+
+		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+		range.start = addr;
+		range.end = next;
+
+		down_read(&mm->mmap_sem);
+
+		ret = hmm_range_register(&range, &dmirror->mirror);
+		if (ret) {
+			up_read(&mm->mmap_sem);
+			break;
+		}
+
+		if (!hmm_range_wait_until_valid(&range,
+						DMIRROR_RANGE_FAULT_TIMEOUT)) {
+			hmm_range_unregister(&range);
+			up_read(&mm->mmap_sem);
+			continue;
+		}
+
+		count = hmm_range_fault(&range, 0);
+		if (count < 0) {
+			ret = count;
+			hmm_range_unregister(&range);
+			up_read(&mm->mmap_sem);
+			break;
+		}
+
+		if (!hmm_range_valid(&range)) {
+			hmm_range_unregister(&range);
+			up_read(&mm->mmap_sem);
+			continue;
+		}
+		mutex_lock(&dmirror->mutex);
+		ret = dmirror_pt_walk(dmirror, dmirror_do_fault,
+				      addr, next, &range, true);
+		mutex_unlock(&dmirror->mutex);
+		hmm_range_unregister(&range);
+		up_read(&mm->mmap_sem);
+		if (ret)
+			break;
+
+		addr = next;
+	}
+
+	return ret;
+}
+
+static int dmirror_do_read(struct dmirror *dmirror,
+			   unsigned long addr,
+			   unsigned long end,
+			   u64 *dptep,
+			   void *private)
+{
+	struct dmirror_bounce *bounce = private;
+	void *ptr;
+
+	ptr = bounce->ptr + ((addr - bounce->addr) & PAGE_MASK);
+
+	for (; addr < end; addr += PAGE_SIZE, ++dptep) {
+		struct page *page;
+		void *tmp;
+
+		page = dmirror_pt_page(dptep);
+		if (!page)
+			return -ENOENT;
+
+		tmp = kmap(page);
+		memcpy(ptr, tmp, PAGE_SIZE);
+		kunmap(page);
+
+		ptr += PAGE_SIZE;
+		bounce->cpages++;
+	}
+
+	return 0;
+}
+
+static int dmirror_read(struct dmirror *dmirror,
+			struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_bounce bounce;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+
+again:
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_pt_walk(dmirror, dmirror_do_read, start, end, &bounce,
+				false);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0)
+		ret = dmirror_bounce_copy_to(&bounce, cmd->ptr);
+	else if (ret == -ENOENT) {
+		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+		ret = dmirror_fault(dmirror, start, end, false);
+		if (ret == 0) {
+			cmd->faults++;
+			goto again;
+		}
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
+static int dmirror_do_write(struct dmirror *dmirror,
+			    unsigned long addr,
+			    unsigned long end,
+			    u64 *dptep,
+			    void *private)
+{
+	struct dmirror_bounce *bounce = private;
+	void *ptr;
+
+	ptr = bounce->ptr + ((addr - bounce->addr) & PAGE_MASK);
+
+	for (; addr < end; addr += PAGE_SIZE, ++dptep) {
+		struct page *page;
+		void *tmp;
+
+		page = dmirror_pt_page_write(dptep);
+		if (!page)
+			return -ENOENT;
+
+		tmp = kmap(page);
+		memcpy(tmp, ptr, PAGE_SIZE);
+		kunmap(page);
+
+		ptr += PAGE_SIZE;
+		bounce->cpages++;
+	}
+
+	return 0;
+}
+
+static int dmirror_write(struct dmirror *dmirror,
+			 struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_bounce bounce;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	ret = dmirror_bounce_copy_from(&bounce, cmd->ptr);
+	if (ret)
+		return ret;
+
+again:
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_pt_walk(dmirror, dmirror_do_write,
+			      start, end, &bounce, false);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == -ENOENT) {
+		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+		ret = dmirror_fault(dmirror, start, end, true);
+		if (ret == 0) {
+			cmd->faults++;
+			goto again;
+		}
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
+static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+				   struct page **ppage)
+{
+	struct dmirror_chunk *devmem;
+	struct resource *res;
+	unsigned long pfn;
+	unsigned long pfn_first;
+	unsigned long pfn_last;
+	void *ptr;
+
+	mutex_lock(&mdevice->devmem_lock);
+
+	if (mdevice->devmem_count == mdevice->devmem_capacity) {
+		struct dmirror_chunk **new_chunks;
+		unsigned int new_capacity;
+
+		new_capacity = mdevice->devmem_capacity +
+				DEVMEM_CHUNKS_RESERVE;
+		new_chunks = krealloc(mdevice->devmem_chunks,
+				sizeof(new_chunks[0]) * new_capacity,
+				GFP_KERNEL);
+		if (!new_chunks)
+			goto err;
+		mdevice->devmem_capacity = new_capacity;
+		mdevice->devmem_chunks = new_chunks;
+	}
+
+	res = devm_request_free_mem_region(&mdevice->pdevice->dev,
+					&iomem_resource, DEVMEM_CHUNK_SIZE);
+	if (IS_ERR(res))
+		goto err;
+
+	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
+	if (!devmem)
+		goto err;
+
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.res = *res;
+	devmem->pagemap.ops = &dmirror_devmem_ops;
+	ptr = devm_memremap_pages(&mdevice->pdevice->dev, &devmem->pagemap);
+	if (IS_ERR(ptr))
+		goto err_free;
+
+	devmem->mdevice = mdevice;
+	pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT;
+	pfn_last = pfn_first +
+		(resource_size(&devmem->pagemap.res) >> PAGE_SHIFT);
+	mdevice->devmem_chunks[mdevice->devmem_count++] = devmem;
+
+	mutex_unlock(&mdevice->devmem_lock);
+
+	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+		DEVMEM_CHUNK_SIZE / (1024 * 1024),
+		mdevice->devmem_count,
+		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
+		pfn_first, pfn_last);
+
+	spin_lock(&mdevice->lock);
+	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		page->zone_device_data = mdevice->free_pages;
+		mdevice->free_pages = page;
+	}
+	if (ppage) {
+		*ppage = mdevice->free_pages;
+		mdevice->free_pages = (*ppage)->zone_device_data;
+		mdevice->calloc++;
+	}
+	spin_unlock(&mdevice->lock);
+
+	return true;
+
+err_free:
+	kfree(devmem);
+err:
+	mutex_unlock(&mdevice->devmem_lock);
+	return false;
+}
+
+static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+{
+	struct page *dpage = NULL;
+	struct page *rpage;
+
+	/*
+	 * This is a fake device so we alloc real system memory to store
+	 * our device memory.
+	 */
+	rpage = alloc_page(GFP_HIGHUSER);
+	if (!rpage)
+		return NULL;
+
+	spin_lock(&mdevice->lock);
+
+	if (mdevice->free_pages) {
+		dpage = mdevice->free_pages;
+		mdevice->free_pages = dpage->zone_device_data;
+		mdevice->calloc++;
+		spin_unlock(&mdevice->lock);
+	} else {
+		spin_unlock(&mdevice->lock);
+		if (!dmirror_allocate_chunk(mdevice, &dpage))
+			goto error;
+	}
+
+	dpage->zone_device_data = rpage;
+	get_page(dpage);
+	lock_page(dpage);
+	return dpage;
+
+error:
+	__free_page(rpage);
+	return NULL;
+}
+
+static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
+					   struct dmirror *dmirror)
+{
+	struct dmirror_device *mdevice = dmirror->mdevice;
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long addr;
+
+	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
+						   src++, dst++) {
+		struct page *spage;
+		struct page *dpage;
+		struct page *rpage;
+
+		if (!(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		/*
+		 * Note that spage might be NULL which is OK since it is an
+		 * unallocated pte_none() or read-only zero page.
+		 */
+		spage = migrate_pfn_to_page(*src);
+
+		/*
+		 * Don't migrate device private pages from our own driver or
+		 * others. For our own we would do a device private memory copy
+		 * not a migration and for others, we would need to fault the
+		 * other device's page into system memory first.
+		 */
+		if (spage && is_zone_device_page(spage))
+			continue;
+
+		dpage = dmirror_devmem_alloc_page(mdevice);
+		if (!dpage)
+			continue;
+
+		rpage = dpage->zone_device_data;
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+		/*
+		 * Normally, a device would use the page->zone_device_data to
+		 * point to the mirror but here we use it to hold the page for
+		 * the simulated device memory and that page holds the pointer
+		 * to the mirror.
+		 */
+		rpage->zone_device_data = dmirror;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) |
+			    MIGRATE_PFN_LOCKED;
+		if ((*src & MIGRATE_PFN_WRITE) ||
+		    (!spage && args->vma->vm_flags & VM_WRITE))
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	/* Try to pre-allocate page tables. */
+	mutex_lock(&dmirror->mutex);
+	dmirror_pt_walk(dmirror, NULL, args->start, args->end, NULL, true);
+	mutex_unlock(&dmirror->mutex);
+}
+
+struct dmirror_migrate {
+	struct hmm_dmirror_cmd		*cmd;
+	const unsigned long		*src;
+	const unsigned long		*dst;
+	unsigned long			start;
+};
+
+static int dmirror_do_migrate(struct dmirror *dmirror,
+			      unsigned long addr,
+			      unsigned long end,
+			      u64 *dptep,
+			      void *private)
+{
+	struct dmirror_migrate *migrate = private;
+	const unsigned long *src = migrate->src;
+	const unsigned long *dst = migrate->dst;
+	unsigned long idx = (addr - migrate->start) >> PAGE_SHIFT;
+
+	for (; addr < end; addr += PAGE_SIZE, ++dptep, ++idx) {
+		struct page *page;
+		u64 dpte;
+
+		if (!(src[idx] & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		page = migrate_pfn_to_page(dst[idx]);
+		if (!page)
+			continue;
+
+		/*
+		 * Map the page that holds the data so dmirror_pt_walk()
+		 * doesn't have to deal with ZONE_DEVICE private pages.
+		 */
+		page = page->zone_device_data;
+		dpte = dmirror_pt_from_page(page) | DPT_DPAGE;
+		if (dst[idx] & MIGRATE_PFN_WRITE)
+			dpte |= DPT_WRITE;
+		*dptep = dpte;
+	}
+
+	return 0;
+}
+
+static void dmirror_migrate_finalize_and_map(struct migrate_vma *args,
+					     struct dmirror *dmirror,
+					     struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_migrate migrate;
+
+	migrate.cmd = cmd;
+	migrate.src = args->src;
+	migrate.dst = args->dst;
+	migrate.start = args->start;
+
+	/* Map the migrated pages into the device's page tables. */
+	mutex_lock(&dmirror->mutex);
+	dmirror_pt_walk(dmirror, dmirror_do_migrate, args->start, args->end,
+			&migrate, true);
+	mutex_unlock(&dmirror->mutex);
+}
+
+static int dmirror_migrate(struct dmirror *dmirror,
+			   struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
+	struct vm_area_struct *vma;
+	unsigned long src_pfns[64];
+	unsigned long dst_pfns[64];
+	struct dmirror_bounce bounce;
+	struct migrate_vma args;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	down_read(&mm->mmap_sem);
+	for (addr = start; addr < end; addr = next) {
+		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start) {
+			ret = -EINVAL;
+			goto out;
+		}
+		if (next > vma->vm_end)
+			next = vma->vm_end;
+
+		args.vma = vma;
+		args.src = src_pfns;
+		args.dst = dst_pfns;
+		args.start = addr;
+		args.end = next;
+		ret = migrate_vma_setup(&args);
+		if (ret)
+			goto out;
+
+		dmirror_migrate_alloc_and_copy(&args, dmirror);
+		migrate_vma_pages(&args);
+		dmirror_migrate_finalize_and_map(&args, dmirror, cmd);
+		migrate_vma_finalize(&args);
+	}
+	up_read(&mm->mmap_sem);
+
+	/* Return the migrated data for verification. */
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_pt_walk(dmirror, dmirror_do_read, start, end, &bounce,
+				false);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0)
+		ret = dmirror_bounce_copy_to(&bounce, cmd->ptr);
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+
+out:
+	up_read(&mm->mmap_sem);
+	return ret;
+}
+
+static void dmirror_mkentry(struct dmirror *dmirror,
+			    struct hmm_range *range,
+			    unsigned char *perm,
+			    uint64_t entry)
+{
+	struct page *page;
+
+	if (entry == range->values[HMM_PFN_ERROR]) {
+		*perm = HMM_DMIRROR_PROT_ERROR;
+		return;
+	}
+	page = hmm_device_entry_to_page(range, entry);
+	if (!page) {
+		*perm = HMM_DMIRROR_PROT_NONE;
+		return;
+	}
+	if (entry & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
+		/* Is the page migrated to this device or some other? */
+		if (dmirror->mdevice == dmirror_page_to_device(page))
+			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
+		else
+			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
+	} else if (is_zero_pfn(page_to_pfn(page)))
+		*perm = HMM_DMIRROR_PROT_ZERO;
+	else
+		*perm = HMM_DMIRROR_PROT_NONE;
+	if (entry & range->flags[HMM_PFN_WRITE])
+		*perm |= HMM_DMIRROR_PROT_WRITE;
+	else
+		*perm |= HMM_DMIRROR_PROT_READ;
+}
+
+static int dmirror_snapshot(struct dmirror *dmirror,
+			    struct hmm_dmirror_cmd *cmd)
+{
+	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	unsigned long addr;
+	unsigned long next;
+	uint64_t pfns[64];
+	unsigned char perm[64];
+	char __user *uptr;
+	struct hmm_range range = {
+		.pfns = pfns,
+		.flags = dmirror_hmm_flags,
+		.values = dmirror_hmm_values,
+		.pfn_shift = DPT_SHIFT,
+		.pfn_flags_mask = ~0ULL,
+	};
+	int ret = 0;
+
+	start = cmd->addr;
+	end = start + size;
+	uptr = (void __user *)cmd->ptr;
+
+	for (addr = start; addr < end; ) {
+		long count;
+		unsigned long i;
+		unsigned long n;
+
+		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+		range.start = addr;
+		range.end = next;
+
+		down_read(&mm->mmap_sem);
+
+		ret = hmm_range_register(&range, &dmirror->mirror);
+		if (ret) {
+			up_read(&mm->mmap_sem);
+			break;
+		}
+
+		if (!hmm_range_wait_until_valid(&range,
+						DMIRROR_RANGE_FAULT_TIMEOUT)) {
+			hmm_range_unregister(&range);
+			up_read(&mm->mmap_sem);
+			continue;
+		}
+
+		count = hmm_range_fault(&range, HMM_FAULT_SNAPSHOT);
+		if (count < 0) {
+			ret = count;
+			hmm_range_unregister(&range);
+			up_read(&mm->mmap_sem);
+			if (ret == -EBUSY)
+				continue;
+			break;
+		}
+
+		if (!hmm_range_valid(&range)) {
+			hmm_range_unregister(&range);
+			up_read(&mm->mmap_sem);
+			continue;
+		}
+
+		n = (next - addr) >> PAGE_SHIFT;
+		for (i = 0; i < n; i++)
+			dmirror_mkentry(dmirror, &range, perm + i, pfns[i]);
+		hmm_range_unregister(&range);
+		up_read(&mm->mmap_sem);
+
+		ret = copy_to_user(uptr, perm, n);
+		if (ret)
+			break;
+
+		cmd->cpages += n;
+		uptr += n;
+		addr = next;
+	}
+
+	return ret;
+}
+
+static long dmirror_fops_unlocked_ioctl(struct file *filp,
+					unsigned int command,
+					unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+	struct hmm_dmirror_cmd cmd;
+	struct dmirror *dmirror;
+	int ret;
+
+	dmirror = filp->private_data;
+	if (!dmirror)
+		return -EINVAL;
+
+	ret = copy_from_user(&cmd, uarg, sizeof(cmd));
+	if (ret)
+		return ret;
+
+	if (cmd.addr & ~PAGE_MASK)
+		return -EINVAL;
+	if (cmd.addr >= (cmd.addr + (cmd.npages << PAGE_SHIFT)))
+		return -EINVAL;
+
+	cmd.cpages = 0;
+	cmd.faults = 0;
+
+	switch (command) {
+	case HMM_DMIRROR_READ:
+		ret = dmirror_read(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_WRITE:
+		ret = dmirror_write(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_MIGRATE:
+		ret = dmirror_migrate(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_SNAPSHOT:
+		ret = dmirror_snapshot(dmirror, &cmd);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+	if (ret)
+		return ret;
+
+	return copy_to_user(uarg, &cmd, sizeof(cmd));
+}
+
+static const struct file_operations dmirror_fops = {
+	.read		= dmirror_fops_read,
+	.write		= dmirror_fops_write,
+	.mmap		= dmirror_fops_mmap,
+	.open		= dmirror_fops_open,
+	.release	= dmirror_fops_release,
+	.unlocked_ioctl = dmirror_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+static void dmirror_devmem_free(struct page *page)
+{
+	struct page *rpage = page->zone_device_data;
+	struct dmirror_device *mdevice;
+
+	if (rpage)
+		__free_page(rpage);
+
+	mdevice = dmirror_page_to_device(page);
+
+	spin_lock(&mdevice->lock);
+	mdevice->cfree++;
+	page->zone_device_data = mdevice->free_pages;
+	mdevice->free_pages = page;
+	spin_unlock(&mdevice->lock);
+}
+
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+						struct dmirror_device *mdevice)
+{
+	struct vm_area_struct *vma = args->vma;
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE,
+				       src++, dst++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(*src);
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+		if (!dmirror_device_is_mine(mdevice, spage))
+			continue;
+		spage = spage->zone_device_data;
+
+		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+		if (!dpage)
+			continue;
+
+		lock_page(dpage);
+		copy_highpage(dpage, spage);
+		*dst = migrate_pfn(page_to_pfn(dpage)) |
+			    MIGRATE_PFN_LOCKED;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	return 0;
+}
+
+static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args,
+						  struct dmirror *dmirror)
+{
+	/* Invalidate the device's page table mapping. */
+	mutex_lock(&dmirror->mutex);
+	dmirror_pt_walk(dmirror, dmirror_do_update, args->start, args->end,
+			NULL, false);
+	mutex_unlock(&dmirror->mutex);
+}
+
+static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
+{
+	struct migrate_vma args;
+	unsigned long src_pfns;
+	unsigned long dst_pfns;
+	struct page *rpage;
+	struct dmirror *dmirror;
+	vm_fault_t ret;
+
+	/* FIXME demonstrate how we can adjust migrate range */
+	args.vma = vmf->vma;
+	args.start = vmf->address;
+	args.end = args.start + PAGE_SIZE;
+	args.src = &src_pfns;
+	args.dst = &dst_pfns;
+
+	if (migrate_vma_setup(&args))
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * Normally, a device would use the page->zone_device_data to point to
+	 * the mirror but here we use it to hold the page for the simulated
+	 * device memory and that page holds the pointer to the mirror.
+	 */
+	rpage = vmf->page->zone_device_data;
+	dmirror = rpage->zone_device_data;
+
+	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror->mdevice);
+	if (ret)
+		return ret;
+	migrate_vma_pages(&args);
+	dmirror_devmem_fault_finalize_and_map(&args, dmirror);
+	migrate_vma_finalize(&args);
+	return 0;
+}
+
+static const struct dev_pagemap_ops dmirror_devmem_ops = {
+	.page_free	= dmirror_devmem_free,
+	.migrate_to_ram	= dmirror_devmem_fault,
+};
+
+static void dmirror_pdev_del(void *arg)
+{
+	struct dmirror_device *mdevice = arg;
+	unsigned int i;
+
+	if (mdevice->devmem_chunks) {
+		for (i = 0; i < mdevice->devmem_count; i++)
+			kfree(mdevice->devmem_chunks[i]);
+		kfree(mdevice->devmem_chunks);
+	}
+
+	cdev_del(&mdevice->cdevice);
+	kfree(mdevice);
+}
+
+static int dmirror_probe(struct platform_device *pdev)
+{
+	struct dmirror_device *mdevice;
+	int ret;
+
+	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
+	if (!mdevice)
+		return -ENOMEM;
+
+	mdevice->pdevice = pdev;
+	mutex_init(&mdevice->devmem_lock);
+	spin_lock_init(&mdevice->lock);
+
+	cdev_init(&mdevice->cdevice, &dmirror_fops);
+	ret = cdev_add(&mdevice->cdevice, pdev->dev.devt, 1);
+	if (ret) {
+		kfree(mdevice);
+		return ret;
+	}
+
+	platform_set_drvdata(pdev, mdevice);
+	ret = devm_add_action_or_reset(&pdev->dev, dmirror_pdev_del, mdevice);
+	if (ret)
+		return ret;
+
+	/* Build list of free struct page */
+	dmirror_allocate_chunk(mdevice, NULL);
+
+	return 0;
+}
+
+static int dmirror_remove(struct platform_device *pdev)
+{
+	/* all probe actions are unwound by devm */
+	return 0;
+}
+
+static struct platform_driver dmirror_device_driver = {
+	.probe		= dmirror_probe,
+	.remove		= dmirror_remove,
+	.driver		= {
+		.name	= "HMM_DMIRROR",
+	},
+};
+
+static int __init hmm_dmirror_init(void)
+{
+	int ret;
+	int id;
+
+	ret = platform_driver_register(&dmirror_device_driver);
+	if (ret)
+		return ret;
+
+	ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
+				  "HMM_DMIRROR");
+	if (ret)
+		goto err_unreg;
+
+	for (id = 0; id < DMIRROR_NDEVICES; id++) {
+		struct platform_device *pd;
+
+		pd = platform_device_alloc("HMM_DMIRROR", id);
+		if (!pd) {
+			ret = -ENOMEM;
+			goto err_chrdev;
+		}
+		pd->dev.devt = MKDEV(MAJOR(dmirror_dev), id);
+		ret = platform_device_add(pd);
+		if (ret) {
+			platform_device_put(pd);
+			goto err_chrdev;
+		}
+		dmirror_platform_devices[id] = pd;
+	}
+
+	/*
+	 * Allocate a zero page to simulate a reserved page of device private
+	 * memory which is always zero. The zero_pfn page isn't used just to
+	 * make the code here simpler (i.e., we need a struct page for it).
+	 */
+	dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+	if (!dmirror_zero_page)
+		goto err_chrdev;
+
+	pr_info("hmm_dmirror loaded. This is only for testing HMM.\n");
+	return 0;
+
+err_chrdev:
+	while (--id >= 0) {
+		platform_device_unregister(dmirror_platform_devices[id]);
+		dmirror_platform_devices[id] = NULL;
+	}
+	unregister_chrdev_region(dmirror_dev, 1);
+err_unreg:
+	platform_driver_unregister(&dmirror_device_driver);
+	return ret;
+}
+
+static void __exit hmm_dmirror_exit(void)
+{
+	int id;
+
+	if (dmirror_zero_page)
+		__free_page(dmirror_zero_page);
+	for (id = 0; id < DMIRROR_NDEVICES; id++)
+		platform_device_unregister(dmirror_platform_devices[id]);
+	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
+	platform_driver_unregister(&dmirror_device_driver);
+	mmu_notifier_synchronize();
+}
+
+module_init(hmm_dmirror_init);
+module_exit(hmm_dmirror_exit);
+MODULE_LICENSE("GPL");
diff --git a/include/Kbuild b/include/Kbuild
index ffba79483cc5..6ffb44a45957 100644
--- a/include/Kbuild
+++ b/include/Kbuild
@@ -1063,6 +1063,7 @@ header-test-			+= uapi/linux/coda_psdev.h
 header-test-			+= uapi/linux/errqueue.h
 header-test-			+= uapi/linux/eventpoll.h
 header-test-			+= uapi/linux/hdlc/ioctl.h
+header-test-			+= uapi/linux/hmm_dmirror.h
 header-test-			+= uapi/linux/input.h
 header-test-			+= uapi/linux/kvm.h
 header-test-			+= uapi/linux/kvm_para.h
diff --git a/include/uapi/linux/hmm_dmirror.h b/include/uapi/linux/hmm_dmirror.h
new file mode 100644
index 000000000000..61d3643aff95
--- /dev/null
+++ b/include/uapi/linux/hmm_dmirror.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+/*
+ * This is a dummy driver to exercise the HMM (heterogeneous memory management)
+ * API of the kernel. It allows a userspace program to expose its entire address
+ * space through the HMM dummy driver file.
+ */
+#ifndef _UAPI_LINUX_HMM_DMIRROR_H
+#define _UAPI_LINUX_HMM_DMIRROR_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Structure to pass to the HMM test driver to mimic a device accessing
+ * system memory and ZONE_DEVICE private memory through device page tables.
+ *
+ * @addr: (in) user address the device will read/write
+ * @ptr: (in) user address where device data is copied to/from
+ * @npages: (in) number of pages to read/write
+ * @cpages: (out) number of pages copied
+ * @faults: (out) number of device page faults seen
+ */
+struct hmm_dmirror_cmd {
+	__u64		addr;
+	__u64		ptr;
+	__u64		npages;
+	__u64		cpages;
+	__u64		faults;
+};
+
+/* Expose the address space of the calling process through hmm dummy dev file */
+#define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+
+/*
+ * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
+ * HMM_DMIRROR_PROT_ERROR: no valid mirror PTE for this page
+ * HMM_DMIRROR_PROT_NONE: unpopulated PTE or PTE with no access
+ * HMM_DMIRROR_PROT_READ: read-only PTE
+ * HMM_DMIRROR_PROT_WRITE: read/write PTE
+ * HMM_DMIRROR_PROT_ZERO: special read-only zero page
+ * HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL: Migrated device private page on the
+ *					device the ioctl() is made
+ * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
+ *					other device
+ */
+enum {
+	HMM_DMIRROR_PROT_ERROR			= 0xFF,
+	HMM_DMIRROR_PROT_NONE			= 0x00,
+	HMM_DMIRROR_PROT_READ			= 0x01,
+	HMM_DMIRROR_PROT_WRITE			= 0x02,
+	HMM_DMIRROR_PROT_ZERO			= 0x10,
+	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
+	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
+};
+
+#endif /* _UAPI_LINUX_HMM_DMIRROR_H */
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34..3054565b3f07 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+hmm-tests
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 9534dc2bc929..5643cfb5e3d6 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -5,6 +5,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt
 TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_benchmark
+TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += map_hugetlb
@@ -26,6 +27,8 @@ TEST_FILES := test_vmalloc.sh
 KSFT_KHDR_INSTALL := 1
 include ../lib.mk
 
+$(OUTPUT)/hmm-tests: LDLIBS += -lhugetlbfs -lpthread
+
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
 
 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
diff --git a/tools/testing/selftests/vm/config b/tools/testing/selftests/vm/config
index 1c0d76cb5adf..34cfab18e737 100644
--- a/tools/testing/selftests/vm/config
+++ b/tools/testing/selftests/vm/config
@@ -1,2 +1,5 @@
 CONFIG_SYSVIPC=y
 CONFIG_USERFAULTFD=y
+CONFIG_HMM_MIRROR=y
+CONFIG_DEVICE_PRIVATE=y
+CONFIG_HMM_DMIRROR=m
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
new file mode 100644
index 000000000000..f4ae6188fd0e
--- /dev/null
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -0,0 +1,1311 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright 2013 Red Hat Inc.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 of
+ * the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * Authors: Jérôme Glisse <jglisse@redhat.com>
+ */
+
+/*
+ * HMM stands for Heterogeneous Memory Management, it is a helper layer inside
+ * the linux kernel to help device drivers mirror a process address space in
+ * the device. This allows the device to use the same address space which
+ * makes communication and data exchange a lot easier.
+ *
+ * This framework's sole purpose is to exercise various code paths inside
+ * the kernel to make sure that HMM performs as expected and to flush out any
+ * bugs.
+ */
+
+#include "../kselftest_harness.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <strings.h>
+#include <time.h>
+#include <pthread.h>
+#include <hugetlbfs.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <linux/hmm_dmirror.h>
+
+struct hmm_buffer {
+	void		*ptr;
+	void		*mirror;
+	unsigned long	size;
+	int		fd;
+	uint64_t	cpages;
+	uint64_t	faults;
+};
+
+#define TWOMEG		(1 << 21)
+#define HMM_BUFFER_SIZE (1024 << 12)
+#define HMM_PATH_MAX    64
+#define NTIMES		256
+
+#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+
+FIXTURE(hmm)
+{
+	int		fd;
+	unsigned int	page_size;
+	unsigned int	page_shift;
+};
+
+FIXTURE(hmm2)
+{
+	int		fd0;
+	int		fd1;
+	unsigned int	page_size;
+	unsigned int	page_shift;
+};
+
+static int hmm_open(int unit)
+{
+	char pathname[HMM_PATH_MAX];
+	int fd;
+
+	snprintf(pathname, sizeof(pathname), "/dev/hmm_dmirror%d", unit);
+	fd = open(pathname, O_RDWR, 0);
+	if (fd < 0)
+		fprintf(stderr, "could not open hmm dmirror driver (%s)\n",
+			pathname);
+	return fd;
+}
+
+FIXTURE_SETUP(hmm)
+{
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+	self->page_shift = ffs(self->page_size) - 1;
+
+	self->fd = hmm_open(0);
+	ASSERT_GE(self->fd, 0);
+}
+
+FIXTURE_SETUP(hmm2)
+{
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+	self->page_shift = ffs(self->page_size) - 1;
+
+	self->fd0 = hmm_open(0);
+	ASSERT_GE(self->fd0, 0);
+	self->fd1 = hmm_open(1);
+	ASSERT_GE(self->fd1, 0);
+}
+
+FIXTURE_TEARDOWN(hmm)
+{
+	int ret = close(self->fd);
+
+	ASSERT_EQ(ret, 0);
+	self->fd = -1;
+}
+
+FIXTURE_TEARDOWN(hmm2)
+{
+	int ret = close(self->fd0);
+
+	ASSERT_EQ(ret, 0);
+	self->fd0 = -1;
+
+	ret = close(self->fd1);
+	ASSERT_EQ(ret, 0);
+	self->fd1 = -1;
+}
+
+static int hmm_dmirror_cmd(int fd,
+			   unsigned long request,
+			   struct hmm_buffer *buffer,
+			   unsigned long npages)
+{
+	struct hmm_dmirror_cmd cmd;
+	int ret;
+
+	/* Simulate a device reading system memory. */
+	cmd.addr = (__u64)buffer->ptr;
+	cmd.ptr = (__u64)buffer->mirror;
+	cmd.npages = npages;
+
+	for (;;) {
+		ret = ioctl(fd, request, &cmd);
+		if (ret == 0)
+			break;
+		if (errno == EINTR)
+			continue;
+		return -errno;
+	}
+	buffer->cpages = cmd.cpages;
+	buffer->faults = cmd.faults;
+
+	return 0;
+}
+
+static void hmm_buffer_free(struct hmm_buffer *buffer)
+{
+	if (buffer == NULL)
+		return;
+
+	if (buffer->ptr)
+		munmap(buffer->ptr, buffer->size);
+	free(buffer->mirror);
+	free(buffer);
+}
+
+/*
+ * Create a temporary file that will be deleted on close.
+ */
+static int hmm_create_file(unsigned long size)
+{
+	char path[HMM_PATH_MAX];
+	int fd;
+
+	strcpy(path, "/tmp");
+	fd = open(path, O_TMPFILE | O_EXCL | O_RDWR, 0600);
+	if (fd >= 0) {
+		int r;
+
+		do {
+			r = ftruncate(fd, size);
+		} while (r == -1 && errno == EINTR);
+		if (!r)
+			return fd;
+		close(fd);
+	}
+	return -1;
+}
+
+/*
+ * Return a random unsigned number.
+ */
+static unsigned int hmm_random(void)
+{
+	static int fd = -1;
+	unsigned int r;
+
+	if (fd < 0) {
+		fd = open("/dev/urandom", O_RDONLY);
+		if (fd < 0) {
+			fprintf(stderr, "%s:%d failed to open /dev/urandom\n",
+					__FILE__, __LINE__);
+			return ~0U;
+		}
+	}
+	read(fd, &r, sizeof(r));
+	return r;
+}
+
+static void hmm_nanosleep(unsigned int n)
+{
+	struct timespec t;
+
+	t.tv_sec = 0;
+	t.tv_nsec = n;
+	nanosleep(&t, NULL);
+}
+
+/*
+ * Simple NULL test of device open/close.
+ */
+TEST_F(hmm, open_close)
+{
+}
+
+/*
+ * Read private anonymous memory.
+ */
+TEST_F(hmm, anon_read)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int val;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/*
+	 * Initialize buffer in system memory but leave the first two pages
+	 * zero (pte_none and pfn_zero).
+	 */
+	i = 2 * self->page_size / sizeof(*ptr);
+	for (ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Set buffer permission to read-only. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Populate the CPU page table with a special zero page. */
+	val = *(int *)(buffer->ptr + self->page_size);
+	ASSERT_EQ(val, 0);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	ptr = buffer->mirror;
+	for (i = 0; i < 2 * self->page_size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+	for (; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Read private anonymous memory which has been protected with
+ * mprotect() PROT_NONE.
+ */
+TEST_F(hmm, anon_read_prot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize mirror buffer so we can verify it isn't written. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	/* Protect buffer from reading. */
+	ret = mprotect(buffer->ptr, size, PROT_NONE);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, -EFAULT);
+
+	/* Allow CPU to read the buffer so we can check it. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write private anonymous memory.
+ */
+TEST_F(hmm, anon_write)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write private anonymous memory which has been protected with
+ * mprotect() PROT_READ.
+ */
+TEST_F(hmm, anon_write_prot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Simulate a device reading a zero page of memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, -EPERM);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Now allow writing and see that the zero page is replaced. */
+	ret = mprotect(buffer->ptr, size, PROT_WRITE | PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Check that a device writing an anonymous private mapping
+ * will copy-on-write if a child process inherits the mapping.
+ */
+TEST_F(hmm, anon_write_child)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	pid_t pid;
+	int child_fd;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer->ptr so we can tell if it is written. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (pid != 0) {
+		waitpid(pid, &ret, 0);
+		ASSERT_EQ(WIFEXITED(ret), 1);
+
+		/* Check that the parent's buffer did not change. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+		return;
+	}
+
+	/* Check that we see the parent's values. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	/* The child process needs its own mirror to its own mm. */
+	child_fd = hmm_open(0);
+	ASSERT_GE(child_fd, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	close(child_fd);
+	exit(0);
+}
+
+/*
+ * Check that a device writing an anonymous shared mapping
+ * will not copy-on-write if a child process inherits the mapping.
+ */
+TEST_F(hmm, anon_write_child_shared)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	pid_t pid;
+	int child_fd;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_SHARED | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer->ptr so we can tell if it is written. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (pid != 0) {
+		waitpid(pid, &ret, 0);
+		ASSERT_EQ(WIFEXITED(ret), 1);
+
+		/* Check that the parent's buffer did change. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], -i);
+		return;
+	}
+
+	/* Check that we see the parent's values. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	/* The child process needs its own mirror to its own mm. */
+	child_fd = hmm_open(0);
+	ASSERT_GE(child_fd, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	close(child_fd);
+	exit(0);
+}
+
+/*
+ * Write private anonymous huge page.
+ */
+TEST_F(hmm, anon_write_huge)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = 2 * TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	size = TWOMEG;
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write huge TLBFS page.
+ */
+TEST_F(hmm, anon_write_hugetlbfs)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	long pagesizes[4];
+	int n, idx;
+
+	/* Skip test if we can't allocate a hugetlbfs page. */
+
+	n = gethugepagesizes(pagesizes, 4);
+	if (n <= 0)
+		return;
+	for (idx = 0; --n > 0; ) {
+		if (pagesizes[n] < pagesizes[idx])
+			idx = n;
+	}
+	size = ALIGN(TWOMEG, pagesizes[idx]);
+	npages = size >> self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->ptr = get_hugepage_region(size, GHR_STRICT);
+	if (buffer->ptr == NULL) {
+		free(buffer);
+		return;
+	}
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	free_hugepage_region(buffer->ptr);
+	buffer->ptr = NULL;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Read mmap'ed file memory.
+ */
+TEST_F(hmm, file_read)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int fd;
+	off_t off;
+	ssize_t len;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	fd = hmm_create_file(size);
+	ASSERT_GE(fd, 0);
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = fd;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Write initial contents of the file. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+	off = lseek(fd, 0, SEEK_SET);
+	ASSERT_EQ(off, 0);
+	len = write(fd, buffer->mirror, size);
+	ASSERT_EQ(len, size);
+	memset(buffer->mirror, 0, size);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ,
+			   MAP_SHARED,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write mmap'ed file memory.
+ */
+TEST_F(hmm, file_write)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int fd;
+	off_t off;
+	ssize_t len;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	fd = hmm_create_file(size);
+	ASSERT_GE(fd, 0);
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = fd;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_SHARED,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Check that the device also wrote the file. */
+	off = lseek(fd, 0, SEEK_SET);
+	ASSERT_EQ(off, 0);
+	len = read(fd, buffer->mirror, size);
+	ASSERT_EQ(len, size);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory.
+ */
+TEST_F(hmm, migrate)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory and fault it back to system
+ * memory.
+ */
+TEST_F(hmm, migrate_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Try to migrate various memory types to device private memory.
+ */
+TEST_F(hmm2, migrate_mixed)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	int *ptr;
+	unsigned char *p;
+	int ret;
+	int val;
+
+	npages = 6;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_NONE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+	p = buffer->ptr;
+
+	/* Now try to migrate everything to device 1. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 6);
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	/* We expect an error if the vma doesn't cover the range. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+	ASSERT_EQ(ret, -EINVAL);
+
+	/* Page 2 will be a read-only zero page. */
+	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 2 * self->page_size);
+	val = *ptr + 3;
+	ASSERT_EQ(val, 3);
+
+	/* Page 3 will be read-only. */
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 3 * self->page_size);
+	*ptr = val;
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 4 will be read-write. */
+	ret = mprotect(buffer->ptr + 4 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 4 * self->page_size);
+	*ptr = val;
+
+	/* Page 5 won't be migrated to device 0 because it's on device 1. */
+	buffer->ptr = p + 5 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, -ENOENT);
+	buffer->ptr = p;
+
+	/* Now try to migrate pages 2-3 to device 1. */
+	buffer->ptr = p + 2 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 2);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 2);
+	buffer->ptr = p;
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory and fault it back to system
+ * memory multiple times.
+ */
+TEST_F(hmm, migrate_multiple)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	unsigned long c;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; c++) {
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i;
+
+		/* Migrate memory to device. */
+		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
+				      npages);
+		ASSERT_EQ(ret, 0);
+		ASSERT_EQ(buffer->cpages, npages);
+
+		/* Check what the device read. */
+		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+
+		/* Fault pages back to system memory and check them. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+
+		hmm_buffer_free(buffer);
+	}
+}
+
+/*
+ * Read anonymous memory multiple times.
+ */
+TEST_F(hmm, anon_read_multiple)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	unsigned long c;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; c++) {
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i + c;
+
+		/* Simulate a device reading system memory. */
+		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
+				      npages);
+		ASSERT_EQ(ret, 0);
+		ASSERT_EQ(buffer->cpages, npages);
+		ASSERT_EQ(buffer->faults, 1);
+
+		/* Check what the device read. */
+		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i + c);
+
+		hmm_buffer_free(buffer);
+	}
+}
+
+void *unmap_buffer(void *p)
+{
+	struct hmm_buffer *buffer = p;
+
+	/* Delay for a bit and then unmap buffer while it is being read. */
+	hmm_nanosleep(hmm_random() % 32000);
+	munmap(buffer->ptr + buffer->size / 2, buffer->size / 2);
+	buffer->ptr = NULL;
+
+	return NULL;
+}
+
+/*
+ * Try reading anonymous memory while it is being unmapped.
+ */
+TEST_F(hmm, anon_teardown)
+{
+	unsigned long npages;
+	unsigned long size;
+	unsigned long c;
+	void *ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; ++c) {
+		pthread_t thread;
+		struct hmm_buffer *buffer;
+		unsigned long i;
+		int *ptr;
+		int rc;
+
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i + c;
+
+		rc = pthread_create(&thread, NULL, unmap_buffer, buffer);
+		ASSERT_EQ(rc, 0);
+
+		/* Simulate a device reading system memory. */
+		rc = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
+				     npages);
+		if (rc == 0) {
+			ASSERT_EQ(buffer->cpages, npages);
+			ASSERT_EQ(buffer->faults, 1);
+
+			/* Check what the device read. */
+			for (i = 0, ptr = buffer->mirror;
+			     i < size / sizeof(*ptr);
+			     ++i)
+				ASSERT_EQ(ptr[i], i + c);
+		}
+
+		pthread_join(thread, &ret);
+		hmm_buffer_free(buffer);
+	}
+}
+
+/*
+ * Test memory snapshot without faulting in pages accessed by the device.
+ */
+TEST_F(hmm2, snapshot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	int *ptr;
+	unsigned char *p;
+	unsigned char *m;
+	int ret;
+	int val;
+
+	npages = 7;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(npages);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_NONE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+	p = buffer->ptr;
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 2 will be read-only zero page. */
+	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 2 * self->page_size);
+	val = *ptr + 3;
+	ASSERT_EQ(val, 3);
+
+	/* Page 3 will be read-only. */
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 3 * self->page_size);
+	*ptr = val;
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 4-6 will be read-write. */
+	ret = mprotect(buffer->ptr + 4 * self->page_size, 3 * self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 4 * self->page_size);
+	*ptr = val;
+
+	/* Page 5 will be migrated to device 0. */
+	buffer->ptr = p + 5 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+
+	/* Page 6 will be migrated to device 1. */
+	buffer->ptr = p + 6 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+
+	/* Simulate a device snapshotting CPU pagetables. */
+	buffer->ptr = p;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device saw. */
+	m = buffer->mirror;
+	ASSERT_EQ(m[0], HMM_DMIRROR_PROT_NONE);
+	ASSERT_EQ(m[1], HMM_DMIRROR_PROT_NONE);
+	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
+	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
+	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
+	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+			HMM_DMIRROR_PROT_WRITE);
+	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE |
+			HMM_DMIRROR_PROT_WRITE);
+
+	hmm_buffer_free(buffer);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index 951c507a27f7..634cfefdaffd 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -227,4 +227,20 @@ else
 	exitcode=1
 fi
 
+echo "------------------------------------"
+echo "running HMM smoke test"
+echo "------------------------------------"
+./test_hmm.sh smoke
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
 exit $exitcode
diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
new file mode 100755
index 000000000000..268d32752045
--- /dev/null
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -0,0 +1,97 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Uladzislau Rezki (Sony) <urezki@gmail.com>
+#
+# This is a test script for the kernel test driver to analyse vmalloc
+# allocator. Therefore it is just a kernel module loader. You can specify
+# and pass different parameters in order to:
+#     a) analyse performance of vmalloc allocations;
+#     b) stressing and stability check of vmalloc subsystem.
+
+TEST_NAME="test_hmm"
+DRIVER="hmm_dmirror"
+
+# 1 if fails
+exitcode=1
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+check_test_requirements()
+{
+	uid=$(id -u)
+	if [ $uid -ne 0 ]; then
+		echo "$0: Must be run as root"
+		exit $ksft_skip
+	fi
+
+	if ! which modprobe > /dev/null 2>&1; then
+		echo "$0: You need modprobe installed"
+		exit $ksft_skip
+	fi
+
+	if ! modinfo $DRIVER > /dev/null 2>&1; then
+		echo "$0: You must have the following enabled in your kernel:"
+		echo "CONFIG_HMM_DMIRROR=m"
+		exit $ksft_skip
+	fi
+}
+
+load_driver()
+{
+	modprobe $DRIVER > /dev/null 2>&1
+	if [ $? == 0 ]; then
+		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
+		mknod /dev/hmm_dmirror0 c $major 0
+		mknod /dev/hmm_dmirror1 c $major 1
+	fi
+}
+
+unload_driver()
+{
+	modprobe -r $DRIVER > /dev/null 2>&1
+	rm -f /dev/hmm_dmirror?
+}
+
+run_smoke()
+{
+	echo "Running smoke test. Note, this test provides basic coverage."
+
+	load_driver
+	./hmm-tests
+	unload_driver
+}
+
+usage()
+{
+	echo -n "Usage: $0"
+	echo
+	echo "Example usage:"
+	echo
+	echo "# Shows help message"
+	echo "./${TEST_NAME}.sh"
+	echo
+	echo "# Smoke testing"
+	echo "./${TEST_NAME}.sh smoke"
+	echo
+	exit 0
+}
+
+function run_test()
+{
+	if [ $# -eq 0 ]; then
+		usage
+	else
+		if [ "$1" = "smoke" ]; then
+			run_smoke
+		else
+			usage
+		fi
+	fi
+}
+
+check_test_requirements
+run_test $@
+
+exit 0
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page
  2019-10-23 19:55 ` [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page Ralph Campbell
@ 2019-10-23 20:27   ` Jerome Glisse
  2019-10-24  9:27   ` David Hildenbrand
  2019-10-29 17:27   ` Jason Gunthorpe
  2 siblings, 0 replies; 19+ messages in thread
From: Jerome Glisse @ 2019-10-23 20:27 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: John Hubbard, Christoph Hellwig, Jason Gunthorpe, linux-rdma,
	linux-mm, linux-kernel

On Wed, Oct 23, 2019 at 12:55:14PM -0700, Ralph Campbell wrote:
> If a device driver like nouveau tries to use hmm_range_fault() to access
> the special shared zero page in system memory, hmm_range_fault() will
> return -EFAULT and kill the process.
> Allow hmm_range_fault() to return success (0) when the CPU pagetable
> entry points to the special shared zero page.
> page_to_pfn() and pfn_to_page() are defined on the zero page so just
> handle it like any other page.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>

Reviewed-by: "Jérôme Glisse" <jglisse@redhat.com>

> Cc: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  mm/hmm.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index acf7a664b38c..8c96c9ddcae5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -529,8 +529,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  		if (unlikely(!hmm_vma_walk->pgmap))
>  			return -EBUSY;
>  	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> -		*pfn = range->values[HMM_PFN_SPECIAL];
> -		return -EFAULT;
> +		if (!is_zero_pfn(pte_pfn(pte))) {
> +			*pfn = range->values[HMM_PFN_SPECIAL];
> +			return -EFAULT;
> +		}
> +		/*
> +		 * Since each architecture defines a struct page for the zero
> +		 * page, just fall through and treat it like a normal page.
> +		 */
>  	}
>  
>  	*pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-23 19:55 ` [PATCH v3 3/3] mm/hmm/test: add self tests for HMM Ralph Campbell
@ 2019-10-23 20:28   ` Jerome Glisse
  2019-10-23 21:55     ` Ralph Campbell
  2019-10-29 17:58   ` Jason Gunthorpe
  1 sibling, 1 reply; 19+ messages in thread
From: Jerome Glisse @ 2019-10-23 20:28 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: John Hubbard, Christoph Hellwig, Jason Gunthorpe, linux-rdma,
	linux-mm, linux-kernel

On Wed, Oct 23, 2019 at 12:55:15PM -0700, Ralph Campbell wrote:
> Add self tests for HMM.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>

You can add my signoff

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>


> ---
>  MAINTAINERS                            |    3 +
>  drivers/char/Kconfig                   |   11 +
>  drivers/char/Makefile                  |    1 +
>  drivers/char/hmm_dmirror.c             | 1566 ++++++++++++++++++++++++
>  include/Kbuild                         |    1 +
>  include/uapi/linux/hmm_dmirror.h       |   74 ++
>  tools/testing/selftests/vm/.gitignore  |    1 +
>  tools/testing/selftests/vm/Makefile    |    3 +
>  tools/testing/selftests/vm/config      |    3 +
>  tools/testing/selftests/vm/hmm-tests.c | 1311 ++++++++++++++++++++
>  tools/testing/selftests/vm/run_vmtests |   16 +
>  tools/testing/selftests/vm/test_hmm.sh |   97 ++
>  12 files changed, 3087 insertions(+)
>  create mode 100644 drivers/char/hmm_dmirror.c
>  create mode 100644 include/uapi/linux/hmm_dmirror.h
>  create mode 100644 tools/testing/selftests/vm/hmm-tests.c
>  create mode 100755 tools/testing/selftests/vm/test_hmm.sh
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 296de2b51c83..9890b6b8eea0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7427,8 +7427,11 @@ M:	Jérôme Glisse <jglisse@redhat.com>
>  L:	linux-mm@kvack.org
>  S:	Maintained
>  F:	mm/hmm*
> +F:	drivers/char/hmm*
>  F:	include/linux/hmm*
> +F:	include/uapi/linux/hmm*
>  F:	Documentation/vm/hmm.rst
> +F:	tools/testing/selftests/vm/*hmm*
>  
>  HOST AP DRIVER
>  M:	Jouni Malinen <j@w1.fi>
> diff --git a/drivers/char/Kconfig b/drivers/char/Kconfig
> index df0fc997dc3e..cc8ddb99550d 100644
> --- a/drivers/char/Kconfig
> +++ b/drivers/char/Kconfig
> @@ -535,6 +535,17 @@ config ADI
>  	  and SSM (Silicon Secured Memory).  Intended consumers of this
>  	  driver include crash and makedumpfile.
>  
> +config HMM_DMIRROR
> +	tristate "HMM driver for testing Heterogeneous Memory Management"
> +	depends on HMM_MIRROR
> +	depends on DEVICE_PRIVATE
> +	help
> +	  This is a pseudo device driver solely for testing HMM.
> +	  Say Y here if you want to build the HMM test driver.
> +	  Doing so will allow you to run tools/testing/selftest/vm/hmm-tests.
> +
> +	  If in doubt, say "N".
> +
>  endmenu
>  
>  config RANDOM_TRUST_CPU
> diff --git a/drivers/char/Makefile b/drivers/char/Makefile
> index 7c5ea6f9df14..d4a168c0c138 100644
> --- a/drivers/char/Makefile
> +++ b/drivers/char/Makefile
> @@ -52,3 +52,4 @@ js-rtc-y = rtc.o
>  obj-$(CONFIG_XILLYBUS)		+= xillybus/
>  obj-$(CONFIG_POWERNV_OP_PANEL)	+= powernv-op-panel.o
>  obj-$(CONFIG_ADI)		+= adi.o
> +obj-$(CONFIG_HMM_DMIRROR)	+= hmm_dmirror.o
> diff --git a/drivers/char/hmm_dmirror.c b/drivers/char/hmm_dmirror.c
> new file mode 100644
> index 000000000000..5a1ed34e72e1
> --- /dev/null
> +++ b/drivers/char/hmm_dmirror.c
> @@ -0,0 +1,1566 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of
> + * the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +/*
> + * This is a driver to exercice the HMM (heterogeneous memory management)
> + * mirror and zone device private memory migration APIs of the kernel.
> + * Userspace programs can register with the driver to mirror their own address
> + * space and can use the device to read/write any valid virtual address.
> + *
> + * In some ways it can also serve as an example driver for people wanting to use
> + * HMM inside their own device driver.
> + */
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/mutex.h>
> +#include <linux/rwsem.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/highmem.h>
> +#include <linux/delay.h>
> +#include <linux/pagemap.h>
> +#include <linux/hmm.h>
> +#include <linux/vmalloc.h>
> +#include <linux/swap.h>
> +#include <linux/swapops.h>
> +#include <linux/sched/mm.h>
> +#include <linux/platform_device.h>
> +
> +#include <uapi/linux/hmm_dmirror.h>
> +
> +#define DMIRROR_NDEVICES		2
> +#define DMIRROR_RANGE_FAULT_TIMEOUT	1000
> +#define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
> +#define DEVMEM_CHUNKS_RESERVE		16
> +
> +static const struct dev_pagemap_ops dmirror_devmem_ops;
> +static dev_t dmirror_dev;
> +static struct platform_device *dmirror_platform_devices[DMIRROR_NDEVICES];
> +static struct page *dmirror_zero_page;
> +
> +struct dmirror_device;
> +
> +struct dmirror_bounce {
> +	void			*ptr;
> +	unsigned long		size;
> +	unsigned long		addr;
> +	unsigned long		cpages;
> +};
> +
> +#define DPT_SHIFT PAGE_SHIFT
> +#define DPT_VALID (1UL << 0)
> +#define DPT_WRITE (1UL << 1)
> +#define DPT_DPAGE (1UL << 2)
> +#define DPT_ZPAGE 0x20UL
> +
> +const uint64_t dmirror_hmm_flags[HMM_PFN_FLAG_MAX] = {
> +	[HMM_PFN_VALID] = DPT_VALID,
> +	[HMM_PFN_WRITE] = DPT_WRITE,
> +	[HMM_PFN_DEVICE_PRIVATE] = DPT_DPAGE,
> +};
> +
> +static const uint64_t dmirror_hmm_values[HMM_PFN_VALUE_MAX] = {
> +	[HMM_PFN_NONE]    = 0,
> +	[HMM_PFN_ERROR]   = 0x10,
> +	[HMM_PFN_SPECIAL] = 0x10,
> +};
> +
> +struct dmirror_pt {
> +	u64			pgd[PTRS_PER_PGD];
> +	struct rw_semaphore	lock;
> +};
> +
> +/*
> + * Data attached to the open device file.
> + * Note that it might be shared after a fork().
> + */
> +struct dmirror {
> +	struct hmm_mirror	mirror;
> +	struct dmirror_device	*mdevice;
> +	struct dmirror_pt	pt;
> +	struct mutex		mutex;
> +};
> +
> +/*
> + * ZONE_DEVICE pages for migration and simulating device memory.
> + */
> +struct dmirror_chunk {
> +	struct dev_pagemap	pagemap;
> +	struct dmirror_device	*mdevice;
> +};
> +
> +/*
> + * Per device data.
> + */
> +struct dmirror_device {
> +	struct cdev		cdevice;
> +	struct hmm_devmem	*devmem;
> +	struct platform_device	*pdevice;
> +
> +	unsigned int		devmem_capacity;
> +	unsigned int		devmem_count;
> +	struct dmirror_chunk	**devmem_chunks;
> +	struct mutex		devmem_lock;	/* protects the above */
> +
> +	unsigned long		calloc;
> +	unsigned long		cfree;
> +	struct page		*free_pages;
> +	spinlock_t		lock;		/* protects the above */
> +};
> +
> +static inline unsigned long dmirror_pt_pgd(unsigned long addr)
> +{
> +	return (addr >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1);
> +}
> +
> +static inline unsigned long dmirror_pt_pud(unsigned long addr)
> +{
> +	return (addr >> PUD_SHIFT) & (PTRS_PER_PUD - 1);
> +}
> +
> +static inline unsigned long dmirror_pt_pmd(unsigned long addr)
> +{
> +	return (addr >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
> +}
> +
> +static inline unsigned long dmirror_pt_pte(unsigned long addr)
> +{
> +	return (addr >> PAGE_SHIFT) & (PTRS_PER_PTE - 1);
> +}
> +
> +static inline struct page *dmirror_pt_page(u64 *dptep)
> +{
> +	u64 dpte = *dptep;
> +
> +	if (dpte == DPT_ZPAGE)
> +		return dmirror_zero_page;
> +	if (!(dpte & DPT_VALID))
> +		return NULL;
> +	return pfn_to_page((u64)dpte >> DPT_SHIFT);
> +}
> +
> +static inline struct page *dmirror_pt_page_write(u64 *dptep)
> +{
> +	u64 dpte = *dptep;
> +
> +	if (!(dpte & DPT_VALID) || !(dpte & DPT_WRITE))
> +		return NULL;
> +	return pfn_to_page((u64)dpte >> DPT_SHIFT);
> +}
> +
> +static inline u64 dmirror_pt_from_page(struct page *page)
> +{
> +	if (!page)
> +		return 0;
> +	return (page_to_pfn(page) << DPT_SHIFT) | DPT_VALID;
> +}
> +
> +static struct page *populate_pt(struct dmirror *dmirror, u64 *dptep)
> +{
> +	struct page *page;
> +
> +	/*
> +	 * Since we don't free page tables until the process exits,
> +	 * we can unlock and relock without the page table being freed
> +	 * from under us.
> +	 */
> +	mutex_unlock(&dmirror->mutex);
> +	page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> +	mutex_lock(&dmirror->mutex);
> +	if (page) {
> +		if (unlikely(*dptep)) {
> +			__free_page(page);
> +			page = dmirror_pt_page(dptep);
> +		} else
> +			*dptep = dmirror_pt_from_page(page);
> +	} else if (*dptep)
> +		page = dmirror_pt_page(dptep);
> +	return page;
> +}
> +
> +static inline unsigned long dmirror_pt_pud_end(unsigned long addr)
> +{
> +	return (addr & PGDIR_MASK) + ((unsigned long)PTRS_PER_PUD << PUD_SHIFT);
> +}
> +
> +static inline unsigned long dmirror_pt_pmd_end(unsigned long addr)
> +{
> +	return (addr & PUD_MASK) + ((unsigned long)PTRS_PER_PMD << PMD_SHIFT);
> +}
> +
> +static inline unsigned long dmirror_pt_pte_end(unsigned long addr)
> +{
> +	return (addr & PMD_MASK) + ((unsigned long)PTRS_PER_PTE << PAGE_SHIFT);
> +}
> +
> +typedef int (*dmirror_walk_cb_t)(struct dmirror *dmirror,
> +				 unsigned long start,
> +				 unsigned long end,
> +				 u64 *dptep,
> +				 void *private);
> +
> +static int dmirror_pt_walk(struct dmirror *dmirror,
> +			   dmirror_walk_cb_t cb,
> +			   unsigned long start,
> +			   unsigned long end,
> +			   void *private,
> +			   bool populate)
> +{
> +	u64 *dpgdp = &dmirror->pt.pgd[dmirror_pt_pgd(start)];
> +	unsigned long addr;
> +	int ret = -ENOENT;
> +
> +	for (addr = start; addr < end; dpgdp++) {
> +		u64 *dpudp;
> +		unsigned long pud_end;
> +		struct page *pud_page;
> +
> +		pud_end = min(end, dmirror_pt_pud_end(addr));
> +		pud_page = dmirror_pt_page(dpgdp);
> +		if (!pud_page) {
> +			if (!populate) {
> +				addr = pud_end;
> +				continue;
> +			}
> +			pud_page = populate_pt(dmirror, dpgdp);
> +			if (!pud_page)
> +				return -ENOMEM;
> +		}
> +		dpudp = kmap(pud_page);
> +		dpudp += dmirror_pt_pud(addr);
> +		for (; addr != pud_end; dpudp++) {
> +			u64 *dpmdp;
> +			unsigned long pmd_end;
> +			struct page *pmd_page;
> +
> +			pmd_end = min(end, dmirror_pt_pmd_end(addr));
> +			pmd_page = dmirror_pt_page(dpudp);
> +			if (!pmd_page) {
> +				if (!populate) {
> +					addr = pmd_end;
> +					continue;
> +				}
> +				pmd_page = populate_pt(dmirror, dpudp);
> +				if (!pmd_page) {
> +					kunmap(pud_page);
> +					return -ENOMEM;
> +				}
> +			}
> +			dpmdp = kmap(pmd_page);
> +			dpmdp += dmirror_pt_pmd(addr);
> +			for (; addr != pmd_end; dpmdp++) {
> +				u64 *dptep;
> +				unsigned long pte_end;
> +				struct page *pte_page;
> +
> +				pte_end = min(end, dmirror_pt_pte_end(addr));
> +				pte_page = dmirror_pt_page(dpmdp);
> +				if (!pte_page) {
> +					if (!populate) {
> +						addr = pte_end;
> +						continue;
> +					}
> +					pte_page = populate_pt(dmirror, dpmdp);
> +					if (!pte_page) {
> +						kunmap(pmd_page);
> +						kunmap(pud_page);
> +						return -ENOMEM;
> +					}
> +				}
> +				if (!cb) {
> +					addr = pte_end;
> +					continue;
> +				}
> +				dptep = kmap(pte_page);
> +				dptep += dmirror_pt_pte(addr);
> +				ret = cb(dmirror, addr, pte_end, dptep,
> +					 private);
> +				kunmap(pte_page);
> +				if (ret) {
> +					kunmap(pmd_page);
> +					kunmap(pud_page);
> +					return ret;
> +				}
> +				addr = pte_end;
> +			}
> +			kunmap(pmd_page);
> +			addr = pmd_end;
> +		}
> +		kunmap(pud_page);
> +		addr = pud_end;
> +	}
> +
> +	return ret;
> +}
> +
> +static void dmirror_pt_free(struct dmirror *dmirror)
> +{
> +	u64 *dpgdp = dmirror->pt.pgd;
> +
> +	for (; dpgdp != dmirror->pt.pgd + PTRS_PER_PGD; dpgdp++) {
> +		u64 *dpudp, *dpudp_orig;
> +		u64 *dpudp_end;
> +		struct page *pud_page;
> +
> +		pud_page = dmirror_pt_page(dpgdp);
> +		if (!pud_page)
> +			continue;
> +
> +		dpudp_orig = kmap_atomic(pud_page);
> +		dpudp = dpudp_orig;
> +		dpudp_end = dpudp + PTRS_PER_PUD;
> +		for (; dpudp != dpudp_end; dpudp++) {
> +			u64 *dpmdp, *dpmdp_orig;
> +			u64 *dpmdp_end;
> +			struct page *pmd_page;
> +
> +			pmd_page = dmirror_pt_page(dpudp);
> +			if (!pmd_page)
> +				continue;
> +
> +			dpmdp_orig = kmap_atomic(pmd_page);
> +			dpmdp = dpmdp_orig;
> +			dpmdp_end = dpmdp + PTRS_PER_PMD;
> +			for (; dpmdp != dpmdp_end; dpmdp++) {
> +				struct page *pte_page;
> +
> +				pte_page = dmirror_pt_page(dpmdp);
> +				if (!pte_page)
> +					continue;
> +
> +				*dpmdp = 0;
> +				__free_page(pte_page);
> +			}
> +			kunmap_atomic(dpmdp_orig);
> +			*dpudp = 0;
> +			__free_page(pmd_page);
> +		}
> +		kunmap_atomic(dpudp_orig);
> +		*dpgdp = 0;
> +		__free_page(pud_page);
> +	}
> +}
> +
> +static int dmirror_bounce_init(struct dmirror_bounce *bounce,
> +			       unsigned long addr,
> +			       unsigned long size)
> +{
> +	bounce->addr = addr;
> +	bounce->size = size;
> +	bounce->cpages = 0;
> +	bounce->ptr = vmalloc(size);
> +	if (!bounce->ptr)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static int dmirror_bounce_copy_from(struct dmirror_bounce *bounce,
> +				    unsigned long addr)
> +{
> +	unsigned long end = addr + bounce->size;
> +	char __user *uptr = (void __user *)addr;
> +	void *ptr = bounce->ptr;
> +
> +	for (; addr < end; addr += PAGE_SIZE, ptr += PAGE_SIZE,
> +					      uptr += PAGE_SIZE) {
> +		int ret;
> +
> +		ret = copy_from_user(ptr, uptr, PAGE_SIZE);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dmirror_bounce_copy_to(struct dmirror_bounce *bounce,
> +				  unsigned long addr)
> +{
> +	unsigned long end = addr + bounce->size;
> +	char __user *uptr = (void __user *)addr;
> +	void *ptr = bounce->ptr;
> +
> +	for (; addr < end; addr += PAGE_SIZE, ptr += PAGE_SIZE,
> +					      uptr += PAGE_SIZE) {
> +		int ret;
> +
> +		ret = copy_to_user(uptr, ptr, PAGE_SIZE);
> +		if (ret)
> +			return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
> +{
> +	vfree(bounce->ptr);
> +}
> +
> +static int dmirror_do_update(struct dmirror *dmirror,
> +			     unsigned long addr,
> +			     unsigned long end,
> +			     u64 *dptep,
> +			     void *private)
> +{
> +	/*
> +	 * The page table doesn't hold references to pages since it relies on
> +	 * the mmu notifier to clear pointers when they become stale.
> +	 * Therefore, it is OK to just clear the pte.
> +	 */
> +	for (; addr < end; addr += PAGE_SIZE, ++dptep)
> +		*dptep = 0;
> +
> +	return 0;
> +}
> +
> +static int dmirror_update(struct hmm_mirror *mirror,
> +			  const struct mmu_notifier_range *update)
> +{
> +	struct dmirror *dmirror = container_of(mirror, struct dmirror, mirror);
> +
> +	if (mmu_notifier_range_blockable(update))
> +		mutex_lock(&dmirror->mutex);
> +	else if (!mutex_trylock(&dmirror->mutex))
> +		return -EAGAIN;
> +
> +	dmirror_pt_walk(dmirror, dmirror_do_update, update->start,
> +			update->end, NULL, false);
> +	mutex_unlock(&dmirror->mutex);
> +	return 0;
> +}
> +
> +static const struct hmm_mirror_ops dmirror_ops = {
> +	.sync_cpu_device_pagetables	= &dmirror_update,
> +};
> +
> +/*
> + * dmirror_new() - allocate and initialize dmirror struct.
> + *
> + * @mdevice: The device this mirror is associated with.
> + * @filp: The active device file descriptor this mirror is associated with.
> + */
> +static struct dmirror *dmirror_new(struct dmirror_device *mdevice)
> +{
> +	struct mm_struct *mm = get_task_mm(current);
> +	struct dmirror *dmirror;
> +	int ret;
> +
> +	if (!mm)
> +		goto err;
> +
> +	/* Mirror this process address space */
> +	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
> +	if (dmirror == NULL)
> +		goto err_mmput;
> +
> +	dmirror->mdevice = mdevice;
> +	dmirror->mirror.ops = &dmirror_ops;
> +	mutex_init(&dmirror->mutex);
> +
> +	down_write(&mm->mmap_sem);
> +	ret = hmm_mirror_register(&dmirror->mirror, mm);
> +	up_write(&mm->mmap_sem);
> +	if (ret)
> +		goto err_free;
> +
> +	mmput(mm);
> +	return dmirror;
> +
> +err_free:
> +	kfree(dmirror);
> +err_mmput:
> +	mmput(mm);
> +err:
> +	return NULL;
> +}
> +
> +static void dmirror_del(struct dmirror *dmirror)
> +{
> +	hmm_mirror_unregister(&dmirror->mirror);
> +	dmirror_pt_free(dmirror);
> +	kfree(dmirror);
> +}
> +
> +/*
> + * Below are the file operation for the dmirror device file. Only ioctl matters.
> + *
> + * Note this is highly specific to the dmirror device driver and should not be
> + * construed as an example on how to design the API a real device driver would
> + * expose to userspace.
> + */
> +static ssize_t dmirror_fops_read(struct file *filp,
> +			       char __user *buf,
> +			       size_t count,
> +			       loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +
> +static ssize_t dmirror_fops_write(struct file *filp,
> +				const char __user *buf,
> +				size_t count,
> +				loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +
> +static int dmirror_fops_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +	/* Forbid mmap of the dmirror device file. */
> +	return -EINVAL;
> +}
> +
> +static int dmirror_fops_open(struct inode *inode, struct file *filp)
> +{
> +	struct cdev *cdev = inode->i_cdev;
> +	struct dmirror_device *mdevice;
> +	struct dmirror *dmirror;
> +
> +	/* No exclusive opens. */
> +	if (filp->f_flags & O_EXCL)
> +		return -EINVAL;
> +
> +	mdevice = container_of(cdev, struct dmirror_device, cdevice);
> +	dmirror = dmirror_new(mdevice);
> +	if (!dmirror)
> +		return -ENOMEM;
> +
> +	/* Only the first open registers the address space. */
> +	mutex_lock(&mdevice->devmem_lock);
> +	if (filp->private_data)
> +		goto err_busy;
> +	filp->private_data = dmirror;
> +	mutex_unlock(&mdevice->devmem_lock);
> +
> +	return 0;
> +
> +err_busy:
> +	mutex_unlock(&mdevice->devmem_lock);
> +	dmirror_del(dmirror);
> +	return -EBUSY;
> +}
> +
> +static int dmirror_fops_release(struct inode *inode, struct file *filp)
> +{
> +	struct dmirror *dmirror = filp->private_data;
> +
> +	if (!dmirror)
> +		return 0;
> +
> +	dmirror_del(dmirror);
> +	filp->private_data = NULL;
> +
> +	return 0;
> +}
> +
> +static inline struct dmirror_device *dmirror_page_to_device(struct page *page)
> +
> +{
> +	struct dmirror_chunk *devmem;
> +
> +	devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap);
> +	return devmem->mdevice;
> +}
> +
> +static bool dmirror_device_is_mine(struct dmirror_device *mdevice,
> +				   struct page *page)
> +{
> +	if (!is_zone_device_page(page))
> +		return false;
> +	return page->pgmap->ops == &dmirror_devmem_ops &&
> +		dmirror_page_to_device(page) == mdevice;
> +}
> +
> +static int dmirror_do_fault(struct dmirror *dmirror,
> +			    unsigned long addr,
> +			    unsigned long end,
> +			    u64 *dptep,
> +			    void *private)
> +{
> +	struct hmm_range *range = private;
> +	unsigned long idx = (addr - range->start) >> PAGE_SHIFT;
> +	uint64_t *pfns = range->pfns;
> +
> +	for (; addr < end; addr += PAGE_SIZE, ++dptep, ++idx) {
> +		struct page *page;
> +		u64 dpte;
> +
> +		/*
> +		 * HMM_PFN_ERROR is returned if it is accessing invalid memory
> +		 * either because of memory error (hardware detected memory
> +		 * corruption) or more likely because of truncate on mmap
> +		 * file.
> +		 */
> +		if (pfns[idx] == range->values[HMM_PFN_ERROR])
> +			return -EFAULT;
> +		/*
> +		 * The only special PFN HMM returns is the read-only zero page
> +		 * which doesn't have a matching struct page.
> +		 */
> +		if (pfns[idx] == range->values[HMM_PFN_SPECIAL]) {
> +			*dptep = DPT_ZPAGE;
> +			continue;
> +		}
> +		if (!(pfns[idx] & range->flags[HMM_PFN_VALID]))
> +			return -EFAULT;
> +		page = hmm_device_entry_to_page(range, pfns[idx]);
> +		/* We asked for pages to be populated but check anyway. */
> +		if (!page)
> +			return -EFAULT;
> +		dpte = dmirror_pt_from_page(page);
> +		if (is_zone_device_page(page)) {
> +			if (!dmirror_device_is_mine(dmirror->mdevice, page))
> +				continue;
> +			dpte |= DPT_DPAGE;
> +		}
> +		if (pfns[idx] & range->flags[HMM_PFN_WRITE])
> +			dpte |= DPT_WRITE;
> +		else if (range->default_flags & range->flags[HMM_PFN_WRITE])
> +			return -EFAULT;
> +		*dptep = dpte;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dmirror_fault(struct dmirror *dmirror,
> +			 unsigned long start,
> +			 unsigned long end,
> +			 bool write)
> +{
> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
> +	unsigned long addr;
> +	unsigned long next;
> +	uint64_t pfns[64];
> +	struct hmm_range range = {
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),
> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
> +	};
> +	int ret = 0;
> +
> +	for (addr = start; addr < end; ) {
> +		long count;
> +
> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
> +		range.start = addr;
> +		range.end = next;
> +
> +		down_read(&mm->mmap_sem);
> +
> +		ret = hmm_range_register(&range, &dmirror->mirror);
> +		if (ret) {
> +			up_read(&mm->mmap_sem);
> +			break;
> +		}
> +
> +		if (!hmm_range_wait_until_valid(&range,
> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +
> +		count = hmm_range_fault(&range, 0);
> +		if (count < 0) {
> +			ret = count;
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			break;
> +		}
> +
> +		if (!hmm_range_valid(&range)) {
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +		mutex_lock(&dmirror->mutex);
> +		ret = dmirror_pt_walk(dmirror, dmirror_do_fault,
> +				      addr, next, &range, true);
> +		mutex_unlock(&dmirror->mutex);
> +		hmm_range_unregister(&range);
> +		up_read(&mm->mmap_sem);
> +		if (ret)
> +			break;
> +
> +		addr = next;
> +	}
> +
> +	return ret;
> +}
> +
> +static int dmirror_do_read(struct dmirror *dmirror,
> +			   unsigned long addr,
> +			   unsigned long end,
> +			   u64 *dptep,
> +			   void *private)
> +{
> +	struct dmirror_bounce *bounce = private;
> +	void *ptr;
> +
> +	ptr = bounce->ptr + ((addr - bounce->addr) & PAGE_MASK);
> +
> +	for (; addr < end; addr += PAGE_SIZE, ++dptep) {
> +		struct page *page;
> +		void *tmp;
> +
> +		page = dmirror_pt_page(dptep);
> +		if (!page)
> +			return -ENOENT;
> +
> +		tmp = kmap(page);
> +		memcpy(ptr, tmp, PAGE_SIZE);
> +		kunmap(page);
> +
> +		ptr += PAGE_SIZE;
> +		bounce->cpages++;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dmirror_read(struct dmirror *dmirror,
> +			struct hmm_dmirror_cmd *cmd)
> +{
> +	struct dmirror_bounce bounce;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	ret = dmirror_bounce_init(&bounce, start, size);
> +	if (ret)
> +		return ret;
> +
> +again:
> +	mutex_lock(&dmirror->mutex);
> +	ret = dmirror_pt_walk(dmirror, dmirror_do_read, start, end, &bounce,
> +				false);
> +	mutex_unlock(&dmirror->mutex);
> +	if (ret == 0)
> +		ret = dmirror_bounce_copy_to(&bounce, cmd->ptr);
> +	else if (ret == -ENOENT) {
> +		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
> +		ret = dmirror_fault(dmirror, start, end, false);
> +		if (ret == 0) {
> +			cmd->faults++;
> +			goto again;
> +		}
> +	}
> +
> +	cmd->cpages = bounce.cpages;
> +	dmirror_bounce_fini(&bounce);
> +	return ret;
> +}
> +
> +static int dmirror_do_write(struct dmirror *dmirror,
> +			    unsigned long addr,
> +			    unsigned long end,
> +			    u64 *dptep,
> +			    void *private)
> +{
> +	struct dmirror_bounce *bounce = private;
> +	void *ptr;
> +
> +	ptr = bounce->ptr + ((addr - bounce->addr) & PAGE_MASK);
> +
> +	for (; addr < end; addr += PAGE_SIZE, ++dptep) {
> +		struct page *page;
> +		void *tmp;
> +
> +		page = dmirror_pt_page_write(dptep);
> +		if (!page)
> +			return -ENOENT;
> +
> +		tmp = kmap(page);
> +		memcpy(tmp, ptr, PAGE_SIZE);
> +		kunmap(page);
> +
> +		ptr += PAGE_SIZE;
> +		bounce->cpages++;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dmirror_write(struct dmirror *dmirror,
> +			 struct hmm_dmirror_cmd *cmd)
> +{
> +	struct dmirror_bounce bounce;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	ret = dmirror_bounce_init(&bounce, start, size);
> +	if (ret)
> +		return ret;
> +	ret = dmirror_bounce_copy_from(&bounce, cmd->ptr);
> +	if (ret)
> +		return ret;
> +
> +again:
> +	mutex_lock(&dmirror->mutex);
> +	ret = dmirror_pt_walk(dmirror, dmirror_do_write,
> +			      start, end, &bounce, false);
> +	mutex_unlock(&dmirror->mutex);
> +	if (ret == -ENOENT) {
> +		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
> +		ret = dmirror_fault(dmirror, start, end, true);
> +		if (ret == 0) {
> +			cmd->faults++;
> +			goto again;
> +		}
> +	}
> +
> +	cmd->cpages = bounce.cpages;
> +	dmirror_bounce_fini(&bounce);
> +	return ret;
> +}
> +
> +static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
> +				   struct page **ppage)
> +{
> +	struct dmirror_chunk *devmem;
> +	struct resource *res;
> +	unsigned long pfn;
> +	unsigned long pfn_first;
> +	unsigned long pfn_last;
> +	void *ptr;
> +
> +	mutex_lock(&mdevice->devmem_lock);
> +
> +	if (mdevice->devmem_count == mdevice->devmem_capacity) {
> +		struct dmirror_chunk **new_chunks;
> +		unsigned int new_capacity;
> +
> +		new_capacity = mdevice->devmem_capacity +
> +				DEVMEM_CHUNKS_RESERVE;
> +		new_chunks = krealloc(mdevice->devmem_chunks,
> +				sizeof(new_chunks[0]) * new_capacity,
> +				GFP_KERNEL);
> +		if (!new_chunks)
> +			goto err;
> +		mdevice->devmem_capacity = new_capacity;
> +		mdevice->devmem_chunks = new_chunks;
> +	}
> +
> +	res = devm_request_free_mem_region(&mdevice->pdevice->dev,
> +					&iomem_resource, DEVMEM_CHUNK_SIZE);
> +	if (IS_ERR(res))
> +		goto err;
> +
> +	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
> +	if (!devmem)
> +		goto err;
> +
> +	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
> +	devmem->pagemap.res = *res;
> +	devmem->pagemap.ops = &dmirror_devmem_ops;
> +	ptr = devm_memremap_pages(&mdevice->pdevice->dev, &devmem->pagemap);
> +	if (IS_ERR(ptr))
> +		goto err_free;
> +
> +	devmem->mdevice = mdevice;
> +	pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT;
> +	pfn_last = pfn_first +
> +		(resource_size(&devmem->pagemap.res) >> PAGE_SHIFT);
> +	mdevice->devmem_chunks[mdevice->devmem_count++] = devmem;
> +
> +	mutex_unlock(&mdevice->devmem_lock);
> +
> +	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
> +		DEVMEM_CHUNK_SIZE / (1024 * 1024),
> +		mdevice->devmem_count,
> +		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
> +		pfn_first, pfn_last);
> +
> +	spin_lock(&mdevice->lock);
> +	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
> +		struct page *page = pfn_to_page(pfn);
> +
> +		page->zone_device_data = mdevice->free_pages;
> +		mdevice->free_pages = page;
> +	}
> +	if (ppage) {
> +		*ppage = mdevice->free_pages;
> +		mdevice->free_pages = (*ppage)->zone_device_data;
> +		mdevice->calloc++;
> +	}
> +	spin_unlock(&mdevice->lock);
> +
> +	return true;
> +
> +err_free:
> +	kfree(devmem);
> +err:
> +	mutex_unlock(&mdevice->devmem_lock);
> +	return false;
> +}
> +
> +static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
> +{
> +	struct page *dpage = NULL;
> +	struct page *rpage;
> +
> +	/*
> +	 * This is a fake device so we alloc real system memory to store
> +	 * our device memory.
> +	 */
> +	rpage = alloc_page(GFP_HIGHUSER);
> +	if (!rpage)
> +		return NULL;
> +
> +	spin_lock(&mdevice->lock);
> +
> +	if (mdevice->free_pages) {
> +		dpage = mdevice->free_pages;
> +		mdevice->free_pages = dpage->zone_device_data;
> +		mdevice->calloc++;
> +		spin_unlock(&mdevice->lock);
> +	} else {
> +		spin_unlock(&mdevice->lock);
> +		if (!dmirror_allocate_chunk(mdevice, &dpage))
> +			goto error;
> +	}
> +
> +	dpage->zone_device_data = rpage;
> +	get_page(dpage);
> +	lock_page(dpage);
> +	return dpage;
> +
> +error:
> +	__free_page(rpage);
> +	return NULL;
> +}
> +
> +static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
> +					   struct dmirror *dmirror)
> +{
> +	struct dmirror_device *mdevice = dmirror->mdevice;
> +	const unsigned long *src = args->src;
> +	unsigned long *dst = args->dst;
> +	unsigned long addr;
> +
> +	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
> +						   src++, dst++) {
> +		struct page *spage;
> +		struct page *dpage;
> +		struct page *rpage;
> +
> +		if (!(*src & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		/*
> +		 * Note that spage might be NULL which is OK since it is an
> +		 * unallocated pte_none() or read-only zero page.
> +		 */
> +		spage = migrate_pfn_to_page(*src);
> +
> +		/*
> +		 * Don't migrate device private pages from our own driver or
> +		 * others. For our own we would do a device private memory copy
> +		 * not a migration and for others, we would need to fault the
> +		 * other device's page into system memory first.
> +		 */
> +		if (spage && is_zone_device_page(spage))
> +			continue;
> +
> +		dpage = dmirror_devmem_alloc_page(mdevice);
> +		if (!dpage)
> +			continue;
> +
> +		rpage = dpage->zone_device_data;
> +		if (spage)
> +			copy_highpage(rpage, spage);
> +		else
> +			clear_highpage(rpage);
> +
> +		/*
> +		 * Normally, a device would use the page->zone_device_data to
> +		 * point to the mirror but here we use it to hold the page for
> +		 * the simulated device memory and that page holds the pointer
> +		 * to the mirror.
> +		 */
> +		rpage->zone_device_data = dmirror;
> +
> +		*dst = migrate_pfn(page_to_pfn(dpage)) |
> +			    MIGRATE_PFN_LOCKED;
> +		if ((*src & MIGRATE_PFN_WRITE) ||
> +		    (!spage && args->vma->vm_flags & VM_WRITE))
> +			*dst |= MIGRATE_PFN_WRITE;
> +	}
> +	/* Try to pre-allocate page tables. */
> +	mutex_lock(&dmirror->mutex);
> +	dmirror_pt_walk(dmirror, NULL, args->start, args->end, NULL, true);
> +	mutex_unlock(&dmirror->mutex);
> +}
> +
> +struct dmirror_migrate {
> +	struct hmm_dmirror_cmd		*cmd;
> +	const unsigned long		*src;
> +	const unsigned long		*dst;
> +	unsigned long			start;
> +};
> +
> +static int dmirror_do_migrate(struct dmirror *dmirror,
> +			      unsigned long addr,
> +			      unsigned long end,
> +			      u64 *dptep,
> +			      void *private)
> +{
> +	struct dmirror_migrate *migrate = private;
> +	const unsigned long *src = migrate->src;
> +	const unsigned long *dst = migrate->dst;
> +	unsigned long idx = (addr - migrate->start) >> PAGE_SHIFT;
> +
> +	for (; addr < end; addr += PAGE_SIZE, ++dptep, ++idx) {
> +		struct page *page;
> +		u64 dpte;
> +
> +		if (!(src[idx] & MIGRATE_PFN_MIGRATE))
> +			continue;
> +
> +		page = migrate_pfn_to_page(dst[idx]);
> +		if (!page)
> +			continue;
> +
> +		/*
> +		 * Map the page that holds the data so dmirror_pt_walk()
> +		 * doesn't have to deal with ZONE_DEVICE private pages.
> +		 */
> +		page = page->zone_device_data;
> +		dpte = dmirror_pt_from_page(page) | DPT_DPAGE;
> +		if (dst[idx] & MIGRATE_PFN_WRITE)
> +			dpte |= DPT_WRITE;
> +		*dptep = dpte;
> +	}
> +
> +	return 0;
> +}
> +
> +static void dmirror_migrate_finalize_and_map(struct migrate_vma *args,
> +					     struct dmirror *dmirror,
> +					     struct hmm_dmirror_cmd *cmd)
> +{
> +	struct dmirror_migrate migrate;
> +
> +	migrate.cmd = cmd;
> +	migrate.src = args->src;
> +	migrate.dst = args->dst;
> +	migrate.start = args->start;
> +
> +	/* Map the migrated pages into the device's page tables. */
> +	mutex_lock(&dmirror->mutex);
> +	dmirror_pt_walk(dmirror, dmirror_do_migrate, args->start, args->end,
> +			&migrate, true);
> +	mutex_unlock(&dmirror->mutex);
> +}
> +
> +static int dmirror_migrate(struct dmirror *dmirror,
> +			   struct hmm_dmirror_cmd *cmd)
> +{
> +	unsigned long start, end, addr;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
> +	struct vm_area_struct *vma;
> +	unsigned long src_pfns[64];
> +	unsigned long dst_pfns[64];
> +	struct dmirror_bounce bounce;
> +	struct migrate_vma args;
> +	unsigned long next;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	down_read(&mm->mmap_sem);
> +	for (addr = start; addr < end; addr = next) {
> +		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
> +
> +		vma = find_vma(mm, addr);
> +		if (!vma || addr < vma->vm_start) {
> +			ret = -EINVAL;
> +			goto out;
> +		}
> +		if (next > vma->vm_end)
> +			next = vma->vm_end;
> +
> +		args.vma = vma;
> +		args.src = src_pfns;
> +		args.dst = dst_pfns;
> +		args.start = addr;
> +		args.end = next;
> +		ret = migrate_vma_setup(&args);
> +		if (ret)
> +			goto out;
> +
> +		dmirror_migrate_alloc_and_copy(&args, dmirror);
> +		migrate_vma_pages(&args);
> +		dmirror_migrate_finalize_and_map(&args, dmirror, cmd);
> +		migrate_vma_finalize(&args);
> +	}
> +	up_read(&mm->mmap_sem);
> +
> +	/* Return the migrated data for verification. */
> +	ret = dmirror_bounce_init(&bounce, start, size);
> +	if (ret)
> +		return ret;
> +	mutex_lock(&dmirror->mutex);
> +	ret = dmirror_pt_walk(dmirror, dmirror_do_read, start, end, &bounce,
> +				false);
> +	mutex_unlock(&dmirror->mutex);
> +	if (ret == 0)
> +		ret = dmirror_bounce_copy_to(&bounce, cmd->ptr);
> +	cmd->cpages = bounce.cpages;
> +	dmirror_bounce_fini(&bounce);
> +	return ret;
> +
> +out:
> +	up_read(&mm->mmap_sem);
> +	return ret;
> +}
> +
> +static void dmirror_mkentry(struct dmirror *dmirror,
> +			    struct hmm_range *range,
> +			    unsigned char *perm,
> +			    uint64_t entry)
> +{
> +	struct page *page;
> +
> +	if (entry == range->values[HMM_PFN_ERROR]) {
> +		*perm = HMM_DMIRROR_PROT_ERROR;
> +		return;
> +	}
> +	page = hmm_device_entry_to_page(range, entry);
> +	if (!page) {
> +		*perm = HMM_DMIRROR_PROT_NONE;
> +		return;
> +	}
> +	if (entry & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
> +		/* Is the page migrated to this device or some other? */
> +		if (dmirror->mdevice == dmirror_page_to_device(page))
> +			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
> +		else
> +			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
> +	} else if (is_zero_pfn(page_to_pfn(page)))
> +		*perm = HMM_DMIRROR_PROT_ZERO;
> +	else
> +		*perm = HMM_DMIRROR_PROT_NONE;
> +	if (entry & range->flags[HMM_PFN_WRITE])
> +		*perm |= HMM_DMIRROR_PROT_WRITE;
> +	else
> +		*perm |= HMM_DMIRROR_PROT_READ;
> +}
> +
> +static int dmirror_snapshot(struct dmirror *dmirror,
> +			    struct hmm_dmirror_cmd *cmd)
> +{
> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	unsigned long addr;
> +	unsigned long next;
> +	uint64_t pfns[64];
> +	unsigned char perm[64];
> +	char __user *uptr;
> +	struct hmm_range range = {
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~0ULL,
> +	};
> +	int ret = 0;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	uptr = (void __user *)cmd->ptr;
> +
> +	for (addr = start; addr < end; ) {
> +		long count;
> +		unsigned long i;
> +		unsigned long n;
> +
> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
> +		range.start = addr;
> +		range.end = next;
> +
> +		down_read(&mm->mmap_sem);
> +
> +		ret = hmm_range_register(&range, &dmirror->mirror);
> +		if (ret) {
> +			up_read(&mm->mmap_sem);
> +			break;
> +		}
> +
> +		if (!hmm_range_wait_until_valid(&range,
> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +
> +		count = hmm_range_fault(&range, HMM_FAULT_SNAPSHOT);
> +		if (count < 0) {
> +			ret = count;
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			if (ret == -EBUSY)
> +				continue;
> +			break;
> +		}
> +
> +		if (!hmm_range_valid(&range)) {
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +
> +		n = (next - addr) >> PAGE_SHIFT;
> +		for (i = 0; i < n; i++)
> +			dmirror_mkentry(dmirror, &range, perm + i, pfns[i]);
> +		hmm_range_unregister(&range);
> +		up_read(&mm->mmap_sem);
> +
> +		ret = copy_to_user(uptr, perm, n);
> +		if (ret)
> +			break;
> +
> +		cmd->cpages += n;
> +		uptr += n;
> +		addr = next;
> +	}
> +
> +	return ret;
> +}
> +
> +static long dmirror_fops_unlocked_ioctl(struct file *filp,
> +					unsigned int command,
> +					unsigned long arg)
> +{
> +	void __user *uarg = (void __user *)arg;
> +	struct hmm_dmirror_cmd cmd;
> +	struct dmirror *dmirror;
> +	int ret;
> +
> +	dmirror = filp->private_data;
> +	if (!dmirror)
> +		return -EINVAL;
> +
> +	ret = copy_from_user(&cmd, uarg, sizeof(cmd));
> +	if (ret)
> +		return ret;
> +
> +	if (cmd.addr & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (cmd.addr >= (cmd.addr + (cmd.npages << PAGE_SHIFT)))
> +		return -EINVAL;
> +
> +	cmd.cpages = 0;
> +	cmd.faults = 0;
> +
> +	switch (command) {
> +	case HMM_DMIRROR_READ:
> +		ret = dmirror_read(dmirror, &cmd);
> +		break;
> +
> +	case HMM_DMIRROR_WRITE:
> +		ret = dmirror_write(dmirror, &cmd);
> +		break;
> +
> +	case HMM_DMIRROR_MIGRATE:
> +		ret = dmirror_migrate(dmirror, &cmd);
> +		break;
> +
> +	case HMM_DMIRROR_SNAPSHOT:
> +		ret = dmirror_snapshot(dmirror, &cmd);
> +		break;
> +
> +	default:
> +		return -EINVAL;
> +	}
> +	if (ret)
> +		return ret;
> +
> +	return copy_to_user(uarg, &cmd, sizeof(cmd));
> +}
> +
> +static const struct file_operations dmirror_fops = {
> +	.read		= dmirror_fops_read,
> +	.write		= dmirror_fops_write,
> +	.mmap		= dmirror_fops_mmap,
> +	.open		= dmirror_fops_open,
> +	.release	= dmirror_fops_release,
> +	.unlocked_ioctl = dmirror_fops_unlocked_ioctl,
> +	.llseek		= default_llseek,
> +	.owner		= THIS_MODULE,
> +};
> +
> +static void dmirror_devmem_free(struct page *page)
> +{
> +	struct page *rpage = page->zone_device_data;
> +	struct dmirror_device *mdevice;
> +
> +	if (rpage)
> +		__free_page(rpage);
> +
> +	mdevice = dmirror_page_to_device(page);
> +
> +	spin_lock(&mdevice->lock);
> +	mdevice->cfree++;
> +	page->zone_device_data = mdevice->free_pages;
> +	mdevice->free_pages = page;
> +	spin_unlock(&mdevice->lock);
> +}
> +
> +static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
> +						struct dmirror_device *mdevice)
> +{
> +	struct vm_area_struct *vma = args->vma;
> +	const unsigned long *src = args->src;
> +	unsigned long *dst = args->dst;
> +	unsigned long start = args->start;
> +	unsigned long end = args->end;
> +	unsigned long addr;
> +
> +	for (addr = start; addr < end; addr += PAGE_SIZE,
> +				       src++, dst++) {
> +		struct page *dpage, *spage;
> +
> +		spage = migrate_pfn_to_page(*src);
> +		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
> +			continue;
> +		if (!dmirror_device_is_mine(mdevice, spage))
> +			continue;
> +		spage = spage->zone_device_data;
> +
> +		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
> +		if (!dpage)
> +			continue;
> +
> +		lock_page(dpage);
> +		copy_highpage(dpage, spage);
> +		*dst = migrate_pfn(page_to_pfn(dpage)) |
> +			    MIGRATE_PFN_LOCKED;
> +		if (*src & MIGRATE_PFN_WRITE)
> +			*dst |= MIGRATE_PFN_WRITE;
> +	}
> +	return 0;
> +}
> +
> +static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args,
> +						  struct dmirror *dmirror)
> +{
> +	/* Invalidate the device's page table mapping. */
> +	mutex_lock(&dmirror->mutex);
> +	dmirror_pt_walk(dmirror, dmirror_do_update, args->start, args->end,
> +			NULL, false);
> +	mutex_unlock(&dmirror->mutex);
> +}
> +
> +static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
> +{
> +	struct migrate_vma args;
> +	unsigned long src_pfns;
> +	unsigned long dst_pfns;
> +	struct page *rpage;
> +	struct dmirror *dmirror;
> +	vm_fault_t ret;
> +
> +	/* FIXME demonstrate how we can adjust migrate range */
> +	args.vma = vmf->vma;
> +	args.start = vmf->address;
> +	args.end = args.start + PAGE_SIZE;
> +	args.src = &src_pfns;
> +	args.dst = &dst_pfns;
> +
> +	if (migrate_vma_setup(&args))
> +		return VM_FAULT_SIGBUS;
> +
> +	/*
> +	 * Normally, a device would use the page->zone_device_data to point to
> +	 * the mirror but here we use it to hold the page for the simulated
> +	 * device memory and that page holds the pointer to the mirror.
> +	 */
> +	rpage = vmf->page->zone_device_data;
> +	dmirror = rpage->zone_device_data;
> +
> +	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror->mdevice);
> +	if (ret)
> +		return ret;
> +	migrate_vma_pages(&args);
> +	dmirror_devmem_fault_finalize_and_map(&args, dmirror);
> +	migrate_vma_finalize(&args);
> +	return 0;
> +}
> +
> +static const struct dev_pagemap_ops dmirror_devmem_ops = {
> +	.page_free	= dmirror_devmem_free,
> +	.migrate_to_ram	= dmirror_devmem_fault,
> +};
> +
> +static void dmirror_pdev_del(void *arg)
> +{
> +	struct dmirror_device *mdevice = arg;
> +	unsigned int i;
> +
> +	if (mdevice->devmem_chunks) {
> +		for (i = 0; i < mdevice->devmem_count; i++)
> +			kfree(mdevice->devmem_chunks[i]);
> +		kfree(mdevice->devmem_chunks);
> +	}
> +
> +	cdev_del(&mdevice->cdevice);
> +	kfree(mdevice);
> +}
> +
> +static int dmirror_probe(struct platform_device *pdev)
> +{
> +	struct dmirror_device *mdevice;
> +	int ret;
> +
> +	mdevice = kzalloc(sizeof(*mdevice), GFP_KERNEL);
> +	if (!mdevice)
> +		return -ENOMEM;
> +
> +	mdevice->pdevice = pdev;
> +	mutex_init(&mdevice->devmem_lock);
> +	spin_lock_init(&mdevice->lock);
> +
> +	cdev_init(&mdevice->cdevice, &dmirror_fops);
> +	ret = cdev_add(&mdevice->cdevice, pdev->dev.devt, 1);
> +	if (ret) {
> +		kfree(mdevice);
> +		return ret;
> +	}
> +
> +	platform_set_drvdata(pdev, mdevice);
> +	ret = devm_add_action_or_reset(&pdev->dev, dmirror_pdev_del, mdevice);
> +	if (ret)
> +		return ret;
> +
> +	/* Build list of free struct page */
> +	dmirror_allocate_chunk(mdevice, NULL);
> +
> +	return 0;
> +}
> +
> +static int dmirror_remove(struct platform_device *pdev)
> +{
> +	/* all probe actions are unwound by devm */
> +	return 0;
> +}
> +
> +static struct platform_driver dmirror_device_driver = {
> +	.probe		= dmirror_probe,
> +	.remove		= dmirror_remove,
> +	.driver		= {
> +		.name	= "HMM_DMIRROR",
> +	},
> +};
> +
> +static int __init hmm_dmirror_init(void)
> +{
> +	int ret;
> +	int id;
> +
> +	ret = platform_driver_register(&dmirror_device_driver);
> +	if (ret)
> +		return ret;
> +
> +	ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
> +				  "HMM_DMIRROR");
> +	if (ret)
> +		goto err_unreg;
> +
> +	for (id = 0; id < DMIRROR_NDEVICES; id++) {
> +		struct platform_device *pd;
> +
> +		pd = platform_device_alloc("HMM_DMIRROR", id);
> +		if (!pd) {
> +			ret = -ENOMEM;
> +			goto err_chrdev;
> +		}
> +		pd->dev.devt = MKDEV(MAJOR(dmirror_dev), id);
> +		ret = platform_device_add(pd);
> +		if (ret) {
> +			platform_device_put(pd);
> +			goto err_chrdev;
> +		}
> +		dmirror_platform_devices[id] = pd;
> +	}
> +
> +	/*
> +	 * Allocate a zero page to simulate a reserved page of device private
> +	 * memory which is always zero. The zero_pfn page isn't used just to
> +	 * make the code here simpler (i.e., we need a struct page for it).
> +	 */
> +	dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
> +	if (!dmirror_zero_page)
> +		goto err_chrdev;
> +
> +	pr_info("hmm_dmirror loaded. This is only for testing HMM.\n");
> +	return 0;
> +
> +err_chrdev:
> +	while (--id >= 0) {
> +		platform_device_unregister(dmirror_platform_devices[id]);
> +		dmirror_platform_devices[id] = NULL;
> +	}
> +	unregister_chrdev_region(dmirror_dev, 1);
> +err_unreg:
> +	platform_driver_unregister(&dmirror_device_driver);
> +	return ret;
> +}
> +
> +static void __exit hmm_dmirror_exit(void)
> +{
> +	int id;
> +
> +	if (dmirror_zero_page)
> +		__free_page(dmirror_zero_page);
> +	for (id = 0; id < DMIRROR_NDEVICES; id++)
> +		platform_device_unregister(dmirror_platform_devices[id]);
> +	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
> +	platform_driver_unregister(&dmirror_device_driver);
> +	mmu_notifier_synchronize();
> +}
> +
> +module_init(hmm_dmirror_init);
> +module_exit(hmm_dmirror_exit);
> +MODULE_LICENSE("GPL");
> diff --git a/include/Kbuild b/include/Kbuild
> index ffba79483cc5..6ffb44a45957 100644
> --- a/include/Kbuild
> +++ b/include/Kbuild
> @@ -1063,6 +1063,7 @@ header-test-			+= uapi/linux/coda_psdev.h
>  header-test-			+= uapi/linux/errqueue.h
>  header-test-			+= uapi/linux/eventpoll.h
>  header-test-			+= uapi/linux/hdlc/ioctl.h
> +header-test-			+= uapi/linux/hmm_dmirror.h
>  header-test-			+= uapi/linux/input.h
>  header-test-			+= uapi/linux/kvm.h
>  header-test-			+= uapi/linux/kvm_para.h
> diff --git a/include/uapi/linux/hmm_dmirror.h b/include/uapi/linux/hmm_dmirror.h
> new file mode 100644
> index 000000000000..61d3643aff95
> --- /dev/null
> +++ b/include/uapi/linux/hmm_dmirror.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of
> + * the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +/*
> + * This is a dummy driver to exercise the HMM (heterogeneous memory management)
> + * API of the kernel. It allows a userspace program to expose its entire address
> + * space through the HMM dummy driver file.
> + */
> +#ifndef _UAPI_LINUX_HMM_DMIRROR_H
> +#define _UAPI_LINUX_HMM_DMIRROR_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +/*
> + * Structure to pass to the HMM test driver to mimic a device accessing
> + * system memory and ZONE_DEVICE private memory through device page tables.
> + *
> + * @addr: (in) user address the device will read/write
> + * @ptr: (in) user address where device data is copied to/from
> + * @npages: (in) number of pages to read/write
> + * @cpages: (out) number of pages copied
> + * @faults: (out) number of device page faults seen
> + */
> +struct hmm_dmirror_cmd {
> +	__u64		addr;
> +	__u64		ptr;
> +	__u64		npages;
> +	__u64		cpages;
> +	__u64		faults;
> +};
> +
> +/* Expose the address space of the calling process through hmm dummy dev file */
> +#define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
> +#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
> +
> +/*
> + * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
> + * HMM_DMIRROR_PROT_ERROR: no valid mirror PTE for this page
> + * HMM_DMIRROR_PROT_NONE: unpopulated PTE or PTE with no access
> + * HMM_DMIRROR_PROT_READ: read-only PTE
> + * HMM_DMIRROR_PROT_WRITE: read/write PTE
> + * HMM_DMIRROR_PROT_ZERO: special read-only zero page
> + * HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL: Migrated device private page on the
> + *					device the ioctl() is made
> + * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
> + *					other device
> + */
> +enum {
> +	HMM_DMIRROR_PROT_ERROR			= 0xFF,
> +	HMM_DMIRROR_PROT_NONE			= 0x00,
> +	HMM_DMIRROR_PROT_READ			= 0x01,
> +	HMM_DMIRROR_PROT_WRITE			= 0x02,
> +	HMM_DMIRROR_PROT_ZERO			= 0x10,
> +	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
> +	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
> +};
> +
> +#endif /* _UAPI_LINUX_HMM_DMIRROR_H */
> diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
> index 31b3c98b6d34..3054565b3f07 100644
> --- a/tools/testing/selftests/vm/.gitignore
> +++ b/tools/testing/selftests/vm/.gitignore
> @@ -14,3 +14,4 @@ virtual_address_range
>  gup_benchmark
>  va_128TBswitch
>  map_fixed_noreplace
> +hmm-tests
> diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
> index 9534dc2bc929..5643cfb5e3d6 100644
> --- a/tools/testing/selftests/vm/Makefile
> +++ b/tools/testing/selftests/vm/Makefile
> @@ -5,6 +5,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
>  LDLIBS = -lrt
>  TEST_GEN_FILES = compaction_test
>  TEST_GEN_FILES += gup_benchmark
> +TEST_GEN_FILES += hmm-tests
>  TEST_GEN_FILES += hugepage-mmap
>  TEST_GEN_FILES += hugepage-shm
>  TEST_GEN_FILES += map_hugetlb
> @@ -26,6 +27,8 @@ TEST_FILES := test_vmalloc.sh
>  KSFT_KHDR_INSTALL := 1
>  include ../lib.mk
>  
> +$(OUTPUT)/hmm-tests: LDLIBS += -lhugetlbfs -lpthread
> +
>  $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
>  
>  $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
> diff --git a/tools/testing/selftests/vm/config b/tools/testing/selftests/vm/config
> index 1c0d76cb5adf..34cfab18e737 100644
> --- a/tools/testing/selftests/vm/config
> +++ b/tools/testing/selftests/vm/config
> @@ -1,2 +1,5 @@
>  CONFIG_SYSVIPC=y
>  CONFIG_USERFAULTFD=y
> +CONFIG_HMM_MIRROR=y
> +CONFIG_DEVICE_PRIVATE=y
> +CONFIG_HMM_DMIRROR=m
> diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
> new file mode 100644
> index 000000000000..f4ae6188fd0e
> --- /dev/null
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -0,0 +1,1311 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of
> + * the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Authors: Jérôme Glisse <jglisse@redhat.com>
> + */
> +
> +/*
> + * HMM stands for Heterogeneous Memory Management, it is a helper layer inside
> + * the linux kernel to help device drivers mirror a process address space in
> + * the device. This allows the device to use the same address space which
> + * makes communication and data exchange a lot easier.
> + *
> + * This framework's sole purpose is to exercise various code paths inside
> + * the kernel to make sure that HMM performs as expected and to flush out any
> + * bugs.
> + */
> +
> +#include "../kselftest_harness.h"
> +
> +#include <errno.h>
> +#include <fcntl.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <stdint.h>
> +#include <unistd.h>
> +#include <strings.h>
> +#include <time.h>
> +#include <pthread.h>
> +#include <hugetlbfs.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <sys/mman.h>
> +#include <sys/ioctl.h>
> +#include <linux/hmm_dmirror.h>
> +
> +struct hmm_buffer {
> +	void		*ptr;
> +	void		*mirror;
> +	unsigned long	size;
> +	int		fd;
> +	uint64_t	cpages;
> +	uint64_t	faults;
> +};
> +
> +#define TWOMEG		(1 << 21)
> +#define HMM_BUFFER_SIZE (1024 << 12)
> +#define HMM_PATH_MAX    64
> +#define NTIMES		256
> +
> +#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
> +
> +FIXTURE(hmm)
> +{
> +	int		fd;
> +	unsigned int	page_size;
> +	unsigned int	page_shift;
> +};
> +
> +FIXTURE(hmm2)
> +{
> +	int		fd0;
> +	int		fd1;
> +	unsigned int	page_size;
> +	unsigned int	page_shift;
> +};
> +
> +static int hmm_open(int unit)
> +{
> +	char pathname[HMM_PATH_MAX];
> +	int fd;
> +
> +	snprintf(pathname, sizeof(pathname), "/dev/hmm_dmirror%d", unit);
> +	fd = open(pathname, O_RDWR, 0);
> +	if (fd < 0)
> +		fprintf(stderr, "could not open hmm dmirror driver (%s)\n",
> +			pathname);
> +	return fd;
> +}
> +
> +FIXTURE_SETUP(hmm)
> +{
> +	self->page_size = sysconf(_SC_PAGE_SIZE);
> +	self->page_shift = ffs(self->page_size) - 1;
> +
> +	self->fd = hmm_open(0);
> +	ASSERT_GE(self->fd, 0);
> +}
> +
> +FIXTURE_SETUP(hmm2)
> +{
> +	self->page_size = sysconf(_SC_PAGE_SIZE);
> +	self->page_shift = ffs(self->page_size) - 1;
> +
> +	self->fd0 = hmm_open(0);
> +	ASSERT_GE(self->fd0, 0);
> +	self->fd1 = hmm_open(1);
> +	ASSERT_GE(self->fd1, 0);
> +}
> +
> +FIXTURE_TEARDOWN(hmm)
> +{
> +	int ret = close(self->fd);
> +
> +	ASSERT_EQ(ret, 0);
> +	self->fd = -1;
> +}
> +
> +FIXTURE_TEARDOWN(hmm2)
> +{
> +	int ret = close(self->fd0);
> +
> +	ASSERT_EQ(ret, 0);
> +	self->fd0 = -1;
> +
> +	ret = close(self->fd1);
> +	ASSERT_EQ(ret, 0);
> +	self->fd1 = -1;
> +}
> +
> +static int hmm_dmirror_cmd(int fd,
> +			   unsigned long request,
> +			   struct hmm_buffer *buffer,
> +			   unsigned long npages)
> +{
> +	struct hmm_dmirror_cmd cmd;
> +	int ret;
> +
> +	/* Simulate a device reading system memory. */
> +	cmd.addr = (__u64)buffer->ptr;
> +	cmd.ptr = (__u64)buffer->mirror;
> +	cmd.npages = npages;
> +
> +	for (;;) {
> +		ret = ioctl(fd, request, &cmd);
> +		if (ret == 0)
> +			break;
> +		if (errno == EINTR)
> +			continue;
> +		return -errno;
> +	}
> +	buffer->cpages = cmd.cpages;
> +	buffer->faults = cmd.faults;
> +
> +	return 0;
> +}
> +
> +static void hmm_buffer_free(struct hmm_buffer *buffer)
> +{
> +	if (buffer == NULL)
> +		return;
> +
> +	if (buffer->ptr)
> +		munmap(buffer->ptr, buffer->size);
> +	free(buffer->mirror);
> +	free(buffer);
> +}
> +
> +/*
> + * Create a temporary file that will be deleted on close.
> + */
> +static int hmm_create_file(unsigned long size)
> +{
> +	char path[HMM_PATH_MAX];
> +	int fd;
> +
> +	strcpy(path, "/tmp");
> +	fd = open(path, O_TMPFILE | O_EXCL | O_RDWR, 0600);
> +	if (fd >= 0) {
> +		int r;
> +
> +		do {
> +			r = ftruncate(fd, size);
> +		} while (r == -1 && errno == EINTR);
> +		if (!r)
> +			return fd;
> +		close(fd);
> +	}
> +	return -1;
> +}
> +
> +/*
> + * Return a random unsigned number.
> + */
> +static unsigned int hmm_random(void)
> +{
> +	static int fd = -1;
> +	unsigned int r;
> +
> +	if (fd < 0) {
> +		fd = open("/dev/urandom", O_RDONLY);
> +		if (fd < 0) {
> +			fprintf(stderr, "%s:%d failed to open /dev/urandom\n",
> +					__FILE__, __LINE__);
> +			return ~0U;
> +		}
> +	}
> +	read(fd, &r, sizeof(r));
> +	return r;
> +}
> +
> +static void hmm_nanosleep(unsigned int n)
> +{
> +	struct timespec t;
> +
> +	t.tv_sec = 0;
> +	t.tv_nsec = n;
> +	nanosleep(&t, NULL);
> +}
> +
> +/*
> + * Simple NULL test of device open/close.
> + */
> +TEST_F(hmm, open_close)
> +{
> +}
> +
> +/*
> + * Read private anonymous memory.
> + */
> +TEST_F(hmm, anon_read)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +	int val;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/*
> +	 * Initialize buffer in system memory but leave the first two pages
> +	 * zero (pte_none and pfn_zero).
> +	 */
> +	i = 2 * self->page_size / sizeof(*ptr);
> +	for (ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Set buffer permission to read-only. */
> +	ret = mprotect(buffer->ptr, size, PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Populate the CPU page table with a special zero page. */
> +	val = *(int *)(buffer->ptr + self->page_size);
> +	ASSERT_EQ(val, 0);
> +
> +	/* Simulate a device reading system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device read. */
> +	ptr = buffer->mirror;
> +	for (i = 0; i < 2 * self->page_size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], 0);
> +	for (; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Read private anonymous memory which has been protected with
> + * mprotect() PROT_NONE.
> + */
> +TEST_F(hmm, anon_read_prot)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer in system memory. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Initialize mirror buffer so we can verify it isn't written. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = -i;
> +
> +	/* Protect buffer from reading. */
> +	ret = mprotect(buffer->ptr, size, PROT_NONE);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Simulate a device reading system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
> +	ASSERT_EQ(ret, -EFAULT);
> +
> +	/* Allow CPU to read the buffer so we can check it. */
> +	ret = mprotect(buffer->ptr, size, PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	/* Check what the device read. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], -i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Write private anonymous memory.
> + */
> +TEST_F(hmm, anon_write)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Write private anonymous memory which has been protected with
> + * mprotect() PROT_READ.
> + */
> +TEST_F(hmm, anon_write_prot)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Simulate a device reading a zero page of memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, 1);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, 1);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, -EPERM);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], 0);
> +
> +	/* Now allow writing and see that the zero page is replaced. */
> +	ret = mprotect(buffer->ptr, size, PROT_WRITE | PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Check that a device writing an anonymous private mapping
> + * will copy-on-write if a child process inherits the mapping.
> + */
> +TEST_F(hmm, anon_write_child)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	pid_t pid;
> +	int child_fd;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer->ptr so we can tell if it is written. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = -i;
> +
> +	pid = fork();
> +	if (pid == -1)
> +		ASSERT_EQ(pid, 0);
> +	if (pid != 0) {
> +		waitpid(pid, &ret, 0);
> +		ASSERT_EQ(WIFEXITED(ret), 1);
> +
> +		/* Check that the parent's buffer did not change. */
> +		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +			ASSERT_EQ(ptr[i], i);
> +		return;
> +	}
> +
> +	/* Check that we see the parent's values. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], -i);
> +
> +	/* The child process needs its own mirror to its own mm. */
> +	child_fd = hmm_open(0);
> +	ASSERT_GE(child_fd, 0);
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], -i);
> +
> +	close(child_fd);
> +	exit(0);
> +}
> +
> +/*
> + * Check that a device writing an anonymous shared mapping
> + * will not copy-on-write if a child process inherits the mapping.
> + */
> +TEST_F(hmm, anon_write_child_shared)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	pid_t pid;
> +	int child_fd;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_SHARED | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer->ptr so we can tell if it is written. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = -i;
> +
> +	pid = fork();
> +	if (pid == -1)
> +		ASSERT_EQ(pid, 0);
> +	if (pid != 0) {
> +		waitpid(pid, &ret, 0);
> +		ASSERT_EQ(WIFEXITED(ret), 1);
> +
> +		/* Check that the parent's buffer did change. */
> +		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +			ASSERT_EQ(ptr[i], -i);
> +		return;
> +	}
> +
> +	/* Check that we see the parent's values. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], -i);
> +
> +	/* The child process needs its own mirror to its own mm. */
> +	child_fd = hmm_open(0);
> +	ASSERT_GE(child_fd, 0);
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], -i);
> +
> +	close(child_fd);
> +	exit(0);
> +}
> +
> +/*
> + * Write private anonymous huge page.
> + */
> +TEST_F(hmm, anon_write_huge)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	void *old_ptr;
> +	void *map;
> +	int *ptr;
> +	int ret;
> +
> +	size = 2 * TWOMEG;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	size = TWOMEG;
> +	npages = size >> self->page_shift;
> +	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
> +	ret = madvise(map, size, MADV_HUGEPAGE);
> +	ASSERT_EQ(ret, 0);
> +	old_ptr = buffer->ptr;
> +	buffer->ptr = map;
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	buffer->ptr = old_ptr;
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Write huge TLBFS page.
> + */
> +TEST_F(hmm, anon_write_hugetlbfs)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +	long pagesizes[4];
> +	int n, idx;
> +
> +	/* Skip test if we can't allocate a hugetlbfs page. */
> +
> +	n = gethugepagesizes(pagesizes, 4);
> +	if (n <= 0)
> +		return;
> +	for (idx = 0; --n > 0; ) {
> +		if (pagesizes[n] < pagesizes[idx])
> +			idx = n;
> +	}
> +	size = ALIGN(TWOMEG, pagesizes[idx]);
> +	npages = size >> self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->ptr = get_hugepage_region(size, GHR_STRICT);
> +	if (buffer->ptr == NULL) {
> +		free(buffer);
> +		return;
> +	}
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	free_hugepage_region(buffer->ptr);
> +	buffer->ptr = NULL;
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Read mmap'ed file memory.
> + */
> +TEST_F(hmm, file_read)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +	int fd;
> +	off_t off;
> +	ssize_t len;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	fd = hmm_create_file(size);
> +	ASSERT_GE(fd, 0);
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = fd;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	/* Write initial contents of the file. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +	off = lseek(fd, 0, SEEK_SET);
> +	ASSERT_EQ(off, 0);
> +	len = write(fd, buffer->mirror, size);
> +	ASSERT_EQ(len, size);
> +	memset(buffer->mirror, 0, size);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ,
> +			   MAP_SHARED,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Simulate a device reading system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device read. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Write mmap'ed file memory.
> + */
> +TEST_F(hmm, file_write)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +	int fd;
> +	off_t off;
> +	ssize_t len;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	fd = hmm_create_file(size);
> +	ASSERT_GE(fd, 0);
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = fd;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_SHARED,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize data that the device will write to buffer->ptr. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Simulate a device writing system memory. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +	ASSERT_EQ(buffer->faults, 1);
> +
> +	/* Check what the device wrote. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	/* Check that the device also wrote the file. */
> +	off = lseek(fd, 0, SEEK_SET);
> +	ASSERT_EQ(off, 0);
> +	len = read(fd, buffer->mirror, size);
> +	ASSERT_EQ(len, size);
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Migrate anonymous memory to device private memory.
> + */
> +TEST_F(hmm, migrate)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer in system memory. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Migrate memory to device. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +
> +	/* Check what the device read. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Migrate anonymous memory to device private memory and fault it back to system
> + * memory.
> + */
> +TEST_F(hmm, migrate_fault)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_READ | PROT_WRITE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +	/* Initialize buffer in system memory. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ptr[i] = i;
> +
> +	/* Migrate memory to device. */
> +	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +
> +	/* Check what the device read. */
> +	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	/* Fault pages back to system memory and check them. */
> +	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +		ASSERT_EQ(ptr[i], i);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Try to migrate various memory types to device private memory.
> + */
> +TEST_F(hmm2, migrate_mixed)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	int *ptr;
> +	unsigned char *p;
> +	int ret;
> +	int val;
> +
> +	npages = 6;
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(size);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	/* Reserve a range of addresses. */
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_NONE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +	p = buffer->ptr;
> +
> +	/* Now try to migrate everything to device 1. */
> +	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, 6);
> +
> +	/* Punch a hole after the first page address. */
> +	ret = munmap(buffer->ptr + self->page_size, self->page_size);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* We expect an error if the vma doesn't cover the range. */
> +	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
> +	ASSERT_EQ(ret, -EINVAL);
> +
> +	/* Page 2 will be a read-only zero page. */
> +	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
> +				PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +	ptr = (int *)(buffer->ptr + 2 * self->page_size);
> +	val = *ptr + 3;
> +	ASSERT_EQ(val, 3);
> +
> +	/* Page 3 will be read-only. */
> +	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
> +				PROT_READ | PROT_WRITE);
> +	ASSERT_EQ(ret, 0);
> +	ptr = (int *)(buffer->ptr + 3 * self->page_size);
> +	*ptr = val;
> +	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
> +				PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Page 4 will be read-write. */
> +	ret = mprotect(buffer->ptr + 4 * self->page_size, self->page_size,
> +				PROT_READ | PROT_WRITE);
> +	ASSERT_EQ(ret, 0);
> +	ptr = (int *)(buffer->ptr + 4 * self->page_size);
> +	*ptr = val;
> +
> +	/* Page 5 won't be migrated to device 0 because it's on device 1. */
> +	buffer->ptr = p + 5 * self->page_size;
> +	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
> +	ASSERT_EQ(ret, -ENOENT);
> +	buffer->ptr = p;
> +
> +	/* Now try to migrate pages 2-3 to device 1. */
> +	buffer->ptr = p + 2 * self->page_size;
> +	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 2);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, 2);
> +	buffer->ptr = p;
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +/*
> + * Migrate anonymous memory to device private memory and fault it back to system
> + * memory multiple times.
> + */
> +TEST_F(hmm, migrate_multiple)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	unsigned long c;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	for (c = 0; c < NTIMES; c++) {
> +		buffer = malloc(sizeof(*buffer));
> +		ASSERT_NE(buffer, NULL);
> +
> +		buffer->fd = -1;
> +		buffer->size = size;
> +		buffer->mirror = malloc(size);
> +		ASSERT_NE(buffer->mirror, NULL);
> +
> +		buffer->ptr = mmap(NULL, size,
> +				   PROT_READ | PROT_WRITE,
> +				   MAP_PRIVATE | MAP_ANONYMOUS,
> +				   buffer->fd, 0);
> +		ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +		/* Initialize buffer in system memory. */
> +		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +			ptr[i] = i;
> +
> +		/* Migrate memory to device. */
> +		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
> +				      npages);
> +		ASSERT_EQ(ret, 0);
> +		ASSERT_EQ(buffer->cpages, npages);
> +
> +		/* Check what the device read. */
> +		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +			ASSERT_EQ(ptr[i], i);
> +
> +		/* Fault pages back to system memory and check them. */
> +		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +			ASSERT_EQ(ptr[i], i);
> +
> +		hmm_buffer_free(buffer);
> +	}
> +}
> +
> +/*
> + * Read anonymous memory multiple times.
> + */
> +TEST_F(hmm, anon_read_multiple)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long i;
> +	unsigned long c;
> +	int *ptr;
> +	int ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	for (c = 0; c < NTIMES; c++) {
> +		buffer = malloc(sizeof(*buffer));
> +		ASSERT_NE(buffer, NULL);
> +
> +		buffer->fd = -1;
> +		buffer->size = size;
> +		buffer->mirror = malloc(size);
> +		ASSERT_NE(buffer->mirror, NULL);
> +
> +		buffer->ptr = mmap(NULL, size,
> +				   PROT_READ | PROT_WRITE,
> +				   MAP_PRIVATE | MAP_ANONYMOUS,
> +				   buffer->fd, 0);
> +		ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +		/* Initialize buffer in system memory. */
> +		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +			ptr[i] = i + c;
> +
> +		/* Simulate a device reading system memory. */
> +		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
> +				      npages);
> +		ASSERT_EQ(ret, 0);
> +		ASSERT_EQ(buffer->cpages, npages);
> +		ASSERT_EQ(buffer->faults, 1);
> +
> +		/* Check what the device read. */
> +		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
> +			ASSERT_EQ(ptr[i], i + c);
> +
> +		hmm_buffer_free(buffer);
> +	}
> +}
> +
> +void *unmap_buffer(void *p)
> +{
> +	struct hmm_buffer *buffer = p;
> +
> +	/* Delay for a bit and then unmap buffer while it is being read. */
> +	hmm_nanosleep(hmm_random() % 32000);
> +	munmap(buffer->ptr + buffer->size / 2, buffer->size / 2);
> +	buffer->ptr = NULL;
> +
> +	return NULL;
> +}
> +
> +/*
> + * Try reading anonymous memory while it is being unmapped.
> + */
> +TEST_F(hmm, anon_teardown)
> +{
> +	unsigned long npages;
> +	unsigned long size;
> +	unsigned long c;
> +	void *ret;
> +
> +	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
> +	ASSERT_NE(npages, 0);
> +	size = npages << self->page_shift;
> +
> +	for (c = 0; c < NTIMES; ++c) {
> +		pthread_t thread;
> +		struct hmm_buffer *buffer;
> +		unsigned long i;
> +		int *ptr;
> +		int rc;
> +
> +		buffer = malloc(sizeof(*buffer));
> +		ASSERT_NE(buffer, NULL);
> +
> +		buffer->fd = -1;
> +		buffer->size = size;
> +		buffer->mirror = malloc(size);
> +		ASSERT_NE(buffer->mirror, NULL);
> +
> +		buffer->ptr = mmap(NULL, size,
> +				   PROT_READ | PROT_WRITE,
> +				   MAP_PRIVATE | MAP_ANONYMOUS,
> +				   buffer->fd, 0);
> +		ASSERT_NE(buffer->ptr, MAP_FAILED);
> +
> +		/* Initialize buffer in system memory. */
> +		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
> +			ptr[i] = i + c;
> +
> +		rc = pthread_create(&thread, NULL, unmap_buffer, buffer);
> +		ASSERT_EQ(rc, 0);
> +
> +		/* Simulate a device reading system memory. */
> +		rc = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
> +				     npages);
> +		if (rc == 0) {
> +			ASSERT_EQ(buffer->cpages, npages);
> +			ASSERT_EQ(buffer->faults, 1);
> +
> +			/* Check what the device read. */
> +			for (i = 0, ptr = buffer->mirror;
> +			     i < size / sizeof(*ptr);
> +			     ++i)
> +				ASSERT_EQ(ptr[i], i + c);
> +		}
> +
> +		pthread_join(thread, &ret);
> +		hmm_buffer_free(buffer);
> +	}
> +}
> +
> +/*
> + * Test memory snapshot without faulting in pages accessed by the device.
> + */
> +TEST_F(hmm2, snapshot)
> +{
> +	struct hmm_buffer *buffer;
> +	unsigned long npages;
> +	unsigned long size;
> +	int *ptr;
> +	unsigned char *p;
> +	unsigned char *m;
> +	int ret;
> +	int val;
> +
> +	npages = 7;
> +	size = npages << self->page_shift;
> +
> +	buffer = malloc(sizeof(*buffer));
> +	ASSERT_NE(buffer, NULL);
> +
> +	buffer->fd = -1;
> +	buffer->size = size;
> +	buffer->mirror = malloc(npages);
> +	ASSERT_NE(buffer->mirror, NULL);
> +
> +	/* Reserve a range of addresses. */
> +	buffer->ptr = mmap(NULL, size,
> +			   PROT_NONE,
> +			   MAP_PRIVATE | MAP_ANONYMOUS,
> +			   buffer->fd, 0);
> +	ASSERT_NE(buffer->ptr, MAP_FAILED);
> +	p = buffer->ptr;
> +
> +	/* Punch a hole after the first page address. */
> +	ret = munmap(buffer->ptr + self->page_size, self->page_size);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Page 2 will be read-only zero page. */
> +	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
> +				PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +	ptr = (int *)(buffer->ptr + 2 * self->page_size);
> +	val = *ptr + 3;
> +	ASSERT_EQ(val, 3);
> +
> +	/* Page 3 will be read-only. */
> +	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
> +				PROT_READ | PROT_WRITE);
> +	ASSERT_EQ(ret, 0);
> +	ptr = (int *)(buffer->ptr + 3 * self->page_size);
> +	*ptr = val;
> +	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
> +				PROT_READ);
> +	ASSERT_EQ(ret, 0);
> +
> +	/* Page 4-6 will be read-write. */
> +	ret = mprotect(buffer->ptr + 4 * self->page_size, 3 * self->page_size,
> +				PROT_READ | PROT_WRITE);
> +	ASSERT_EQ(ret, 0);
> +	ptr = (int *)(buffer->ptr + 4 * self->page_size);
> +	*ptr = val;
> +
> +	/* Page 5 will be migrated to device 0. */
> +	buffer->ptr = p + 5 * self->page_size;
> +	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, 1);
> +
> +	/* Page 6 will be migrated to device 1. */
> +	buffer->ptr = p + 6 * self->page_size;
> +	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, 1);
> +
> +	/* Simulate a device snapshotting CPU pagetables. */
> +	buffer->ptr = p;
> +	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_SNAPSHOT, buffer, npages);
> +	ASSERT_EQ(ret, 0);
> +	ASSERT_EQ(buffer->cpages, npages);
> +
> +	/* Check what the device saw. */
> +	m = buffer->mirror;
> +	ASSERT_EQ(m[0], HMM_DMIRROR_PROT_NONE);
> +	ASSERT_EQ(m[1], HMM_DMIRROR_PROT_NONE);
> +	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
> +	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
> +	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
> +	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
> +			HMM_DMIRROR_PROT_WRITE);
> +	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE |
> +			HMM_DMIRROR_PROT_WRITE);
> +
> +	hmm_buffer_free(buffer);
> +}
> +
> +TEST_HARNESS_MAIN
> diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
> index 951c507a27f7..634cfefdaffd 100755
> --- a/tools/testing/selftests/vm/run_vmtests
> +++ b/tools/testing/selftests/vm/run_vmtests
> @@ -227,4 +227,20 @@ else
>  	exitcode=1
>  fi
>  
> +echo "------------------------------------"
> +echo "running HMM smoke test"
> +echo "------------------------------------"
> +./test_hmm.sh smoke
> +ret_val=$?
> +
> +if [ $ret_val -eq 0 ]; then
> +	echo "[PASS]"
> +elif [ $ret_val -eq $ksft_skip ]; then
> +	echo "[SKIP]"
> +	exitcode=$ksft_skip
> +else
> +	echo "[FAIL]"
> +	exitcode=1
> +fi
> +
>  exit $exitcode
> diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
> new file mode 100755
> index 000000000000..268d32752045
> --- /dev/null
> +++ b/tools/testing/selftests/vm/test_hmm.sh
> @@ -0,0 +1,97 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Copyright (C) 2018 Uladzislau Rezki (Sony) <urezki@gmail.com>
> +#
> +# This is a test script for the kernel test driver to analyse vmalloc
> +# allocator. Therefore it is just a kernel module loader. You can specify
> +# and pass different parameters in order to:
> +#     a) analyse performance of vmalloc allocations;
> +#     b) stressing and stability check of vmalloc subsystem.
> +
> +TEST_NAME="test_hmm"
> +DRIVER="hmm_dmirror"
> +
> +# 1 if fails
> +exitcode=1
> +
> +# Kselftest framework requirement - SKIP code is 4.
> +ksft_skip=4
> +
> +check_test_requirements()
> +{
> +	uid=$(id -u)
> +	if [ $uid -ne 0 ]; then
> +		echo "$0: Must be run as root"
> +		exit $ksft_skip
> +	fi
> +
> +	if ! which modprobe > /dev/null 2>&1; then
> +		echo "$0: You need modprobe installed"
> +		exit $ksft_skip
> +	fi
> +
> +	if ! modinfo $DRIVER > /dev/null 2>&1; then
> +		echo "$0: You must have the following enabled in your kernel:"
> +		echo "CONFIG_HMM_DMIRROR=m"
> +		exit $ksft_skip
> +	fi
> +}
> +
> +load_driver()
> +{
> +	modprobe $DRIVER > /dev/null 2>&1
> +	if [ $? == 0 ]; then
> +		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
> +		mknod /dev/hmm_dmirror0 c $major 0
> +		mknod /dev/hmm_dmirror1 c $major 1
> +	fi
> +}
> +
> +unload_driver()
> +{
> +	modprobe -r $DRIVER > /dev/null 2>&1
> +	rm -f /dev/hmm_dmirror?
> +}
> +
> +run_smoke()
> +{
> +	echo "Running smoke test. Note, this test provides basic coverage."
> +
> +	load_driver
> +	./hmm-tests
> +	unload_driver
> +}
> +
> +usage()
> +{
> +	echo -n "Usage: $0"
> +	echo
> +	echo "Example usage:"
> +	echo
> +	echo "# Shows help message"
> +	echo "./${TEST_NAME}.sh"
> +	echo
> +	echo "# Smoke testing"
> +	echo "./${TEST_NAME}.sh smoke"
> +	echo
> +	exit 0
> +}
> +
> +function run_test()
> +{
> +	if [ $# -eq 0 ]; then
> +		usage
> +	else
> +		if [ "$1" = "smoke" ]; then
> +			run_smoke
> +		else
> +			usage
> +		fi
> +	fi
> +}
> +
> +check_test_requirements
> +run_test $@
> +
> +exit 0
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-23 20:28   ` Jerome Glisse
@ 2019-10-23 21:55     ` Ralph Campbell
  0 siblings, 0 replies; 19+ messages in thread
From: Ralph Campbell @ 2019-10-23 21:55 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, Christoph Hellwig, Jason Gunthorpe, linux-rdma,
	linux-mm, linux-kernel


On 10/23/19 1:28 PM, Jerome Glisse wrote:
> On Wed, Oct 23, 2019 at 12:55:15PM -0700, Ralph Campbell wrote:
>> Add self tests for HMM.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> You can add my signoff
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>

Thanks!

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page
  2019-10-23 19:55 ` [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page Ralph Campbell
  2019-10-23 20:27   ` Jerome Glisse
@ 2019-10-24  9:27   ` David Hildenbrand
  2019-10-29 17:27   ` Jason Gunthorpe
  2 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2019-10-24  9:27 UTC (permalink / raw)
  To: Ralph Campbell, Jerome Glisse, John Hubbard, Christoph Hellwig,
	Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel

On 23.10.19 21:55, Ralph Campbell wrote:
> If a device driver like nouveau tries to use hmm_range_fault() to access
> the special shared zero page in system memory, hmm_range_fault() will
> return -EFAULT and kill the process.
> Allow hmm_range_fault() to return success (0) when the CPU pagetable
> entry points to the special shared zero page.
> page_to_pfn() and pfn_to_page() are defined on the zero page so just
> handle it like any other page.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   mm/hmm.c | 10 ++++++++--
>   1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index acf7a664b38c..8c96c9ddcae5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -529,8 +529,14 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>   		if (unlikely(!hmm_vma_walk->pgmap))
>   			return -EBUSY;
>   	} else if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pte_special(pte)) {
> -		*pfn = range->values[HMM_PFN_SPECIAL];
> -		return -EFAULT;
> +		if (!is_zero_pfn(pte_pfn(pte))) {
> +			*pfn = range->values[HMM_PFN_SPECIAL];
> +			return -EFAULT;
> +		}
> +		/*
> +		 * Since each architecture defines a struct page for the zero
> +		 * page, just fall through and treat it like a normal page.
> +		 */
>   	}
>   
>   	*pfn = hmm_device_entry_from_pfn(range, pte_pfn(pte)) | cpu_flags;
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page
  2019-10-23 19:55 ` [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page Ralph Campbell
  2019-10-23 20:27   ` Jerome Glisse
  2019-10-24  9:27   ` David Hildenbrand
@ 2019-10-29 17:27   ` Jason Gunthorpe
  2 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 17:27 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel

On Wed, Oct 23, 2019 at 12:55:14PM -0700, Ralph Campbell wrote:
> If a device driver like nouveau tries to use hmm_range_fault() to access
> the special shared zero page in system memory, hmm_range_fault() will
> return -EFAULT and kill the process.
> Allow hmm_range_fault() to return success (0) when the CPU pagetable
> entry points to the special shared zero page.
> page_to_pfn() and pfn_to_page() are defined on the zero page so just
> handle it like any other page.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  mm/hmm.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)

Applied to hmm.git

Thanks,
Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] mm/hmm: make full use of walk_page_range()
  2019-10-23 19:55 ` [PATCH v3 1/3] mm/hmm: make full use of walk_page_range() Ralph Campbell
@ 2019-10-29 17:40   ` Jason Gunthorpe
  0 siblings, 0 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 17:40 UTC (permalink / raw)
  To: Ralph Campbell, Jerome Glisse, Christoph Hellwig
  Cc: John Hubbard, linux-rdma, linux-mm, linux-kernel

On Wed, Oct 23, 2019 at 12:55:13PM -0700, Ralph Campbell wrote:
> hmm_range_fault() calls find_vma() and walk_page_range() in a loop.
> This is unnecessary duplication since walk_page_range() calls find_vma()
> in a loop already.
> Simplify hmm_range_fault() by defining a walk_test() callback function
> to filter unhandled vmas.
> This also fixes a bug where hmm_range_fault() was not checking
> start >= vma->vm_start before checking vma->vm_flags so hmm_range_fault()
> could return an error based on the wrong vma for the requested range.
> It also fixes a bug when the vma has no read access and the caller did
> not request a fault, there shouldn't be any error return code.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: Christoph Hellwig <hch@lst.de>
>  mm/hmm.c | 126 +++++++++++++++++++++++++++----------------------------
>  1 file changed, 63 insertions(+), 63 deletions(-)

This is looking OK, can we get an ack from Jerome? Christoph?

I recall my first worry was that walk->vma could now be null, as
ops->pte_hole is set. But it looks like that is all handled now?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-23 19:55 ` [PATCH v3 3/3] mm/hmm/test: add self tests for HMM Ralph Campbell
  2019-10-23 20:28   ` Jerome Glisse
@ 2019-10-29 17:58   ` Jason Gunthorpe
  2019-10-29 21:16     ` Ralph Campbell
  2019-10-30 18:34     ` Qian Cai
  1 sibling, 2 replies; 19+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 17:58 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel

On Wed, Oct 23, 2019 at 12:55:15PM -0700, Ralph Campbell wrote:
> Add self tests for HMM.
>
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>  MAINTAINERS                            |    3 +
>  drivers/char/Kconfig                   |   11 +
>  drivers/char/Makefile                  |    1 +
>  drivers/char/hmm_dmirror.c             | 1566 ++++++++++++++++++++++++
>  include/Kbuild                         |    1 +
>  include/uapi/linux/hmm_dmirror.h       |   74 ++
>  tools/testing/selftests/vm/.gitignore  |    1 +
>  tools/testing/selftests/vm/Makefile    |    3 +
>  tools/testing/selftests/vm/config      |    3 +
>  tools/testing/selftests/vm/hmm-tests.c | 1311 ++++++++++++++++++++
>  tools/testing/selftests/vm/run_vmtests |   16 +
>  tools/testing/selftests/vm/test_hmm.sh |   97 ++
>  12 files changed, 3087 insertions(+)
>  create mode 100644 drivers/char/hmm_dmirror.c
>  create mode 100644 include/uapi/linux/hmm_dmirror.h
>  create mode 100644 tools/testing/selftests/vm/hmm-tests.c
>  create mode 100755 tools/testing/selftests/vm/test_hmm.sh

This is really big, it would be nice to get a comment from the various
kernel testing folks if this approach makes sense with the test
frameworks. Do we have other drivers that are only intended to be used
by selftests?

Frankly, I'm not super excited about the idea of a 'test driver', it
seems more logical for testing to have some way for a test harness to
call hmm_range_fault() under various conditions and check the results?

It seems especially over-complicated to use a full page table layout
for this, wouldn't something simple like an xarray be good enough for
test purposes?

> +/*
> + * Below are the file operation for the dmirror device file. Only ioctl matters.
> + *
> + * Note this is highly specific to the dmirror device driver and should not be
> + * construed as an example on how to design the API a real device driver would
> + * expose to userspace.
> + */
> +static ssize_t dmirror_fops_read(struct file *filp,
> +			       char __user *buf,
> +			       size_t count,
> +			       loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +
> +static ssize_t dmirror_fops_write(struct file *filp,
> +				const char __user *buf,
> +				size_t count,
> +				loff_t *ppos)
> +{
> +	return -EINVAL;
> +}
> +
> +static int dmirror_fops_mmap(struct file *filp, struct vm_area_struct *vma)
> +{
> +	/* Forbid mmap of the dmirror device file. */
> +	return -EINVAL;
> +}

I'm pretty sure these can just be left as NULL in the fops?

> +static int dmirror_fault(struct dmirror *dmirror,
> +			 unsigned long start,
> +			 unsigned long end,
> +			 bool write)
> +{
> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
> +	unsigned long addr;
> +	unsigned long next;
> +	uint64_t pfns[64];
> +	struct hmm_range range = {
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),
> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
> +	};
> +	int ret = 0;
> +
> +	for (addr = start; addr < end; ) {
> +		long count;
> +
> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
> +		range.start = addr;
> +		range.end = next;
> +
> +		down_read(&mm->mmap_sem);
> +
> +		ret = hmm_range_register(&range, &dmirror->mirror);
> +		if (ret) {
> +			up_read(&mm->mmap_sem);
> +			break;
> +		}
> +
> +		if (!hmm_range_wait_until_valid(&range,
> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +
> +		count = hmm_range_fault(&range, 0);
> +		if (count < 0) {
> +			ret = count;
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			break;
> +		}
> +
> +		if (!hmm_range_valid(&range)) {

There is no 'driver lock' being held here, how does this work?
Shouldn't it hold dmirror->mutex for this sequence?

> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +		mutex_lock(&dmirror->mutex);
> +		ret = dmirror_pt_walk(dmirror, dmirror_do_fault,
> +				      addr, next, &range, true);
> +		mutex_unlock(&dmirror->mutex);

Ie move it down into this block

> +		hmm_range_unregister(&range);
> +		up_read(&mm->mmap_sem);
> +		if (ret)
> +			break;
> +
> +		addr = next;
> +	}
> +
> +	return ret;
> +}

> +static int dmirror_read(struct dmirror *dmirror,
> +			struct hmm_dmirror_cmd *cmd)
> +{

Why not just use pread()/pwrite() for this instead of an ioctl?

> +	struct dmirror_bounce bounce;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	ret = dmirror_bounce_init(&bounce, start, size);
> +	if (ret)
> +		return ret;
> +
> +static int dmirror_snapshot(struct dmirror *dmirror,
> +			    struct hmm_dmirror_cmd *cmd)
> +{
> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	unsigned long addr;
> +	unsigned long next;
> +	uint64_t pfns[64];
> +	unsigned char perm[64];
> +	char __user *uptr;
> +	struct hmm_range range = {
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~0ULL,
> +	};
> +	int ret = 0;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	uptr = (void __user *)cmd->ptr;
> +
> +	for (addr = start; addr < end; ) {
> +		long count;
> +		unsigned long i;
> +		unsigned long n;
> +
> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
> +		range.start = addr;
> +		range.end = next;
> +
> +		down_read(&mm->mmap_sem);
> +
> +		ret = hmm_range_register(&range, &dmirror->mirror);
> +		if (ret) {
> +			up_read(&mm->mmap_sem);
> +			break;
> +		}
> +
> +		if (!hmm_range_wait_until_valid(&range,
> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +
> +		count = hmm_range_fault(&range, HMM_FAULT_SNAPSHOT);
> +		if (count < 0) {
> +			ret = count;
> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			if (ret == -EBUSY)
> +				continue;
> +			break;
> +		}
> +
> +		if (!hmm_range_valid(&range)) {

Same as for dmirror_fault

> +			hmm_range_unregister(&range);
> +			up_read(&mm->mmap_sem);
> +			continue;
> +		}
> +
> +		n = (next - addr) >> PAGE_SHIFT;
> +		for (i = 0; i < n; i++)
> +			dmirror_mkentry(dmirror, &range, perm + i, pfns[i]);

Is this missing locking too?

> +static int dmirror_remove(struct platform_device *pdev)
> +{
> +	/* all probe actions are unwound by devm */
> +	return 0;
> +}
> +
> +static struct platform_driver dmirror_device_driver = {
> +	.probe		= dmirror_probe,
> +	.remove		= dmirror_remove,
> +	.driver		= {
> +		.name	= "HMM_DMIRROR",
> +	},
> +};

This presence of a platform_driver and device is very confusing. I'm
sure Greg KH would object to this as a misuse of platform drivers.

A platform device isn't needed to create a char dev, so what is this for?

> diff --git a/include/Kbuild b/include/Kbuild
> index ffba79483cc5..6ffb44a45957 100644
> --- a/include/Kbuild
> +++ b/include/Kbuild
> @@ -1063,6 +1063,7 @@ header-test-			+= uapi/linux/coda_psdev.h
>  header-test-			+= uapi/linux/errqueue.h
>  header-test-			+= uapi/linux/eventpoll.h
>  header-test-			+= uapi/linux/hdlc/ioctl.h
> +header-test-			+= uapi/linux/hmm_dmirror.h

Why? This list should only be updated if the header is broken in some
way.


> diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
> new file mode 100644
> index 000000000000..f4ae6188fd0e
> --- /dev/null
> +++ b/tools/testing/selftests/vm/hmm-tests.c
> @@ -0,0 +1,1311 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright 2013 Red Hat Inc.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License as
> + * published by the Free Software Foundation; either version 2 of
> + * the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.

btw, I think if a SPDX is present I don't think the license text is
required, just the copyright.

I think these tests should also study the various case of invoke
pte_hole, ie faulting/snappshotting before/after a vma, or across a
vma range with a hole, etc, etc.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-29 17:58   ` Jason Gunthorpe
@ 2019-10-29 21:16     ` Ralph Campbell
  2019-10-29 23:12       ` Jason Gunthorpe
  2019-10-30 18:34     ` Qian Cai
  1 sibling, 1 reply; 19+ messages in thread
From: Ralph Campbell @ 2019-10-29 21:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel


On 10/29/19 10:58 AM, Jason Gunthorpe wrote:
> On Wed, Oct 23, 2019 at 12:55:15PM -0700, Ralph Campbell wrote:
>> Add self tests for HMM.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> 
>> ---
>>   MAINTAINERS                            |    3 +
>>   drivers/char/Kconfig                   |   11 +
>>   drivers/char/Makefile                  |    1 +
>>   drivers/char/hmm_dmirror.c             | 1566 ++++++++++++++++++++++++
>>   include/Kbuild                         |    1 +
>>   include/uapi/linux/hmm_dmirror.h       |   74 ++
>>   tools/testing/selftests/vm/.gitignore  |    1 +
>>   tools/testing/selftests/vm/Makefile    |    3 +
>>   tools/testing/selftests/vm/config      |    3 +
>>   tools/testing/selftests/vm/hmm-tests.c | 1311 ++++++++++++++++++++
>>   tools/testing/selftests/vm/run_vmtests |   16 +
>>   tools/testing/selftests/vm/test_hmm.sh |   97 ++
>>   12 files changed, 3087 insertions(+)
>>   create mode 100644 drivers/char/hmm_dmirror.c
>>   create mode 100644 include/uapi/linux/hmm_dmirror.h
>>   create mode 100644 tools/testing/selftests/vm/hmm-tests.c
>>   create mode 100755 tools/testing/selftests/vm/test_hmm.sh
> 
> This is really big, it would be nice to get a comment from the various
> kernel testing folks if this approach makes sense with the test
> frameworks. Do we have other drivers that are only intended to be used
> by selftests?
> 
> Frankly, I'm not super excited about the idea of a 'test driver', it
> seems more logical for testing to have some way for a test harness to
> call hmm_range_fault() under various conditions and check the results?

test_vmalloc.sh at least uses a test module(s).

> It seems especially over-complicated to use a full page table layout
> for this, wouldn't something simple like an xarray be good enough for
> test purposes?

Possibly. A page table is really just a lookup table from virtual address
to pfn/page. Part of the rationale was to mimic what a real device might do.


>> +/*
>> + * Below are the file operation for the dmirror device file. Only ioctl matters.
>> + *
>> + * Note this is highly specific to the dmirror device driver and should not be
>> + * construed as an example on how to design the API a real device driver would
>> + * expose to userspace.
>> + */
>> +static ssize_t dmirror_fops_read(struct file *filp,
>> +			       char __user *buf,
>> +			       size_t count,
>> +			       loff_t *ppos)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +static ssize_t dmirror_fops_write(struct file *filp,
>> +				const char __user *buf,
>> +				size_t count,
>> +				loff_t *ppos)
>> +{
>> +	return -EINVAL;
>> +}
>> +
>> +static int dmirror_fops_mmap(struct file *filp, struct vm_area_struct *vma)
>> +{
>> +	/* Forbid mmap of the dmirror device file. */
>> +	return -EINVAL;
>> +}
> 
> I'm pretty sure these can just be left as NULL in the fops?

I think so.

>> +static int dmirror_fault(struct dmirror *dmirror,
>> +			 unsigned long start,
>> +			 unsigned long end,
>> +			 bool write)
>> +{
>> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
>> +	unsigned long addr;
>> +	unsigned long next;
>> +	uint64_t pfns[64];
>> +	struct hmm_range range = {
>> +		.pfns = pfns,
>> +		.flags = dmirror_hmm_flags,
>> +		.values = dmirror_hmm_values,
>> +		.pfn_shift = DPT_SHIFT,
>> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
>> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),
>> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
>> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
>> +	};
>> +	int ret = 0;
>> +
>> +	for (addr = start; addr < end; ) {
>> +		long count;
>> +
>> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
>> +		range.start = addr;
>> +		range.end = next;
>> +
>> +		down_read(&mm->mmap_sem);
>> +
>> +		ret = hmm_range_register(&range, &dmirror->mirror);
>> +		if (ret) {
>> +			up_read(&mm->mmap_sem);
>> +			break;
>> +		}
>> +
>> +		if (!hmm_range_wait_until_valid(&range,
>> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
>> +			hmm_range_unregister(&range);
>> +			up_read(&mm->mmap_sem);
>> +			continue;
>> +		}
>> +
>> +		count = hmm_range_fault(&range, 0);
>> +		if (count < 0) {
>> +			ret = count;
>> +			hmm_range_unregister(&range);
>> +			up_read(&mm->mmap_sem);
>> +			break;
>> +		}
>> +
>> +		if (!hmm_range_valid(&range)) {
> 
> There is no 'driver lock' being held here, how does this work?
> Shouldn't it hold dmirror->mutex for this sequence?

I have a modified version of this driver that's based on your series
removing hmm_mirror_register() which uses a mutex.
Otherwise, it looks similar to the changes in nouveau.

>> +			hmm_range_unregister(&range);
>> +			up_read(&mm->mmap_sem);
>> +			continue;
>> +		}
>> +		mutex_lock(&dmirror->mutex);
>> +		ret = dmirror_pt_walk(dmirror, dmirror_do_fault,
>> +				      addr, next, &range, true);
>> +		mutex_unlock(&dmirror->mutex);
> 
> Ie move it down into this block
> 
>> +		hmm_range_unregister(&range);
>> +		up_read(&mm->mmap_sem);
>> +		if (ret)
>> +			break;
>> +
>> +		addr = next;
>> +	}
>> +
>> +	return ret;
>> +}
> 
>> +static int dmirror_read(struct dmirror *dmirror,
>> +			struct hmm_dmirror_cmd *cmd)
>> +{
> 
> Why not just use pread()/pwrite() for this instead of an ioctl?

pread()/pwrite() could certainly be implemented.
I think the idea was that the read/write is actually the "device"
doing read/write and making that clearly different from a program
reading/writing the device. Also, the ioctl() allows information
about what faults or events happened during the operation. I only
have number of pages and number of page faults returned at the moment,
but one of Jerome's version of this driver had other counters being
returned.

>> +	struct dmirror_bounce bounce;
>> +	unsigned long start, end;
>> +	unsigned long size = cmd->npages << PAGE_SHIFT;
>> +	int ret;
>> +
>> +	start = cmd->addr;
>> +	end = start + size;
>> +	if (end < start)
>> +		return -EINVAL;
>> +
>> +	ret = dmirror_bounce_init(&bounce, start, size);
>> +	if (ret)
>> +		return ret;
>> +
>> +static int dmirror_snapshot(struct dmirror *dmirror,
>> +			    struct hmm_dmirror_cmd *cmd)
>> +{
>> +	struct mm_struct *mm = dmirror->mirror.hmm->mmu_notifier.mm;
>> +	unsigned long start, end;
>> +	unsigned long size = cmd->npages << PAGE_SHIFT;
>> +	unsigned long addr;
>> +	unsigned long next;
>> +	uint64_t pfns[64];
>> +	unsigned char perm[64];
>> +	char __user *uptr;
>> +	struct hmm_range range = {
>> +		.pfns = pfns,
>> +		.flags = dmirror_hmm_flags,
>> +		.values = dmirror_hmm_values,
>> +		.pfn_shift = DPT_SHIFT,
>> +		.pfn_flags_mask = ~0ULL,
>> +	};
>> +	int ret = 0;
>> +
>> +	start = cmd->addr;
>> +	end = start + size;
>> +	uptr = (void __user *)cmd->ptr;
>> +
>> +	for (addr = start; addr < end; ) {
>> +		long count;
>> +		unsigned long i;
>> +		unsigned long n;
>> +
>> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
>> +		range.start = addr;
>> +		range.end = next;
>> +
>> +		down_read(&mm->mmap_sem);
>> +
>> +		ret = hmm_range_register(&range, &dmirror->mirror);
>> +		if (ret) {
>> +			up_read(&mm->mmap_sem);
>> +			break;
>> +		}
>> +
>> +		if (!hmm_range_wait_until_valid(&range,
>> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
>> +			hmm_range_unregister(&range);
>> +			up_read(&mm->mmap_sem);
>> +			continue;
>> +		}
>> +
>> +		count = hmm_range_fault(&range, HMM_FAULT_SNAPSHOT);
>> +		if (count < 0) {
>> +			ret = count;
>> +			hmm_range_unregister(&range);
>> +			up_read(&mm->mmap_sem);
>> +			if (ret == -EBUSY)
>> +				continue;
>> +			break;
>> +		}
>> +
>> +		if (!hmm_range_valid(&range)) {
> 
> Same as for dmirror_fault
> 
>> +			hmm_range_unregister(&range);
>> +			up_read(&mm->mmap_sem);
>> +			continue;
>> +		}
>> +
>> +		n = (next - addr) >> PAGE_SHIFT;
>> +		for (i = 0; i < n; i++)
>> +			dmirror_mkentry(dmirror, &range, perm + i, pfns[i]);
> 
> Is this missing locking too?

Yes. It's in the updated version as mentioned above.

>> +static int dmirror_remove(struct platform_device *pdev)
>> +{
>> +	/* all probe actions are unwound by devm */
>> +	return 0;
>> +}
>> +
>> +static struct platform_driver dmirror_device_driver = {
>> +	.probe		= dmirror_probe,
>> +	.remove		= dmirror_remove,
>> +	.driver		= {
>> +		.name	= "HMM_DMIRROR",
>> +	},
>> +};
> 
> This presence of a platform_driver and device is very confusing. I'm
> sure Greg KH would object to this as a misuse of platform drivers.
> 
> A platform device isn't needed to create a char dev, so what is this for?

The devm_request_free_mem_region() and devm_memremap_pages() calls for
creating the ZONE_DEVICE private pages tie into the devm* clean up framework.
I thought a platform_driver was the simplest way to also be able to call
devm_add_action_or_reset() to clean up on module unload and be compatible
with the private page clean up.

>> diff --git a/include/Kbuild b/include/Kbuild
>> index ffba79483cc5..6ffb44a45957 100644
>> --- a/include/Kbuild
>> +++ b/include/Kbuild
>> @@ -1063,6 +1063,7 @@ header-test-			+= uapi/linux/coda_psdev.h
>>   header-test-			+= uapi/linux/errqueue.h
>>   header-test-			+= uapi/linux/eventpoll.h
>>   header-test-			+= uapi/linux/hdlc/ioctl.h
>> +header-test-			+= uapi/linux/hmm_dmirror.h
> 
> Why? This list should only be updated if the header is broken in some
> way.

Should this be in include/linux/ instead?
I wasn't sure where the "right" place was to put the header.

> 
>> diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
>> new file mode 100644
>> index 000000000000..f4ae6188fd0e
>> --- /dev/null
>> +++ b/tools/testing/selftests/vm/hmm-tests.c
>> @@ -0,0 +1,1311 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright 2013 Red Hat Inc.
>> + *
>> + * This program is free software; you can redistribute it and/or
>> + * modify it under the terms of the GNU General Public License as
>> + * published by the Free Software Foundation; either version 2 of
>> + * the License, or (at your option) any later version.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>> + * GNU General Public License for more details.
> 
> btw, I think if a SPDX is present I don't think the license text is
> required, just the copyright.

Since I was starting from Jerome's HMM test driver, I didn't want to
delete any of the original copyright text.
If Jerome is OK with just the SPDX header, that's OK with me.

> I think these tests should also study the various case of invoke
> pte_hole, ie faulting/snappshotting before/after a vma, or across a
> vma range with a hole, etc, etc.
> 
> Jason
> 

There are tests for vma hole, pte_none(), zero page, and normal page.
Nothing stress testing races, just set up the mmap() and test once.
I can add more test cases if you have something specific in mind.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-29 21:16     ` Ralph Campbell
@ 2019-10-29 23:12       ` Jason Gunthorpe
  2019-10-31  0:14         ` Ralph Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 23:12 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel

On Tue, Oct 29, 2019 at 02:16:05PM -0700, Ralph Campbell wrote:

> > Frankly, I'm not super excited about the idea of a 'test driver', it
> > seems more logical for testing to have some way for a test harness to
> > call hmm_range_fault() under various conditions and check the results?
> 
> test_vmalloc.sh at least uses a test module(s).

Well, that is good, is it also under drivers/char? It kind feels like
it should not be there...
 
> > It seems especially over-complicated to use a full page table layout
> > for this, wouldn't something simple like an xarray be good enough for
> > test purposes?
> 
> Possibly. A page table is really just a lookup table from virtual address
> to pfn/page. Part of the rationale was to mimic what a real device
> might do.

Well, but the details of the page table layout don't see really
important to this testing, IMHO.

> > > +	for (addr = start; addr < end; ) {
> > > +		long count;
> > > +
> > > +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
> > > +		range.start = addr;
> > > +		range.end = next;
> > > +
> > > +		down_read(&mm->mmap_sem);

Also, did we get a mmget() before doing this down_read?

> > > +
> > > +		ret = hmm_range_register(&range, &dmirror->mirror);
> > > +		if (ret) {
> > > +			up_read(&mm->mmap_sem);
> > > +			break;
> > > +		}
> > > +
> > > +		if (!hmm_range_wait_until_valid(&range,
> > > +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
> > > +			hmm_range_unregister(&range);
> > > +			up_read(&mm->mmap_sem);
> > > +			continue;
> > > +		}
> > > +
> > > +		count = hmm_range_fault(&range, 0);
> > > +		if (count < 0) {
> > > +			ret = count;
> > > +			hmm_range_unregister(&range);
> > > +			up_read(&mm->mmap_sem);
> > > +			break;
> > > +		}
> > > +
> > > +		if (!hmm_range_valid(&range)) {
> > 
> > There is no 'driver lock' being held here, how does this work?
> > Shouldn't it hold dmirror->mutex for this sequence?
> 
> I have a modified version of this driver that's based on your series
> removing hmm_mirror_register() which uses a mutex.
> Otherwise, it looks similar to the changes in nouveau.

Well, that locking pattern is required even for original hmm calls..


> > > +static int dmirror_read(struct dmirror *dmirror,
> > > +			struct hmm_dmirror_cmd *cmd)
> > > +{
> > 
> > Why not just use pread()/pwrite() for this instead of an ioctl?
> 
> pread()/pwrite() could certainly be implemented.
> I think the idea was that the read/write is actually the "device"
> doing read/write and making that clearly different from a program
> reading/writing the device. Also, the ioctl() allows information
> about what faults or events happened during the operation. I only
> have number of pages and number of page faults returned at the moment,
> but one of Jerome's version of this driver had other counters being
> returned.

Makes sense I guess

> > > +static struct platform_driver dmirror_device_driver = {
> > > +	.probe		= dmirror_probe,
> > > +	.remove		= dmirror_remove,
> > > +	.driver		= {
> > > +		.name	= "HMM_DMIRROR",
> > > +	},
> > > +};
> > 
> > This presence of a platform_driver and device is very confusing. I'm
> > sure Greg KH would object to this as a misuse of platform drivers.
> > 
> > A platform device isn't needed to create a char dev, so what is this for?
> 
> The devm_request_free_mem_region() and devm_memremap_pages() calls for
> creating the ZONE_DEVICE private pages tie into the devm* clean up framework.
> I thought a platform_driver was the simplest way to also be able to call
> devm_add_action_or_reset() to clean up on module unload and be compatible
> with the private page clean up.

IIRC Christoph recently fixed things so there was a non devm version
of these functions. Certainly we should not be making fake
platform_devices just to call devm.

There is also a struct device inside the cdev, maybe that could be
arrange to be devm compatible if it was *really* needed.

> > > diff --git a/include/Kbuild b/include/Kbuild
> > > index ffba79483cc5..6ffb44a45957 100644
> > > +++ b/include/Kbuild
> > > @@ -1063,6 +1063,7 @@ header-test-			+= uapi/linux/coda_psdev.h
> > >   header-test-			+= uapi/linux/errqueue.h
> > >   header-test-			+= uapi/linux/eventpoll.h
> > >   header-test-			+= uapi/linux/hdlc/ioctl.h
> > > +header-test-			+= uapi/linux/hmm_dmirror.h
> > 
> > Why? This list should only be updated if the header is broken in some
> > way.
> 
> Should this be in include/linux/ instead?
> I wasn't sure where the "right" place was to put the header.

No, it is right, it just shouldn't be in this makefile.

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-29 17:58   ` Jason Gunthorpe
  2019-10-29 21:16     ` Ralph Campbell
@ 2019-10-30 18:34     ` Qian Cai
  1 sibling, 0 replies; 19+ messages in thread
From: Qian Cai @ 2019-10-30 18:34 UTC (permalink / raw)
  To: Jason Gunthorpe, Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel

On Tue, 2019-10-29 at 17:58 +0000, Jason Gunthorpe wrote:
> On Wed, Oct 23, 2019 at 12:55:15PM -0700, Ralph Campbell wrote:
> > Add self tests for HMM.
> > 
> > Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
> > ---
> >  MAINTAINERS                            |    3 +
> >  drivers/char/Kconfig                   |   11 +
> >  drivers/char/Makefile                  |    1 +
> >  drivers/char/hmm_dmirror.c             | 1566 ++++++++++++++++++++++++
> >  include/Kbuild                         |    1 +
> >  include/uapi/linux/hmm_dmirror.h       |   74 ++
> >  tools/testing/selftests/vm/.gitignore  |    1 +
> >  tools/testing/selftests/vm/Makefile    |    3 +
> >  tools/testing/selftests/vm/config      |    3 +
> >  tools/testing/selftests/vm/hmm-tests.c | 1311 ++++++++++++++++++++
> >  tools/testing/selftests/vm/run_vmtests |   16 +
> >  tools/testing/selftests/vm/test_hmm.sh |   97 ++
> >  12 files changed, 3087 insertions(+)
> >  create mode 100644 drivers/char/hmm_dmirror.c
> >  create mode 100644 include/uapi/linux/hmm_dmirror.h
> >  create mode 100644 tools/testing/selftests/vm/hmm-tests.c
> >  create mode 100755 tools/testing/selftests/vm/test_hmm.sh
> 
> This is really big, it would be nice to get a comment from the various
> kernel testing folks if this approach makes sense with the test
> frameworks. Do we have other drivers that are only intended to be used
> by selftests?
> 
> Frankly, I'm not super excited about the idea of a 'test driver', it
> seems more logical for testing to have some way for a test harness to
> call hmm_range_fault() under various conditions and check the results?

Not a big fan of those selftests either. Could it be saner to use the new KUnit
framework for those instead?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-29 23:12       ` Jason Gunthorpe
@ 2019-10-31  0:14         ` Ralph Campbell
  2019-10-31 12:42           ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Ralph Campbell @ 2019-10-31  0:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel


On 10/29/19 4:12 PM, Jason Gunthorpe wrote:
> On Tue, Oct 29, 2019 at 02:16:05PM -0700, Ralph Campbell wrote:
> 
>>> Frankly, I'm not super excited about the idea of a 'test driver', it
>>> seems more logical for testing to have some way for a test harness to
>>> call hmm_range_fault() under various conditions and check the results?
>>
>> test_vmalloc.sh at least uses a test module(s).
> 
> Well, that is good, is it also under drivers/char? It kind feels like
> it should not be there...

I think most of the test modules live in lib/ but I wasn't sure that
was the right place for the HMM test driver.
If you think that is better, I can easily move it.

>>> It seems especially over-complicated to use a full page table layout
>>> for this, wouldn't something simple like an xarray be good enough for
>>> test purposes?
>>
>> Possibly. A page table is really just a lookup table from virtual address
>> to pfn/page. Part of the rationale was to mimic what a real device
>> might do.
> 
> Well, but the details of the page table layout don't see really
> important to this testing, IMHO.

One problem with XArray is that on 32-bit machines the value would
need to be u64 to hold a pfn which won't fit in a ULONG_MAX.
I guess we could make the driver 64-bit only.

>>>> +	for (addr = start; addr < end; ) {
>>>> +		long count;
>>>> +
>>>> +		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
>>>> +		range.start = addr;
>>>> +		range.end = next;
>>>> +
>>>> +		down_read(&mm->mmap_sem);
> 
> Also, did we get a mmget() before doing this down_read?
> 
>>>> +
>>>> +		ret = hmm_range_register(&range, &dmirror->mirror);
>>>> +		if (ret) {
>>>> +			up_read(&mm->mmap_sem);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		if (!hmm_range_wait_until_valid(&range,
>>>> +						DMIRROR_RANGE_FAULT_TIMEOUT)) {
>>>> +			hmm_range_unregister(&range);
>>>> +			up_read(&mm->mmap_sem);
>>>> +			continue;
>>>> +		}
>>>> +
>>>> +		count = hmm_range_fault(&range, 0);
>>>> +		if (count < 0) {
>>>> +			ret = count;
>>>> +			hmm_range_unregister(&range);
>>>> +			up_read(&mm->mmap_sem);
>>>> +			break;
>>>> +		}
>>>> +
>>>> +		if (!hmm_range_valid(&range)) {
>>>
>>> There is no 'driver lock' being held here, how does this work?
>>> Shouldn't it hold dmirror->mutex for this sequence?
>>
>> I have a modified version of this driver that's based on your series
>> removing hmm_mirror_register() which uses a mutex.
>> Otherwise, it looks similar to the changes in nouveau.
> 
> Well, that locking pattern is required even for original hmm calls..

Will be fixed in v4.

> 
>>>> +static int dmirror_read(struct dmirror *dmirror,
>>>> +			struct hmm_dmirror_cmd *cmd)
>>>> +{
>>>
>>> Why not just use pread()/pwrite() for this instead of an ioctl?
>>
>> pread()/pwrite() could certainly be implemented.
>> I think the idea was that the read/write is actually the "device"
>> doing read/write and making that clearly different from a program
>> reading/writing the device. Also, the ioctl() allows information
>> about what faults or events happened during the operation. I only
>> have number of pages and number of page faults returned at the moment,
>> but one of Jerome's version of this driver had other counters being
>> returned.
> 
> Makes sense I guess
> 
>>>> +static struct platform_driver dmirror_device_driver = {
>>>> +	.probe		= dmirror_probe,
>>>> +	.remove		= dmirror_remove,
>>>> +	.driver		= {
>>>> +		.name	= "HMM_DMIRROR",
>>>> +	},
>>>> +};
>>>
>>> This presence of a platform_driver and device is very confusing. I'm
>>> sure Greg KH would object to this as a misuse of platform drivers.
>>>
>>> A platform device isn't needed to create a char dev, so what is this for?
>>
>> The devm_request_free_mem_region() and devm_memremap_pages() calls for
>> creating the ZONE_DEVICE private pages tie into the devm* clean up framework.
>> I thought a platform_driver was the simplest way to also be able to call
>> devm_add_action_or_reset() to clean up on module unload and be compatible
>> with the private page clean up.
> 
> IIRC Christoph recently fixed things so there was a non devm version
> of these functions. Certainly we should not be making fake
> platform_devices just to call devm.
> 
> There is also a struct device inside the cdev, maybe that could be
> arrange to be devm compatible if it was *really* needed.

Will be fixed in v4.

>>>> diff --git a/include/Kbuild b/include/Kbuild
>>>> index ffba79483cc5..6ffb44a45957 100644
>>>> +++ b/include/Kbuild
>>>> @@ -1063,6 +1063,7 @@ header-test-			+= uapi/linux/coda_psdev.h
>>>>    header-test-			+= uapi/linux/errqueue.h
>>>>    header-test-			+= uapi/linux/eventpoll.h
>>>>    header-test-			+= uapi/linux/hdlc/ioctl.h
>>>> +header-test-			+= uapi/linux/hmm_dmirror.h
>>>
>>> Why? This list should only be updated if the header is broken in some
>>> way.
>>
>> Should this be in include/linux/ instead?
>> I wasn't sure where the "right" place was to put the header.
> 
> No, it is right, it just shouldn't be in this makefile.
> 
> Jason

Will be fixed in v4.

Thanks for the review, the code is much simpler now.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-31  0:14         ` Ralph Campbell
@ 2019-10-31 12:42           ` Jason Gunthorpe
  2019-10-31 17:28             ` Ralph Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2019-10-31 12:42 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel

On Wed, Oct 30, 2019 at 05:14:30PM -0700, Ralph Campbell wrote:

> > Well, that is good, is it also under drivers/char? It kind feels like
> > it should not be there...
> 
> I think most of the test modules live in lib/ but I wasn't sure that
> was the right place for the HMM test driver.
> If you think that is better, I can easily move it.

It would be good to get the various test people involved in this, I
really don't know.
 
> > > > It seems especially over-complicated to use a full page table layout
> > > > for this, wouldn't something simple like an xarray be good enough for
> > > > test purposes?
> > > 
> > > Possibly. A page table is really just a lookup table from virtual address
> > > to pfn/page. Part of the rationale was to mimic what a real device
> > > might do.
> > 
> > Well, but the details of the page table layout don't see really
> > important to this testing, IMHO.
> 
> One problem with XArray is that on 32-bit machines the value would
> need to be u64 to hold a pfn which won't fit in a ULONG_MAX.
> I guess we could make the driver 64-bit only.

Why would a 32 bit machine need a 64 bit pfn?

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-31 12:42           ` Jason Gunthorpe
@ 2019-10-31 17:28             ` Ralph Campbell
  2019-10-31 17:34               ` Jason Gunthorpe
  0 siblings, 1 reply; 19+ messages in thread
From: Ralph Campbell @ 2019-10-31 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel



On 10/31/19 5:42 AM, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2019 at 05:14:30PM -0700, Ralph Campbell wrote:
> 
>>> Well, that is good, is it also under drivers/char? It kind feels like
>>> it should not be there...
>>
>> I think most of the test modules live in lib/ but I wasn't sure that
>> was the right place for the HMM test driver.
>> If you think that is better, I can easily move it.
> 
> It would be good to get the various test people involved in this, I
> really don't know.

OK.
  
>>>>> It seems especially over-complicated to use a full page table layout
>>>>> for this, wouldn't something simple like an xarray be good enough for
>>>>> test purposes?
>>>>
>>>> Possibly. A page table is really just a lookup table from virtual address
>>>> to pfn/page. Part of the rationale was to mimic what a real device
>>>> might do.
>>>
>>> Well, but the details of the page table layout don't see really
>>> important to this testing, IMHO.
>>
>> One problem with XArray is that on 32-bit machines the value would
>> need to be u64 to hold a pfn which won't fit in a ULONG_MAX.
>> I guess we could make the driver 64-bit only.
> 
> Why would a 32 bit machine need a 64 bit pfn?
> 
> Jason
> 

On x86, Physical Address Extension (PAE) uses a 64 bit PTE.
See arch/x86/include/asm/pgtable_32_types.h which includes
arch/x86/include/asm/pgtable-3level_types.h.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-31 17:28             ` Ralph Campbell
@ 2019-10-31 17:34               ` Jason Gunthorpe
  2019-10-31 17:48                 ` Ralph Campbell
  0 siblings, 1 reply; 19+ messages in thread
From: Jason Gunthorpe @ 2019-10-31 17:34 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel

On Thu, Oct 31, 2019 at 10:28:12AM -0700, Ralph Campbell wrote:
> > > > > > It seems especially over-complicated to use a full page table layout
> > > > > > for this, wouldn't something simple like an xarray be good enough for
> > > > > > test purposes?
> > > > > 
> > > > > Possibly. A page table is really just a lookup table from virtual address
> > > > > to pfn/page. Part of the rationale was to mimic what a real device
> > > > > might do.
> > > > 
> > > > Well, but the details of the page table layout don't see really
> > > > important to this testing, IMHO.
> > > 
> > > One problem with XArray is that on 32-bit machines the value would
> > > need to be u64 to hold a pfn which won't fit in a ULONG_MAX.
> > > I guess we could make the driver 64-bit only.
> > 
> > Why would a 32 bit machine need a 64 bit pfn?
> > 
> 
> On x86, Physical Address Extension (PAE) uses a 64 bit PTE.
> See arch/x86/include/asm/pgtable_32_types.h which includes
> arch/x86/include/asm/pgtable-3level_types.h.

That is the content of the PTE, not the address of the PTE. In this
case the xarray index is the 'virtual' address of the fictional device
and it can easily be 32 bits with no problem

Jason

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] mm/hmm/test: add self tests for HMM
  2019-10-31 17:34               ` Jason Gunthorpe
@ 2019-10-31 17:48                 ` Ralph Campbell
  0 siblings, 0 replies; 19+ messages in thread
From: Ralph Campbell @ 2019-10-31 17:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, linux-rdma,
	linux-mm, linux-kernel


On 10/31/19 10:34 AM, Jason Gunthorpe wrote:
> On Thu, Oct 31, 2019 at 10:28:12AM -0700, Ralph Campbell wrote:
>>>>>>> It seems especially over-complicated to use a full page table layout
>>>>>>> for this, wouldn't something simple like an xarray be good enough for
>>>>>>> test purposes?
>>>>>>
>>>>>> Possibly. A page table is really just a lookup table from virtual address
>>>>>> to pfn/page. Part of the rationale was to mimic what a real device
>>>>>> might do.
>>>>>
>>>>> Well, but the details of the page table layout don't see really
>>>>> important to this testing, IMHO.
>>>>
>>>> One problem with XArray is that on 32-bit machines the value would
>>>> need to be u64 to hold a pfn which won't fit in a ULONG_MAX.
>>>> I guess we could make the driver 64-bit only.
>>>
>>> Why would a 32 bit machine need a 64 bit pfn?
>>>
>>
>> On x86, Physical Address Extension (PAE) uses a 64 bit PTE.
>> See arch/x86/include/asm/pgtable_32_types.h which includes
>> arch/x86/include/asm/pgtable-3level_types.h.
> 
> That is the content of the PTE, not the address of the PTE. In this
> case the xarray index is the 'virtual' address of the fictional device
> and it can easily be 32 bits with no problem
> 
> Jason
> 

Oh, I see. You mean use a 32-bit user virtual address for the index
and store a pointer to the 64-bit PTE which of course would be
32 bit. That should work.
I was stuck on thinking the PTE needed to be stored.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-10-31 17:48 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-23 19:55 [PATCH v3 0/3] HMM tests and minor fixes Ralph Campbell
2019-10-23 19:55 ` [PATCH v3 1/3] mm/hmm: make full use of walk_page_range() Ralph Campbell
2019-10-29 17:40   ` Jason Gunthorpe
2019-10-23 19:55 ` [PATCH v3 2/3] mm/hmm: allow snapshot of the special zero page Ralph Campbell
2019-10-23 20:27   ` Jerome Glisse
2019-10-24  9:27   ` David Hildenbrand
2019-10-29 17:27   ` Jason Gunthorpe
2019-10-23 19:55 ` [PATCH v3 3/3] mm/hmm/test: add self tests for HMM Ralph Campbell
2019-10-23 20:28   ` Jerome Glisse
2019-10-23 21:55     ` Ralph Campbell
2019-10-29 17:58   ` Jason Gunthorpe
2019-10-29 21:16     ` Ralph Campbell
2019-10-29 23:12       ` Jason Gunthorpe
2019-10-31  0:14         ` Ralph Campbell
2019-10-31 12:42           ` Jason Gunthorpe
2019-10-31 17:28             ` Ralph Campbell
2019-10-31 17:34               ` Jason Gunthorpe
2019-10-31 17:48                 ` Ralph Campbell
2019-10-30 18:34     ` Qian Cai

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).