amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* ensure device private pages have an owner v2
@ 2020-03-16 19:32 Christoph Hellwig
  2020-03-16 19:32 ` [PATCH 1/4] memremap: add an owner field to struct dev_pagemap Christoph Hellwig
                   ` (5 more replies)
  0 siblings, 6 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-16 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx

When acting on device private mappings a driver needs to know if the
device (or other entity in case of kvmppc) actually owns this private
mapping.  This series adds an owner field and converts the migrate_vma
code over to check it.  I looked into doing the same for
hmm_range_fault, but as far as I can tell that code has never been
wired up to actually work for device private memory, so instead of
trying to fix some unused code the second patch just remove the code.
We can add it back once we have a working and fully tested code, and
then should pass the expected owner in the hmm_range structure.

Changes since v1:
 - split out the pgmap->owner addition into a separate patch
 - check pgmap->owner is set for device private mappings
 - rename the dev_private_owner field in struct migrate_vma to src_owner
 - refuse to migrate private pages if src_owner is not set
 - keep the non-fault device private handling in hmm_range_fault
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/4] memremap: add an owner field to struct dev_pagemap
  2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
@ 2020-03-16 19:32 ` Christoph Hellwig
  2020-03-16 20:55   ` Ralph Campbell
  2020-03-16 19:32 ` [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma Christoph Hellwig
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-16 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx

Add a new opaque owner field to struct dev_pagemap, which will allow
the hmm and migrate_vma code to identify who owns ZONE_DEVICE memory,
and refuse to work on mappings not owned by the calling entity.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c     | 2 ++
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 +
 include/linux/memremap.h               | 4 ++++
 mm/memremap.c                          | 4 ++++
 4 files changed, 11 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 79b1202b1c62..67fefb03b9b7 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -779,6 +779,8 @@ int kvmppc_uvmem_init(void)
 	kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE;
 	kvmppc_uvmem_pgmap.res = *res;
 	kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops;
+	/* just one global instance: */
+	kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap;
 	addr = memremap_pages(&kvmppc_uvmem_pgmap, NUMA_NO_NODE);
 	if (IS_ERR(addr)) {
 		ret = PTR_ERR(addr);
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 0ad5d87b5a8e..a4682272586e 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -526,6 +526,7 @@ nouveau_dmem_init(struct nouveau_drm *drm)
 	drm->dmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	drm->dmem->pagemap.res = *res;
 	drm->dmem->pagemap.ops = &nouveau_dmem_pagemap_ops;
+	drm->dmem->pagemap.owner = drm->dev;
 	if (IS_ERR(devm_memremap_pages(device, &drm->dmem->pagemap)))
 		goto out_free;
 
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 6fefb09af7c3..60d97e8fd3c0 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -103,6 +103,9 @@ struct dev_pagemap_ops {
  * @type: memory type: see MEMORY_* in memory_hotplug.h
  * @flags: PGMAP_* flags to specify defailed behavior
  * @ops: method table
+ * @owner: an opaque pointer identifying the entity that manages this
+ *	instance.  Used by various helpers to make sure that no
+ *	foreign ZONE_DEVICE memory is accessed.
  */
 struct dev_pagemap {
 	struct vmem_altmap altmap;
@@ -113,6 +116,7 @@ struct dev_pagemap {
 	enum memory_type type;
 	unsigned int flags;
 	const struct dev_pagemap_ops *ops;
+	void *owner;
 };
 
 static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
diff --git a/mm/memremap.c b/mm/memremap.c
index 09b5b7adc773..9b2c97ceb775 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -181,6 +181,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 			WARN(1, "Missing migrate_to_ram method\n");
 			return ERR_PTR(-EINVAL);
 		}
+		if (!pgmap->owner) {
+			WARN(1, "Missing owner\n");
+			return ERR_PTR(-EINVAL);
+		}
 		break;
 	case MEMORY_DEVICE_FS_DAX:
 		if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
-- 
2.24.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma
  2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
  2020-03-16 19:32 ` [PATCH 1/4] memremap: add an owner field to struct dev_pagemap Christoph Hellwig
@ 2020-03-16 19:32 ` Christoph Hellwig
  2020-03-16 21:43   ` Ralph Campbell
  2020-03-16 19:32 ` [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault Christoph Hellwig
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-16 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx

Add a new src_owner field to struct migrate_vma.  If the field is set,
only device private pages with page->pgmap->owner equal to that field
are migrated.  If the field is not set only "normal" pages are migrated.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")
---
 arch/powerpc/kvm/book3s_hv_uvmem.c     | 1 +
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 +
 include/linux/migrate.h                | 8 ++++++++
 mm/migrate.c                           | 9 ++++++---
 4 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 67fefb03b9b7..f44f6b27950f 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -563,6 +563,7 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long start,
 	mig.end = end;
 	mig.src = &src_pfn;
 	mig.dst = &dst_pfn;
+	mig.src_owner = &kvmppc_uvmem_pgmap;
 
 	mutex_lock(&kvm->arch.uvmem_lock);
 	/* The requested page is already paged-out, nothing to do */
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index a4682272586e..0e36345d395c 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -176,6 +176,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
 		.end		= vmf->address + PAGE_SIZE,
 		.src		= &src,
 		.dst		= &dst,
+		.src_owner	= drm->dev,
 	};
 
 	/*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 72120061b7d4..3e546cbf03dd 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -196,6 +196,14 @@ struct migrate_vma {
 	unsigned long		npages;
 	unsigned long		start;
 	unsigned long		end;
+
+	/*
+	 * Set to the owner value also stored in page->pgmap->owner for
+	 * migrating out of device private memory.  If set only device
+	 * private pages with this owner are migrated.  If not set
+	 * device private pages are not migrated at all.
+	 */
+	void			*src_owner;
 };
 
 int migrate_vma_setup(struct migrate_vma *args);
diff --git a/mm/migrate.c b/mm/migrate.c
index b1092876e537..7605d2c23433 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2241,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 	arch_enter_lazy_mmu_mode();
 
 	for (; addr < end; addr += PAGE_SIZE, ptep++) {
-		unsigned long mpfn, pfn;
+		unsigned long mpfn = 0, pfn;
 		struct page *page;
 		swp_entry_t entry;
 		pte_t pte;
@@ -2255,8 +2255,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 		}
 
 		if (!pte_present(pte)) {
-			mpfn = 0;
-
 			/*
 			 * Only care about unaddressable device page special
 			 * page table entry. Other special swap entries are not
@@ -2267,11 +2265,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 				goto next;
 
 			page = device_private_entry_to_page(entry);
+			if (page->pgmap->owner != migrate->src_owner)
+				goto next;
+
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
 			if (is_write_device_private_entry(entry))
 				mpfn |= MIGRATE_PFN_WRITE;
 		} else {
+			if (migrate->src_owner)
+				goto next;
 			pfn = pte_pfn(pte);
 			if (is_zero_pfn(pfn)) {
 				mpfn = MIGRATE_PFN_MIGRATE;
-- 
2.24.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
  2020-03-16 19:32 ` [PATCH 1/4] memremap: add an owner field to struct dev_pagemap Christoph Hellwig
  2020-03-16 19:32 ` [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma Christoph Hellwig
@ 2020-03-16 19:32 ` Christoph Hellwig
  2020-03-16 19:59   ` Jason Gunthorpe
  2020-03-16 22:49   ` Ralph Campbell
  2020-03-16 19:32 ` [PATCH 4/4] mm: check the device private page owner " Christoph Hellwig
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-16 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx

Remove the code to fault device private pages back into system memory
that has never been used by any driver.  Also replace the usage of the
HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple
is_device_private_page check in nouveau.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  1 -
 drivers/gpu/drm/nouveau/nouveau_dmem.c  |  5 +++--
 drivers/gpu/drm/nouveau/nouveau_svm.c   |  1 -
 include/linux/hmm.h                     |  2 --
 mm/hmm.c                                | 25 +++++--------------------
 5 files changed, 8 insertions(+), 26 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dee446278417..90821ce5e6ca 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -776,7 +776,6 @@ struct amdgpu_ttm_tt {
 static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
 	(1 << 0), /* HMM_PFN_VALID */
 	(1 << 1), /* HMM_PFN_WRITE */
-	0 /* HMM_PFN_DEVICE_PRIVATE */
 };
 
 static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 0e36345d395c..edfd0805fba4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -28,6 +28,7 @@
 
 #include <nvif/class.h>
 #include <nvif/object.h>
+#include <nvif/if000c.h>
 #include <nvif/if500b.h>
 #include <nvif/if900b.h>
 
@@ -692,9 +693,8 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
 		if (page == NULL)
 			continue;
 
-		if (!(range->pfns[i] & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
+		if (!is_device_private_page(page))
 			continue;
-		}
 
 		if (!nouveau_dmem_page(drm, page)) {
 			WARN(1, "Some unknown device memory !\n");
@@ -705,5 +705,6 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
 		addr = nouveau_dmem_page_addr(page);
 		range->pfns[i] &= ((1UL << range->pfn_shift) - 1);
 		range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift;
+		range->pfns[i] |= NVIF_VMM_PFNMAP_V0_VRAM;
 	}
 }
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index df9bf1fd1bc0..39c731a99937 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -367,7 +367,6 @@ static const u64
 nouveau_svm_pfn_flags[HMM_PFN_FLAG_MAX] = {
 	[HMM_PFN_VALID         ] = NVIF_VMM_PFNMAP_V0_V,
 	[HMM_PFN_WRITE         ] = NVIF_VMM_PFNMAP_V0_W,
-	[HMM_PFN_DEVICE_PRIVATE] = NVIF_VMM_PFNMAP_V0_VRAM,
 };
 
 static const u64
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4bf8d6997b12..5e6034f105c3 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -74,7 +74,6 @@
  * Flags:
  * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
  * HMM_PFN_WRITE: CPU page table has write permission set
- * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
  *
  * The driver provides a flags array for mapping page protections to device
  * PTE bits. If the driver valid bit for an entry is bit 3,
@@ -86,7 +85,6 @@
 enum hmm_pfn_flag_e {
 	HMM_PFN_VALID = 0,
 	HMM_PFN_WRITE,
-	HMM_PFN_DEVICE_PRIVATE,
 	HMM_PFN_FLAG_MAX
 };
 
diff --git a/mm/hmm.c b/mm/hmm.c
index 180e398170b0..cfad65f6a67b 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -118,15 +118,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
 	/* We aren't ask to do anything ... */
 	if (!(pfns & range->flags[HMM_PFN_VALID]))
 		return;
-	/* If this is device memory then only fault if explicitly requested */
-	if ((cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
-		/* Do we fault on device memory ? */
-		if (pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
-			*write_fault = pfns & range->flags[HMM_PFN_WRITE];
-			*fault = true;
-		}
-		return;
-	}
 
 	/* If CPU page table is not valid then we need to fault */
 	*fault = !(cpu_flags & range->flags[HMM_PFN_VALID]);
@@ -260,21 +251,15 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		/*
-		 * This is a special swap entry, ignore migration, use
-		 * device and report anything else as error.
+		 * Never fault in device private pages pages, but just report
+		 * the PFN even if not present.
 		 */
 		if (is_device_private_entry(entry)) {
-			cpu_flags = range->flags[HMM_PFN_VALID] |
-				range->flags[HMM_PFN_DEVICE_PRIVATE];
-			cpu_flags |= is_write_device_private_entry(entry) ?
-				range->flags[HMM_PFN_WRITE] : 0;
-			hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
-					   &fault, &write_fault);
-			if (fault || write_fault)
-				goto fault;
 			*pfn = hmm_device_entry_from_pfn(range,
 					    swp_offset(entry));
-			*pfn |= cpu_flags;
+			*pfn |= range->flags[HMM_PFN_VALID];
+			if (is_write_device_private_entry(entry))
+				*pfn |= range->flags[HMM_PFN_WRITE];
 			return 0;
 		}
 
-- 
2.24.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
                   ` (2 preceding siblings ...)
  2020-03-16 19:32 ` [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault Christoph Hellwig
@ 2020-03-16 19:32 ` Christoph Hellwig
  2020-03-16 19:49   ` Jason Gunthorpe
                     ` (2 more replies)
  2020-03-17  5:31 ` ensure device private pages have an owner v2 Bharata B Rao
  2020-03-19  0:28 ` Jason Gunthorpe
  5 siblings, 3 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-16 19:32 UTC (permalink / raw)
  To: Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx

Hmm range fault will succeed for any kind of device private memory,
even if it doesn't belong to the calling entity.  While nouveau
has some crude checks for that, they are broken because they assume
nouveau is the only user of device private memory.  Fix this by
passing in an expected pgmap owner in the hmm_range_fault structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 12 ------------
 include/linux/hmm.h                    |  2 ++
 mm/hmm.c                               | 10 +++++++++-
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index edfd0805fba4..ad89e09a0be3 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -672,12 +672,6 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
 	return ret;
 }
 
-static inline bool
-nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
-{
-	return is_device_private_page(page) && drm->dmem == page_to_dmem(page);
-}
-
 void
 nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
 			 struct hmm_range *range)
@@ -696,12 +690,6 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
 		if (!is_device_private_page(page))
 			continue;
 
-		if (!nouveau_dmem_page(drm, page)) {
-			WARN(1, "Some unknown device memory !\n");
-			range->pfns[i] = 0;
-			continue;
-		}
-
 		addr = nouveau_dmem_page_addr(page);
 		range->pfns[i] &= ((1UL << range->pfn_shift) - 1);
 		range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift;
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 5e6034f105c3..bb6be4428633 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -132,6 +132,7 @@ enum hmm_pfn_value_e {
  * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
  * @valid: pfns array did not change since it has been fill by an HMM function
+ * @dev_private_owner: owner of device private pages
  */
 struct hmm_range {
 	struct mmu_interval_notifier *notifier;
@@ -144,6 +145,7 @@ struct hmm_range {
 	uint64_t		default_flags;
 	uint64_t		pfn_flags_mask;
 	uint8_t			pfn_shift;
+	void			*dev_private_owner;
 };
 
 /*
diff --git a/mm/hmm.c b/mm/hmm.c
index cfad65f6a67b..b75b3750e03d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -216,6 +216,14 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
 		unsigned long end, uint64_t *pfns, pmd_t pmd);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline bool hmm_is_device_private_entry(struct hmm_range *range,
+		swp_entry_t entry)
+{
+	return is_device_private_entry(entry) &&
+		device_private_entry_to_page(entry)->pgmap->owner ==
+		range->dev_private_owner;
+}
+
 static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
 {
 	if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
@@ -254,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		 * Never fault in device private pages pages, but just report
 		 * the PFN even if not present.
 		 */
-		if (is_device_private_entry(entry)) {
+		if (hmm_is_device_private_entry(range, entry)) {
 			*pfn = hmm_device_entry_from_pfn(range,
 					    swp_offset(entry));
 			*pfn |= range->flags[HMM_PFN_VALID];
-- 
2.24.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-16 19:32 ` [PATCH 4/4] mm: check the device private page owner " Christoph Hellwig
@ 2020-03-16 19:49   ` Jason Gunthorpe
  2020-03-16 23:11   ` Ralph Campbell
  2020-03-20 13:41   ` Jason Gunthorpe
  2 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-16 19:49 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Mon, Mar 16, 2020 at 08:32:16PM +0100, Christoph Hellwig wrote:
> Hmm range fault will succeed for any kind of device private memory,
> even if it doesn't belong to the calling entity.  While nouveau
> has some crude checks for that, they are broken because they assume
> nouveau is the only user of device private memory.  Fix this by
> passing in an expected pgmap owner in the hmm_range_fault structure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")
> ---
>  drivers/gpu/drm/nouveau/nouveau_dmem.c | 12 ------------
>  include/linux/hmm.h                    |  2 ++
>  mm/hmm.c                               | 10 +++++++++-
>  3 files changed, 11 insertions(+), 13 deletions(-)

Nice

Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-16 19:32 ` [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault Christoph Hellwig
@ 2020-03-16 19:59   ` Jason Gunthorpe
  2020-03-16 21:33     ` Christoph Hellwig
  2020-03-16 22:49   ` Ralph Campbell
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-16 19:59 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Mon, Mar 16, 2020 at 08:32:15PM +0100, Christoph Hellwig wrote:
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 180e398170b0..cfad65f6a67b 100644
> +++ b/mm/hmm.c
> @@ -118,15 +118,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>  	/* We aren't ask to do anything ... */
>  	if (!(pfns & range->flags[HMM_PFN_VALID]))
>  		return;
> -	/* If this is device memory then only fault if explicitly requested */
> -	if ((cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
> -		/* Do we fault on device memory ? */
> -		if (pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
> -			*write_fault = pfns & range->flags[HMM_PFN_WRITE];
> -			*fault = true;
> -		}
> -		return;
> -	}

Yes, this is an elegant solution to the input flags.

However, between patch 3 and 4 doesn't this break amd gpu as it will
return device_private pages now if not requested? Squash the two?

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/4] memremap: add an owner field to struct dev_pagemap
  2020-03-16 19:32 ` [PATCH 1/4] memremap: add an owner field to struct dev_pagemap Christoph Hellwig
@ 2020-03-16 20:55   ` Ralph Campbell
  0 siblings, 0 replies; 38+ messages in thread
From: Ralph Campbell @ 2020-03-16 20:55 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx


On 3/16/20 12:32 PM, Christoph Hellwig wrote:
> Add a new opaque owner field to struct dev_pagemap, which will allow
> the hmm and migrate_vma code to identify who owns ZONE_DEVICE memory,
> and refuse to work on mappings not owned by the calling entity.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

This looks like a reasonable approach to take.
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   arch/powerpc/kvm/book3s_hv_uvmem.c     | 2 ++
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 +
>   include/linux/memremap.h               | 4 ++++
>   mm/memremap.c                          | 4 ++++
>   4 files changed, 11 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 79b1202b1c62..67fefb03b9b7 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -779,6 +779,8 @@ int kvmppc_uvmem_init(void)
>   	kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE;
>   	kvmppc_uvmem_pgmap.res = *res;
>   	kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops;
> +	/* just one global instance: */
> +	kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap;
>   	addr = memremap_pages(&kvmppc_uvmem_pgmap, NUMA_NO_NODE);
>   	if (IS_ERR(addr)) {
>   		ret = PTR_ERR(addr);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 0ad5d87b5a8e..a4682272586e 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -526,6 +526,7 @@ nouveau_dmem_init(struct nouveau_drm *drm)
>   	drm->dmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
>   	drm->dmem->pagemap.res = *res;
>   	drm->dmem->pagemap.ops = &nouveau_dmem_pagemap_ops;
> +	drm->dmem->pagemap.owner = drm->dev;
>   	if (IS_ERR(devm_memremap_pages(device, &drm->dmem->pagemap)))
>   		goto out_free;
>   
> diff --git a/include/linux/memremap.h b/include/linux/memremap.h
> index 6fefb09af7c3..60d97e8fd3c0 100644
> --- a/include/linux/memremap.h
> +++ b/include/linux/memremap.h
> @@ -103,6 +103,9 @@ struct dev_pagemap_ops {
>    * @type: memory type: see MEMORY_* in memory_hotplug.h
>    * @flags: PGMAP_* flags to specify defailed behavior
>    * @ops: method table
> + * @owner: an opaque pointer identifying the entity that manages this
> + *	instance.  Used by various helpers to make sure that no
> + *	foreign ZONE_DEVICE memory is accessed.
>    */
>   struct dev_pagemap {
>   	struct vmem_altmap altmap;
> @@ -113,6 +116,7 @@ struct dev_pagemap {
>   	enum memory_type type;
>   	unsigned int flags;
>   	const struct dev_pagemap_ops *ops;
> +	void *owner;
>   };
>   
>   static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 09b5b7adc773..9b2c97ceb775 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -181,6 +181,10 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid)
>   			WARN(1, "Missing migrate_to_ram method\n");
>   			return ERR_PTR(-EINVAL);
>   		}
> +		if (!pgmap->owner) {
> +			WARN(1, "Missing owner\n");
> +			return ERR_PTR(-EINVAL);
> +		}
>   		break;
>   	case MEMORY_DEVICE_FS_DAX:
>   		if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-16 19:59   ` Jason Gunthorpe
@ 2020-03-16 21:33     ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-16 21:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Mon, Mar 16, 2020 at 04:59:23PM -0300, Jason Gunthorpe wrote:
> However, between patch 3 and 4 doesn't this break amd gpu as it will
> return device_private pages now if not requested? Squash the two?

No change in behavior in this patch as long as HMM_PFN_DEVICE_PRIVATE
isn't set in ->pfns or ->default_flags, which is the case for both
nouveau and amdgpu.  The existing behavior is broken for private
pages not known to the driver, but that is fixed in the next patch.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma
  2020-03-16 19:32 ` [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma Christoph Hellwig
@ 2020-03-16 21:43   ` Ralph Campbell
  0 siblings, 0 replies; 38+ messages in thread
From: Ralph Campbell @ 2020-03-16 21:43 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx


On 3/16/20 12:32 PM, Christoph Hellwig wrote:
> Add a new src_owner field to struct migrate_vma.  If the field is set,
> only device private pages with page->pgmap->owner equal to that field
> are migrated.  If the field is not set only "normal" pages are migrated.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Fixes: df6ad69838fc ("mm/device-public-memory: device memory cache coherent with CPU")

When migrating to device private memory, setting the src_owner lets the caller
know about source pages that are already migrated and skips pages migrated to a
different device similar to pages swapped out an actual swap device.
But, it prevents normal pages from being migrated to device private memory.
It can still be useful for the driver to know that a page is owned by a
different device if it has a device to device way of migrating data.
nouveau_dmem_migrate_vma() isn't setting args.src_owner so only normal pages
will be migrated which I guess is OK for now since nouveau doesn't handle
direct GPU to GPU migration currently.

When migrating device private memory to system memory due to a CPU fault,
the source page should be the device's device private struct page so if it
isn't, then it does make sense to not migrate whatever normal page is there.
nouveau_dmem_migrate_to_ram() sets src_owner so this case looks OK.

Just had to think this through.
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   arch/powerpc/kvm/book3s_hv_uvmem.c     | 1 +
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 1 +
>   include/linux/migrate.h                | 8 ++++++++
>   mm/migrate.c                           | 9 ++++++---
>   4 files changed, 16 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 67fefb03b9b7..f44f6b27950f 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -563,6 +563,7 @@ kvmppc_svm_page_out(struct vm_area_struct *vma, unsigned long start,
>   	mig.end = end;
>   	mig.src = &src_pfn;
>   	mig.dst = &dst_pfn;
> +	mig.src_owner = &kvmppc_uvmem_pgmap;
>   
>   	mutex_lock(&kvm->arch.uvmem_lock);
>   	/* The requested page is already paged-out, nothing to do */
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index a4682272586e..0e36345d395c 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -176,6 +176,7 @@ static vm_fault_t nouveau_dmem_migrate_to_ram(struct vm_fault *vmf)
>   		.end		= vmf->address + PAGE_SIZE,
>   		.src		= &src,
>   		.dst		= &dst,
> +		.src_owner	= drm->dev,
>   	};
>   
>   	/*
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 72120061b7d4..3e546cbf03dd 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -196,6 +196,14 @@ struct migrate_vma {
>   	unsigned long		npages;
>   	unsigned long		start;
>   	unsigned long		end;
> +
> +	/*
> +	 * Set to the owner value also stored in page->pgmap->owner for
> +	 * migrating out of device private memory.  If set only device
> +	 * private pages with this owner are migrated.  If not set
> +	 * device private pages are not migrated at all.
> +	 */
> +	void			*src_owner;
>   };
>   
>   int migrate_vma_setup(struct migrate_vma *args);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index b1092876e537..7605d2c23433 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2241,7 +2241,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   	arch_enter_lazy_mmu_mode();
>   
>   	for (; addr < end; addr += PAGE_SIZE, ptep++) {
> -		unsigned long mpfn, pfn;
> +		unsigned long mpfn = 0, pfn;
>   		struct page *page;
>   		swp_entry_t entry;
>   		pte_t pte;
> @@ -2255,8 +2255,6 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   		}
>   
>   		if (!pte_present(pte)) {
> -			mpfn = 0;
> -
>   			/*
>   			 * Only care about unaddressable device page special
>   			 * page table entry. Other special swap entries are not
> @@ -2267,11 +2265,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   				goto next;
>   
>   			page = device_private_entry_to_page(entry);
> +			if (page->pgmap->owner != migrate->src_owner)
> +				goto next;
> +
>   			mpfn = migrate_pfn(page_to_pfn(page)) |
>   					MIGRATE_PFN_MIGRATE;
>   			if (is_write_device_private_entry(entry))
>   				mpfn |= MIGRATE_PFN_WRITE;
>   		} else {
> +			if (migrate->src_owner)
> +				goto next;
>   			pfn = pte_pfn(pte);
>   			if (is_zero_pfn(pfn)) {
>   				mpfn = MIGRATE_PFN_MIGRATE;
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-16 19:32 ` [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault Christoph Hellwig
  2020-03-16 19:59   ` Jason Gunthorpe
@ 2020-03-16 22:49   ` Ralph Campbell
  2020-03-17  7:34     ` Christoph Hellwig
  2020-03-17 12:15     ` Jason Gunthorpe
  1 sibling, 2 replies; 38+ messages in thread
From: Ralph Campbell @ 2020-03-16 22:49 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx


On 3/16/20 12:32 PM, Christoph Hellwig wrote:
> Remove the code to fault device private pages back into system memory
> that has never been used by any driver.  Also replace the usage of the
> HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple
> is_device_private_page check in nouveau.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
look at the struct page but what if a driver needs to fault in a page from
another device's private memory? Should it call handle_mm_fault()?


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c |  1 -
>   drivers/gpu/drm/nouveau/nouveau_dmem.c  |  5 +++--
>   drivers/gpu/drm/nouveau/nouveau_svm.c   |  1 -
>   include/linux/hmm.h                     |  2 --
>   mm/hmm.c                                | 25 +++++--------------------
>   5 files changed, 8 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index dee446278417..90821ce5e6ca 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -776,7 +776,6 @@ struct amdgpu_ttm_tt {
>   static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
>   	(1 << 0), /* HMM_PFN_VALID */
>   	(1 << 1), /* HMM_PFN_WRITE */
> -	0 /* HMM_PFN_DEVICE_PRIVATE */
>   };
>   
>   static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index 0e36345d395c..edfd0805fba4 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -28,6 +28,7 @@
>   
>   #include <nvif/class.h>
>   #include <nvif/object.h>
> +#include <nvif/if000c.h>
>   #include <nvif/if500b.h>
>   #include <nvif/if900b.h>
>   
> @@ -692,9 +693,8 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>   		if (page == NULL)
>   			continue;
>   
> -		if (!(range->pfns[i] & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
> +		if (!is_device_private_page(page))
>   			continue;
> -		}
>   
>   		if (!nouveau_dmem_page(drm, page)) {
>   			WARN(1, "Some unknown device memory !\n");
> @@ -705,5 +705,6 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>   		addr = nouveau_dmem_page_addr(page);
>   		range->pfns[i] &= ((1UL << range->pfn_shift) - 1);
>   		range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift;
> +		range->pfns[i] |= NVIF_VMM_PFNMAP_V0_VRAM;
>   	}
>   }
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index df9bf1fd1bc0..39c731a99937 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -367,7 +367,6 @@ static const u64
>   nouveau_svm_pfn_flags[HMM_PFN_FLAG_MAX] = {
>   	[HMM_PFN_VALID         ] = NVIF_VMM_PFNMAP_V0_V,
>   	[HMM_PFN_WRITE         ] = NVIF_VMM_PFNMAP_V0_W,
> -	[HMM_PFN_DEVICE_PRIVATE] = NVIF_VMM_PFNMAP_V0_VRAM,
>   };
>   
>   static const u64
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4bf8d6997b12..5e6034f105c3 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -74,7 +74,6 @@
>    * Flags:
>    * HMM_PFN_VALID: pfn is valid. It has, at least, read permission.
>    * HMM_PFN_WRITE: CPU page table has write permission set
> - * HMM_PFN_DEVICE_PRIVATE: private device memory (ZONE_DEVICE)
>    *
>    * The driver provides a flags array for mapping page protections to device
>    * PTE bits. If the driver valid bit for an entry is bit 3,
> @@ -86,7 +85,6 @@
>   enum hmm_pfn_flag_e {
>   	HMM_PFN_VALID = 0,
>   	HMM_PFN_WRITE,
> -	HMM_PFN_DEVICE_PRIVATE,
>   	HMM_PFN_FLAG_MAX
>   };
>   
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 180e398170b0..cfad65f6a67b 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -118,15 +118,6 @@ static inline void hmm_pte_need_fault(const struct hmm_vma_walk *hmm_vma_walk,
>   	/* We aren't ask to do anything ... */
>   	if (!(pfns & range->flags[HMM_PFN_VALID]))
>   		return;
> -	/* If this is device memory then only fault if explicitly requested */
> -	if ((cpu_flags & range->flags[HMM_PFN_DEVICE_PRIVATE])) {
> -		/* Do we fault on device memory ? */
> -		if (pfns & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
> -			*write_fault = pfns & range->flags[HMM_PFN_WRITE];
> -			*fault = true;
> -		}
> -		return;
> -	}
>   
>   	/* If CPU page table is not valid then we need to fault */
>   	*fault = !(cpu_flags & range->flags[HMM_PFN_VALID]);
> @@ -260,21 +251,15 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>   		swp_entry_t entry = pte_to_swp_entry(pte);
>   
>   		/*
> -		 * This is a special swap entry, ignore migration, use
> -		 * device and report anything else as error.
> +		 * Never fault in device private pages pages, but just report
> +		 * the PFN even if not present.
>   		 */
>   		if (is_device_private_entry(entry)) {
> -			cpu_flags = range->flags[HMM_PFN_VALID] |
> -				range->flags[HMM_PFN_DEVICE_PRIVATE];
> -			cpu_flags |= is_write_device_private_entry(entry) ?
> -				range->flags[HMM_PFN_WRITE] : 0;
> -			hmm_pte_need_fault(hmm_vma_walk, orig_pfn, cpu_flags,
> -					   &fault, &write_fault);
> -			if (fault || write_fault)
> -				goto fault;
>   			*pfn = hmm_device_entry_from_pfn(range,
>   					    swp_offset(entry));
> -			*pfn |= cpu_flags;
> +			*pfn |= range->flags[HMM_PFN_VALID];
> +			if (is_write_device_private_entry(entry))
> +				*pfn |= range->flags[HMM_PFN_WRITE];
>   			return 0;
>   		}
>   
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-16 19:32 ` [PATCH 4/4] mm: check the device private page owner " Christoph Hellwig
  2020-03-16 19:49   ` Jason Gunthorpe
@ 2020-03-16 23:11   ` Ralph Campbell
  2020-03-20 13:41   ` Jason Gunthorpe
  2 siblings, 0 replies; 38+ messages in thread
From: Ralph Campbell @ 2020-03-16 23:11 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe, Dan Williams, Bharata B Rao,
	Christian König, Ben Skeggs
  Cc: kvm-ppc, nouveau, dri-devel, linux-mm, Jerome Glisse, amd-gfx


On 3/16/20 12:32 PM, Christoph Hellwig wrote:
> Hmm range fault will succeed for any kind of device private memory,
> even if it doesn't belong to the calling entity.  While nouveau
> has some crude checks for that, they are broken because they assume
> nouveau is the only user of device private memory.  Fix this by
> passing in an expected pgmap owner in the hmm_range_fault structure.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Fixes: 4ef589dc9b10 ("mm/hmm/devmem: device memory hotplug using ZONE_DEVICE")

Looks good.
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   drivers/gpu/drm/nouveau/nouveau_dmem.c | 12 ------------
>   include/linux/hmm.h                    |  2 ++
>   mm/hmm.c                               | 10 +++++++++-
>   3 files changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> index edfd0805fba4..ad89e09a0be3 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
> @@ -672,12 +672,6 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
>   	return ret;
>   }
>   
> -static inline bool
> -nouveau_dmem_page(struct nouveau_drm *drm, struct page *page)
> -{
> -	return is_device_private_page(page) && drm->dmem == page_to_dmem(page);
> -}
> -
>   void
>   nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>   			 struct hmm_range *range)
> @@ -696,12 +690,6 @@ nouveau_dmem_convert_pfn(struct nouveau_drm *drm,
>   		if (!is_device_private_page(page))
>   			continue;
>   
> -		if (!nouveau_dmem_page(drm, page)) {
> -			WARN(1, "Some unknown device memory !\n");
> -			range->pfns[i] = 0;
> -			continue;
> -		}
> -
>   		addr = nouveau_dmem_page_addr(page);
>   		range->pfns[i] &= ((1UL << range->pfn_shift) - 1);
>   		range->pfns[i] |= (addr >> PAGE_SHIFT) << range->pfn_shift;
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 5e6034f105c3..bb6be4428633 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -132,6 +132,7 @@ enum hmm_pfn_value_e {
>    * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
>    * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
>    * @valid: pfns array did not change since it has been fill by an HMM function
> + * @dev_private_owner: owner of device private pages
>    */
>   struct hmm_range {
>   	struct mmu_interval_notifier *notifier;
> @@ -144,6 +145,7 @@ struct hmm_range {
>   	uint64_t		default_flags;
>   	uint64_t		pfn_flags_mask;
>   	uint8_t			pfn_shift;
> +	void			*dev_private_owner;
>   };
>   
>   /*
> diff --git a/mm/hmm.c b/mm/hmm.c
> index cfad65f6a67b..b75b3750e03d 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -216,6 +216,14 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
>   		unsigned long end, uint64_t *pfns, pmd_t pmd);
>   #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>   
> +static inline bool hmm_is_device_private_entry(struct hmm_range *range,
> +		swp_entry_t entry)
> +{
> +	return is_device_private_entry(entry) &&
> +		device_private_entry_to_page(entry)->pgmap->owner ==
> +		range->dev_private_owner;
> +}
> +
>   static inline uint64_t pte_to_hmm_pfn_flags(struct hmm_range *range, pte_t pte)
>   {
>   	if (pte_none(pte) || !pte_present(pte) || pte_protnone(pte))
> @@ -254,7 +262,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>   		 * Never fault in device private pages pages, but just report
>   		 * the PFN even if not present.
>   		 */
> -		if (is_device_private_entry(entry)) {
> +		if (hmm_is_device_private_entry(range, entry)) {
>   			*pfn = hmm_device_entry_from_pfn(range,
>   					    swp_offset(entry));
>   			*pfn |= range->flags[HMM_PFN_VALID];
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ensure device private pages have an owner v2
  2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
                   ` (3 preceding siblings ...)
  2020-03-16 19:32 ` [PATCH 4/4] mm: check the device private page owner " Christoph Hellwig
@ 2020-03-17  5:31 ` Bharata B Rao
  2020-03-19  0:28 ` Jason Gunthorpe
  5 siblings, 0 replies; 38+ messages in thread
From: Bharata B Rao @ 2020-03-17  5:31 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, linux-mm, nouveau, dri-devel, kvm-ppc, Jason Gunthorpe,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Mon, Mar 16, 2020 at 08:32:12PM +0100, Christoph Hellwig wrote:
> When acting on device private mappings a driver needs to know if the
> device (or other entity in case of kvmppc) actually owns this private
> mapping.  This series adds an owner field and converts the migrate_vma
> code over to check it.  I looked into doing the same for
> hmm_range_fault, but as far as I can tell that code has never been
> wired up to actually work for device private memory, so instead of
> trying to fix some unused code the second patch just remove the code.
> We can add it back once we have a working and fully tested code, and
> then should pass the expected owner in the hmm_range structure.

Boot-tested a pseries secure guest with this change (1/4 and 2/4 only)

So Tested-by: Bharata B Rao <bharata@linux.ibm.com> for the above
two patches.

Regards,
Bharata.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-16 22:49   ` Ralph Campbell
@ 2020-03-17  7:34     ` Christoph Hellwig
  2020-03-17 22:43       ` Ralph Campbell
  2020-03-17 12:15     ` Jason Gunthorpe
  1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-17  7:34 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, linux-mm, nouveau, dri-devel, kvm-ppc,
	Christoph Hellwig, Jason Gunthorpe, Jerome Glisse, Ben Skeggs,
	Dan Williams, Bharata B Rao, Christian König

On Mon, Mar 16, 2020 at 03:49:51PM -0700, Ralph Campbell wrote:
> On 3/16/20 12:32 PM, Christoph Hellwig wrote:
>> Remove the code to fault device private pages back into system memory
>> that has never been used by any driver.  Also replace the usage of the
>> HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple
>> is_device_private_page check in nouveau.
>>
>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>
> Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
> look at the struct page but what if a driver needs to fault in a page from
> another device's private memory? Should it call handle_mm_fault()?

Obviously no driver cared for that so far.  Once we have test cases
for that and thus testable code we can add code to fault it in from
hmm_vma_handle_pte.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-16 22:49   ` Ralph Campbell
  2020-03-17  7:34     ` Christoph Hellwig
@ 2020-03-17 12:15     ` Jason Gunthorpe
  2020-03-17 12:24       ` Christoph Hellwig
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-17 12:15 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Mon, Mar 16, 2020 at 03:49:51PM -0700, Ralph Campbell wrote:
> 
> On 3/16/20 12:32 PM, Christoph Hellwig wrote:
> > Remove the code to fault device private pages back into system memory
> > that has never been used by any driver.  Also replace the usage of the
> > HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple
> > is_device_private_page check in nouveau.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
> look at the struct page but what if a driver needs to fault in a page from
> another device's private memory? Should it call handle_mm_fault()?

Isn't that what this series basically does?

The dev_private_owner is set to the type of pgmap the device knows how
to handle, and everything else is automatically faulted for the
device.

If the device does not know how to handle device_private then it sets
dev_private_owner to NULL and it never gets device_private pfns.

Since the device_private pfn cannot be dma mapped, drivers must have
explicit support for them.

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 12:15     ` Jason Gunthorpe
@ 2020-03-17 12:24       ` Christoph Hellwig
  2020-03-17 12:28         ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-17 12:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ralph Campbell, amd-gfx, nouveau, dri-devel, kvm-ppc,
	Christoph Hellwig, linux-mm, Jerome Glisse, Ben Skeggs,
	Dan Williams, Bharata B Rao, Christian König

On Tue, Mar 17, 2020 at 09:15:36AM -0300, Jason Gunthorpe wrote:
> > Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
> > look at the struct page but what if a driver needs to fault in a page from
> > another device's private memory? Should it call handle_mm_fault()?
> 
> Isn't that what this series basically does?
>
> The dev_private_owner is set to the type of pgmap the device knows how
> to handle, and everything else is automatically faulted for the
> device.
> 
> If the device does not know how to handle device_private then it sets
> dev_private_owner to NULL and it never gets device_private pfns.
> 
> Since the device_private pfn cannot be dma mapped, drivers must have
> explicit support for them.

No, with this series (and all actual callers before this series)
we never fault in device private pages.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 12:24       ` Christoph Hellwig
@ 2020-03-17 12:28         ` Christoph Hellwig
  2020-03-17 12:47           ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-17 12:28 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ralph Campbell, amd-gfx, nouveau, dri-devel, kvm-ppc,
	Christoph Hellwig, linux-mm, Jerome Glisse, Ben Skeggs,
	Dan Williams, Bharata B Rao, Christian König

On Tue, Mar 17, 2020 at 01:24:45PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 17, 2020 at 09:15:36AM -0300, Jason Gunthorpe wrote:
> > > Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
> > > look at the struct page but what if a driver needs to fault in a page from
> > > another device's private memory? Should it call handle_mm_fault()?
> > 
> > Isn't that what this series basically does?
> >
> > The dev_private_owner is set to the type of pgmap the device knows how
> > to handle, and everything else is automatically faulted for the
> > device.
> > 
> > If the device does not know how to handle device_private then it sets
> > dev_private_owner to NULL and it never gets device_private pfns.
> > 
> > Since the device_private pfn cannot be dma mapped, drivers must have
> > explicit support for them.
> 
> No, with this series (and all actual callers before this series)
> we never fault in device private pages.

IFF we want to fault it in we'd need something like this.  But I'd
really prefer to see test cases for that first.

diff --git a/mm/hmm.c b/mm/hmm.c
index b75b3750e03d..2884a3d11a1f 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -276,7 +276,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (!fault && !write_fault)
 			return 0;
 
-		if (!non_swap_entry(entry))
+		if (!non_swap_entry(entry) || is_device_private_entry(entry))
 			goto fault;
 
 		if (is_migration_entry(entry)) {
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 12:28         ` Christoph Hellwig
@ 2020-03-17 12:47           ` Jason Gunthorpe
  2020-03-17 12:59             ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-17 12:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ralph Campbell, amd-gfx, nouveau, dri-devel, kvm-ppc,
	Bharata B Rao, linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams,
	Christian König

On Tue, Mar 17, 2020 at 01:28:13PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 17, 2020 at 01:24:45PM +0100, Christoph Hellwig wrote:
> > On Tue, Mar 17, 2020 at 09:15:36AM -0300, Jason Gunthorpe wrote:
> > > > Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
> > > > look at the struct page but what if a driver needs to fault in a page from
> > > > another device's private memory? Should it call handle_mm_fault()?
> > > 
> > > Isn't that what this series basically does?
> > >
> > > The dev_private_owner is set to the type of pgmap the device knows how
> > > to handle, and everything else is automatically faulted for the
> > > device.
> > > 
> > > If the device does not know how to handle device_private then it sets
> > > dev_private_owner to NULL and it never gets device_private pfns.
> > > 
> > > Since the device_private pfn cannot be dma mapped, drivers must have
> > > explicit support for them.
> > 
> > No, with this series (and all actual callers before this series)
> > we never fault in device private pages.
> 
> IFF we want to fault it in we'd need something like this.  But I'd
> really prefer to see test cases for that first.

In general I think hmm_range_fault should have a mode that is the same
as get_user_pages in terms of when it returns a hard failure, and
generates faults. AFAIK, GUP will fault in this case?

I need this for making ODP use this API. ODP is the one that is highly
likely to see other driver's device_private pages and must have them
always fault to CPU.

> diff --git a/mm/hmm.c b/mm/hmm.c
> index b75b3750e03d..2884a3d11a1f 100644
> +++ b/mm/hmm.c
> @@ -276,7 +276,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  		if (!fault && !write_fault)
>  			return 0;
>  
> -		if (!non_swap_entry(entry))
> +		if (!non_swap_entry(entry) || is_device_private_entry(entry))
>  			goto fault;

Yes, OK,  makes sense.

I've been using v7 of Ralph's tester and it is working well - it has
DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
you able?

This hunk seems trivial enough to me, can we include it now?

Thanks,
Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 12:47           ` Jason Gunthorpe
@ 2020-03-17 12:59             ` Christoph Hellwig
  2020-03-17 17:32               ` Jason Gunthorpe
  2020-03-17 23:14               ` Ralph Campbell
  0 siblings, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-17 12:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ralph Campbell, amd-gfx, nouveau, dri-devel, kvm-ppc,
	Christoph Hellwig, linux-mm, Jerome Glisse, Ben Skeggs,
	Dan Williams, Bharata B Rao, Christian König

On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
> I've been using v7 of Ralph's tester and it is working well - it has
> DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
> you able?
> 
> This hunk seems trivial enough to me, can we include it now?

I can send a separate patch for it once the tester covers it.  I don't
want to add it to the original patch as it is a significant behavior
change compared to the existing code.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 12:59             ` Christoph Hellwig
@ 2020-03-17 17:32               ` Jason Gunthorpe
  2020-03-17 23:14               ` Ralph Campbell
  1 sibling, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-17 17:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Ralph Campbell, amd-gfx, nouveau, dri-devel, kvm-ppc,
	Bharata B Rao, linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams,
	Christian König

On Tue, Mar 17, 2020 at 01:59:55PM +0100, Christoph Hellwig wrote:
> On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
> > I've been using v7 of Ralph's tester and it is working well - it has
> > DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
> > you able?
> > 
> > This hunk seems trivial enough to me, can we include it now?
> 
> I can send a separate patch for it once the tester covers it.  I don't
> want to add it to the original patch as it is a significant behavior
> change compared to the existing code.

Okay. I'm happy enough for now that amdgpu will get ERROR on
device_private pages. That is a bug fix in of itself.

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17  7:34     ` Christoph Hellwig
@ 2020-03-17 22:43       ` Ralph Campbell
  2020-03-18  9:34         ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Ralph Campbell @ 2020-03-17 22:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, linux-mm, nouveau, dri-devel, kvm-ppc, Bharata B Rao,
	Jason Gunthorpe, Jerome Glisse, Ben Skeggs, Dan Williams,
	Christian König


On 3/17/20 12:34 AM, Christoph Hellwig wrote:
> On Mon, Mar 16, 2020 at 03:49:51PM -0700, Ralph Campbell wrote:
>> On 3/16/20 12:32 PM, Christoph Hellwig wrote:
>>> Remove the code to fault device private pages back into system memory
>>> that has never been used by any driver.  Also replace the usage of the
>>> HMM_PFN_DEVICE_PRIVATE flag in the pfns array with a simple
>>> is_device_private_page check in nouveau.
>>>
>>> Signed-off-by: Christoph Hellwig <hch@lst.de>
>>
>> Getting rid of HMM_PFN_DEVICE_PRIVATE seems reasonable to me since a driver can
>> look at the struct page but what if a driver needs to fault in a page from
>> another device's private memory? Should it call handle_mm_fault()?
> 
> Obviously no driver cared for that so far.  Once we have test cases
> for that and thus testable code we can add code to fault it in from
> hmm_vma_handle_pte.
> 

I'm OK with the series. I think I would have been less confused if I looked at
patch 4 then 3.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 12:59             ` Christoph Hellwig
  2020-03-17 17:32               ` Jason Gunthorpe
@ 2020-03-17 23:14               ` Ralph Campbell
  2020-03-19 18:17                 ` Jason Gunthorpe
  2020-03-20  0:14                 ` Jason Gunthorpe
  1 sibling, 2 replies; 38+ messages in thread
From: Ralph Campbell @ 2020-03-17 23:14 UTC (permalink / raw)
  To: Christoph Hellwig, Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

[-- Attachment #1: Type: text/plain, Size: 859 bytes --]


On 3/17/20 5:59 AM, Christoph Hellwig wrote:
> On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
>> I've been using v7 of Ralph's tester and it is working well - it has
>> DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
>> you able?
>>
>> This hunk seems trivial enough to me, can we include it now?
> 
> I can send a separate patch for it once the tester covers it.  I don't
> want to add it to the original patch as it is a significant behavior
> change compared to the existing code.
> 

Attached is an updated version of my HMM tests based on linux-5.6.0-rc6.
I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups,
and Christoph's 1-4 device private page changes applied.

I'm working on getting my nouveau tests running again on a different test
machine and will report on that when ready.

[-- Attachment #2: 0001-mm-hmm-test-add-self-tests-for-HMM.patch --]
[-- Type: text/x-patch, Size: 73675 bytes --]

From d499fb343bfa9764695ecdcd759fb16bc1ca3c93 Mon Sep 17 00:00:00 2001
From: Ralph Campbell <rcampbell@nvidia.com>
Date: Tue, 17 Mar 2020 11:10:38 -0700
Subject: [PATCH] mm/hmm/test: add self tests for HMM
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add some basic stand alone self tests for HMM.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 MAINTAINERS                            |    3 +
 include/uapi/linux/test_hmm.h          |   59 ++
 lib/Kconfig.debug                      |   12 +
 lib/Makefile                           |    1 +
 lib/test_hmm.c                         | 1210 +++++++++++++++++++++
 tools/testing/selftests/vm/.gitignore  |    1 +
 tools/testing/selftests/vm/Makefile    |    3 +
 tools/testing/selftests/vm/config      |    2 +
 tools/testing/selftests/vm/hmm-tests.c | 1353 ++++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests |   16 +
 tools/testing/selftests/vm/test_hmm.sh |   97 ++
 11 files changed, 2757 insertions(+)
 create mode 100644 include/uapi/linux/test_hmm.h
 create mode 100644 lib/test_hmm.c
 create mode 100644 tools/testing/selftests/vm/hmm-tests.c
 create mode 100755 tools/testing/selftests/vm/test_hmm.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index cc1d18cb5d18..98c80a589a52 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7608,7 +7608,10 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	include/uapi/linux/test_hmm*
 F:	Documentation/vm/hmm.rst
+F:	lib/test_hmm*
+F:	tools/testing/selftests/vm/*hmm*
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/include/uapi/linux/test_hmm.h b/include/uapi/linux/test_hmm.h
new file mode 100644
index 000000000000..8c5f70c160bf
--- /dev/null
+++ b/include/uapi/linux/test_hmm.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * This is a module to test the HMM (Heterogeneous Memory Management) API
+ * of the kernel. It allows a userspace program to expose its entire address
+ * space through the HMM test module device file.
+ */
+#ifndef _UAPI_LINUX_HMM_DMIRROR_H
+#define _UAPI_LINUX_HMM_DMIRROR_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+/*
+ * Structure to pass to the HMM test driver to mimic a device accessing
+ * system memory and ZONE_DEVICE private memory through device page tables.
+ *
+ * @addr: (in) user address the device will read/write
+ * @ptr: (in) user address where device data is copied to/from
+ * @npages: (in) number of pages to read/write
+ * @cpages: (out) number of pages copied
+ * @faults: (out) number of device page faults seen
+ */
+struct hmm_dmirror_cmd {
+	__u64		addr;
+	__u64		ptr;
+	__u64		npages;
+	__u64		cpages;
+	__u64		faults;
+};
+
+/* Expose the address space of the calling process through hmm device file */
+#define HMM_DMIRROR_READ		_IOWR('H', 0x00, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_WRITE		_IOWR('H', 0x01, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_MIGRATE		_IOWR('H', 0x02, struct hmm_dmirror_cmd)
+#define HMM_DMIRROR_SNAPSHOT		_IOWR('H', 0x03, struct hmm_dmirror_cmd)
+
+/*
+ * Values returned in hmm_dmirror_cmd.ptr for HMM_DMIRROR_SNAPSHOT.
+ * HMM_DMIRROR_PROT_ERROR: no valid mirror PTE for this page
+ * HMM_DMIRROR_PROT_NONE: unpopulated PTE or PTE with no access
+ * HMM_DMIRROR_PROT_READ: read-only PTE
+ * HMM_DMIRROR_PROT_WRITE: read/write PTE
+ * HMM_DMIRROR_PROT_ZERO: special read-only zero page
+ * HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL: Migrated device private page on the
+ *					device the ioctl() is made
+ * HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE: Migrated device private page on some
+ *					other device
+ */
+enum {
+	HMM_DMIRROR_PROT_ERROR			= 0xFF,
+	HMM_DMIRROR_PROT_NONE			= 0x00,
+	HMM_DMIRROR_PROT_READ			= 0x01,
+	HMM_DMIRROR_PROT_WRITE			= 0x02,
+	HMM_DMIRROR_PROT_ZERO			= 0x10,
+	HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL	= 0x20,
+	HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE	= 0x30,
+};
+
+#endif /* _UAPI_LINUX_HMM_DMIRROR_H */
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 69def4a9df00..4d22ce7879a7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2162,6 +2162,18 @@ config TEST_MEMINIT
 
 	  If unsure, say N.
 
+config TEST_HMM
+	tristate "Test HMM (Heterogeneous Memory Management)"
+	depends on DEVICE_PRIVATE
+	select HMM_MIRROR
+        select MMU_NOTIFIER
+	help
+	  This is a pseudo device driver solely for testing HMM.
+	  Say M here if you want to build the HMM test module.
+	  Doing so will allow you to run tools/testing/selftest/vm/hmm-tests.
+
+	  If unsure, say N.
+
 endif # RUNTIME_TESTING_MENU
 
 config MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 611872c06926..c168bdc803dc 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -89,6 +89,7 @@ obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o
 obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o
 obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o
 obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o
+obj-$(CONFIG_TEST_HMM) += test_hmm.o
 
 obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
new file mode 100644
index 000000000000..6ca953926dc1
--- /dev/null
+++ b/lib/test_hmm.c
@@ -0,0 +1,1210 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This is a module to test the HMM (Heterogeneous Memory Management)
+ * mirror and zone device private memory migration APIs of the kernel.
+ * Userspace programs can register with the driver to mirror their own address
+ * space and can use the device to read/write any valid virtual address.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+#include <linux/pagemap.h>
+#include <linux/hmm.h>
+#include <linux/vmalloc.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/sched/mm.h>
+#include <linux/platform_device.h>
+
+#include <uapi/linux/test_hmm.h>
+
+#define DMIRROR_NDEVICES		2
+#define DMIRROR_RANGE_FAULT_TIMEOUT	1000
+#define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
+#define DEVMEM_CHUNKS_RESERVE		16
+
+static const struct dev_pagemap_ops dmirror_devmem_ops;
+static const struct mmu_interval_notifier_ops dmirror_min_ops;
+static dev_t dmirror_dev;
+static struct page *dmirror_zero_page;
+
+struct dmirror_device;
+
+struct dmirror_bounce {
+	void			*ptr;
+	unsigned long		size;
+	unsigned long		addr;
+	unsigned long		cpages;
+};
+
+#define DPT_SHIFT PAGE_SHIFT
+#define DPT_VALID (1UL << 0)
+#define DPT_WRITE (1UL << 1)
+
+#define DPT_XA_TAG_WRITE 3UL
+
+static const uint64_t dmirror_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	[HMM_PFN_VALID] = DPT_VALID,
+	[HMM_PFN_WRITE] = DPT_WRITE,
+};
+
+static const uint64_t dmirror_hmm_values[HMM_PFN_VALUE_MAX] = {
+	[HMM_PFN_NONE]    = 0,
+	[HMM_PFN_ERROR]   = 0x10,
+	[HMM_PFN_SPECIAL] = 0x10,
+};
+
+/*
+ * Data structure to track address ranges and register for mmu interval
+ * notifier updates.
+ */
+struct dmirror_interval {
+	struct mmu_interval_notifier	notifier;
+	struct dmirror			*dmirror;
+};
+
+/*
+ * Data attached to the open device file.
+ * Note that it might be shared after a fork().
+ */
+struct dmirror {
+	struct mm_struct		*mm;
+	struct dmirror_device		*mdevice;
+	struct xarray			pt;
+	struct mmu_interval_notifier	notifier;
+	struct mutex			mutex;
+};
+
+/*
+ * ZONE_DEVICE pages for migration and simulating device memory.
+ */
+struct dmirror_chunk {
+	struct dev_pagemap	pagemap;
+	struct dmirror_device	*mdevice;
+};
+
+/*
+ * Per device data.
+ */
+struct dmirror_device {
+	struct cdev		cdevice;
+	struct hmm_devmem	*devmem;
+
+	unsigned int		devmem_capacity;
+	unsigned int		devmem_count;
+	struct dmirror_chunk	**devmem_chunks;
+	struct mutex		devmem_lock;	/* protects the above */
+
+	unsigned long		calloc;
+	unsigned long		cfree;
+	struct page		*free_pages;
+	spinlock_t		lock;		/* protects the above */
+};
+
+static struct dmirror_device dmirror_devices[DMIRROR_NDEVICES];
+
+static int dmirror_bounce_init(struct dmirror_bounce *bounce,
+			       unsigned long addr,
+			       unsigned long size)
+{
+	bounce->addr = addr;
+	bounce->size = size;
+	bounce->cpages = 0;
+	bounce->ptr = vmalloc(size);
+	if (!bounce->ptr)
+		return -ENOMEM;
+	return 0;
+}
+
+static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
+{
+	vfree(bounce->ptr);
+}
+
+static int dmirror_fops_open(struct inode *inode, struct file *filp)
+{
+	struct cdev *cdev = inode->i_cdev;
+	struct dmirror *dmirror;
+	int ret;
+
+	/* Mirror this process address space */
+	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+	if (dmirror == NULL)
+		return -ENOMEM;
+
+	dmirror->mdevice = container_of(cdev, struct dmirror_device, cdevice);
+	mutex_init(&dmirror->mutex);
+	xa_init(&dmirror->pt);
+
+	ret = mmu_interval_notifier_insert(&dmirror->notifier, current->mm,
+				0, ULONG_MAX & PAGE_MASK, &dmirror_min_ops);
+	if (ret) {
+		kfree(dmirror);
+		return ret;
+	}
+
+	/* Pairs with the mmdrop() in dmirror_fops_release(). */
+	mmgrab(current->mm);
+	dmirror->mm = current->mm;
+
+	/* Only the first open registers the address space. */
+	filp->private_data = dmirror;
+	return 0;
+}
+
+static int dmirror_fops_release(struct inode *inode, struct file *filp)
+{
+	struct dmirror *dmirror = filp->private_data;
+
+	mmu_interval_notifier_remove(&dmirror->notifier);
+	mmdrop(dmirror->mm);
+	xa_destroy(&dmirror->pt);
+	kfree(dmirror);
+	return 0;
+}
+
+static inline struct dmirror_device *dmirror_page_to_device(struct page *page)
+
+{
+	struct dmirror_chunk *devmem;
+
+	devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap);
+	return devmem->mdevice;
+}
+
+static bool dmirror_device_is_mine(struct dmirror_device *mdevice,
+				   struct page *page)
+{
+	if (!is_zone_device_page(page))
+		return false;
+	return page->pgmap->ops == &dmirror_devmem_ops &&
+		dmirror_page_to_device(page) == mdevice;
+}
+
+static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
+{
+	uint64_t *pfns = range->pfns;
+	unsigned long pfn;
+
+	for (pfn = (range->start >> PAGE_SHIFT);
+	     pfn < (range->end >> PAGE_SHIFT);
+	     pfn++, pfns++) {
+		struct page *page;
+		void *entry;
+
+		/*
+		 * HMM_PFN_ERROR is returned if it is accessing invalid memory
+		 * either because of memory error (hardware detected memory
+		 * corruption) or more likely because of truncate on mmap
+		 * file.
+		 */
+		if (*pfns == range->values[HMM_PFN_ERROR])
+			return -EFAULT;
+		if (!(*pfns & range->flags[HMM_PFN_VALID]))
+			return -EFAULT;
+		page = hmm_device_entry_to_page(range, *pfns);
+		/* We asked for pages to be populated but check anyway. */
+		if (!page)
+			return -EFAULT;
+		if (is_zone_device_page(page)) {
+			/*
+			 * TODO: need a way to ask HMM to fault foreign zone
+			 * device private pages.
+			 */
+			if (!dmirror_device_is_mine(dmirror->mdevice, page))
+				continue;
+		}
+		entry = page;
+		if (*pfns & range->flags[HMM_PFN_WRITE])
+			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+		else if (range->default_flags & range->flags[HMM_PFN_WRITE])
+			return -EFAULT;
+		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
+		if (xa_is_err(entry))
+			return xa_err(entry);
+	}
+
+	return 0;
+}
+
+static void dmirror_do_update(struct dmirror *dmirror, unsigned long start,
+			      unsigned long end)
+{
+	unsigned long pfn;
+
+	/*
+	 * The XArray doesn't hold references to pages since it relies on
+	 * the mmu notifier to clear page pointers when they become stale.
+	 * Therefore, it is OK to just clear the entry.
+	 */
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++)
+		xa_erase(&dmirror->pt, pfn);
+}
+
+static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
+				const struct mmu_notifier_range *range,
+				unsigned long cur_seq)
+{
+	struct dmirror *dmirror = container_of(mni, struct dmirror, notifier);
+	struct mm_struct *mm = dmirror->mm;
+
+	/*
+	 * If the process doesn't exist, we don't need to invalidate the
+	 * device page table since the address space will be torn down.
+	 */
+	if (!mmget_not_zero(mm))
+		return true;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&dmirror->mutex);
+	else if (!mutex_trylock(&dmirror->mutex))
+		return false;
+
+	mmu_interval_set_seq(mni, cur_seq);
+	dmirror_do_update(dmirror, range->start, range->end);
+
+	mutex_unlock(&dmirror->mutex);
+	mmput(mm);
+	return true;
+}
+
+static const struct mmu_interval_notifier_ops dmirror_min_ops = {
+	.invalidate = dmirror_interval_invalidate,
+};
+
+static int dmirror_range_fault(struct dmirror *dmirror,
+				struct hmm_range *range)
+{
+	struct mm_struct *mm = dmirror->mm;
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	int ret;
+
+	while (true) {
+		long count;
+
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		range->notifier_seq = mmu_interval_read_begin(range->notifier);
+		down_read(&mm->mmap_sem);
+		count = hmm_range_fault(range, 0);
+		up_read(&mm->mmap_sem);
+		if (count <= 0) {
+			if (count == 0 || count == -EBUSY)
+				continue;
+			ret = count;
+			goto out;
+		}
+
+		mutex_lock(&dmirror->mutex);
+		if (mmu_interval_read_retry(range->notifier,
+					    range->notifier_seq)) {
+			mutex_unlock(&dmirror->mutex);
+			continue;
+		}
+		break;
+	}
+
+	ret = dmirror_do_fault(dmirror, range);
+
+	mutex_unlock(&dmirror->mutex);
+out:
+	return ret;
+}
+
+static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
+			 unsigned long end, bool write)
+{
+	struct mm_struct *mm = dmirror->mm;
+	unsigned long addr;
+	uint64_t pfns[64];
+	struct hmm_range range = {
+		.notifier = &dmirror->notifier,
+		.pfns = pfns,
+		.flags = dmirror_hmm_flags,
+		.values = dmirror_hmm_values,
+		.pfn_shift = DPT_SHIFT,
+		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
+				    dmirror_hmm_flags[HMM_PFN_WRITE]),
+		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
+				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
+		.dev_private_owner = dmirror->mdevice,
+	};
+	int ret = 0;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return 0;
+
+	for (addr = start; addr < end; addr = range.end) {
+		range.start = addr;
+		range.end = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+
+		ret = dmirror_range_fault(dmirror, &range);
+		if (ret)
+			break;
+	}
+
+	mmput(mm);
+	return ret;
+}
+
+static int dmirror_do_read(struct dmirror *dmirror, unsigned long start,
+			   unsigned long end, struct dmirror_bounce *bounce)
+{
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK);
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+		void *entry;
+		struct page *page;
+		void *tmp;
+
+		entry = xa_load(&dmirror->pt, pfn);
+		page = xa_untag_pointer(entry);
+		if (!page)
+			return -ENOENT;
+
+		tmp = kmap(page);
+		memcpy(ptr, tmp, PAGE_SIZE);
+		kunmap(page);
+
+		ptr += PAGE_SIZE;
+		bounce->cpages++;
+	}
+
+	return 0;
+}
+
+static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_bounce bounce;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+
+again:
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0)
+		ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr,
+					bounce.size);
+	else if (ret == -ENOENT) {
+		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+		ret = dmirror_fault(dmirror, start, end, false);
+		if (ret == 0) {
+			cmd->faults++;
+			goto again;
+		}
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
+static int dmirror_do_write(struct dmirror *dmirror, unsigned long start,
+			    unsigned long end, struct dmirror_bounce *bounce)
+{
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK);
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+		void *entry;
+		struct page *page;
+		void *tmp;
+
+		entry = xa_load(&dmirror->pt, pfn);
+		page = xa_untag_pointer(entry);
+		if (!page || xa_pointer_tag(entry) != DPT_XA_TAG_WRITE)
+			return -ENOENT;
+
+		tmp = kmap(page);
+		memcpy(tmp, ptr, PAGE_SIZE);
+		kunmap(page);
+
+		ptr += PAGE_SIZE;
+		bounce->cpages++;
+	}
+
+	return 0;
+}
+
+static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_bounce bounce;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	ret = copy_from_user(bounce.ptr, (void __user *)cmd->ptr,
+				bounce.size);
+	if (ret)
+		return ret;
+
+again:
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_write(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == -ENOENT) {
+		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+		ret = dmirror_fault(dmirror, start, end, true);
+		if (ret == 0) {
+			cmd->faults++;
+			goto again;
+		}
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
+static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+				   struct page **ppage)
+{
+	struct dmirror_chunk *devmem;
+	struct resource *res;
+	unsigned long pfn;
+	unsigned long pfn_first;
+	unsigned long pfn_last;
+	void *ptr;
+
+	mutex_lock(&mdevice->devmem_lock);
+
+	if (mdevice->devmem_count == mdevice->devmem_capacity) {
+		struct dmirror_chunk **new_chunks;
+		unsigned int new_capacity;
+
+		new_capacity = mdevice->devmem_capacity +
+				DEVMEM_CHUNKS_RESERVE;
+		new_chunks = krealloc(mdevice->devmem_chunks,
+				sizeof(new_chunks[0]) * new_capacity,
+				GFP_KERNEL);
+		if (!new_chunks)
+			goto err;
+		mdevice->devmem_capacity = new_capacity;
+		mdevice->devmem_chunks = new_chunks;
+	}
+
+	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+					"hmm_dmirror");
+	if (IS_ERR(res))
+		goto err;
+
+	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
+	if (!devmem)
+		goto err;
+
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.res = *res;
+	devmem->pagemap.ops = &dmirror_devmem_ops;
+	devmem->pagemap.owner = mdevice;
+
+	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
+	if (IS_ERR(ptr))
+		goto err_free;
+
+	devmem->mdevice = mdevice;
+	pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT;
+	pfn_last = pfn_first +
+		(resource_size(&devmem->pagemap.res) >> PAGE_SHIFT);
+	mdevice->devmem_chunks[mdevice->devmem_count++] = devmem;
+
+	mutex_unlock(&mdevice->devmem_lock);
+
+	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+		DEVMEM_CHUNK_SIZE / (1024 * 1024),
+		mdevice->devmem_count,
+		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
+		pfn_first, pfn_last);
+
+	spin_lock(&mdevice->lock);
+	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		page->zone_device_data = mdevice->free_pages;
+		mdevice->free_pages = page;
+	}
+	if (ppage) {
+		*ppage = mdevice->free_pages;
+		mdevice->free_pages = (*ppage)->zone_device_data;
+		mdevice->calloc++;
+	}
+	spin_unlock(&mdevice->lock);
+
+	return true;
+
+err_free:
+	kfree(devmem);
+err:
+	mutex_unlock(&mdevice->devmem_lock);
+	return false;
+}
+
+static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+{
+	struct page *dpage = NULL;
+	struct page *rpage;
+
+	/*
+	 * This is a fake device so we alloc real system memory to store
+	 * our device memory.
+	 */
+	rpage = alloc_page(GFP_HIGHUSER);
+	if (!rpage)
+		return NULL;
+
+	spin_lock(&mdevice->lock);
+
+	if (mdevice->free_pages) {
+		dpage = mdevice->free_pages;
+		mdevice->free_pages = dpage->zone_device_data;
+		mdevice->calloc++;
+		spin_unlock(&mdevice->lock);
+	} else {
+		spin_unlock(&mdevice->lock);
+		if (!dmirror_allocate_chunk(mdevice, &dpage))
+			goto error;
+	}
+
+	dpage->zone_device_data = rpage;
+	get_page(dpage);
+	lock_page(dpage);
+	return dpage;
+
+error:
+	__free_page(rpage);
+	return NULL;
+}
+
+static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
+					   struct dmirror *dmirror)
+{
+	struct dmirror_device *mdevice = dmirror->mdevice;
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long addr;
+
+	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
+						   src++, dst++) {
+		struct page *spage;
+		struct page *dpage;
+		struct page *rpage;
+
+		if (!(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		/*
+		 * Note that spage might be NULL which is OK since it is an
+		 * unallocated pte_none() or read-only zero page.
+		 */
+		spage = migrate_pfn_to_page(*src);
+
+		/*
+		 * Don't migrate device private pages from our own driver or
+		 * others. For our own we would do a device private memory copy
+		 * not a migration and for others, we would need to fault the
+		 * other device's page into system memory first.
+		 */
+		if (spage && is_zone_device_page(spage))
+			continue;
+
+		dpage = dmirror_devmem_alloc_page(mdevice);
+		if (!dpage)
+			continue;
+
+		rpage = dpage->zone_device_data;
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+		/*
+		 * Normally, a device would use the page->zone_device_data to
+		 * point to the mirror but here we use it to hold the page for
+		 * the simulated device memory and that page holds the pointer
+		 * to the mirror.
+		 */
+		rpage->zone_device_data = dmirror;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) |
+			    MIGRATE_PFN_LOCKED;
+		if ((*src & MIGRATE_PFN_WRITE) ||
+		    (!spage && args->vma->vm_flags & VM_WRITE))
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+}
+
+static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
+					    struct dmirror *dmirror)
+{
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	const unsigned long *src = args->src;
+	const unsigned long *dst = args->dst;
+	unsigned long pfn;
+
+	/* Map the migrated pages into the device's page tables. */
+	mutex_lock(&dmirror->mutex);
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
+								src++, dst++) {
+		struct page *dpage;
+		void *entry;
+
+		if (!(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		dpage = migrate_pfn_to_page(*dst);
+		if (!dpage)
+			continue;
+
+		/*
+		 * Store the page that holds the data so the page table
+		 * doesn't have to deal with ZONE_DEVICE private pages.
+		 */
+		entry = dpage->zone_device_data;
+		if (*dst & MIGRATE_PFN_WRITE)
+			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+		entry = xa_store(&dmirror->pt, pfn, entry, GFP_ATOMIC);
+		if (xa_is_err(entry))
+			return xa_err(entry);
+	}
+
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+static int dmirror_migrate(struct dmirror *dmirror,
+			   struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->mm;
+	struct vm_area_struct *vma;
+	unsigned long src_pfns[64];
+	unsigned long dst_pfns[64];
+	struct dmirror_bounce bounce;
+	struct migrate_vma args;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	down_read(&mm->mmap_sem);
+	for (addr = start; addr < end; addr = next) {
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start) {
+			ret = -EINVAL;
+			goto out;
+		}
+		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		if (next > vma->vm_end)
+			next = vma->vm_end;
+
+		args.vma = vma;
+		args.src = src_pfns;
+		args.dst = dst_pfns;
+		args.start = addr;
+		args.end = next;
+		args.src_owner = NULL;
+		ret = migrate_vma_setup(&args);
+		if (ret)
+			goto out;
+
+		dmirror_migrate_alloc_and_copy(&args, dmirror);
+		migrate_vma_pages(&args);
+		dmirror_migrate_finalize_and_map(&args, dmirror);
+		migrate_vma_finalize(&args);
+	}
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+
+	/* Return the migrated data for verification. */
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0)
+		ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr,
+					bounce.size);
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+
+out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
+
+static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
+			    unsigned char *perm, uint64_t entry)
+{
+	struct page *page;
+
+	if (entry == range->values[HMM_PFN_ERROR]) {
+		*perm = HMM_DMIRROR_PROT_ERROR;
+		return;
+	}
+	page = hmm_device_entry_to_page(range, entry);
+	if (!page) {
+		*perm = HMM_DMIRROR_PROT_NONE;
+		return;
+	}
+	if (is_device_private_page(page)) {
+		/* Is the page migrated to this device or some other? */
+		if (dmirror->mdevice == dmirror_page_to_device(page))
+			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
+		else
+			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
+	} else if (is_zero_pfn(page_to_pfn(page)))
+		*perm = HMM_DMIRROR_PROT_ZERO;
+	else
+		*perm = HMM_DMIRROR_PROT_NONE;
+	if (entry & range->flags[HMM_PFN_WRITE])
+		*perm |= HMM_DMIRROR_PROT_WRITE;
+	else
+		*perm |= HMM_DMIRROR_PROT_READ;
+}
+
+static bool dmirror_snapshot_invalidate(struct mmu_interval_notifier *mni,
+				const struct mmu_notifier_range *range,
+				unsigned long cur_seq)
+{
+	struct dmirror_interval *dmi =
+		container_of(mni, struct dmirror_interval, notifier);
+	struct dmirror *dmirror = dmi->dmirror;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&dmirror->mutex);
+	else if (!mutex_trylock(&dmirror->mutex))
+		return false;
+
+	/*
+	 * Snapshots only need to set the sequence number since any
+	 * invalidation in the interval invalidates the whole snapshot.
+	 */
+	mmu_interval_set_seq(mni, cur_seq);
+
+	mutex_unlock(&dmirror->mutex);
+	return true;
+}
+
+static const struct mmu_interval_notifier_ops dmirror_mrn_ops = {
+	.invalidate = dmirror_snapshot_invalidate,
+};
+
+static int dmirror_range_snapshot(struct dmirror *dmirror,
+				  struct hmm_range *range,
+				  unsigned char *perm)
+{
+	struct mm_struct *mm = dmirror->mm;
+	struct dmirror_interval notifier;
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long i;
+	unsigned long n;
+	int ret = 0;
+
+	notifier.dmirror = dmirror;
+	range->notifier = &notifier.notifier;
+
+	ret = mmu_interval_notifier_insert(range->notifier, mm,
+			range->start, range->end - range->start,
+			&dmirror_mrn_ops);
+	if (ret)
+		return ret;
+
+	while (true) {
+		long count;
+
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		range->notifier_seq = mmu_interval_read_begin(range->notifier);
+
+		down_read(&mm->mmap_sem);
+		count = hmm_range_fault(range, HMM_FAULT_SNAPSHOT);
+		up_read(&mm->mmap_sem);
+		if (count <= 0) {
+			if (count == 0 || count == -EBUSY)
+				continue;
+			ret = count;
+			goto out;
+		}
+
+		mutex_lock(&dmirror->mutex);
+		if (mmu_interval_read_retry(range->notifier,
+					    range->notifier_seq)) {
+			mutex_unlock(&dmirror->mutex);
+			continue;
+		}
+		break;
+	}
+
+	n = (range->end - range->start) >> PAGE_SHIFT;
+	for (i = 0; i < n; i++)
+		dmirror_mkentry(dmirror, range, perm + i, range->pfns[i]);
+
+	mutex_unlock(&dmirror->mutex);
+out:
+	mmu_interval_notifier_remove(range->notifier);
+	return ret;
+}
+
+static int dmirror_snapshot(struct dmirror *dmirror,
+			    struct hmm_dmirror_cmd *cmd)
+{
+	struct mm_struct *mm = dmirror->mm;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	unsigned long addr;
+	unsigned long next;
+	uint64_t pfns[64];
+	unsigned char perm[64];
+	char __user *uptr;
+	struct hmm_range range = {
+		.pfns = pfns,
+		.flags = dmirror_hmm_flags,
+		.values = dmirror_hmm_values,
+		.pfn_shift = DPT_SHIFT,
+		.pfn_flags_mask = ~0ULL,
+		.dev_private_owner = dmirror->mdevice,
+	};
+	int ret = 0;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	/*
+	 * Register a temporary notifier to detect invalidations even if it
+	 * overlaps with other mmu_interval_notifiers.
+	 */
+	uptr = (void __user *)cmd->ptr;
+	for (addr = start; addr < end; addr = next) {
+		unsigned long n;
+
+		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+		range.start = addr;
+		range.end = next;
+
+		ret = dmirror_range_snapshot(dmirror, &range, perm);
+		if (ret)
+			break;
+
+		n = (range.end - range.start) >> PAGE_SHIFT;
+		ret = copy_to_user(uptr, perm, n);
+		if (ret)
+			break;
+
+		cmd->cpages += n;
+		uptr += n;
+	}
+	mmput(mm);
+
+	return ret;
+}
+
+static long dmirror_fops_unlocked_ioctl(struct file *filp,
+					unsigned int command,
+					unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+	struct hmm_dmirror_cmd cmd;
+	struct dmirror *dmirror;
+	int ret;
+
+	dmirror = filp->private_data;
+	if (!dmirror)
+		return -EINVAL;
+
+	ret = copy_from_user(&cmd, uarg, sizeof(cmd));
+	if (ret)
+		return ret;
+
+	if (cmd.addr & ~PAGE_MASK)
+		return -EINVAL;
+	if (cmd.addr >= (cmd.addr + (cmd.npages << PAGE_SHIFT)))
+		return -EINVAL;
+
+	cmd.cpages = 0;
+	cmd.faults = 0;
+
+	switch (command) {
+	case HMM_DMIRROR_READ:
+		ret = dmirror_read(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_WRITE:
+		ret = dmirror_write(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_MIGRATE:
+		ret = dmirror_migrate(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_SNAPSHOT:
+		ret = dmirror_snapshot(dmirror, &cmd);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+	if (ret)
+		return ret;
+
+	return copy_to_user(uarg, &cmd, sizeof(cmd));
+}
+
+static const struct file_operations dmirror_fops = {
+	.open		= dmirror_fops_open,
+	.release	= dmirror_fops_release,
+	.unlocked_ioctl = dmirror_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+static void dmirror_devmem_free(struct page *page)
+{
+	struct page *rpage = page->zone_device_data;
+	struct dmirror_device *mdevice;
+
+	if (rpage)
+		__free_page(rpage);
+
+	mdevice = dmirror_page_to_device(page);
+
+	spin_lock(&mdevice->lock);
+	mdevice->cfree++;
+	page->zone_device_data = mdevice->free_pages;
+	mdevice->free_pages = page;
+	spin_unlock(&mdevice->lock);
+}
+
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+						struct dmirror_device *mdevice)
+{
+	struct vm_area_struct *vma = args->vma;
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE,
+				       src++, dst++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(*src);
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+		if (!dmirror_device_is_mine(mdevice, spage))
+			continue;
+		spage = spage->zone_device_data;
+
+		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+		if (!dpage)
+			continue;
+
+		lock_page(dpage);
+		copy_highpage(dpage, spage);
+		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	return 0;
+}
+
+static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args,
+						  struct dmirror *dmirror)
+{
+	/* Invalidate the device's page table mapping. */
+	mutex_lock(&dmirror->mutex);
+	dmirror_do_update(dmirror, args->start, args->end);
+	mutex_unlock(&dmirror->mutex);
+}
+
+static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
+{
+	struct migrate_vma args;
+	unsigned long src_pfns;
+	unsigned long dst_pfns;
+	struct page *rpage;
+	struct dmirror *dmirror;
+	vm_fault_t ret;
+
+	/*
+	 * Normally, a device would use the page->zone_device_data to point to
+	 * the mirror but here we use it to hold the page for the simulated
+	 * device memory and that page holds the pointer to the mirror.
+	 */
+	rpage = vmf->page->zone_device_data;
+	dmirror = rpage->zone_device_data;
+
+	/* FIXME demonstrate how we can adjust migrate range */
+	args.vma = vmf->vma;
+	args.start = vmf->address;
+	args.end = args.start + PAGE_SIZE;
+	args.src = &src_pfns;
+	args.dst = &dst_pfns;
+	args.src_owner = dmirror->mdevice;
+
+	if (migrate_vma_setup(&args))
+		return VM_FAULT_SIGBUS;
+
+	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror->mdevice);
+	if (ret)
+		return ret;
+	migrate_vma_pages(&args);
+	dmirror_devmem_fault_finalize_and_map(&args, dmirror);
+	migrate_vma_finalize(&args);
+	return 0;
+}
+
+static const struct dev_pagemap_ops dmirror_devmem_ops = {
+	.page_free	= dmirror_devmem_free,
+	.migrate_to_ram	= dmirror_devmem_fault,
+};
+
+static int dmirror_device_init(struct dmirror_device *mdevice, int id)
+{
+	dev_t dev;
+	int ret;
+
+	dev = MKDEV(MAJOR(dmirror_dev), id);
+	mutex_init(&mdevice->devmem_lock);
+	spin_lock_init(&mdevice->lock);
+
+	cdev_init(&mdevice->cdevice, &dmirror_fops);
+	ret = cdev_add(&mdevice->cdevice, dev, 1);
+	if (ret)
+		return ret;
+
+	/* Build a list of free ZONE_DEVICE private struct pages */
+	dmirror_allocate_chunk(mdevice, NULL);
+
+	return 0;
+}
+
+static void dmirror_device_remove(struct dmirror_device *mdevice)
+{
+	unsigned int i;
+
+	if (mdevice->devmem_chunks) {
+		for (i = 0; i < mdevice->devmem_count; i++) {
+			struct dmirror_chunk *devmem =
+				mdevice->devmem_chunks[i];
+
+			memunmap_pages(&devmem->pagemap);
+			kfree(devmem);
+		}
+		kfree(mdevice->devmem_chunks);
+	}
+
+	cdev_del(&mdevice->cdevice);
+}
+
+static int __init hmm_dmirror_init(void)
+{
+	int ret;
+	int id;
+
+	ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
+				  "HMM_DMIRROR");
+	if (ret)
+		goto err_unreg;
+
+	for (id = 0; id < DMIRROR_NDEVICES; id++) {
+		ret = dmirror_device_init(dmirror_devices + id, id);
+		if (ret)
+			goto err_chrdev;
+	}
+
+	/*
+	 * Allocate a zero page to simulate a reserved page of device private
+	 * memory which is always zero. The zero_pfn page isn't used just to
+	 * make the code here simpler (i.e., we need a struct page for it).
+	 */
+	dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+	if (!dmirror_zero_page)
+		goto err_chrdev;
+
+	pr_info("HMM test module loaded. This is only for testing HMM.\n");
+	return 0;
+
+err_chrdev:
+	while (--id >= 0)
+		dmirror_device_remove(dmirror_devices + id);
+	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
+err_unreg:
+	return ret;
+}
+
+static void __exit hmm_dmirror_exit(void)
+{
+	int id;
+
+	if (dmirror_zero_page)
+		__free_page(dmirror_zero_page);
+	for (id = 0; id < DMIRROR_NDEVICES; id++)
+		dmirror_device_remove(dmirror_devices + id);
+	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
+}
+
+module_init(hmm_dmirror_init);
+module_exit(hmm_dmirror_exit);
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34..3054565b3f07 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+hmm-tests
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 7f9a8a8c31da..3fadab99d991 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -7,6 +7,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt
 TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_benchmark
+TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += map_hugetlb
@@ -31,6 +32,8 @@ TEST_FILES := test_vmalloc.sh
 KSFT_KHDR_INSTALL := 1
 include ../lib.mk
 
+$(OUTPUT)/hmm-tests: LDLIBS += -lhugetlbfs -lpthread
+
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
 
 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
diff --git a/tools/testing/selftests/vm/config b/tools/testing/selftests/vm/config
index 93b90a9b1eeb..f6d0adad739f 100644
--- a/tools/testing/selftests/vm/config
+++ b/tools/testing/selftests/vm/config
@@ -1,3 +1,5 @@
 CONFIG_SYSVIPC=y
 CONFIG_USERFAULTFD=y
 CONFIG_TEST_VMALLOC=m
+CONFIG_HMM_MIRROR=y
+CONFIG_DEVICE_PRIVATE=y
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
new file mode 100644
index 000000000000..033a12c7ab5b
--- /dev/null
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -0,0 +1,1353 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * HMM stands for Heterogeneous Memory Management, it is a helper layer inside
+ * the linux kernel to help device drivers mirror a process address space in
+ * the device. This allows the device to use the same address space which
+ * makes communication and data exchange a lot easier.
+ *
+ * This framework's sole purpose is to exercise various code paths inside
+ * the kernel to make sure that HMM performs as expected and to flush out any
+ * bugs.
+ */
+
+#include "../kselftest_harness.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <strings.h>
+#include <time.h>
+#include <pthread.h>
+#include <hugetlbfs.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <linux/test_hmm.h>
+
+struct hmm_buffer {
+	void		*ptr;
+	void		*mirror;
+	unsigned long	size;
+	int		fd;
+	uint64_t	cpages;
+	uint64_t	faults;
+};
+
+#define TWOMEG		(1 << 21)
+#define HMM_BUFFER_SIZE (1024 << 12)
+#define HMM_PATH_MAX    64
+#define NTIMES		256
+
+#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+
+FIXTURE(hmm)
+{
+	int		fd;
+	unsigned int	page_size;
+	unsigned int	page_shift;
+};
+
+FIXTURE(hmm2)
+{
+	int		fd0;
+	int		fd1;
+	unsigned int	page_size;
+	unsigned int	page_shift;
+};
+
+static int hmm_open(int unit)
+{
+	char pathname[HMM_PATH_MAX];
+	int fd;
+
+	snprintf(pathname, sizeof(pathname), "/dev/hmm_dmirror%d", unit);
+	fd = open(pathname, O_RDWR, 0);
+	if (fd < 0)
+		fprintf(stderr, "could not open hmm dmirror driver (%s)\n",
+			pathname);
+	return fd;
+}
+
+FIXTURE_SETUP(hmm)
+{
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+	self->page_shift = ffs(self->page_size) - 1;
+
+	self->fd = hmm_open(0);
+	ASSERT_GE(self->fd, 0);
+}
+
+FIXTURE_SETUP(hmm2)
+{
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+	self->page_shift = ffs(self->page_size) - 1;
+
+	self->fd0 = hmm_open(0);
+	ASSERT_GE(self->fd0, 0);
+	self->fd1 = hmm_open(1);
+	ASSERT_GE(self->fd1, 0);
+}
+
+FIXTURE_TEARDOWN(hmm)
+{
+	int ret = close(self->fd);
+
+	ASSERT_EQ(ret, 0);
+	self->fd = -1;
+}
+
+FIXTURE_TEARDOWN(hmm2)
+{
+	int ret = close(self->fd0);
+
+	ASSERT_EQ(ret, 0);
+	self->fd0 = -1;
+
+	ret = close(self->fd1);
+	ASSERT_EQ(ret, 0);
+	self->fd1 = -1;
+}
+
+static int hmm_dmirror_cmd(int fd,
+			   unsigned long request,
+			   struct hmm_buffer *buffer,
+			   unsigned long npages)
+{
+	struct hmm_dmirror_cmd cmd;
+	int ret;
+
+	/* Simulate a device reading system memory. */
+	cmd.addr = (__u64)buffer->ptr;
+	cmd.ptr = (__u64)buffer->mirror;
+	cmd.npages = npages;
+
+	for (;;) {
+		ret = ioctl(fd, request, &cmd);
+		if (ret == 0)
+			break;
+		if (errno == EINTR)
+			continue;
+		return -errno;
+	}
+	buffer->cpages = cmd.cpages;
+	buffer->faults = cmd.faults;
+
+	return 0;
+}
+
+static void hmm_buffer_free(struct hmm_buffer *buffer)
+{
+	if (buffer == NULL)
+		return;
+
+	if (buffer->ptr)
+		munmap(buffer->ptr, buffer->size);
+	free(buffer->mirror);
+	free(buffer);
+}
+
+/*
+ * Create a temporary file that will be deleted on close.
+ */
+static int hmm_create_file(unsigned long size)
+{
+	char path[HMM_PATH_MAX];
+	int fd;
+
+	strcpy(path, "/tmp");
+	fd = open(path, O_TMPFILE | O_EXCL | O_RDWR, 0600);
+	if (fd >= 0) {
+		int r;
+
+		do {
+			r = ftruncate(fd, size);
+		} while (r == -1 && errno == EINTR);
+		if (!r)
+			return fd;
+		close(fd);
+	}
+	return -1;
+}
+
+/*
+ * Return a random unsigned number.
+ */
+static unsigned int hmm_random(void)
+{
+	static int fd = -1;
+	unsigned int r;
+
+	if (fd < 0) {
+		fd = open("/dev/urandom", O_RDONLY);
+		if (fd < 0) {
+			fprintf(stderr, "%s:%d failed to open /dev/urandom\n",
+					__FILE__, __LINE__);
+			return ~0U;
+		}
+	}
+	read(fd, &r, sizeof(r));
+	return r;
+}
+
+static void hmm_nanosleep(unsigned int n)
+{
+	struct timespec t;
+
+	t.tv_sec = 0;
+	t.tv_nsec = n;
+	nanosleep(&t, NULL);
+}
+
+/*
+ * Simple NULL test of device open/close.
+ */
+TEST_F(hmm, open_close)
+{
+}
+
+/*
+ * Read private anonymous memory.
+ */
+TEST_F(hmm, anon_read)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int val;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/*
+	 * Initialize buffer in system memory but leave the first two pages
+	 * zero (pte_none and pfn_zero).
+	 */
+	i = 2 * self->page_size / sizeof(*ptr);
+	for (ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Set buffer permission to read-only. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Populate the CPU page table with a special zero page. */
+	val = *(int *)(buffer->ptr + self->page_size);
+	ASSERT_EQ(val, 0);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	ptr = buffer->mirror;
+	for (i = 0; i < 2 * self->page_size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+	for (; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Read private anonymous memory which has been protected with
+ * mprotect() PROT_NONE.
+ */
+TEST_F(hmm, anon_read_prot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize mirror buffer so we can verify it isn't written. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	/* Protect buffer from reading. */
+	ret = mprotect(buffer->ptr, size, PROT_NONE);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, -EFAULT);
+
+	/* Allow CPU to read the buffer so we can check it. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write private anonymous memory.
+ */
+TEST_F(hmm, anon_write)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write private anonymous memory which has been protected with
+ * mprotect() PROT_READ.
+ */
+TEST_F(hmm, anon_write_prot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Simulate a device reading a zero page of memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, -EPERM);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Now allow writing and see that the zero page is replaced. */
+	ret = mprotect(buffer->ptr, size, PROT_WRITE | PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Check that a device writing an anonymous private mapping
+ * will copy-on-write if a child process inherits the mapping.
+ */
+TEST_F(hmm, anon_write_child)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	pid_t pid;
+	int child_fd;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer->ptr so we can tell if it is written. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (pid != 0) {
+		waitpid(pid, &ret, 0);
+		ASSERT_EQ(WIFEXITED(ret), 1);
+
+		/* Check that the parent's buffer did not change. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+		return;
+	}
+
+	/* Check that we see the parent's values. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	/* The child process needs its own mirror to its own mm. */
+	child_fd = hmm_open(0);
+	ASSERT_GE(child_fd, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	close(child_fd);
+	exit(0);
+}
+
+/*
+ * Check that a device writing an anonymous shared mapping
+ * will not copy-on-write if a child process inherits the mapping.
+ */
+TEST_F(hmm, anon_write_child_shared)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	pid_t pid;
+	int child_fd;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_SHARED | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer->ptr so we can tell if it is written. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (pid != 0) {
+		waitpid(pid, &ret, 0);
+		ASSERT_EQ(WIFEXITED(ret), 1);
+
+		/* Check that the parent's buffer did change. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], -i);
+		return;
+	}
+
+	/* Check that we see the parent's values. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	/* The child process needs its own mirror to its own mm. */
+	child_fd = hmm_open(0);
+	ASSERT_GE(child_fd, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	close(child_fd);
+	exit(0);
+}
+
+/*
+ * Write private anonymous huge page.
+ */
+TEST_F(hmm, anon_write_huge)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = 2 * TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	size = TWOMEG;
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write huge TLBFS page.
+ */
+TEST_F(hmm, anon_write_hugetlbfs)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	long pagesizes[4];
+	int n, idx;
+
+	/* Skip test if we can't allocate a hugetlbfs page. */
+
+	n = gethugepagesizes(pagesizes, 4);
+	if (n <= 0)
+		return;
+	for (idx = 0; --n > 0; ) {
+		if (pagesizes[n] < pagesizes[idx])
+			idx = n;
+	}
+	size = ALIGN(TWOMEG, pagesizes[idx]);
+	npages = size >> self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->ptr = get_hugepage_region(size, GHR_STRICT);
+	if (buffer->ptr == NULL) {
+		free(buffer);
+		return;
+	}
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	free_hugepage_region(buffer->ptr);
+	buffer->ptr = NULL;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Read mmap'ed file memory.
+ */
+TEST_F(hmm, file_read)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int fd;
+	ssize_t len;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	fd = hmm_create_file(size);
+	ASSERT_GE(fd, 0);
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = fd;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Write initial contents of the file. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+	len = pwrite(fd, buffer->mirror, size, 0);
+	ASSERT_EQ(len, size);
+	memset(buffer->mirror, 0, size);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ,
+			   MAP_SHARED,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write mmap'ed file memory.
+ */
+TEST_F(hmm, file_write)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int fd;
+	ssize_t len;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	fd = hmm_create_file(size);
+	ASSERT_GE(fd, 0);
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = fd;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_SHARED,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Check that the device also wrote the file. */
+	len = pread(fd, buffer->mirror, size, 0);
+	ASSERT_EQ(len, size);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory.
+ */
+TEST_F(hmm, migrate)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory and fault it back to system
+ * memory.
+ */
+TEST_F(hmm, migrate_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Try to migrate various memory types to device private memory.
+ */
+TEST_F(hmm2, migrate_mixed)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	int *ptr;
+	unsigned char *p;
+	int ret;
+	int val;
+
+	npages = 6;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_NONE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+	p = buffer->ptr;
+
+	/* Now try to migrate everything to device 1. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 6);
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	/* We expect an error if the vma doesn't cover the range. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+	ASSERT_EQ(ret, -EINVAL);
+
+	/* Page 2 will be a read-only zero page. */
+	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 2 * self->page_size);
+	val = *ptr + 3;
+	ASSERT_EQ(val, 3);
+
+	/* Page 3 will be read-only. */
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 3 * self->page_size);
+	*ptr = val;
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 4 will be read-write. */
+	ret = mprotect(buffer->ptr + 4 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 4 * self->page_size);
+	*ptr = val;
+
+	/* Page 5 won't be migrated to device 0 because it's on device 1. */
+	buffer->ptr = p + 5 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, -ENOENT);
+	buffer->ptr = p;
+
+	/* Now try to migrate pages 2-3 to device 1. */
+	buffer->ptr = p + 2 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 2);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 2);
+	buffer->ptr = p;
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory and fault it back to system
+ * memory multiple times.
+ */
+TEST_F(hmm, migrate_multiple)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	unsigned long c;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; c++) {
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i;
+
+		/* Migrate memory to device. */
+		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
+				      npages);
+		ASSERT_EQ(ret, 0);
+		ASSERT_EQ(buffer->cpages, npages);
+
+		/* Check what the device read. */
+		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+
+		/* Fault pages back to system memory and check them. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+
+		hmm_buffer_free(buffer);
+	}
+}
+
+/*
+ * Read anonymous memory multiple times.
+ */
+TEST_F(hmm, anon_read_multiple)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	unsigned long c;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; c++) {
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i + c;
+
+		/* Simulate a device reading system memory. */
+		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
+				      npages);
+		ASSERT_EQ(ret, 0);
+		ASSERT_EQ(buffer->cpages, npages);
+		ASSERT_EQ(buffer->faults, 1);
+
+		/* Check what the device read. */
+		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i + c);
+
+		hmm_buffer_free(buffer);
+	}
+}
+
+void *unmap_buffer(void *p)
+{
+	struct hmm_buffer *buffer = p;
+
+	/* Delay for a bit and then unmap buffer while it is being read. */
+	hmm_nanosleep(hmm_random() % 32000);
+	munmap(buffer->ptr + buffer->size / 2, buffer->size / 2);
+	buffer->ptr = NULL;
+
+	return NULL;
+}
+
+/*
+ * Try reading anonymous memory while it is being unmapped.
+ */
+TEST_F(hmm, anon_teardown)
+{
+	unsigned long npages;
+	unsigned long size;
+	unsigned long c;
+	void *ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; ++c) {
+		pthread_t thread;
+		struct hmm_buffer *buffer;
+		unsigned long i;
+		int *ptr;
+		int rc;
+
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i + c;
+
+		rc = pthread_create(&thread, NULL, unmap_buffer, buffer);
+		ASSERT_EQ(rc, 0);
+
+		/* Simulate a device reading system memory. */
+		rc = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
+				     npages);
+		if (rc == 0) {
+			ASSERT_EQ(buffer->cpages, npages);
+			ASSERT_EQ(buffer->faults, 1);
+
+			/* Check what the device read. */
+			for (i = 0, ptr = buffer->mirror;
+			     i < size / sizeof(*ptr);
+			     ++i)
+				ASSERT_EQ(ptr[i], i + c);
+		}
+
+		pthread_join(thread, &ret);
+		hmm_buffer_free(buffer);
+	}
+}
+
+/*
+ * Test memory snapshot without faulting in pages accessed by the device.
+ */
+TEST_F(hmm2, snapshot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	int *ptr;
+	unsigned char *p;
+	unsigned char *m;
+	int ret;
+	int val;
+
+	npages = 7;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(npages);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_NONE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+	p = buffer->ptr;
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 2 will be read-only zero page. */
+	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 2 * self->page_size);
+	val = *ptr + 3;
+	ASSERT_EQ(val, 3);
+
+	/* Page 3 will be read-only. */
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 3 * self->page_size);
+	*ptr = val;
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 4-6 will be read-write. */
+	ret = mprotect(buffer->ptr + 4 * self->page_size, 3 * self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 4 * self->page_size);
+	*ptr = val;
+
+	/* Page 5 will be migrated to device 0. */
+	buffer->ptr = p + 5 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+
+	/* Page 6 will be migrated to device 1. */
+	buffer->ptr = p + 6 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+
+	/* Simulate a device snapshotting CPU pagetables. */
+	buffer->ptr = p;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device saw. */
+	m = buffer->mirror;
+	ASSERT_EQ(m[0], HMM_DMIRROR_PROT_ERROR);
+	ASSERT_EQ(m[1], HMM_DMIRROR_PROT_NONE);
+	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
+	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
+	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
+	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+			HMM_DMIRROR_PROT_WRITE);
+	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_NONE);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Test two devices reading the same memory (double mapped).
+ */
+TEST_F(hmm2, double_map)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = 6;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(npages);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Make region read-only. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate device 0 reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Simulate device 1 reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	hmm_buffer_free(buffer);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index f33714843198..5e82b2574d54 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -270,4 +270,20 @@ else
 	exitcode=1
 fi
 
+echo "------------------------------------"
+echo "running HMM smoke test"
+echo "------------------------------------"
+./test_hmm.sh smoke
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
 exit $exitcode
diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
new file mode 100755
index 000000000000..461e4a99a362
--- /dev/null
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -0,0 +1,97 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Uladzislau Rezki (Sony) <urezki@gmail.com>
+#
+# This is a test script for the kernel test driver to analyse vmalloc
+# allocator. Therefore it is just a kernel module loader. You can specify
+# and pass different parameters in order to:
+#     a) analyse performance of vmalloc allocations;
+#     b) stressing and stability check of vmalloc subsystem.
+
+TEST_NAME="test_hmm"
+DRIVER="test_hmm"
+
+# 1 if fails
+exitcode=1
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+check_test_requirements()
+{
+	uid=$(id -u)
+	if [ $uid -ne 0 ]; then
+		echo "$0: Must be run as root"
+		exit $ksft_skip
+	fi
+
+	if ! which modprobe > /dev/null 2>&1; then
+		echo "$0: You need modprobe installed"
+		exit $ksft_skip
+	fi
+
+	if ! modinfo $DRIVER > /dev/null 2>&1; then
+		echo "$0: You must have the following enabled in your kernel:"
+		echo "CONFIG_TEST_HMM=m"
+		exit $ksft_skip
+	fi
+}
+
+load_driver()
+{
+	modprobe $DRIVER > /dev/null 2>&1
+	if [ $? == 0 ]; then
+		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
+		mknod /dev/hmm_dmirror0 c $major 0
+		mknod /dev/hmm_dmirror1 c $major 1
+	fi
+}
+
+unload_driver()
+{
+	modprobe -r $DRIVER > /dev/null 2>&1
+	rm -f /dev/hmm_dmirror?
+}
+
+run_smoke()
+{
+	echo "Running smoke test. Note, this test provides basic coverage."
+
+	load_driver
+	./hmm-tests
+	unload_driver
+}
+
+usage()
+{
+	echo -n "Usage: $0"
+	echo
+	echo "Example usage:"
+	echo
+	echo "# Shows help message"
+	echo "./${TEST_NAME}.sh"
+	echo
+	echo "# Smoke testing"
+	echo "./${TEST_NAME}.sh smoke"
+	echo
+	exit 0
+}
+
+function run_test()
+{
+	if [ $# -eq 0 ]; then
+		usage
+	else
+		if [ "$1" = "smoke" ]; then
+			run_smoke
+		else
+			usage
+		fi
+	fi
+}
+
+check_test_requirements
+run_test $@
+
+exit 0
-- 
2.23.0


[-- Attachment #3: Type: text/plain, Size: 154 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 22:43       ` Ralph Campbell
@ 2020-03-18  9:34         ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-18  9:34 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, linux-mm, nouveau, dri-devel, kvm-ppc,
	Christoph Hellwig, Jason Gunthorpe, Jerome Glisse, Ben Skeggs,
	Dan Williams, Bharata B Rao, Christian König

On Tue, Mar 17, 2020 at 03:43:47PM -0700, Ralph Campbell wrote:
>> Obviously no driver cared for that so far.  Once we have test cases
>> for that and thus testable code we can add code to fault it in from
>> hmm_vma_handle_pte.
>>
>
> I'm OK with the series. I think I would have been less confused if I looked at
> patch 4 then 3.

I guess I could just merge 3 and 4 if it is too confusing otherwise.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ensure device private pages have an owner v2
  2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
                   ` (4 preceding siblings ...)
  2020-03-17  5:31 ` ensure device private pages have an owner v2 Bharata B Rao
@ 2020-03-19  0:28 ` Jason Gunthorpe
  2020-03-19  7:16   ` Christoph Hellwig
  5 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-19  0:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Mon, Mar 16, 2020 at 08:32:12PM +0100, Christoph Hellwig wrote:
> When acting on device private mappings a driver needs to know if the
> device (or other entity in case of kvmppc) actually owns this private
> mapping.  This series adds an owner field and converts the migrate_vma
> code over to check it.  I looked into doing the same for
> hmm_range_fault, but as far as I can tell that code has never been
> wired up to actually work for device private memory, so instead of
> trying to fix some unused code the second patch just remove the code.
> We can add it back once we have a working and fully tested code, and
> then should pass the expected owner in the hmm_range structure.
> 
> Changes since v1:
>  - split out the pgmap->owner addition into a separate patch
>  - check pgmap->owner is set for device private mappings
>  - rename the dev_private_owner field in struct migrate_vma to src_owner
>  - refuse to migrate private pages if src_owner is not set
>  - keep the non-fault device private handling in hmm_range_fault

I'm happy enough to take this, did you have plans for a v3?

Thanks,
Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ensure device private pages have an owner v2
  2020-03-19  0:28 ` Jason Gunthorpe
@ 2020-03-19  7:16   ` Christoph Hellwig
  2020-03-19 11:50     ` Jason Gunthorpe
  2020-03-19 18:50     ` Jason Gunthorpe
  0 siblings, 2 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-19  7:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Wed, Mar 18, 2020 at 09:28:49PM -0300, Jason Gunthorpe wrote:
> > Changes since v1:
> >  - split out the pgmap->owner addition into a separate patch
> >  - check pgmap->owner is set for device private mappings
> >  - rename the dev_private_owner field in struct migrate_vma to src_owner
> >  - refuse to migrate private pages if src_owner is not set
> >  - keep the non-fault device private handling in hmm_range_fault
> 
> I'm happy enough to take this, did you have plans for a v3?

I think the only open question is if merging 3 and 4 might make sense.
It's up to you if you want it resent that way or not.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ensure device private pages have an owner v2
  2020-03-19  7:16   ` Christoph Hellwig
@ 2020-03-19 11:50     ` Jason Gunthorpe
  2020-03-19 18:50     ` Jason Gunthorpe
  1 sibling, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-19 11:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Thu, Mar 19, 2020 at 08:16:33AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 18, 2020 at 09:28:49PM -0300, Jason Gunthorpe wrote:
> > > Changes since v1:
> > >  - split out the pgmap->owner addition into a separate patch
> > >  - check pgmap->owner is set for device private mappings
> > >  - rename the dev_private_owner field in struct migrate_vma to src_owner
> > >  - refuse to migrate private pages if src_owner is not set
> > >  - keep the non-fault device private handling in hmm_range_fault
> > 
> > I'm happy enough to take this, did you have plans for a v3?
> 
> I think the only open question is if merging 3 and 4 might make sense.
> It's up to you if you want it resent that way or not.

Now that I understand that amdgpu doesn't set the 'do not return
device_private pages' flag, I think the split is fine, I'll grab it as
is then today

Thanks,
Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 23:14               ` Ralph Campbell
@ 2020-03-19 18:17                 ` Jason Gunthorpe
  2020-03-19 22:56                   ` Ralph Campbell
  2020-03-20  0:14                 ` Jason Gunthorpe
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-19 18:17 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote:
> 
> On 3/17/20 5:59 AM, Christoph Hellwig wrote:
> > On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
> > > I've been using v7 of Ralph's tester and it is working well - it has
> > > DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
> > > you able?
> > > 
> > > This hunk seems trivial enough to me, can we include it now?
> > 
> > I can send a separate patch for it once the tester covers it.  I don't
> > want to add it to the original patch as it is a significant behavior
> > change compared to the existing code.
> > 
> 
> Attached is an updated version of my HMM tests based on linux-5.6.0-rc6.
> I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups,
> and Christoph's 1-4 device private page changes applied.

I'd like to get this to mergable, it looks pretty good now, but I have
no idea about selftests - and I'm struggling to even compile the tools
dir

> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 69def4a9df00..4d22ce7879a7 100644
> +++ b/lib/Kconfig.debug
> @@ -2162,6 +2162,18 @@ config TEST_MEMINIT
>  
>  	  If unsure, say N.
>  
> +config TEST_HMM
> +	tristate "Test HMM (Heterogeneous Memory Management)"
> +	depends on DEVICE_PRIVATE
> +	select HMM_MIRROR
> +        select MMU_NOTIFIER

extra spaces

In general I wonder if it even makes sense that DEVICE_PRIVATE is user
selectable?

> +static int dmirror_fops_open(struct inode *inode, struct file *filp)
> +{
> +	struct cdev *cdev = inode->i_cdev;
> +	struct dmirror *dmirror;
> +	int ret;
> +
> +	/* Mirror this process address space */
> +	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
> +	if (dmirror == NULL)
> +		return -ENOMEM;
> +
> +	dmirror->mdevice = container_of(cdev, struct dmirror_device, cdevice);
> +	mutex_init(&dmirror->mutex);
> +	xa_init(&dmirror->pt);
> +
> +	ret = mmu_interval_notifier_insert(&dmirror->notifier, current->mm,
> +				0, ULONG_MAX & PAGE_MASK, &dmirror_min_ops);
> +	if (ret) {
> +		kfree(dmirror);
> +		return ret;
> +	}
> +
> +	/* Pairs with the mmdrop() in dmirror_fops_release(). */
> +	mmgrab(current->mm);
> +	dmirror->mm = current->mm;

The notifier holds a mmgrab, no need for another one

> +	/* Only the first open registers the address space. */
> +	filp->private_data = dmirror;

Not sure what this comment means

> +static inline struct dmirror_device *dmirror_page_to_device(struct page *page)
> +
> +{
> +	struct dmirror_chunk *devmem;
> +
> +	devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap);
> +	return devmem->mdevice;
> +}

extra devmem var is not really needed

> +
> +static bool dmirror_device_is_mine(struct dmirror_device *mdevice,
> +				   struct page *page)
> +{
> +	if (!is_zone_device_page(page))
> +		return false;
> +	return page->pgmap->ops == &dmirror_devmem_ops &&
> +		dmirror_page_to_device(page) == mdevice;
> +}

Use new owner stuff, right? Actually this is redunant now, the check
should be just WARN_ON pageowner != self owner

> +static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
> +{
> +	uint64_t *pfns = range->pfns;
> +	unsigned long pfn;
> +
> +	for (pfn = (range->start >> PAGE_SHIFT);
> +	     pfn < (range->end >> PAGE_SHIFT);
> +	     pfn++, pfns++) {
> +		struct page *page;
> +		void *entry;
> +
> +		/*
> +		 * HMM_PFN_ERROR is returned if it is accessing invalid memory
> +		 * either because of memory error (hardware detected memory
> +		 * corruption) or more likely because of truncate on mmap
> +		 * file.
> +		 */
> +		if (*pfns == range->values[HMM_PFN_ERROR])
> +			return -EFAULT;

Unless that snapshot is use hmm_range_fault() never returns success
and sets PFN_ERROR, so this should be a WARN_ON

> +		if (!(*pfns & range->flags[HMM_PFN_VALID]))
> +			return -EFAULT;

Same with valid.

> +		page = hmm_device_entry_to_page(range, *pfns);
> +		/* We asked for pages to be populated but check anyway. */
> +		if (!page)
> +			return -EFAULT;

WARN_ON

> +		if (is_zone_device_page(page)) {
> +			/*
> +			 * TODO: need a way to ask HMM to fault foreign zone
> +			 * device private pages.
> +			 */
> +			if (!dmirror_device_is_mine(dmirror->mdevice, page))
> +				continue;

Actually re

> +static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> +				const struct mmu_notifier_range *range,
> +				unsigned long cur_seq)
> +{
> +	struct dmirror *dmirror = container_of(mni, struct dmirror, notifier);
> +	struct mm_struct *mm = dmirror->mm;
> +
> +	/*
> +	 * If the process doesn't exist, we don't need to invalidate the
> +	 * device page table since the address space will be torn down.
> +	 */
> +	if (!mmget_not_zero(mm))
> +		return true;

Why? Don't the notifiers provide for this already. 

mmget_not_zero() is required before calling hmm_range_fault() though

> +static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
> +			 unsigned long end, bool write)
> +{
> +	struct mm_struct *mm = dmirror->mm;
> +	unsigned long addr;
> +	uint64_t pfns[64];
> +	struct hmm_range range = {
> +		.notifier = &dmirror->notifier,
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),
> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
> +		.dev_private_owner = dmirror->mdevice,
> +	};
> +	int ret = 0;
> +
> +	/* Since the mm is for the mirrored process, get a reference first. */
> +	if (!mmget_not_zero(mm))
> +		return 0;

Right

> +	for (addr = start; addr < end; addr = range.end) {
> +		range.start = addr;
> +		range.end = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
> +
> +		ret = dmirror_range_fault(dmirror, &range);
> +		if (ret)
> +			break;
> +	}
> +
> +	mmput(mm);
> +	return ret;
> +}
> +
> +static int dmirror_do_read(struct dmirror *dmirror, unsigned long start,
> +			   unsigned long end, struct dmirror_bounce *bounce)
> +{
> +	unsigned long pfn;
> +	void *ptr;
> +
> +	ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK);
> +
> +	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
> +		void *entry;
> +		struct page *page;
> +		void *tmp;
> +
> +		entry = xa_load(&dmirror->pt, pfn);
> +		page = xa_untag_pointer(entry);
> +		if (!page)
> +			return -ENOENT;
> +
> +		tmp = kmap(page);
> +		memcpy(ptr, tmp, PAGE_SIZE);
> +		kunmap(page);
> +
> +		ptr += PAGE_SIZE;
> +		bounce->cpages++;
> +	}
> +
> +	return 0;
> +}
> +
> +static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
> +{
> +	struct dmirror_bounce bounce;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	ret = dmirror_bounce_init(&bounce, start, size);
> +	if (ret)
> +		return ret;
> +
> +again:
> +	mutex_lock(&dmirror->mutex);
> +	ret = dmirror_do_read(dmirror, start, end, &bounce);
> +	mutex_unlock(&dmirror->mutex);
> +	if (ret == 0)
> +		ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr,
> +					bounce.size);

Use u64_to_user_ptr() instead of the cast

> +static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
> +{
> +	struct dmirror_bounce bounce;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	int ret;
> +
> +	start = cmd->addr;
> +	end = start + size;
> +	if (end < start)
> +		return -EINVAL;
> +
> +	ret = dmirror_bounce_init(&bounce, start, size);
> +	if (ret)
> +		return ret;
> +	ret = copy_from_user(bounce.ptr, (void __user *)cmd->ptr,
> +				bounce.size);

ditto

> +	if (ret)
> +		return ret;
> +
> +again:
> +	mutex_lock(&dmirror->mutex);
> +	ret = dmirror_do_write(dmirror, start, end, &bounce);
> +	mutex_unlock(&dmirror->mutex);
> +	if (ret == -ENOENT) {
> +		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
> +		ret = dmirror_fault(dmirror, start, end, true);
> +		if (ret == 0) {
> +			cmd->faults++;
> +			goto again;

Use a loop instead of goto?

Also I get this:

lib/test_hmm.c: In function ‘dmirror_devmem_fault_alloc_and_copy’:
lib/test_hmm.c:1041:25: warning: unused variable ‘vma’ [-Wunused-variable]
 1041 |  struct vm_area_struct *vma = args->vma;

But this is a kernel bug, due to alloc_page_vma being a #define not a
static inline and me having CONFIG_NUMA off in this .config

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: ensure device private pages have an owner v2
  2020-03-19  7:16   ` Christoph Hellwig
  2020-03-19 11:50     ` Jason Gunthorpe
@ 2020-03-19 18:50     ` Jason Gunthorpe
  1 sibling, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-19 18:50 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Thu, Mar 19, 2020 at 08:16:33AM +0100, Christoph Hellwig wrote:
> On Wed, Mar 18, 2020 at 09:28:49PM -0300, Jason Gunthorpe wrote:
> > > Changes since v1:
> > >  - split out the pgmap->owner addition into a separate patch
> > >  - check pgmap->owner is set for device private mappings
> > >  - rename the dev_private_owner field in struct migrate_vma to src_owner
> > >  - refuse to migrate private pages if src_owner is not set
> > >  - keep the non-fault device private handling in hmm_range_fault
> > 
> > I'm happy enough to take this, did you have plans for a v3?
> 
> I think the only open question is if merging 3 and 4 might make sense.
> It's up to you if you want it resent that way or not.

Okay, I kept it as is and elaborated the commit messages a bit based
on the discussion

It doesn't seem like the changes outside hmm are significant enough to
need more acks

Thanks,
Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-19 18:17                 ` Jason Gunthorpe
@ 2020-03-19 22:56                   ` Ralph Campbell
  2020-03-20  0:03                     ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Ralph Campbell @ 2020-03-19 22:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, linux-kselftest,
	Dan Williams, Bharata B Rao, Christian König

Adding linux-kselftest@vger.kernel.org for the test config question.

On 3/19/20 11:17 AM, Jason Gunthorpe wrote:
> On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote:
>>
>> On 3/17/20 5:59 AM, Christoph Hellwig wrote:
>>> On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
>>>> I've been using v7 of Ralph's tester and it is working well - it has
>>>> DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
>>>> you able?
>>>>
>>>> This hunk seems trivial enough to me, can we include it now?
>>>
>>> I can send a separate patch for it once the tester covers it.  I don't
>>> want to add it to the original patch as it is a significant behavior
>>> change compared to the existing code.
>>>
>>
>> Attached is an updated version of my HMM tests based on linux-5.6.0-rc6.
>> I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups,
>> and Christoph's 1-4 device private page changes applied.
> 
> I'd like to get this to mergable, it looks pretty good now, but I have
> no idea about selftests - and I'm struggling to even compile the tools
> dir
> 
>> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
>> index 69def4a9df00..4d22ce7879a7 100644
>> +++ b/lib/Kconfig.debug
>> @@ -2162,6 +2162,18 @@ config TEST_MEMINIT
>>   
>>   	  If unsure, say N.
>>   
>> +config TEST_HMM
>> +	tristate "Test HMM (Heterogeneous Memory Management)"
>> +	depends on DEVICE_PRIVATE
>> +	select HMM_MIRROR
>> +        select MMU_NOTIFIER
> 
> extra spaces

Will fix in v8.

> In general I wonder if it even makes sense that DEVICE_PRIVATE is user
> selectable?

Should tests enable the feature or the feature enable the test?
IMHO, if the feature is being compiled into the kernel, that should
enable the menu item for the test. If the feature isn't selected,
no need to test it :-)

>> +static int dmirror_fops_open(struct inode *inode, struct file *filp)
>> +{
>> +	struct cdev *cdev = inode->i_cdev;
>> +	struct dmirror *dmirror;
>> +	int ret;
>> +
>> +	/* Mirror this process address space */
>> +	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
>> +	if (dmirror == NULL)
>> +		return -ENOMEM;
>> +
>> +	dmirror->mdevice = container_of(cdev, struct dmirror_device, cdevice);
>> +	mutex_init(&dmirror->mutex);
>> +	xa_init(&dmirror->pt);
>> +
>> +	ret = mmu_interval_notifier_insert(&dmirror->notifier, current->mm,
>> +				0, ULONG_MAX & PAGE_MASK, &dmirror_min_ops);
>> +	if (ret) {
>> +		kfree(dmirror);
>> +		return ret;
>> +	}
>> +
>> +	/* Pairs with the mmdrop() in dmirror_fops_release(). */
>> +	mmgrab(current->mm);
>> +	dmirror->mm = current->mm;
> 
> The notifier holds a mmgrab, no need for another one

OK. I'll replace dmirror->mm with dmirror->notifier.mm.

>> +	/* Only the first open registers the address space. */
>> +	filp->private_data = dmirror;
> 
> Not sure what this comment means

I'll change the comment to:
	/*
          * The first open of the device character file registers the address
          * space of the process doing the open() system call with the device.
          * Subsequent file opens by other processes will have access to the
          * first process' address space.
          */

>> +static inline struct dmirror_device *dmirror_page_to_device(struct page *page)
>> +
>> +{
>> +	struct dmirror_chunk *devmem;
>> +
>> +	devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap);
>> +	return devmem->mdevice;
>> +}
> 
> extra devmem var is not really needed

I'll change this to:
	return container_of(page->pgmap, struct dmirror_chunk,
			    pagemap)->mdevice;

>> +
>> +static bool dmirror_device_is_mine(struct dmirror_device *mdevice,
>> +				   struct page *page)
>> +{
>> +	if (!is_zone_device_page(page))
>> +		return false;
>> +	return page->pgmap->ops == &dmirror_devmem_ops &&
>> +		dmirror_page_to_device(page) == mdevice;
>> +}
> 
> Use new owner stuff, right? Actually this is redunant now, the check
> should be just WARN_ON pageowner != self owner

I'll clean this up. dmirror_device_is_mine() isn't needed now.

>> +static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
>> +{
>> +	uint64_t *pfns = range->pfns;
>> +	unsigned long pfn;
>> +
>> +	for (pfn = (range->start >> PAGE_SHIFT);
>> +	     pfn < (range->end >> PAGE_SHIFT);
>> +	     pfn++, pfns++) {
>> +		struct page *page;
>> +		void *entry;
>> +
>> +		/*
>> +		 * HMM_PFN_ERROR is returned if it is accessing invalid memory
>> +		 * either because of memory error (hardware detected memory
>> +		 * corruption) or more likely because of truncate on mmap
>> +		 * file.
>> +		 */
>> +		if (*pfns == range->values[HMM_PFN_ERROR])
>> +			return -EFAULT;
> 
> Unless that snapshot is use hmm_range_fault() never returns success
> and sets PFN_ERROR, so this should be a WARN_ON
> 
>> +		if (!(*pfns & range->flags[HMM_PFN_VALID]))
>> +			return -EFAULT;
> 
> Same with valid.
> 
>> +		page = hmm_device_entry_to_page(range, *pfns);
>> +		/* We asked for pages to be populated but check anyway. */
>> +		if (!page)
>> +			return -EFAULT;
> 
> WARN_ON
> 
>> +		if (is_zone_device_page(page)) {
>> +			/*
>> +			 * TODO: need a way to ask HMM to fault foreign zone
>> +			 * device private pages.
>> +			 */
>> +			if (!dmirror_device_is_mine(dmirror->mdevice, page))
>> +				continue;
> 
> Actually re
> 
>> +static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
>> +				const struct mmu_notifier_range *range,
>> +				unsigned long cur_seq)
>> +{
>> +	struct dmirror *dmirror = container_of(mni, struct dmirror, notifier);
>> +	struct mm_struct *mm = dmirror->mm;
>> +
>> +	/*
>> +	 * If the process doesn't exist, we don't need to invalidate the
>> +	 * device page table since the address space will be torn down.
>> +	 */
>> +	if (!mmget_not_zero(mm))
>> +		return true;
> 
> Why? Don't the notifiers provide for this already.
> 
> mmget_not_zero() is required before calling hmm_range_fault() though

This is a workaround for a problem I don't quite understand.
If you change tools/testing/selftests/vm/hmm-tests.c line 868 to
	ASSERT_EQ(ret, -1);
Then the test will abort, core dump, and cause two problems,
1) the migrated page will be faulted back to system memory in order to write
    it to the core dump. This triggers lockdep_assert_held(&walk.mm->mmap_sem)
    in walk_page_range().
2) Then after a delay, I get:
[  137.852986] rcu: INFO: rcu_sched self-detected stall on CPU
[  137.858594] rcu: 	0-....: (26000 ticks this GP) idle=69e/1/0x4000000000000002 softirq=34555/34555 fqs=6497
[  137.868439] 	(t=26007 jiffies g=14653 q=271)
[  137.872711] NMI backtrace for cpu 0
[  137.876205] CPU: 0 PID: 6228 Comm: hmm-tests Not tainted 5.6.0-rc6+ #2
[  137.882730] Hardware name: System manufacturer System Product Name/SABERTOOTH X79, BIOS 4302 08/29/2013
[  137.892115] Call Trace:
[  137.894570]  <IRQ>
[  137.896593]  dump_stack+0x97/0xe0
[  137.899920]  nmi_cpu_backtrace.cold+0x14/0x68
[  137.904287]  ? lapic_can_unplug_cpu.cold+0x39/0x39
[  137.909091]  nmi_trigger_cpumask_backtrace+0xf1/0x10e
[  137.914152]  rcu_dump_cpu_stacks+0xe2/0x125
[  137.918348]  rcu_sched_clock_irq.cold+0x393/0x610
[  137.923069]  update_process_times+0x24/0x50
[  137.927263]  tick_sched_handle+0x68/0x90
[  137.931196]  tick_sched_timer+0x38/0xa0
[  137.935037]  __hrtimer_run_queues+0x1f9/0x6c0
[  137.939403]  ? tick_sched_do_timer+0x90/0x90
[  137.943683]  ? enqueue_hrtimer+0x1a0/0x1a0
[  137.947790]  ? recalibrate_cpu_khz+0x10/0x10
[  137.952064]  ? ktime_get_update_offsets_now+0xed/0x1c0
[  137.957209]  hrtimer_interrupt+0x1a5/0x340
[  137.961315]  ? rcu_read_lock_sched_held+0xa1/0xd0
[  137.966038]  smp_apic_timer_interrupt+0xbb/0x320
[  137.970665]  apic_timer_interrupt+0xf/0x20
[  137.974771]  </IRQ>
[  137.976876] RIP: 0010:xas_load+0x7/0x80
[  137.980718] Code: 80 2f 1a 83 c6 05 e9 8d 7b 01 01 e8 3e b1 b1 fe e9 05 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 56 41 55 41 54 55 <48> 89 fd 53 4c 8d 6d 10 e8 3c fc ff ff 49 89 c4 4c 89 e0 83 e0 03
[  137.999461] RSP: 0018:ffffc900015e77c8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  138.007028] RAX: ffff8886e508c408 RBX: 0000000000000000 RCX: ffffffff82626c89
[  138.014159] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffffc900015e78a0
[  138.021293] RBP: ffffc900015e78a0 R08: ffffffff811461c4 R09: fffff520002bcf17
[  138.028426] R10: fffff520002bcf16 R11: 0000000000000003 R12: 0000000002606d10
[  138.035557] R13: ffff8886e508c448 R14: 0000000000000031 R15: ffffffffa06546a0
[  138.042701]  ? do_raw_spin_lock+0x104/0x1d0
[  138.046888]  ? xas_store+0x19/0xa60
[  138.050390]  xas_store+0x5b3/0xa60
[  138.053806]  ? register_lock_class+0x860/0x860
[  138.058267]  __xa_erase+0x96/0x110
[  138.061673]  ? xas_store+0xa60/0xa60
[  138.065267]  xa_erase+0x19/0x30
[  138.068418]  dmirror_interval_invalidate+0x7d/0xc0 [test_hmm]
[  138.074174]  __mmu_notifier_release+0x1a6/0x370
[  138.078714]  ? mmu_notifier_unregister+0x1e0/0x1e0
[  138.083520]  ? lock_downgrade+0x380/0x380
[  138.087535]  ? uprobe_clear_state+0x2e/0x150
[  138.091823]  exit_mmap+0x24d/0x2a0
[  138.095229]  ? do_munmap+0x10/0x10
[  138.098635]  ? __x64_sys_io_setup+0x200/0x200
[  138.102995]  ? __mutex_unlock_slowpath+0xb4/0x3f0
[  138.107704]  ? wait_for_completion+0x250/0x250
[  138.112158]  ? lock_downgrade+0x380/0x380
[  138.116176]  ? check_flags.part.0+0x82/0x210
[  138.120463]  mmput+0xb5/0x210
[  138.123444]  do_exit+0x602/0x14c0
[  138.126776]  ? mm_update_next_owner+0x400/0x400
[  138.131329]  do_group_exit+0x8a/0x140
[  138.135006]  get_signal+0x25b/0x1080
[  138.138606]  do_signal+0x8c/0xa90
[  138.141928]  ? _raw_spin_unlock_irq+0x24/0x30
[  138.146292]  ? mark_held_locks+0x24/0x90
[  138.150219]  ? _raw_spin_unlock_irq+0x24/0x30
[  138.154580]  ? lockdep_hardirqs_on+0x190/0x280
[  138.159026]  ? setup_sigcontext+0x260/0x260
[  138.163210]  ? sigprocmask+0x10b/0x150
[  138.166965]  ? __x64_sys_rt_sigsuspend+0xe0/0xe0
[  138.171594]  ? __x64_sys_rt_sigprocmask+0xfb/0x180
[  138.176394]  ? __ia32_compat_sys_rt_sigprocmask+0x190/0x190
[  138.181965]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  138.187192]  ? exit_to_usermode_loop+0x60/0x100
[  138.191723]  ? mark_held_locks+0x24/0x90
[  138.195656]  exit_to_usermode_loop+0x85/0x100
[  138.200023]  do_syscall_64+0x20b/0x290
[  138.203782]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  138.208839] RIP: 0033:0x7f2d349f1625
[  138.212420] Code: c2 b8 ea 00 00 00 0f 05 48 3d 00 f0 ff ff 77 3d 41 89 c0 41 ba 08 00 00 00 31 d2 4c 89 ce bf 02 00 00 00 b8 0e 00 00 00 0f 05 <48> 8b 84 24 08 01 00 00 64 48 33 04 25 28 00 00 00 75 24 44 89 c0
[  138.231163] RSP: 002b:00007ffe84228c60 EFLAGS: 00000246 ORIG_RAX: 000000000000000e
[  138.238731] RAX: 0000000000000000 RBX: 00007f2d349ad040 RCX: 00007f2d349f1625
[  138.245862] RDX: 0000000000000000 RSI: 00007ffe84228c60 RDI: 0000000000000002
[  138.252997] RBP: 00007ffe84228ec0 R08: 0000000000000000 R09: 00007ffe84228c60
[  138.260127] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000401240
[  138.267261] R13: 00007ffe84229190 R14: 0000000000000000 R15: 0000000000000000
[  138.274554] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 0-... } 26428 jiffies s: 277 root: 0x1/.
[  138.285167] rcu: blocking rcu_node structures:
[  138.289641] Task dump for CPU 0:
[  138.292890] hmm-tests       R  running task    27560  6228   6211 0x8000400a
[  138.300002] Call Trace:
[  138.302486]  ? check_chain_key+0x1d1/0x2c0
[  138.306627]  ? __lock_acquire+0x61c/0x2820
[  138.310760]  ? match_held_lock+0x1b/0x230
[  138.314800]  ? check_chain_key+0x1d1/0x2c0
[  138.318937]  ? lock_downgrade+0x380/0x380
[  138.323008]  ? lock_acquire+0xff/0x220
[  138.326795]  ? stack_depot_save+0x137/0x450
[  138.331026]  ? _raw_spin_unlock_irqrestore+0x3e/0x50
[  138.336026]  ? mark_held_locks+0x24/0x90
[  138.340012]  ? _raw_spin_unlock_irqrestore+0x3e/0x50
[  138.345022]  ? lockdep_hardirqs_on+0x190/0x280
[  138.349504]  ? stack_depot_save+0x253/0x450
[  138.353731]  ? check_chain_key+0x1d1/0x2c0
[  138.357871]  ? __lock_acquire+0x61c/0x2820
[  138.362033]  ? match_held_lock+0x1b/0x230
[  138.366070]  ? check_chain_key+0x1d1/0x2c0
[  138.370209]  ? mark_lock+0xac/0x9e0
[  138.373734]  ? mark_lock+0xac/0x9e0
[  138.377261]  ? mark_lock+0xac/0x9e0
[  138.380772]  ? mark_held_locks+0x65/0x90
[  138.384730]  ? mark_lock+0xac/0x9e0
[  138.388249]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  138.392895]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  138.397542]  ? lockdep_hardirqs_on+0x190/0x280
[  138.402029]  ? trace_hardirqs_on_thunk+0x1a/0x1c
[  138.406676]  ? mark_lock+0xac/0x9e0
[  138.410209]  ? lock_is_held_type+0x110/0x140
[  138.414509]  ? mark_held_locks+0x65/0x90
[  138.418461]  ? match_held_lock+0x1b/0x230
[  138.422500]  ? match_held_lock+0x1b/0x230
[  138.426551]  ? __lock_acquire+0x11c3/0x2820
[  138.430773]  ? xas_load+0x64/0x80
[  138.434126]  ? xas_store+0xac/0xa60
[  138.437652]  ? register_lock_class+0x860/0x860
[  138.442140]  ? lock_downgrade+0x380/0x380
[  138.446193]  ? _raw_spin_lock+0x2c/0x40
[  138.450059]  ? _raw_spin_unlock+0x17/0x30
[  138.454100]  ? xa_erase+0xe/0x30
[  138.457370]  ? dmirror_interval_invalidate+0x7d/0xc0 [test_hmm]
[  138.463325]  ? __mmu_notifier_release+0x1a6/0x370
[  138.468065]  ? mmu_notifier_unregister+0x1e0/0x1e0
[  138.472896]  ? lock_downgrade+0x380/0x380
[  138.476937]  ? uprobe_clear_state+0x2e/0x150
[  138.481252]  ? exit_mmap+0x24d/0x2a0
[  138.484859]  ? do_munmap+0x10/0x10
[  138.488298]  ? __x64_sys_io_setup+0x200/0x200
[  138.492683]  ? __mutex_unlock_slowpath+0xb4/0x3f0
[  138.497418]  ? wait_for_completion+0x250/0x250
[  138.501897]  ? lock_downgrade+0x380/0x380
[  138.505944]  ? check_flags.part.0+0x82/0x210
[  138.510254]  ? mmput+0xb5/0x210
[  138.513435]  ? do_exit+0x602/0x14c0
[  138.516967]  ? mm_update_next_owner+0x400/0x400
[  138.521545]  ? do_group_exit+0x8a/0x140
[  138.525423]  ? get_signal+0x25b/0x1080
[  138.529225]  ? do_signal+0x8c/0xa90
[  138.532755]  ? _raw_spin_unlock_irq+0x24/0x30
[  138.537143]  ? mark_held_locks+0x24/0x90
[  138.541096]  ? _raw_spin_unlock_irq+0x24/0x30
[  138.545492]  ? lockdep_hardirqs_on+0x190/0x280
[  138.549971]  ? setup_sigcontext+0x260/0x260
[  138.554190]  ? sigprocmask+0x10b/0x150
[  138.557969]  ? __x64_sys_rt_sigsuspend+0xe0/0xe0
[  138.562627]  ? __x64_sys_rt_sigprocmask+0xfb/0x180
[  138.567452]  ? __ia32_compat_sys_rt_sigprocmask+0x190/0x190
[  138.573057]  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
[  138.578311]  ? exit_to_usermode_loop+0x60/0x100
[  138.582875]  ? mark_held_locks+0x24/0x90
[  138.586837]  ? exit_to_usermode_loop+0x85/0x100
[  138.591400]  ? do_syscall_64+0x20b/0x290
[  138.595360]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
./test_hmm.sh: line 58:  6211 Alarm clock             ./hmm-tests


>> +static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
>> +			 unsigned long end, bool write)
>> +{
>> +	struct mm_struct *mm = dmirror->mm;
>> +	unsigned long addr;
>> +	uint64_t pfns[64];
>> +	struct hmm_range range = {
>> +		.notifier = &dmirror->notifier,
>> +		.pfns = pfns,
>> +		.flags = dmirror_hmm_flags,
>> +		.values = dmirror_hmm_values,
>> +		.pfn_shift = DPT_SHIFT,
>> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
>> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),
>> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
>> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
>> +		.dev_private_owner = dmirror->mdevice,
>> +	};
>> +	int ret = 0;
>> +
>> +	/* Since the mm is for the mirrored process, get a reference first. */
>> +	if (!mmget_not_zero(mm))
>> +		return 0;
> 
> Right
> 
>> +	for (addr = start; addr < end; addr = range.end) {
>> +		range.start = addr;
>> +		range.end = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
>> +
>> +		ret = dmirror_range_fault(dmirror, &range);
>> +		if (ret)
>> +			break;
>> +	}
>> +
>> +	mmput(mm);
>> +	return ret;
>> +}
>> +
>> +static int dmirror_do_read(struct dmirror *dmirror, unsigned long start,
>> +			   unsigned long end, struct dmirror_bounce *bounce)
>> +{
>> +	unsigned long pfn;
>> +	void *ptr;
>> +
>> +	ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK);
>> +
>> +	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
>> +		void *entry;
>> +		struct page *page;
>> +		void *tmp;
>> +
>> +		entry = xa_load(&dmirror->pt, pfn);
>> +		page = xa_untag_pointer(entry);
>> +		if (!page)
>> +			return -ENOENT;
>> +
>> +		tmp = kmap(page);
>> +		memcpy(ptr, tmp, PAGE_SIZE);
>> +		kunmap(page);
>> +
>> +		ptr += PAGE_SIZE;
>> +		bounce->cpages++;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
>> +{
>> +	struct dmirror_bounce bounce;
>> +	unsigned long start, end;
>> +	unsigned long size = cmd->npages << PAGE_SHIFT;
>> +	int ret;
>> +
>> +	start = cmd->addr;
>> +	end = start + size;
>> +	if (end < start)
>> +		return -EINVAL;
>> +
>> +	ret = dmirror_bounce_init(&bounce, start, size);
>> +	if (ret)
>> +		return ret;
>> +
>> +again:
>> +	mutex_lock(&dmirror->mutex);
>> +	ret = dmirror_do_read(dmirror, start, end, &bounce);
>> +	mutex_unlock(&dmirror->mutex);
>> +	if (ret == 0)
>> +		ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr,
>> +					bounce.size);
> 
> Use u64_to_user_ptr() instead of the cast

Will do.

>> +static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
>> +{
>> +	struct dmirror_bounce bounce;
>> +	unsigned long start, end;
>> +	unsigned long size = cmd->npages << PAGE_SHIFT;
>> +	int ret;
>> +
>> +	start = cmd->addr;
>> +	end = start + size;
>> +	if (end < start)
>> +		return -EINVAL;
>> +
>> +	ret = dmirror_bounce_init(&bounce, start, size);
>> +	if (ret)
>> +		return ret;
>> +	ret = copy_from_user(bounce.ptr, (void __user *)cmd->ptr,
>> +				bounce.size);
> 
> ditto
> 
>> +	if (ret)
>> +		return ret;
>> +
>> +again:
>> +	mutex_lock(&dmirror->mutex);
>> +	ret = dmirror_do_write(dmirror, start, end, &bounce);
>> +	mutex_unlock(&dmirror->mutex);
>> +	if (ret == -ENOENT) {
>> +		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
>> +		ret = dmirror_fault(dmirror, start, end, true);
>> +		if (ret == 0) {
>> +			cmd->faults++;
>> +			goto again;
> 
> Use a loop instead of goto?

OK.

> Also I get this:
> 
> lib/test_hmm.c: In function ‘dmirror_devmem_fault_alloc_and_copy’:
> lib/test_hmm.c:1041:25: warning: unused variable ‘vma’ [-Wunused-variable]
>   1041 |  struct vm_area_struct *vma = args->vma;
> 
> But this is a kernel bug, due to alloc_page_vma being a #define not a
> static inline and me having CONFIG_NUMA off in this .config

Fixed.
I'll repost as a proper series shortly.

> Jason
> 
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-19 22:56                   ` Ralph Campbell
@ 2020-03-20  0:03                     ` Jason Gunthorpe
  2020-03-21  8:20                       ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-20  0:03 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, linux-kselftest,
	Dan Williams, Bharata B Rao, Christian König

On Thu, Mar 19, 2020 at 03:56:50PM -0700, Ralph Campbell wrote:
> Adding linux-kselftest@vger.kernel.org for the test config question.
> 
> On 3/19/20 11:17 AM, Jason Gunthorpe wrote:
> > On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote:
> > > 
> > > On 3/17/20 5:59 AM, Christoph Hellwig wrote:
> > > > On Tue, Mar 17, 2020 at 09:47:55AM -0300, Jason Gunthorpe wrote:
> > > > > I've been using v7 of Ralph's tester and it is working well - it has
> > > > > DEVICE_PRIVATE support so I think it can test this flow too. Ralph are
> > > > > you able?
> > > > > 
> > > > > This hunk seems trivial enough to me, can we include it now?
> > > > 
> > > > I can send a separate patch for it once the tester covers it.  I don't
> > > > want to add it to the original patch as it is a significant behavior
> > > > change compared to the existing code.
> > > > 
> > > 
> > > Attached is an updated version of my HMM tests based on linux-5.6.0-rc6.
> > > I ran this OK with Jason's 8+1 HMM patches, Christoph's 1-5 misc HMM clean ups,
> > > and Christoph's 1-4 device private page changes applied.
> > 
> > I'd like to get this to mergable, it looks pretty good now, but I have
> > no idea about selftests - and I'm struggling to even compile the tools
> > dir
> > 
> > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> > > index 69def4a9df00..4d22ce7879a7 100644
> > > +++ b/lib/Kconfig.debug
> > > @@ -2162,6 +2162,18 @@ config TEST_MEMINIT
> > >   	  If unsure, say N.
> > > +config TEST_HMM
> > > +	tristate "Test HMM (Heterogeneous Memory Management)"
> > > +	depends on DEVICE_PRIVATE
> > > +	select HMM_MIRROR
> > > +        select MMU_NOTIFIER
> > 
> > extra spaces
> 
> Will fix in v8.
> 
> > In general I wonder if it even makes sense that DEVICE_PRIVATE is user
> > selectable?
> 
> Should tests enable the feature or the feature enable the test?
> IMHO, if the feature is being compiled into the kernel, that should
> enable the menu item for the test. If the feature isn't selected,
> no need to test it :-)

I ment if DEVICE_PRIVATE should be a user selectable option at all, or
should it be turned on when a driver like nouveau is selected.

Is there some downside to enabling DEVICE_PRIVATE?

> > The notifier holds a mmgrab, no need for another one
> 
> OK. I'll replace dmirror->mm with dmirror->notifier.mm.

Right that is good too

> > > +	filp->private_data = dmirror;
> > 
> > Not sure what this comment means
> 
> I'll change the comment to:
> 	  /*
>          * The first open of the device character file registers the address
>          * space of the process doing the open() system call with the device.
>          * Subsequent file opens by other processes will have access to the
>          * first process' address space.
>          */

How does this happen? The function looks like it always does the same thing

> > > +static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
> > > +				const struct mmu_notifier_range *range,
> > > +				unsigned long cur_seq)
> > > +{
> > > +	struct dmirror *dmirror = container_of(mni, struct dmirror, notifier);
> > > +	struct mm_struct *mm = dmirror->mm;
> > > +
> > > +	/*
> > > +	 * If the process doesn't exist, we don't need to invalidate the
> > > +	 * device page table since the address space will be torn down.
> > > +	 */
> > > +	if (!mmget_not_zero(mm))
> > > +		return true;
> > 
> > Why? Don't the notifiers provide for this already.
> > 
> > mmget_not_zero() is required before calling hmm_range_fault() though

Oh... This is the invalidate_all path during invalidation

IMHO you should test the invalidation reason in the range to exclude
this.

But xa_erase looks totally safe so there should be no reason to do
that.

> This is a workaround for a problem I don't quite understand.
> If you change tools/testing/selftests/vm/hmm-tests.c line 868 to
> 	ASSERT_EQ(ret, -1);
> Then the test will abort, core dump, and cause two problems,
> 1) the migrated page will be faulted back to system memory in order to write
>    it to the core dump. This triggers lockdep_assert_held(&walk.mm->mmap_sem)
>    in walk_page_range().

Has the migration stuff become entangled with the xarray?

> [  137.980718] Code: 80 2f 1a 83 c6 05 e9 8d 7b 01 01 e8 3e b1 b1 fe e9 05 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 41 56 41 55 41 54 55 <48> 89 fd 53 4c 8d 6d 10 e8 3c fc ff ff 49 89 c4 4c 89 e0 83 e0 03
> [  137.999461] RSP: 0018:ffffc900015e77c8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> [  138.007028] RAX: ffff8886e508c408 RBX: 0000000000000000 RCX: ffffffff82626c89
> [  138.014159] RDX: dffffc0000000000 RSI: 0000000000000000 RDI: ffffc900015e78a0
> [  138.021293] RBP: ffffc900015e78a0 R08: ffffffff811461c4 R09: fffff520002bcf17
> [  138.028426] R10: fffff520002bcf16 R11: 0000000000000003 R12: 0000000002606d10
> [  138.035557] R13: ffff8886e508c448 R14: 0000000000000031 R15: ffffffffa06546a0
> [  138.042701]  ? do_raw_spin_lock+0x104/0x1d0
> [  138.046888]  ? xas_store+0x19/0xa60
> [  138.050390]  xas_store+0x5b3/0xa60
> [  138.053806]  ? register_lock_class+0x860/0x860
> [  138.058267]  __xa_erase+0x96/0x110
> [  138.061673]  ? xas_store+0xa60/0xa60
> [  138.065267]  xa_erase+0x19/0x30

oh, it is doing this:

static void mn_itree_release(struct mmu_notifier_subscriptions *subscriptions,
                             struct mm_struct *mm)
{
        struct mmu_notifier_range range = {
                .flags = MMU_NOTIFIER_RANGE_BLOCKABLE,
                .event = MMU_NOTIFY_RELEASE,
                .mm = mm,
                .start = 0,
                .end = ULONG_MAX,
        };

ie it is sitting doing a huge number of xa_erases, I suppose. Probably
in normal exit the notifier is removed before the mm is destroyed.

The xa_erase needs to be a bit smarter to jump over gaps in the tree
perhaps some

xa_for_each()
   xa_erase()

pattern?

> > Also I get this:
> > 
> > lib/test_hmm.c: In function ‘dmirror_devmem_fault_alloc_and_copy’:
> > lib/test_hmm.c:1041:25: warning: unused variable ‘vma’ [-Wunused-variable]
> >   1041 |  struct vm_area_struct *vma = args->vma;
> > 
> > But this is a kernel bug, due to alloc_page_vma being a #define not a
> > static inline and me having CONFIG_NUMA off in this .config
> 
> Fixed.

in gfp.h?

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-17 23:14               ` Ralph Campbell
  2020-03-19 18:17                 ` Jason Gunthorpe
@ 2020-03-20  0:14                 ` Jason Gunthorpe
  2020-03-20  1:33                   ` Ralph Campbell
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-20  0:14 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote:

> +static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
> +			 unsigned long end, bool write)
> +{
> +	struct mm_struct *mm = dmirror->mm;
> +	unsigned long addr;
> +	uint64_t pfns[64];
> +	struct hmm_range range = {
> +		.notifier = &dmirror->notifier,
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),

Since pfns is not initialized pfn_flags_mask should be 0.

> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
> +		.dev_private_owner = dmirror->mdevice,
> +	};
> +	int ret = 0;

> +static int dmirror_snapshot(struct dmirror *dmirror,
> +			    struct hmm_dmirror_cmd *cmd)
> +{
> +	struct mm_struct *mm = dmirror->mm;
> +	unsigned long start, end;
> +	unsigned long size = cmd->npages << PAGE_SHIFT;
> +	unsigned long addr;
> +	unsigned long next;
> +	uint64_t pfns[64];
> +	unsigned char perm[64];
> +	char __user *uptr;
> +	struct hmm_range range = {
> +		.pfns = pfns,
> +		.flags = dmirror_hmm_flags,
> +		.values = dmirror_hmm_values,
> +		.pfn_shift = DPT_SHIFT,
> +		.pfn_flags_mask = ~0ULL,

Same here, especially since this is snapshot

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-20  0:14                 ` Jason Gunthorpe
@ 2020-03-20  1:33                   ` Ralph Campbell
  2020-03-20 12:58                     ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Ralph Campbell @ 2020-03-20  1:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König


On 3/19/20 5:14 PM, Jason Gunthorpe wrote:
> On Tue, Mar 17, 2020 at 04:14:31PM -0700, Ralph Campbell wrote:
> 
>> +static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
>> +			 unsigned long end, bool write)
>> +{
>> +	struct mm_struct *mm = dmirror->mm;
>> +	unsigned long addr;
>> +	uint64_t pfns[64];
>> +	struct hmm_range range = {
>> +		.notifier = &dmirror->notifier,
>> +		.pfns = pfns,
>> +		.flags = dmirror_hmm_flags,
>> +		.values = dmirror_hmm_values,
>> +		.pfn_shift = DPT_SHIFT,
>> +		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
>> +				    dmirror_hmm_flags[HMM_PFN_WRITE]),
> 
> Since pfns is not initialized pfn_flags_mask should be 0.

Good point.

>> +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
>> +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
>> +		.dev_private_owner = dmirror->mdevice,
>> +	};
>> +	int ret = 0;
> 
>> +static int dmirror_snapshot(struct dmirror *dmirror,
>> +			    struct hmm_dmirror_cmd *cmd)
>> +{
>> +	struct mm_struct *mm = dmirror->mm;
>> +	unsigned long start, end;
>> +	unsigned long size = cmd->npages << PAGE_SHIFT;
>> +	unsigned long addr;
>> +	unsigned long next;
>> +	uint64_t pfns[64];
>> +	unsigned char perm[64];
>> +	char __user *uptr;
>> +	struct hmm_range range = {
>> +		.pfns = pfns,
>> +		.flags = dmirror_hmm_flags,
>> +		.values = dmirror_hmm_values,
>> +		.pfn_shift = DPT_SHIFT,
>> +		.pfn_flags_mask = ~0ULL,
> 
> Same here, especially since this is snapshot
> 
> Jason

Actually, snapshot ignores pfn_flags_mask and default_flags.
In hmm_pte_need_fault(), HMM_FAULT_SNAPSHOT is checked and returns early before
checking pfn_flags_mask and default_flags since no faults are being requested.

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-20  1:33                   ` Ralph Campbell
@ 2020-03-20 12:58                     ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-20 12:58 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Thu, Mar 19, 2020 at 06:33:04PM -0700, Ralph Campbell wrote:

> > > +		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
> > > +				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
> > > +		.dev_private_owner = dmirror->mdevice,
> > > +	};
> > > +	int ret = 0;
> > 
> > > +static int dmirror_snapshot(struct dmirror *dmirror,
> > > +			    struct hmm_dmirror_cmd *cmd)
> > > +{
> > > +	struct mm_struct *mm = dmirror->mm;
> > > +	unsigned long start, end;
> > > +	unsigned long size = cmd->npages << PAGE_SHIFT;
> > > +	unsigned long addr;
> > > +	unsigned long next;
> > > +	uint64_t pfns[64];
> > > +	unsigned char perm[64];
> > > +	char __user *uptr;
> > > +	struct hmm_range range = {
> > > +		.pfns = pfns,
> > > +		.flags = dmirror_hmm_flags,
> > > +		.values = dmirror_hmm_values,
> > > +		.pfn_shift = DPT_SHIFT,
> > > +		.pfn_flags_mask = ~0ULL,
> > 
> > Same here, especially since this is snapshot
> > 
> > Jason
> 
> Actually, snapshot ignores pfn_flags_mask and default_flags.

Yes, so no reason to set them to not 0..

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-16 19:32 ` [PATCH 4/4] mm: check the device private page owner " Christoph Hellwig
  2020-03-16 19:49   ` Jason Gunthorpe
  2020-03-16 23:11   ` Ralph Campbell
@ 2020-03-20 13:41   ` Jason Gunthorpe
  2020-03-21  8:22     ` Christoph Hellwig
  2 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-20 13:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Mon, Mar 16, 2020 at 08:32:16PM +0100, Christoph Hellwig wrote:
> diff --git a/mm/hmm.c b/mm/hmm.c
> index cfad65f6a67b..b75b3750e03d 100644
> +++ b/mm/hmm.c
> @@ -216,6 +216,14 @@ int hmm_vma_handle_pmd(struct mm_walk *walk, unsigned long addr,
>  		unsigned long end, uint64_t *pfns, pmd_t pmd);
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline bool hmm_is_device_private_entry(struct hmm_range *range,
> +		swp_entry_t entry)
> +{
> +	return is_device_private_entry(entry) &&
> +		device_private_entry_to_page(entry)->pgmap->owner ==
> +		range->dev_private_owner;
> +}

Thinking about this some more, does the locking work out here?

hmm_range_fault() runs with mmap_sem in read, and does not lock any of
the page table levels.

So it relies on accessing stale pte data being safe, and here we
introduce for the first time a page pointer dereference and a pgmap
dereference without any locking/refcounting.

The get_dev_pagemap() worked on the PFN and obtained a refcount, so it
created safety.

Is there some tricky reason this is safe, eg a DEVICE_PRIVATE page
cannot be removed from the vma without holding mmap_sem in write or
something?

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault
  2020-03-20  0:03                     ` Jason Gunthorpe
@ 2020-03-21  8:20                       ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-21  8:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Ralph Campbell, amd-gfx, nouveau, dri-devel, kvm-ppc,
	Christoph Hellwig, linux-mm, Jerome Glisse, Ben Skeggs,
	linux-kselftest, Dan Williams, Bharata B Rao,
	Christian König

On Thu, Mar 19, 2020 at 09:03:45PM -0300, Jason Gunthorpe wrote:
> > Should tests enable the feature or the feature enable the test?
> > IMHO, if the feature is being compiled into the kernel, that should
> > enable the menu item for the test. If the feature isn't selected,
> > no need to test it :-)
> 
> I ment if DEVICE_PRIVATE should be a user selectable option at all, or
> should it be turned on when a driver like nouveau is selected.

I don't think it should be user selectable.  This is an implementation
detail users can't know about.

> Is there some downside to enabling DEVICE_PRIVATE?

The option itself adds a little more code to the core kernel, and
introduces a few additional branches in core mm code.

But more importantly it pulls in the whole pgmap infrastructure.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-20 13:41   ` Jason Gunthorpe
@ 2020-03-21  8:22     ` Christoph Hellwig
  2020-03-21 12:38       ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-21  8:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Fri, Mar 20, 2020 at 10:41:09AM -0300, Jason Gunthorpe wrote:
> Thinking about this some more, does the locking work out here?
> 
> hmm_range_fault() runs with mmap_sem in read, and does not lock any of
> the page table levels.
> 
> So it relies on accessing stale pte data being safe, and here we
> introduce for the first time a page pointer dereference and a pgmap
> dereference without any locking/refcounting.
> 
> The get_dev_pagemap() worked on the PFN and obtained a refcount, so it
> created safety.
> 
> Is there some tricky reason this is safe, eg a DEVICE_PRIVATE page
> cannot be removed from the vma without holding mmap_sem in write or
> something?

I don't think there is any specific protection.  Let me see if we
can throw in a get_dev_pagemap here - note that current mainline doesn't
even use it for this path..
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-21  8:22     ` Christoph Hellwig
@ 2020-03-21 12:38       ` Jason Gunthorpe
  2020-03-21 15:18         ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2020-03-21 12:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Bharata B Rao, linux-mm,
	Jerome Glisse, Ben Skeggs, Dan Williams, Christian König

On Sat, Mar 21, 2020 at 09:22:36AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 20, 2020 at 10:41:09AM -0300, Jason Gunthorpe wrote:
> > Thinking about this some more, does the locking work out here?
> > 
> > hmm_range_fault() runs with mmap_sem in read, and does not lock any of
> > the page table levels.
> > 
> > So it relies on accessing stale pte data being safe, and here we
> > introduce for the first time a page pointer dereference and a pgmap
> > dereference without any locking/refcounting.
> > 
> > The get_dev_pagemap() worked on the PFN and obtained a refcount, so it
> > created safety.
> > 
> > Is there some tricky reason this is safe, eg a DEVICE_PRIVATE page
> > cannot be removed from the vma without holding mmap_sem in write or
> > something?
> 
> I don't think there is any specific protection.  Let me see if we
> can throw in a get_dev_pagemap here

The page tables are RCU protected right? could we do something like

 if (is_device_private_entry()) {
       rcu_read_lock()
       if (READ_ONCE(*ptep) != pte)
           return -EBUSY;
       hmm_is_device_private_entry()
       rcu_read_unlock()
 }

?

Then pgmap needs a synchronize_rcu before the struct page's are
destroyed (possibly gup_fast already requires this?)

I've got some other patches trying to close some of these styles of
bugs, but 

> note that current mainline doesn't even use it for this path..

Don't follow?

Jason
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 4/4] mm: check the device private page owner in hmm_range_fault
  2020-03-21 12:38       ` Jason Gunthorpe
@ 2020-03-21 15:18         ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2020-03-21 15:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: amd-gfx, nouveau, dri-devel, kvm-ppc, Christoph Hellwig,
	linux-mm, Jerome Glisse, Ben Skeggs, Dan Williams, Bharata B Rao,
	Christian König

On Sat, Mar 21, 2020 at 09:38:04AM -0300, Jason Gunthorpe wrote:
> > I don't think there is any specific protection.  Let me see if we
> > can throw in a get_dev_pagemap here
> 
> The page tables are RCU protected right? could we do something like
> 
>  if (is_device_private_entry()) {
>        rcu_read_lock()
>        if (READ_ONCE(*ptep) != pte)
>            return -EBUSY;
>        hmm_is_device_private_entry()
>        rcu_read_unlock()
>  }
> 
> ?

Are they everywhere?  I'd really love to hear from people that really
know this ara..

> 
> Then pgmap needs a synchronize_rcu before the struct page's are
> destroyed (possibly gup_fast already requires this?)
> 
> I've got some other patches trying to close some of these styles of
> bugs, but 
> 
> > note that current mainline doesn't even use it for this path..
> 
> Don't follow?

If you look at mainline (or any other tree), we only do a
get_dev_pagemap for devmap ptes.  But device private pages are encoded
as non-present swap ptes.
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2020-03-23  5:03 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-16 19:32 ensure device private pages have an owner v2 Christoph Hellwig
2020-03-16 19:32 ` [PATCH 1/4] memremap: add an owner field to struct dev_pagemap Christoph Hellwig
2020-03-16 20:55   ` Ralph Campbell
2020-03-16 19:32 ` [PATCH 2/4] mm: handle multiple owners of device private pages in migrate_vma Christoph Hellwig
2020-03-16 21:43   ` Ralph Campbell
2020-03-16 19:32 ` [PATCH 3/4] mm: simplify device private page handling in hmm_range_fault Christoph Hellwig
2020-03-16 19:59   ` Jason Gunthorpe
2020-03-16 21:33     ` Christoph Hellwig
2020-03-16 22:49   ` Ralph Campbell
2020-03-17  7:34     ` Christoph Hellwig
2020-03-17 22:43       ` Ralph Campbell
2020-03-18  9:34         ` Christoph Hellwig
2020-03-17 12:15     ` Jason Gunthorpe
2020-03-17 12:24       ` Christoph Hellwig
2020-03-17 12:28         ` Christoph Hellwig
2020-03-17 12:47           ` Jason Gunthorpe
2020-03-17 12:59             ` Christoph Hellwig
2020-03-17 17:32               ` Jason Gunthorpe
2020-03-17 23:14               ` Ralph Campbell
2020-03-19 18:17                 ` Jason Gunthorpe
2020-03-19 22:56                   ` Ralph Campbell
2020-03-20  0:03                     ` Jason Gunthorpe
2020-03-21  8:20                       ` Christoph Hellwig
2020-03-20  0:14                 ` Jason Gunthorpe
2020-03-20  1:33                   ` Ralph Campbell
2020-03-20 12:58                     ` Jason Gunthorpe
2020-03-16 19:32 ` [PATCH 4/4] mm: check the device private page owner " Christoph Hellwig
2020-03-16 19:49   ` Jason Gunthorpe
2020-03-16 23:11   ` Ralph Campbell
2020-03-20 13:41   ` Jason Gunthorpe
2020-03-21  8:22     ` Christoph Hellwig
2020-03-21 12:38       ` Jason Gunthorpe
2020-03-21 15:18         ` Christoph Hellwig
2020-03-17  5:31 ` ensure device private pages have an owner v2 Bharata B Rao
2020-03-19  0:28 ` Jason Gunthorpe
2020-03-19  7:16   ` Christoph Hellwig
2020-03-19 11:50     ` Jason Gunthorpe
2020-03-19 18:50     ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).