linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2] mm/hmm/test: add self tests for HMM
@ 2019-12-16 19:57 Ralph Campbell
  2019-12-16 19:57 ` [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe Ralph Campbell
  2019-12-16 19:57 ` [PATCH v5 2/2] mm/hmm/test: add self tests for HMM Ralph Campbell
  0 siblings, 2 replies; 12+ messages in thread
From: Ralph Campbell @ 2019-12-16 19:57 UTC (permalink / raw)
  To: linux-rdma, linux-mm, linux-kernel, linux-kselftest
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Shuah Khan, Ralph Campbell

This series adds basic self tests for HMM and mmu interval notifiers so
that changes can be validated. It is based on linux-5.5.0-rc1 and is for
Jason's rdma/hmm tree since I believe he is planning some interval
notifier changes.  Patch 2 was last posted as part of [1] but the other
patches in that series have been merged and this version of the HMM tests
is modified to address Jason's concern over using both process wide MMU
notifiers in combination with MMU interval notifiers. Therefore, patch 1
adds some core functionality to allow intervals to be updated from within
the invalidation() callback so that MMU_NOTIFY_UNMAP events can update
the range being tracked.

[1] https://lore.kernel.org/linux-mm/20191104222141.5173-1-rcampbell@nvidia.com

Ralph Campbell (2):
  mm/mmu_notifier: make interval notifier updates safe
  mm/hmm/test: add self tests for HMM

 MAINTAINERS                            |    3 +
 include/linux/mmu_notifier.h           |   15 +
 lib/Kconfig.debug                      |   11 +
 lib/Makefile                           |    1 +
 lib/test_hmm.c                         | 1367 ++++++++++++++++++++++++
 mm/mmu_notifier.c                      |  196 +++-
 tools/testing/selftests/vm/.gitignore  |    1 +
 tools/testing/selftests/vm/Makefile    |    3 +
 tools/testing/selftests/vm/config      |    2 +
 tools/testing/selftests/vm/hmm-tests.c | 1360 +++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests |   16 +
 tools/testing/selftests/vm/test_hmm.sh |   97 ++
 12 files changed, 3047 insertions(+), 25 deletions(-)
 create mode 100644 lib/test_hmm.c
 create mode 100644 tools/testing/selftests/vm/hmm-tests.c
 create mode 100755 tools/testing/selftests/vm/test_hmm.sh

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2019-12-16 19:57 [PATCH v5 0/2] mm/hmm/test: add self tests for HMM Ralph Campbell
@ 2019-12-16 19:57 ` Ralph Campbell
  2019-12-17 20:51   ` Jason Gunthorpe
  2020-01-09 19:48   ` Jason Gunthorpe
  2019-12-16 19:57 ` [PATCH v5 2/2] mm/hmm/test: add self tests for HMM Ralph Campbell
  1 sibling, 2 replies; 12+ messages in thread
From: Ralph Campbell @ 2019-12-16 19:57 UTC (permalink / raw)
  To: linux-rdma, linux-mm, linux-kernel, linux-kselftest
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Shuah Khan, Ralph Campbell

mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't
be called safely from inside the invalidate() callback. This is fine for
devices with explicit memory region register and unregister calls but it
is desirable from a programming model standpoint to not require explicit
memory region registration. Regions can be registered based on device
address faults but without a mechanism for updating or removing the mmu
interval notifiers in response to munmap(), the invalidation callbacks
will be for regions that are stale or apply to different mmaped regions.

The invalidate() callback provides the necessary information (i.e.,
the event type MMU_NOTIFY_UNMAP) so add insert, remove, and update
functions that are safe to call from the invalidate() callback by
extending the work done in mn_itree_inv_end() to update the interval tree
when it is not being traversed.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
---
 include/linux/mmu_notifier.h |  15 +++
 mm/mmu_notifier.c            | 196 ++++++++++++++++++++++++++++++-----
 2 files changed, 186 insertions(+), 25 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 9e6caa8ecd19..55fbefcdc564 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -233,11 +233,18 @@ struct mmu_notifier {
  * @invalidate: Upon return the caller must stop using any SPTEs within this
  *              range. This function can sleep. Return false only if sleeping
  *              was required but mmu_notifier_range_blockable(range) is false.
+ * @release:	This function will be called when the mmu_interval_notifier
+ *		is removed from the interval tree. Defining this function also
+ *		allows mmu_interval_notifier_remove() and
+ *		mmu_interval_notifier_update() to be called from the
+ *		invalidate() callback function (i.e., they won't block waiting
+ *		for invalidations to finish.
  */
 struct mmu_interval_notifier_ops {
 	bool (*invalidate)(struct mmu_interval_notifier *mni,
 			   const struct mmu_notifier_range *range,
 			   unsigned long cur_seq);
+	void (*release)(struct mmu_interval_notifier *mni);
 };
 
 struct mmu_interval_notifier {
@@ -246,6 +253,8 @@ struct mmu_interval_notifier {
 	struct mm_struct *mm;
 	struct hlist_node deferred_item;
 	unsigned long invalidate_seq;
+	unsigned long deferred_start;
+	unsigned long deferred_last;
 };
 
 #ifdef CONFIG_MMU_NOTIFIER
@@ -299,7 +308,13 @@ int mmu_interval_notifier_insert_locked(
 	struct mmu_interval_notifier *mni, struct mm_struct *mm,
 	unsigned long start, unsigned long length,
 	const struct mmu_interval_notifier_ops *ops);
+int mmu_interval_notifier_insert_safe(
+	struct mmu_interval_notifier *mni, struct mm_struct *mm,
+	unsigned long start, unsigned long length,
+	const struct mmu_interval_notifier_ops *ops);
 void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni);
+int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
+				 unsigned long start, unsigned long length);
 
 /**
  * mmu_interval_set_seq - Save the invalidation sequence
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index f76ea05b1cb0..c303285750b2 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
 {
 	struct mmu_interval_notifier *mni;
 	struct hlist_node *next;
+	struct hlist_head removed_list;
 
 	spin_lock(&mmn_mm->lock);
 	if (--mmn_mm->active_invalidate_ranges ||
@@ -144,21 +145,47 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
 	 * The inv_end incorporates a deferred mechanism like rtnl_unlock().
 	 * Adds and removes are queued until the final inv_end happens then
 	 * they are progressed. This arrangement for tree updates is used to
-	 * avoid using a blocking lock during invalidate_range_start.
+	 * avoid using a blocking lock while walking the interval tree.
 	 */
+	INIT_HLIST_HEAD(&removed_list);
 	hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list,
 				  deferred_item) {
+		hlist_del(&mni->deferred_item);
 		if (RB_EMPTY_NODE(&mni->interval_tree.rb))
 			interval_tree_insert(&mni->interval_tree,
 					     &mmn_mm->itree);
-		else
+		else {
 			interval_tree_remove(&mni->interval_tree,
 					     &mmn_mm->itree);
-		hlist_del(&mni->deferred_item);
+			if (mni->deferred_last) {
+				mni->interval_tree.start = mni->deferred_start;
+				mni->interval_tree.last = mni->deferred_last;
+				mni->deferred_last = 0;
+				interval_tree_insert(&mni->interval_tree,
+						     &mmn_mm->itree);
+			} else if (mni->ops->release)
+				hlist_add_head(&mni->deferred_item,
+					       &removed_list);
+		}
 	}
 	spin_unlock(&mmn_mm->lock);
 
 	wake_up_all(&mmn_mm->wq);
+
+	/*
+	 * Interval notifiers with a release() operation expect a callback
+	 * instead of mmu_interval_notifier_remove() waiting for the
+	 * wake up above.
+	 */
+	hlist_for_each_entry_safe(mni, next, &removed_list, deferred_item) {
+		struct mm_struct *mm = mni->mm;
+
+		hlist_del(&mni->deferred_item);
+		mni->ops->release(mni);
+
+		/* pairs with mmgrab() in mmu_interval_notifier_insert() */
+		mmdrop(mm);
+	}
 }
 
 /**
@@ -856,6 +883,7 @@ static int __mmu_interval_notifier_insert(
 	mni->ops = ops;
 	RB_CLEAR_NODE(&mni->interval_tree.rb);
 	mni->interval_tree.start = start;
+	mni->deferred_last = 0;
 	/*
 	 * Note that the representation of the intervals in the interval tree
 	 * considers the ending point as contained in the interval.
@@ -913,16 +941,17 @@ static int __mmu_interval_notifier_insert(
 /**
  * mmu_interval_notifier_insert - Insert an interval notifier
  * @mni: Interval notifier to register
+ * @mm : mm_struct to attach to
  * @start: Starting virtual address to monitor
  * @length: Length of the range to monitor
- * @mm : mm_struct to attach to
+ * @ops: Interval notifier callback operations
  *
  * This function subscribes the interval notifier for notifications from the
- * mm.  Upon return the ops related to mmu_interval_notifier will be called
+ * mm.  Upon return, the ops related to mmu_interval_notifier will be called
  * whenever an event that intersects with the given range occurs.
  *
- * Upon return the range_notifier may not be present in the interval tree yet.
- * The caller must use the normal interval notifier read flow via
+ * Upon return, the mmu_interval_notifier may not be present in the interval
+ * tree yet.  The caller must use the normal interval notifier read flow via
  * mmu_interval_read_begin() to establish SPTEs for this range.
  */
 int mmu_interval_notifier_insert(struct mmu_interval_notifier *mni,
@@ -970,14 +999,52 @@ int mmu_interval_notifier_insert_locked(
 EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
 
 /**
- * mmu_interval_notifier_remove - Remove a interval notifier
- * @mni: Interval notifier to unregister
+ * mmu_interval_notifier_insert_safe - Insert an interval notifier
+ * @mni: Interval notifier to register
+ * @mm : mm_struct to attach to
+ * @start: Starting virtual address to monitor
+ * @length: Length of the range to monitor
+ * @ops: Interval notifier callback operations
  *
- * This function must be paired with mmu_interval_notifier_insert(). It cannot
- * be called from any ops callback.
+ * Return: -EINVAL if @mm hasn't been initialized for interval notifiers
+ *	by calling mmu_notifier_register(NULL, mm) or
+ *	__mmu_notifier_register(NULL, mm).
  *
- * Once this returns ops callbacks are no longer running on other CPUs and
- * will not be called in future.
+ * This function subscribes the interval notifier for notifications from the
+ * mm.  Upon return, the ops related to mmu_interval_notifier will be called
+ * whenever an event that intersects with the given range occurs.
+ *
+ * This function is safe to call from the ops->invalidate() function.
+ * Upon return, the mmu_interval_notifier may not be present in the interval
+ * tree yet.  The caller must use the normal interval notifier read flow via
+ * mmu_interval_read_begin() to establish SPTEs for this range.
+ */
+int mmu_interval_notifier_insert_safe(
+	struct mmu_interval_notifier *mni, struct mm_struct *mm,
+	unsigned long start, unsigned long length,
+	const struct mmu_interval_notifier_ops *ops)
+{
+	struct mmu_notifier_mm *mmn_mm;
+
+	mmn_mm = mm->mmu_notifier_mm;
+	if (!mmn_mm || !mmn_mm->has_itree)
+		return -EINVAL;
+	return __mmu_interval_notifier_insert(mni, mm, mmn_mm, start, length,
+					      ops);
+}
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_safe);
+
+/**
+ * mmu_interval_notifier_remove - Remove an interval notifier
+ * @mni: Interval notifier to unregister
+ *
+ * This function must be paired with mmu_interval_notifier_insert().
+ * If a mmu_interval_notifier_ops.release() function is defined,
+ * mmu_interval_notifier_remove() is safe to call from the invalidate()
+ * callback and the release() function will be called once all CPUs
+ * are finished using the notifier. Otherwise, mmu_interval_notifier_remove()
+ * cannot be called from any ops callback and will block waiting for
+ * invalidation callbacks to finish before returning.
  */
 void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni)
 {
@@ -996,8 +1063,11 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni)
 		if (RB_EMPTY_NODE(&mni->interval_tree.rb)) {
 			hlist_del(&mni->deferred_item);
 		} else {
-			hlist_add_head(&mni->deferred_item,
-				       &mmn_mm->deferred_list);
+			if (mni->deferred_last)
+				mni->deferred_last = 0;
+			else
+				hlist_add_head(&mni->deferred_item,
+					       &mmn_mm->deferred_list);
 			seq = mmn_mm->invalidate_seq;
 		}
 	} else {
@@ -1006,20 +1076,96 @@ void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni)
 	}
 	spin_unlock(&mmn_mm->lock);
 
-	/*
-	 * The possible sleep on progress in the invalidation requires the
-	 * caller not hold any locks held by invalidation callbacks.
-	 */
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
-	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
-	if (seq)
+	if (mni->ops->release) {
+		if (!seq) {
+			mni->ops->release(mni);
+
+			/*
+			 * Pairs with mmgrab() in
+			 * mmu_interval_notifier_insert().
+			 */
+			mmdrop(mm);
+		}
+	} else {
+		/*
+		 * The possible sleep on progress in the invalidation requires
+		 * the caller not hold any locks held by invalidation
+		 * callbacks.
+		 */
+		lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+		lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+		if (seq)
+			wait_event(mmn_mm->wq,
+				   READ_ONCE(mmn_mm->invalidate_seq) != seq);
+
+		/* pairs with mmgrab in mmu_interval_notifier_insert() */
+		mmdrop(mm);
+	}
+}
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove);
+
+/**
+ * mmu_interval_notifier_update - Update interval notifier end
+ * @mni: Interval notifier to update
+ * @start: New starting virtual address to monitor
+ * @length: New length of the range to monitor
+ *
+ * This function updates the range being monitored.
+ * If there is no release() function defined, the call will wait for the
+ * update to finish before returning.
+ */
+int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
+				 unsigned long start, unsigned long length)
+{
+	struct mm_struct *mm = mni->mm;
+	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
+	unsigned long seq = 0;
+	unsigned long last;
+
+	if (length == 0 || check_add_overflow(start, length - 1, &last))
+		return -EOVERFLOW;
+
+	spin_lock(&mmn_mm->lock);
+	if (mn_itree_is_invalidating(mmn_mm)) {
+		/*
+		 * Update is being called after insert put this on the
+		 * deferred list, but before the deferred list was processed.
+		 */
+		if (RB_EMPTY_NODE(&mni->interval_tree.rb)) {
+			mni->interval_tree.start = start;
+			mni->interval_tree.last = last;
+		} else {
+			if (!mni->deferred_last)
+				hlist_add_head(&mni->deferred_item,
+					       &mmn_mm->deferred_list);
+			mni->deferred_start = start;
+			mni->deferred_last = last;
+		}
+		seq = mmn_mm->invalidate_seq;
+	} else {
+		WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb));
+		interval_tree_remove(&mni->interval_tree, &mmn_mm->itree);
+		mni->interval_tree.start = start;
+		mni->interval_tree.last = last;
+		interval_tree_insert(&mni->interval_tree, &mmn_mm->itree);
+	}
+	spin_unlock(&mmn_mm->lock);
+
+	if (!mni->ops->release && seq) {
+		/*
+		 * The possible sleep on progress in the invalidation requires
+		 * the caller not hold any locks held by invalidation
+		 * callbacks.
+		 */
+		lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+		lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 		wait_event(mmn_mm->wq,
 			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
+	}
 
-	/* pairs with mmgrab in mmu_interval_notifier_insert() */
-	mmdrop(mm);
+	return 0;
 }
-EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove);
+EXPORT_SYMBOL_GPL(mmu_interval_notifier_update);
 
 /**
  * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 2/2] mm/hmm/test: add self tests for HMM
  2019-12-16 19:57 [PATCH v5 0/2] mm/hmm/test: add self tests for HMM Ralph Campbell
  2019-12-16 19:57 ` [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe Ralph Campbell
@ 2019-12-16 19:57 ` Ralph Campbell
  1 sibling, 0 replies; 12+ messages in thread
From: Ralph Campbell @ 2019-12-16 19:57 UTC (permalink / raw)
  To: linux-rdma, linux-mm, linux-kernel, linux-kselftest
  Cc: Jerome Glisse, John Hubbard, Christoph Hellwig, Jason Gunthorpe,
	Andrew Morton, Shuah Khan, Ralph Campbell

Add self tests for HMM.

Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 MAINTAINERS                            |    3 +
 lib/Kconfig.debug                      |   11 +
 lib/Makefile                           |    1 +
 lib/test_hmm.c                         | 1367 ++++++++++++++++++++++++
 tools/testing/selftests/vm/.gitignore  |    1 +
 tools/testing/selftests/vm/Makefile    |    3 +
 tools/testing/selftests/vm/config      |    2 +
 tools/testing/selftests/vm/hmm-tests.c | 1360 +++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests |   16 +
 tools/testing/selftests/vm/test_hmm.sh |   97 ++
 10 files changed, 2861 insertions(+)
 create mode 100644 lib/test_hmm.c
 create mode 100644 tools/testing/selftests/vm/hmm-tests.c
 create mode 100755 tools/testing/selftests/vm/test_hmm.sh

diff --git a/MAINTAINERS b/MAINTAINERS
index 02d5278a4c9a..38cf016f8108 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7508,7 +7508,10 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	mm/hmm*
 F:	include/linux/hmm*
+F:	include/uapi/linux/test_hmm*
 F:	Documentation/vm/hmm.rst
+F:	lib/test_hmm*
+F:	tools/testing/selftests/vm/*hmm*
 
 HOST AP DRIVER
 M:	Jouni Malinen <j@w1.fi>
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index d1842fe756d5..d3dbca0352c9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2117,6 +2117,17 @@ config TEST_MEMINIT
 
 	  If unsure, say N.
 
+config TEST_HMM
+	tristate "Test HMM (Heterogeneous Memory Management)"
+	depends on HMM_MIRROR
+	depends on DEVICE_PRIVATE
+	help
+	  This is a pseudo device driver solely for testing HMM.
+	  Say M here if you want to build the HMM test module.
+	  Doing so will allow you to run tools/testing/selftest/vm/hmm-tests.
+
+	  If unsure, say N.
+
 endif # RUNTIME_TESTING_MENU
 
 config MEMTEST
diff --git a/lib/Makefile b/lib/Makefile
index 93217d44237f..348ce83cb21f 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -88,6 +88,7 @@ obj-$(CONFIG_TEST_OBJAGG) += test_objagg.o
 obj-$(CONFIG_TEST_STACKINIT) += test_stackinit.o
 obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test_blackhole_dev.o
 obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o
+obj-$(CONFIG_TEST_HMM) += test_hmm.o
 
 obj-$(CONFIG_TEST_LIVEPATCH) += livepatch/
 
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
new file mode 100644
index 000000000000..5b65f397df83
--- /dev/null
+++ b/lib/test_hmm.c
@@ -0,0 +1,1367 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * This is a module to test the HMM (Heterogeneous Memory Management)
+ * mirror and zone device private memory migration APIs of the kernel.
+ * Userspace programs can register with the driver to mirror their own address
+ * space and can use the device to read/write any valid virtual address.
+ */
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/mutex.h>
+#include <linux/rwsem.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/highmem.h>
+#include <linux/delay.h>
+#include <linux/pagemap.h>
+#include <linux/hmm.h>
+#include <linux/vmalloc.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/sched/mm.h>
+#include <linux/platform_device.h>
+
+#include <uapi/linux/test_hmm.h>
+
+#define DMIRROR_NDEVICES		2
+#define DMIRROR_RANGE_FAULT_TIMEOUT	1000
+#define DEVMEM_CHUNK_SIZE		(256 * 1024 * 1024U)
+#define DEVMEM_CHUNKS_RESERVE		16
+
+static const struct dev_pagemap_ops dmirror_devmem_ops;
+static const struct mmu_interval_notifier_ops dmirror_min_ops;
+static dev_t dmirror_dev;
+static struct page *dmirror_zero_page;
+
+struct dmirror_device;
+
+struct dmirror_bounce {
+	void			*ptr;
+	unsigned long		size;
+	unsigned long		addr;
+	unsigned long		cpages;
+};
+
+#define DPT_SHIFT PAGE_SHIFT
+#define DPT_VALID (1UL << 0)
+#define DPT_WRITE (1UL << 1)
+#define DPT_DPAGE (1UL << 2)
+
+#define DPT_XA_TAG_WRITE 3UL
+
+static const uint64_t dmirror_hmm_flags[HMM_PFN_FLAG_MAX] = {
+	[HMM_PFN_VALID] = DPT_VALID,
+	[HMM_PFN_WRITE] = DPT_WRITE,
+	[HMM_PFN_DEVICE_PRIVATE] = DPT_DPAGE,
+};
+
+static const uint64_t dmirror_hmm_values[HMM_PFN_VALUE_MAX] = {
+	[HMM_PFN_NONE]    = 0,
+	[HMM_PFN_ERROR]   = 0x10,
+	[HMM_PFN_SPECIAL] = 0x10,
+};
+
+/*
+ * Data structure to track address ranges and register for mmu interval
+ * notifier updates.
+ */
+struct dmirror_interval {
+	struct interval_tree_node	tree_node;
+	struct mmu_interval_notifier	notifier;
+	struct dmirror			*dmirror;
+};
+
+/*
+ * Data attached to the open device file.
+ * Note that it might be shared after a fork().
+ */
+struct dmirror {
+	struct mm_struct	*mm;
+	struct dmirror_device	*mdevice;
+	struct xarray		pt;
+	struct rb_root_cached	ranges;
+	struct mutex		mutex;
+};
+
+/*
+ * ZONE_DEVICE pages for migration and simulating device memory.
+ */
+struct dmirror_chunk {
+	struct dev_pagemap	pagemap;
+	struct dmirror_device	*mdevice;
+};
+
+/*
+ * Per device data.
+ */
+struct dmirror_device {
+	struct cdev		cdevice;
+	struct hmm_devmem	*devmem;
+
+	unsigned int		devmem_capacity;
+	unsigned int		devmem_count;
+	struct dmirror_chunk	**devmem_chunks;
+	struct mutex		devmem_lock;	/* protects the above */
+
+	unsigned long		calloc;
+	unsigned long		cfree;
+	struct page		*free_pages;
+	spinlock_t		lock;		/* protects the above */
+};
+
+static struct dmirror_device dmirror_devices[DMIRROR_NDEVICES];
+
+static int dmirror_bounce_init(struct dmirror_bounce *bounce,
+			       unsigned long addr,
+			       unsigned long size)
+{
+	bounce->addr = addr;
+	bounce->size = size;
+	bounce->cpages = 0;
+	bounce->ptr = vmalloc(size);
+	if (!bounce->ptr)
+		return -ENOMEM;
+	return 0;
+}
+
+static void dmirror_bounce_fini(struct dmirror_bounce *bounce)
+{
+	vfree(bounce->ptr);
+}
+
+static int dmirror_fops_open(struct inode *inode, struct file *filp)
+{
+	struct cdev *cdev = inode->i_cdev;
+	struct dmirror *dmirror;
+	int ret;
+
+	/* Mirror this process address space */
+	dmirror = kzalloc(sizeof(*dmirror), GFP_KERNEL);
+	if (dmirror == NULL)
+		return -ENOMEM;
+
+	dmirror->mdevice = container_of(cdev, struct dmirror_device, cdevice);
+	mutex_init(&dmirror->mutex);
+	xa_init(&dmirror->pt);
+	dmirror->ranges = RB_ROOT_CACHED;
+
+	/*
+	 * Pre-register for mmu interval notifiers so
+	 * mmu_interval_notifier_insert_safe() can be called without holding
+	 * mmap_sem for write.
+	 */
+	ret = mmu_notifier_register(NULL, current->mm);
+	if (ret) {
+		kfree(dmirror);
+		return ret;
+	}
+
+	/* Pairs with the mmdrop() in dmirror_fops_release(). */
+	mmgrab(current->mm);
+	dmirror->mm = current->mm;
+
+	/* Only the first open registers the address space. */
+	filp->private_data = dmirror;
+	return ret;
+}
+
+static int dmirror_fops_release(struct inode *inode, struct file *filp)
+{
+	struct dmirror *dmirror = filp->private_data;
+	struct interval_tree_node *node;
+	struct interval_tree_node *next;
+	struct dmirror_interval *dmi;
+
+	mutex_lock(&dmirror->mutex);
+	for (node = interval_tree_iter_first(&dmirror->ranges, 0UL, ~0UL);
+	     node;
+	     node = next) {
+		next = interval_tree_iter_next(node, 0UL, ~0UL);
+		dmi = container_of(node, struct dmirror_interval, tree_node);
+		interval_tree_remove(&dmi->tree_node, &dmirror->ranges);
+		mmu_interval_notifier_remove(&dmi->notifier);
+	}
+	mutex_unlock(&dmirror->mutex);
+	mmdrop(dmirror->mm);
+	mmu_notifier_synchronize();
+	xa_destroy(&dmirror->pt);
+	kfree(dmirror);
+	return 0;
+}
+
+static inline struct dmirror_device *dmirror_page_to_device(struct page *page)
+
+{
+	struct dmirror_chunk *devmem;
+
+	devmem = container_of(page->pgmap, struct dmirror_chunk, pagemap);
+	return devmem->mdevice;
+}
+
+static bool dmirror_device_is_mine(struct dmirror_device *mdevice,
+				   struct page *page)
+{
+	if (!is_zone_device_page(page))
+		return false;
+	return page->pgmap->ops == &dmirror_devmem_ops &&
+		dmirror_page_to_device(page) == mdevice;
+}
+
+static int dmirror_do_fault(struct dmirror *dmirror, struct hmm_range *range)
+{
+	uint64_t *pfns = range->pfns;
+	unsigned long pfn;
+
+	for (pfn = (range->start >> PAGE_SHIFT);
+	     pfn < (range->end >> PAGE_SHIFT);
+	     pfn++, pfns++) {
+		struct page *page;
+		void *entry;
+
+		/*
+		 * HMM_PFN_ERROR is returned if it is accessing invalid memory
+		 * either because of memory error (hardware detected memory
+		 * corruption) or more likely because of truncate on mmap
+		 * file.
+		 */
+		if (*pfns == range->values[HMM_PFN_ERROR])
+			return -EFAULT;
+		if (!(*pfns & range->flags[HMM_PFN_VALID]))
+			return -EFAULT;
+		page = hmm_device_entry_to_page(range, *pfns);
+		/* We asked for pages to be populated but check anyway. */
+		if (!page)
+			return -EFAULT;
+		if (is_zone_device_page(page)) {
+			/*
+			 * TODO: need a way to ask HMM to fault foreign zone
+			 * device private pages.
+			 */
+			if (!dmirror_device_is_mine(dmirror->mdevice, page))
+				continue;
+		}
+		entry = page;
+		if (*pfns & range->flags[HMM_PFN_WRITE])
+			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+		else if (range->default_flags & range->flags[HMM_PFN_WRITE])
+			return -EFAULT;
+		entry = xa_store(&dmirror->pt, pfn, entry, GFP_KERNEL);
+		if (xa_is_err(entry))
+			return xa_err(entry);
+	}
+
+	return 0;
+}
+
+static void dmirror_do_update(struct dmirror *dmirror, unsigned long start,
+			      unsigned long end)
+{
+	unsigned long pfn;
+
+	/*
+	 * The XArray doesn't hold references to pages since it relies on
+	 * the mmu notifier to clear pointers when they become stale.
+	 * Therefore, it is OK to just clear the entry.
+	 */
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++)
+		xa_store(&dmirror->pt, pfn, NULL, GFP_KERNEL);
+}
+
+static struct dmirror_interval *dmirror_new_interval(struct dmirror *dmirror,
+						     unsigned long start,
+						     unsigned long last)
+{
+	struct dmirror_interval *dmi;
+	int ret;
+
+	dmi = kmalloc(sizeof(*dmi), GFP_ATOMIC);
+	if (!dmi)
+		return NULL;
+
+	dmi->dmirror = dmirror;
+
+	ret = mmu_interval_notifier_insert_safe(&dmi->notifier, dmirror->mm,
+				start, last - start + 1, &dmirror_min_ops);
+	if (ret) {
+		kfree(dmi);
+		return NULL;
+	}
+
+	dmi->tree_node.start = start;
+	dmi->tree_node.last = last;
+	interval_tree_insert(&dmi->tree_node, &dmirror->ranges);
+
+	return dmi;
+}
+
+static void dmirror_do_unmap(struct mmu_interval_notifier *mni,
+			     const struct mmu_notifier_range *range)
+{
+	struct dmirror_interval *dmi =
+		container_of(mni, struct dmirror_interval, notifier);
+	struct dmirror *dmirror = dmi->dmirror;
+	unsigned long last;
+
+	interval_tree_remove(&dmi->tree_node, &dmirror->ranges);
+
+	if (dmi->tree_node.start >= range->start) {
+		if (dmi->tree_node.last < range->end) {
+			/* Remove the whole interval. */
+			mmu_interval_notifier_remove(mni);
+		} else {
+			/* Keep the right-hand part of the interval. */
+			mmu_interval_notifier_update(mni, range->end,
+				dmi->tree_node.last - range->end + 1);
+			dmi->tree_node.start = range->end;
+			interval_tree_insert(&dmi->tree_node,
+					     &dmirror->ranges);
+		}
+		return;
+	}
+
+	/* Keep the left-hand part of the interval. */
+	mmu_interval_notifier_update(mni, dmi->tree_node.start,
+				     range->start - dmi->tree_node.start);
+	last = dmi->tree_node.last;
+	dmi->tree_node.last = range->start - 1;
+	interval_tree_insert(&dmi->tree_node, &dmirror->ranges);
+
+	/* If a hole is created, create an interval for the right-hand part. */
+	if (last >= range->end) {
+		dmi = dmirror_new_interval(dmirror, range->end, last);
+		/*
+		 * If we can't allocate an interval, we won't get invalidation
+		 * callbacks so clear the mapping and rely on faults to reload
+		 * the mappings if needed.
+		 */
+		if (!dmi)
+			dmirror_do_update(dmirror, range->end, last + 1);
+	}
+}
+
+static bool dmirror_range_invalidate(struct mmu_interval_notifier *mni,
+				const struct mmu_notifier_range *range,
+				unsigned long cur_seq)
+{
+	struct dmirror_interval *dmi =
+		container_of(mni, struct dmirror_interval, notifier);
+	struct dmirror *dmirror = dmi->dmirror;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&dmirror->mutex);
+	else if (!mutex_trylock(&dmirror->mutex))
+		return false;
+
+	mmu_interval_set_seq(mni, cur_seq);
+	dmirror_do_update(dmirror, range->start, range->end);
+
+	/* Stop tracking the range if it is an unmap. */
+	if (range->event == MMU_NOTIFY_UNMAP)
+		dmirror_do_unmap(mni, range);
+
+	mutex_unlock(&dmirror->mutex);
+	return true;
+}
+
+static void dmirror_interval_release(struct mmu_interval_notifier *mni)
+{
+	struct dmirror_interval *dmi =
+		container_of(mni, struct dmirror_interval, notifier);
+
+	kfree(dmi);
+}
+
+static const struct mmu_interval_notifier_ops dmirror_min_ops = {
+	.invalidate = dmirror_range_invalidate,
+	.release = dmirror_interval_release,
+};
+
+/*
+ * Find or create a mmu_interval_notifier for the given range.
+ * Although mmu_interval_notifier_insert_safe() can handle overlapping
+ * intervals, we only create non-overlapping intervals, shrinking the hmm_range
+ * if it spans more than one dmirror_interval.
+ */
+static int dmirror_interval_find(struct dmirror *dmirror,
+				 struct hmm_range *range)
+{
+	struct interval_tree_node *node;
+	struct dmirror_interval *dmi;
+	struct vm_area_struct *vma;
+	unsigned long end;
+	int ret;
+
+	mutex_lock(&dmirror->mutex);
+	node = interval_tree_iter_first(&dmirror->ranges, range->start,
+					range->end - 1);
+	if (node) {
+		dmi = container_of(node, struct dmirror_interval, tree_node);
+		if (range->start >= node->start) {
+			if (range->end - 1 > node->last)
+				range->end = node->last + 1;
+			goto found;
+		}
+		if (range->end - 1 > node->last)
+			range->end = node->last + 1;
+	}
+	/*
+	 * Might as well create an interval covering the underlying VMA to
+	 * avoid having to create a bunch of small intervals.
+	 */
+	vma = find_vma(dmirror->mm, range->start);
+	if (!vma || range->start < vma->vm_start) {
+		ret = -ENOENT;
+		goto err;
+	}
+	end = vma->vm_end;
+	if (end < range->end)
+		range->end = end;
+	dmi = dmirror_new_interval(dmirror, vma->vm_start, end - 1);
+	if (!dmi) {
+		ret = -ENOMEM;
+		goto err;
+	}
+
+found:
+	range->notifier = &dmi->notifier;
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+
+err:
+	mutex_unlock(&dmirror->mutex);
+	return ret;
+}
+
+static int dmirror_range_fault(struct dmirror *dmirror,
+				struct hmm_range *range)
+{
+	struct mm_struct *mm = dmirror->mm;
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	int ret;
+
+	while (true) {
+		long count;
+
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		down_read(&mm->mmap_sem);
+		ret = dmirror_interval_find(dmirror, range);
+		if (ret) {
+			up_read(&mm->mmap_sem);
+			goto out;
+		}
+		range->notifier_seq = mmu_interval_read_begin(range->notifier);
+		count = hmm_range_fault(range, 0);
+		up_read(&mm->mmap_sem);
+		if (count <= 0) {
+			if (count == 0 || count == -EBUSY)
+				continue;
+			ret = count;
+			goto out;
+		}
+
+		mutex_lock(&dmirror->mutex);
+		if (mmu_interval_read_retry(range->notifier,
+					    range->notifier_seq)) {
+			mutex_unlock(&dmirror->mutex);
+			continue;
+		}
+		break;
+	}
+
+	ret = dmirror_do_fault(dmirror, range);
+
+	mutex_unlock(&dmirror->mutex);
+out:
+	return ret;
+}
+
+static int dmirror_fault(struct dmirror *dmirror, unsigned long start,
+			 unsigned long end, bool write)
+{
+	struct mm_struct *mm = dmirror->mm;
+	unsigned long addr;
+	unsigned long next;
+	uint64_t pfns[64];
+	struct hmm_range range = {
+		.pfns = pfns,
+		.flags = dmirror_hmm_flags,
+		.values = dmirror_hmm_values,
+		.pfn_shift = DPT_SHIFT,
+		.pfn_flags_mask = ~(dmirror_hmm_flags[HMM_PFN_VALID] |
+				    dmirror_hmm_flags[HMM_PFN_WRITE]),
+		.default_flags = dmirror_hmm_flags[HMM_PFN_VALID] |
+				(write ? dmirror_hmm_flags[HMM_PFN_WRITE] : 0),
+	};
+	int ret = 0;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return 0;
+
+	for (addr = start; addr < end; addr = next) {
+		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+		range.start = addr;
+		range.end = next;
+
+		ret = dmirror_range_fault(dmirror, &range);
+		if (ret)
+			break;
+	}
+
+	mmput(mm);
+	return ret;
+}
+
+static int dmirror_do_read(struct dmirror *dmirror, unsigned long start,
+			   unsigned long end, struct dmirror_bounce *bounce)
+{
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK);
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+		void *entry;
+		struct page *page;
+		void *tmp;
+
+		entry = xa_load(&dmirror->pt, pfn);
+		page = xa_untag_pointer(entry);
+		if (!page)
+			return -ENOENT;
+
+		tmp = kmap(page);
+		memcpy(ptr, tmp, PAGE_SIZE);
+		kunmap(page);
+
+		ptr += PAGE_SIZE;
+		bounce->cpages++;
+	}
+
+	return 0;
+}
+
+static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_bounce bounce;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+
+again:
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0)
+		ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr,
+					bounce.size);
+	else if (ret == -ENOENT) {
+		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+		ret = dmirror_fault(dmirror, start, end, false);
+		if (ret == 0) {
+			cmd->faults++;
+			goto again;
+		}
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
+static int dmirror_do_write(struct dmirror *dmirror, unsigned long start,
+			    unsigned long end, struct dmirror_bounce *bounce)
+{
+	unsigned long pfn;
+	void *ptr;
+
+	ptr = bounce->ptr + ((start - bounce->addr) & PAGE_MASK);
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++) {
+		void *entry;
+		struct page *page;
+		void *tmp;
+
+		entry = xa_load(&dmirror->pt, pfn);
+		page = xa_untag_pointer(entry);
+		if (!page || xa_pointer_tag(entry) != DPT_XA_TAG_WRITE)
+			return -ENOENT;
+
+		tmp = kmap(page);
+		memcpy(tmp, ptr, PAGE_SIZE);
+		kunmap(page);
+
+		ptr += PAGE_SIZE;
+		bounce->cpages++;
+	}
+
+	return 0;
+}
+
+static int dmirror_write(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd)
+{
+	struct dmirror_bounce bounce;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	ret = copy_from_user(bounce.ptr, (void __user *)cmd->ptr,
+				bounce.size);
+	if (ret)
+		return ret;
+
+again:
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_write(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == -ENOENT) {
+		start = cmd->addr + (bounce.cpages << PAGE_SHIFT);
+		ret = dmirror_fault(dmirror, start, end, true);
+		if (ret == 0) {
+			cmd->faults++;
+			goto again;
+		}
+	}
+
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+}
+
+static bool dmirror_allocate_chunk(struct dmirror_device *mdevice,
+				   struct page **ppage)
+{
+	struct dmirror_chunk *devmem;
+	struct resource *res;
+	unsigned long pfn;
+	unsigned long pfn_first;
+	unsigned long pfn_last;
+	void *ptr;
+
+	mutex_lock(&mdevice->devmem_lock);
+
+	if (mdevice->devmem_count == mdevice->devmem_capacity) {
+		struct dmirror_chunk **new_chunks;
+		unsigned int new_capacity;
+
+		new_capacity = mdevice->devmem_capacity +
+				DEVMEM_CHUNKS_RESERVE;
+		new_chunks = krealloc(mdevice->devmem_chunks,
+				sizeof(new_chunks[0]) * new_capacity,
+				GFP_KERNEL);
+		if (!new_chunks)
+			goto err;
+		mdevice->devmem_capacity = new_capacity;
+		mdevice->devmem_chunks = new_chunks;
+	}
+
+	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+					"hmm_dmirror");
+	if (IS_ERR(res))
+		goto err;
+
+	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
+	if (!devmem)
+		goto err;
+
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.res = *res;
+	devmem->pagemap.ops = &dmirror_devmem_ops;
+
+	/* Numa node ID doesn't matter for ZONE_DEVICE private pages. */
+	ptr = memremap_pages(&devmem->pagemap, 0);
+	if (IS_ERR(ptr))
+		goto err_free;
+
+	devmem->mdevice = mdevice;
+	pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT;
+	pfn_last = pfn_first +
+		(resource_size(&devmem->pagemap.res) >> PAGE_SHIFT);
+	mdevice->devmem_chunks[mdevice->devmem_count++] = devmem;
+
+	mutex_unlock(&mdevice->devmem_lock);
+
+	pr_info("added new %u MB chunk (total %u chunks, %u MB) PFNs [0x%lx 0x%lx)\n",
+		DEVMEM_CHUNK_SIZE / (1024 * 1024),
+		mdevice->devmem_count,
+		mdevice->devmem_count * (DEVMEM_CHUNK_SIZE / (1024 * 1024)),
+		pfn_first, pfn_last);
+
+	spin_lock(&mdevice->lock);
+	for (pfn = pfn_first; pfn < pfn_last; pfn++) {
+		struct page *page = pfn_to_page(pfn);
+
+		page->zone_device_data = mdevice->free_pages;
+		mdevice->free_pages = page;
+	}
+	if (ppage) {
+		*ppage = mdevice->free_pages;
+		mdevice->free_pages = (*ppage)->zone_device_data;
+		mdevice->calloc++;
+	}
+	spin_unlock(&mdevice->lock);
+
+	return true;
+
+err_free:
+	kfree(devmem);
+err:
+	mutex_unlock(&mdevice->devmem_lock);
+	return false;
+}
+
+static struct page *dmirror_devmem_alloc_page(struct dmirror_device *mdevice)
+{
+	struct page *dpage = NULL;
+	struct page *rpage;
+
+	/*
+	 * This is a fake device so we alloc real system memory to store
+	 * our device memory.
+	 */
+	rpage = alloc_page(GFP_HIGHUSER);
+	if (!rpage)
+		return NULL;
+
+	spin_lock(&mdevice->lock);
+
+	if (mdevice->free_pages) {
+		dpage = mdevice->free_pages;
+		mdevice->free_pages = dpage->zone_device_data;
+		mdevice->calloc++;
+		spin_unlock(&mdevice->lock);
+	} else {
+		spin_unlock(&mdevice->lock);
+		if (!dmirror_allocate_chunk(mdevice, &dpage))
+			goto error;
+	}
+
+	dpage->zone_device_data = rpage;
+	get_page(dpage);
+	lock_page(dpage);
+	return dpage;
+
+error:
+	__free_page(rpage);
+	return NULL;
+}
+
+static void dmirror_migrate_alloc_and_copy(struct migrate_vma *args,
+					   struct dmirror *dmirror)
+{
+	struct dmirror_device *mdevice = dmirror->mdevice;
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long addr;
+
+	for (addr = args->start; addr < args->end; addr += PAGE_SIZE,
+						   src++, dst++) {
+		struct page *spage;
+		struct page *dpage;
+		struct page *rpage;
+
+		if (!(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		/*
+		 * Note that spage might be NULL which is OK since it is an
+		 * unallocated pte_none() or read-only zero page.
+		 */
+		spage = migrate_pfn_to_page(*src);
+
+		/*
+		 * Don't migrate device private pages from our own driver or
+		 * others. For our own we would do a device private memory copy
+		 * not a migration and for others, we would need to fault the
+		 * other device's page into system memory first.
+		 */
+		if (spage && is_zone_device_page(spage))
+			continue;
+
+		dpage = dmirror_devmem_alloc_page(mdevice);
+		if (!dpage)
+			continue;
+
+		rpage = dpage->zone_device_data;
+		if (spage)
+			copy_highpage(rpage, spage);
+		else
+			clear_highpage(rpage);
+
+		/*
+		 * Normally, a device would use the page->zone_device_data to
+		 * point to the mirror but here we use it to hold the page for
+		 * the simulated device memory and that page holds the pointer
+		 * to the mirror.
+		 */
+		rpage->zone_device_data = dmirror;
+
+		*dst = migrate_pfn(page_to_pfn(dpage)) |
+			    MIGRATE_PFN_LOCKED;
+		if ((*src & MIGRATE_PFN_WRITE) ||
+		    (!spage && args->vma->vm_flags & VM_WRITE))
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+}
+
+static int dmirror_migrate_finalize_and_map(struct migrate_vma *args,
+					    struct dmirror *dmirror)
+{
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	const unsigned long *src = args->src;
+	const unsigned long *dst = args->dst;
+	unsigned long pfn;
+
+	/* Map the migrated pages into the device's page tables. */
+	mutex_lock(&dmirror->mutex);
+
+	for (pfn = start >> PAGE_SHIFT; pfn < (end >> PAGE_SHIFT); pfn++,
+								src++, dst++) {
+		struct page *dpage;
+		void *entry;
+
+		if (!(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+
+		dpage = migrate_pfn_to_page(*dst);
+		if (!dpage)
+			continue;
+
+		/*
+		 * Store the page that holds the data so the page table
+		 * doesn't have to deal with ZONE_DEVICE private pages.
+		 */
+		entry = dpage->zone_device_data;
+		if (*dst & MIGRATE_PFN_WRITE)
+			entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
+		entry = xa_store(&dmirror->pt, pfn, entry, GFP_KERNEL);
+		if (xa_is_err(entry))
+			return xa_err(entry);
+	}
+
+	mutex_unlock(&dmirror->mutex);
+	return 0;
+}
+
+static int dmirror_migrate(struct dmirror *dmirror,
+			   struct hmm_dmirror_cmd *cmd)
+{
+	unsigned long start, end, addr;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	struct mm_struct *mm = dmirror->mm;
+	struct vm_area_struct *vma;
+	unsigned long src_pfns[64];
+	unsigned long dst_pfns[64];
+	struct dmirror_bounce bounce;
+	struct migrate_vma args;
+	unsigned long next;
+	int ret;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	down_read(&mm->mmap_sem);
+	for (addr = start; addr < end; addr = next) {
+		vma = find_vma(mm, addr);
+		if (!vma || addr < vma->vm_start) {
+			ret = -EINVAL;
+			goto out;
+		}
+		next = min(end, addr + (ARRAY_SIZE(src_pfns) << PAGE_SHIFT));
+		if (next > vma->vm_end)
+			next = vma->vm_end;
+
+		args.vma = vma;
+		args.src = src_pfns;
+		args.dst = dst_pfns;
+		args.start = addr;
+		args.end = next;
+		ret = migrate_vma_setup(&args);
+		if (ret)
+			goto out;
+
+		dmirror_migrate_alloc_and_copy(&args, dmirror);
+		migrate_vma_pages(&args);
+		dmirror_migrate_finalize_and_map(&args, dmirror);
+		migrate_vma_finalize(&args);
+	}
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+
+	/* Return the migrated data for verification. */
+	ret = dmirror_bounce_init(&bounce, start, size);
+	if (ret)
+		return ret;
+	mutex_lock(&dmirror->mutex);
+	ret = dmirror_do_read(dmirror, start, end, &bounce);
+	mutex_unlock(&dmirror->mutex);
+	if (ret == 0)
+		ret = copy_to_user((void __user *)cmd->ptr, bounce.ptr,
+					bounce.size);
+	cmd->cpages = bounce.cpages;
+	dmirror_bounce_fini(&bounce);
+	return ret;
+
+out:
+	up_read(&mm->mmap_sem);
+	mmput(mm);
+	return ret;
+}
+
+static void dmirror_mkentry(struct dmirror *dmirror, struct hmm_range *range,
+			    unsigned char *perm, uint64_t entry)
+{
+	struct page *page;
+
+	if (entry == range->values[HMM_PFN_ERROR]) {
+		*perm = HMM_DMIRROR_PROT_ERROR;
+		return;
+	}
+	page = hmm_device_entry_to_page(range, entry);
+	if (!page) {
+		*perm = HMM_DMIRROR_PROT_NONE;
+		return;
+	}
+	if (entry & range->flags[HMM_PFN_DEVICE_PRIVATE]) {
+		/* Is the page migrated to this device or some other? */
+		if (dmirror->mdevice == dmirror_page_to_device(page))
+			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL;
+		else
+			*perm = HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE;
+	} else if (is_zero_pfn(page_to_pfn(page)))
+		*perm = HMM_DMIRROR_PROT_ZERO;
+	else
+		*perm = HMM_DMIRROR_PROT_NONE;
+	if (entry & range->flags[HMM_PFN_WRITE])
+		*perm |= HMM_DMIRROR_PROT_WRITE;
+	else
+		*perm |= HMM_DMIRROR_PROT_READ;
+}
+
+static bool dmirror_snapshot_invalidate(struct mmu_interval_notifier *mni,
+				const struct mmu_notifier_range *range,
+				unsigned long cur_seq)
+{
+	struct dmirror_interval *dmi =
+		container_of(mni, struct dmirror_interval, notifier);
+	struct dmirror *dmirror = dmi->dmirror;
+
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&dmirror->mutex);
+	else if (!mutex_trylock(&dmirror->mutex))
+		return false;
+
+	/*
+	 * Snapshots only need to set the sequence number since the
+	 * invalidations are handled by the dmirror_interval ranges.
+	 */
+	mmu_interval_set_seq(mni, cur_seq);
+
+	mutex_unlock(&dmirror->mutex);
+	return true;
+}
+
+static const struct mmu_interval_notifier_ops dmirror_mrn_ops = {
+	.invalidate = dmirror_snapshot_invalidate,
+};
+
+static int dmirror_range_snapshot(struct dmirror *dmirror,
+				  struct hmm_range *range,
+				  unsigned char *perm)
+{
+	struct mm_struct *mm = dmirror->mm;
+	struct dmirror_interval notifier;
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	unsigned long i;
+	unsigned long n;
+	int ret = 0;
+
+	notifier.dmirror = dmirror;
+	range->notifier = &notifier.notifier;
+
+	ret = mmu_interval_notifier_insert_safe(range->notifier, mm,
+			range->start, range->end - range->start,
+			&dmirror_mrn_ops);
+	if (ret)
+		return ret;
+
+	while (true) {
+		long count;
+
+		if (time_after(jiffies, timeout)) {
+			ret = -EBUSY;
+			goto out;
+		}
+
+		range->notifier_seq = mmu_interval_read_begin(range->notifier);
+
+		down_read(&mm->mmap_sem);
+		count = hmm_range_fault(range, HMM_FAULT_SNAPSHOT);
+		up_read(&mm->mmap_sem);
+		if (count <= 0) {
+			if (count == 0 || count == -EBUSY)
+				continue;
+			ret = count;
+			goto out;
+		}
+
+		mutex_lock(&dmirror->mutex);
+		if (mmu_interval_read_retry(range->notifier,
+					    range->notifier_seq)) {
+			mutex_unlock(&dmirror->mutex);
+			continue;
+		}
+		break;
+	}
+
+	n = (range->end - range->start) >> PAGE_SHIFT;
+	for (i = 0; i < n; i++)
+		dmirror_mkentry(dmirror, range, perm + i, range->pfns[i]);
+
+	mutex_unlock(&dmirror->mutex);
+out:
+	mmu_interval_notifier_remove(range->notifier);
+	return ret;
+}
+
+static int dmirror_snapshot(struct dmirror *dmirror,
+			    struct hmm_dmirror_cmd *cmd)
+{
+	struct mm_struct *mm = dmirror->mm;
+	unsigned long start, end;
+	unsigned long size = cmd->npages << PAGE_SHIFT;
+	unsigned long addr;
+	unsigned long next;
+	uint64_t pfns[64];
+	unsigned char perm[64];
+	char __user *uptr;
+	struct hmm_range range = {
+		.pfns = pfns,
+		.flags = dmirror_hmm_flags,
+		.values = dmirror_hmm_values,
+		.pfn_shift = DPT_SHIFT,
+		.pfn_flags_mask = ~0ULL,
+	};
+	int ret = 0;
+
+	start = cmd->addr;
+	end = start + size;
+	if (end < start)
+		return -EINVAL;
+
+	/* Since the mm is for the mirrored process, get a reference first. */
+	if (!mmget_not_zero(mm))
+		return -EINVAL;
+
+	/*
+	 * Register a temporary notifier to detect invalidations even if it
+	 * overlaps with other mmu_interval_notifiers.
+	 */
+	uptr = (void __user *)cmd->ptr;
+	for (addr = start; addr < end; addr = next) {
+		unsigned long n;
+
+		next = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end);
+		range.start = addr;
+		range.end = next;
+
+		ret = dmirror_range_snapshot(dmirror, &range, perm);
+		if (ret)
+			break;
+
+		n = (range.end - range.start) >> PAGE_SHIFT;
+		ret = copy_to_user(uptr, perm, n);
+		if (ret)
+			break;
+
+		cmd->cpages += n;
+		uptr += n;
+	}
+	mmput(mm);
+
+	return ret;
+}
+
+static long dmirror_fops_unlocked_ioctl(struct file *filp,
+					unsigned int command,
+					unsigned long arg)
+{
+	void __user *uarg = (void __user *)arg;
+	struct hmm_dmirror_cmd cmd;
+	struct dmirror *dmirror;
+	int ret;
+
+	dmirror = filp->private_data;
+	if (!dmirror)
+		return -EINVAL;
+
+	ret = copy_from_user(&cmd, uarg, sizeof(cmd));
+	if (ret)
+		return ret;
+
+	if (cmd.addr & ~PAGE_MASK)
+		return -EINVAL;
+	if (cmd.addr >= (cmd.addr + (cmd.npages << PAGE_SHIFT)))
+		return -EINVAL;
+
+	cmd.cpages = 0;
+	cmd.faults = 0;
+
+	switch (command) {
+	case HMM_DMIRROR_READ:
+		ret = dmirror_read(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_WRITE:
+		ret = dmirror_write(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_MIGRATE:
+		ret = dmirror_migrate(dmirror, &cmd);
+		break;
+
+	case HMM_DMIRROR_SNAPSHOT:
+		ret = dmirror_snapshot(dmirror, &cmd);
+		break;
+
+	default:
+		return -EINVAL;
+	}
+	if (ret)
+		return ret;
+
+	return copy_to_user(uarg, &cmd, sizeof(cmd));
+}
+
+static const struct file_operations dmirror_fops = {
+	.open		= dmirror_fops_open,
+	.release	= dmirror_fops_release,
+	.unlocked_ioctl = dmirror_fops_unlocked_ioctl,
+	.llseek		= default_llseek,
+	.owner		= THIS_MODULE,
+};
+
+static void dmirror_devmem_free(struct page *page)
+{
+	struct page *rpage = page->zone_device_data;
+	struct dmirror_device *mdevice;
+
+	if (rpage)
+		__free_page(rpage);
+
+	mdevice = dmirror_page_to_device(page);
+
+	spin_lock(&mdevice->lock);
+	mdevice->cfree++;
+	page->zone_device_data = mdevice->free_pages;
+	mdevice->free_pages = page;
+	spin_unlock(&mdevice->lock);
+}
+
+static vm_fault_t dmirror_devmem_fault_alloc_and_copy(struct migrate_vma *args,
+						struct dmirror_device *mdevice)
+{
+	struct vm_area_struct *vma = args->vma;
+	const unsigned long *src = args->src;
+	unsigned long *dst = args->dst;
+	unsigned long start = args->start;
+	unsigned long end = args->end;
+	unsigned long addr;
+
+	for (addr = start; addr < end; addr += PAGE_SIZE,
+				       src++, dst++) {
+		struct page *dpage, *spage;
+
+		spage = migrate_pfn_to_page(*src);
+		if (!spage || !(*src & MIGRATE_PFN_MIGRATE))
+			continue;
+		if (!dmirror_device_is_mine(mdevice, spage))
+			continue;
+		spage = spage->zone_device_data;
+
+		dpage = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, addr);
+		if (!dpage)
+			continue;
+
+		lock_page(dpage);
+		copy_highpage(dpage, spage);
+		*dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
+		if (*src & MIGRATE_PFN_WRITE)
+			*dst |= MIGRATE_PFN_WRITE;
+	}
+	return 0;
+}
+
+static void dmirror_devmem_fault_finalize_and_map(struct migrate_vma *args,
+						  struct dmirror *dmirror)
+{
+	/* Invalidate the device's page table mapping. */
+	mutex_lock(&dmirror->mutex);
+	dmirror_do_update(dmirror, args->start, args->end);
+	mutex_unlock(&dmirror->mutex);
+}
+
+static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
+{
+	struct migrate_vma args;
+	unsigned long src_pfns;
+	unsigned long dst_pfns;
+	struct page *rpage;
+	struct dmirror *dmirror;
+	vm_fault_t ret;
+
+	/* FIXME demonstrate how we can adjust migrate range */
+	args.vma = vmf->vma;
+	args.start = vmf->address;
+	args.end = args.start + PAGE_SIZE;
+	args.src = &src_pfns;
+	args.dst = &dst_pfns;
+
+	if (migrate_vma_setup(&args))
+		return VM_FAULT_SIGBUS;
+
+	/*
+	 * Normally, a device would use the page->zone_device_data to point to
+	 * the mirror but here we use it to hold the page for the simulated
+	 * device memory and that page holds the pointer to the mirror.
+	 */
+	rpage = vmf->page->zone_device_data;
+	dmirror = rpage->zone_device_data;
+
+	ret = dmirror_devmem_fault_alloc_and_copy(&args, dmirror->mdevice);
+	if (ret)
+		return ret;
+	migrate_vma_pages(&args);
+	dmirror_devmem_fault_finalize_and_map(&args, dmirror);
+	migrate_vma_finalize(&args);
+	return 0;
+}
+
+static const struct dev_pagemap_ops dmirror_devmem_ops = {
+	.page_free	= dmirror_devmem_free,
+	.migrate_to_ram	= dmirror_devmem_fault,
+};
+
+static int dmirror_device_init(struct dmirror_device *mdevice, int id)
+{
+	dev_t dev;
+	int ret;
+
+	dev = MKDEV(MAJOR(dmirror_dev), id);
+	mutex_init(&mdevice->devmem_lock);
+	spin_lock_init(&mdevice->lock);
+
+	cdev_init(&mdevice->cdevice, &dmirror_fops);
+	ret = cdev_add(&mdevice->cdevice, dev, 1);
+	if (ret)
+		return ret;
+
+	/* Build a list of free ZONE_DEVICE private struct pages */
+	dmirror_allocate_chunk(mdevice, NULL);
+
+	return 0;
+}
+
+static void dmirror_device_remove(struct dmirror_device *mdevice)
+{
+	unsigned int i;
+
+	if (mdevice->devmem_chunks) {
+		for (i = 0; i < mdevice->devmem_count; i++) {
+			struct dmirror_chunk *devmem =
+				mdevice->devmem_chunks[i];
+
+			memunmap_pages(&devmem->pagemap);
+			kfree(devmem);
+		}
+		kfree(mdevice->devmem_chunks);
+	}
+
+	cdev_del(&mdevice->cdevice);
+}
+
+static int __init hmm_dmirror_init(void)
+{
+	int ret;
+	int id;
+
+	ret = alloc_chrdev_region(&dmirror_dev, 0, DMIRROR_NDEVICES,
+				  "HMM_DMIRROR");
+	if (ret)
+		goto err_unreg;
+
+	for (id = 0; id < DMIRROR_NDEVICES; id++) {
+		ret = dmirror_device_init(dmirror_devices + id, id);
+		if (ret)
+			goto err_chrdev;
+	}
+
+	/*
+	 * Allocate a zero page to simulate a reserved page of device private
+	 * memory which is always zero. The zero_pfn page isn't used just to
+	 * make the code here simpler (i.e., we need a struct page for it).
+	 */
+	dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+	if (!dmirror_zero_page)
+		goto err_chrdev;
+
+	pr_info("HMM test module loaded. This is only for testing HMM.\n");
+	return 0;
+
+err_chrdev:
+	while (--id >= 0)
+		dmirror_device_remove(dmirror_devices + id);
+	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
+err_unreg:
+	return ret;
+}
+
+static void __exit hmm_dmirror_exit(void)
+{
+	int id;
+
+	if (dmirror_zero_page)
+		__free_page(dmirror_zero_page);
+	for (id = 0; id < DMIRROR_NDEVICES; id++)
+		dmirror_device_remove(dmirror_devices + id);
+	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
+}
+
+module_init(hmm_dmirror_init);
+module_exit(hmm_dmirror_exit);
+MODULE_LICENSE("GPL");
diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 31b3c98b6d34..3054565b3f07 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -14,3 +14,4 @@ virtual_address_range
 gup_benchmark
 va_128TBswitch
 map_fixed_noreplace
+hmm-tests
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index 7f9a8a8c31da..3fadab99d991 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -7,6 +7,7 @@ CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt
 TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_benchmark
+TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugepage-mmap
 TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += map_hugetlb
@@ -31,6 +32,8 @@ TEST_FILES := test_vmalloc.sh
 KSFT_KHDR_INSTALL := 1
 include ../lib.mk
 
+$(OUTPUT)/hmm-tests: LDLIBS += -lhugetlbfs -lpthread
+
 $(OUTPUT)/userfaultfd: LDLIBS += -lpthread
 
 $(OUTPUT)/mlock-random-test: LDLIBS += -lcap
diff --git a/tools/testing/selftests/vm/config b/tools/testing/selftests/vm/config
index 93b90a9b1eeb..f6d0adad739f 100644
--- a/tools/testing/selftests/vm/config
+++ b/tools/testing/selftests/vm/config
@@ -1,3 +1,5 @@
 CONFIG_SYSVIPC=y
 CONFIG_USERFAULTFD=y
 CONFIG_TEST_VMALLOC=m
+CONFIG_HMM_MIRROR=y
+CONFIG_DEVICE_PRIVATE=y
diff --git a/tools/testing/selftests/vm/hmm-tests.c b/tools/testing/selftests/vm/hmm-tests.c
new file mode 100644
index 000000000000..bbf5cdcdc7da
--- /dev/null
+++ b/tools/testing/selftests/vm/hmm-tests.c
@@ -0,0 +1,1360 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * HMM stands for Heterogeneous Memory Management, it is a helper layer inside
+ * the linux kernel to help device drivers mirror a process address space in
+ * the device. This allows the device to use the same address space which
+ * makes communication and data exchange a lot easier.
+ *
+ * This framework's sole purpose is to exercise various code paths inside
+ * the kernel to make sure that HMM performs as expected and to flush out any
+ * bugs.
+ */
+
+#include "../kselftest_harness.h"
+
+#include <errno.h>
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <unistd.h>
+#include <strings.h>
+#include <time.h>
+#include <pthread.h>
+#include <hugetlbfs.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/mman.h>
+#include <sys/ioctl.h>
+#include <linux/test_hmm.h>
+
+struct hmm_buffer {
+	void		*ptr;
+	void		*mirror;
+	unsigned long	size;
+	int		fd;
+	uint64_t	cpages;
+	uint64_t	faults;
+};
+
+#define TWOMEG		(1 << 21)
+#define HMM_BUFFER_SIZE (1024 << 12)
+#define HMM_PATH_MAX    64
+#define NTIMES		256
+
+#define ALIGN(x, a) (((x) + (a - 1)) & (~((a) - 1)))
+
+FIXTURE(hmm)
+{
+	int		fd;
+	unsigned int	page_size;
+	unsigned int	page_shift;
+};
+
+FIXTURE(hmm2)
+{
+	int		fd0;
+	int		fd1;
+	unsigned int	page_size;
+	unsigned int	page_shift;
+};
+
+static int hmm_open(int unit)
+{
+	char pathname[HMM_PATH_MAX];
+	int fd;
+
+	snprintf(pathname, sizeof(pathname), "/dev/hmm_dmirror%d", unit);
+	fd = open(pathname, O_RDWR, 0);
+	if (fd < 0)
+		fprintf(stderr, "could not open hmm dmirror driver (%s)\n",
+			pathname);
+	return fd;
+}
+
+FIXTURE_SETUP(hmm)
+{
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+	self->page_shift = ffs(self->page_size) - 1;
+
+	self->fd = hmm_open(0);
+	ASSERT_GE(self->fd, 0);
+}
+
+FIXTURE_SETUP(hmm2)
+{
+	self->page_size = sysconf(_SC_PAGE_SIZE);
+	self->page_shift = ffs(self->page_size) - 1;
+
+	self->fd0 = hmm_open(0);
+	ASSERT_GE(self->fd0, 0);
+	self->fd1 = hmm_open(1);
+	ASSERT_GE(self->fd1, 0);
+}
+
+FIXTURE_TEARDOWN(hmm)
+{
+	int ret = close(self->fd);
+
+	ASSERT_EQ(ret, 0);
+	self->fd = -1;
+}
+
+FIXTURE_TEARDOWN(hmm2)
+{
+	int ret = close(self->fd0);
+
+	ASSERT_EQ(ret, 0);
+	self->fd0 = -1;
+
+	ret = close(self->fd1);
+	ASSERT_EQ(ret, 0);
+	self->fd1 = -1;
+}
+
+static int hmm_dmirror_cmd(int fd,
+			   unsigned long request,
+			   struct hmm_buffer *buffer,
+			   unsigned long npages)
+{
+	struct hmm_dmirror_cmd cmd;
+	int ret;
+
+	/* Simulate a device reading system memory. */
+	cmd.addr = (__u64)buffer->ptr;
+	cmd.ptr = (__u64)buffer->mirror;
+	cmd.npages = npages;
+
+	for (;;) {
+		ret = ioctl(fd, request, &cmd);
+		if (ret == 0)
+			break;
+		if (errno == EINTR)
+			continue;
+		return -errno;
+	}
+	buffer->cpages = cmd.cpages;
+	buffer->faults = cmd.faults;
+
+	return 0;
+}
+
+static void hmm_buffer_free(struct hmm_buffer *buffer)
+{
+	if (buffer == NULL)
+		return;
+
+	if (buffer->ptr)
+		munmap(buffer->ptr, buffer->size);
+	free(buffer->mirror);
+	free(buffer);
+}
+
+/*
+ * Create a temporary file that will be deleted on close.
+ */
+static int hmm_create_file(unsigned long size)
+{
+	char path[HMM_PATH_MAX];
+	int fd;
+
+	strcpy(path, "/tmp");
+	fd = open(path, O_TMPFILE | O_EXCL | O_RDWR, 0600);
+	if (fd >= 0) {
+		int r;
+
+		do {
+			r = ftruncate(fd, size);
+		} while (r == -1 && errno == EINTR);
+		if (!r)
+			return fd;
+		close(fd);
+	}
+	return -1;
+}
+
+/*
+ * Return a random unsigned number.
+ */
+static unsigned int hmm_random(void)
+{
+	static int fd = -1;
+	unsigned int r;
+
+	if (fd < 0) {
+		fd = open("/dev/urandom", O_RDONLY);
+		if (fd < 0) {
+			fprintf(stderr, "%s:%d failed to open /dev/urandom\n",
+					__FILE__, __LINE__);
+			return ~0U;
+		}
+	}
+	read(fd, &r, sizeof(r));
+	return r;
+}
+
+static void hmm_nanosleep(unsigned int n)
+{
+	struct timespec t;
+
+	t.tv_sec = 0;
+	t.tv_nsec = n;
+	nanosleep(&t, NULL);
+}
+
+/*
+ * Simple NULL test of device open/close.
+ */
+TEST_F(hmm, open_close)
+{
+}
+
+/*
+ * Read private anonymous memory.
+ */
+TEST_F(hmm, anon_read)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int val;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/*
+	 * Initialize buffer in system memory but leave the first two pages
+	 * zero (pte_none and pfn_zero).
+	 */
+	i = 2 * self->page_size / sizeof(*ptr);
+	for (ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Set buffer permission to read-only. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Populate the CPU page table with a special zero page. */
+	val = *(int *)(buffer->ptr + self->page_size);
+	ASSERT_EQ(val, 0);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	ptr = buffer->mirror;
+	for (i = 0; i < 2 * self->page_size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+	for (; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Read private anonymous memory which has been protected with
+ * mprotect() PROT_NONE.
+ */
+TEST_F(hmm, anon_read_prot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize mirror buffer so we can verify it isn't written. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	/* Protect buffer from reading. */
+	ret = mprotect(buffer->ptr, size, PROT_NONE);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, -EFAULT);
+
+	/* Allow CPU to read the buffer so we can check it. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write private anonymous memory.
+ */
+TEST_F(hmm, anon_write)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write private anonymous memory which has been protected with
+ * mprotect() PROT_READ.
+ */
+TEST_F(hmm, anon_write_prot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Simulate a device reading a zero page of memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, -EPERM);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], 0);
+
+	/* Now allow writing and see that the zero page is replaced. */
+	ret = mprotect(buffer->ptr, size, PROT_WRITE | PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Check that a device writing an anonymous private mapping
+ * will copy-on-write if a child process inherits the mapping.
+ */
+TEST_F(hmm, anon_write_child)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	pid_t pid;
+	int child_fd;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer->ptr so we can tell if it is written. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (pid != 0) {
+		waitpid(pid, &ret, 0);
+		ASSERT_EQ(WIFEXITED(ret), 1);
+
+		/* Check that the parent's buffer did not change. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+		return;
+	}
+
+	/* Check that we see the parent's values. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	/* The child process needs its own mirror to its own mm. */
+	child_fd = hmm_open(0);
+	ASSERT_GE(child_fd, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	close(child_fd);
+	exit(0);
+}
+
+/*
+ * Check that a device writing an anonymous shared mapping
+ * will not copy-on-write if a child process inherits the mapping.
+ */
+TEST_F(hmm, anon_write_child_shared)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	pid_t pid;
+	int child_fd;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_SHARED | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer->ptr so we can tell if it is written. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = -i;
+
+	pid = fork();
+	if (pid == -1)
+		ASSERT_EQ(pid, 0);
+	if (pid != 0) {
+		waitpid(pid, &ret, 0);
+		ASSERT_EQ(WIFEXITED(ret), 1);
+
+		/* Check that the parent's buffer did change. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], -i);
+		return;
+	}
+
+	/* Check that we see the parent's values. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	/* The child process needs its own mirror to its own mm. */
+	child_fd = hmm_open(0);
+	ASSERT_GE(child_fd, 0);
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(child_fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], -i);
+
+	close(child_fd);
+	exit(0);
+}
+
+/*
+ * Write private anonymous huge page.
+ */
+TEST_F(hmm, anon_write_huge)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	void *old_ptr;
+	void *map;
+	int *ptr;
+	int ret;
+
+	size = 2 * TWOMEG;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	size = TWOMEG;
+	npages = size >> self->page_shift;
+	map = (void *)ALIGN((uintptr_t)buffer->ptr, size);
+	ret = madvise(map, size, MADV_HUGEPAGE);
+	ASSERT_EQ(ret, 0);
+	old_ptr = buffer->ptr;
+	buffer->ptr = map;
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	buffer->ptr = old_ptr;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write huge TLBFS page.
+ */
+TEST_F(hmm, anon_write_hugetlbfs)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	long pagesizes[4];
+	int n, idx;
+
+	/* Skip test if we can't allocate a hugetlbfs page. */
+
+	n = gethugepagesizes(pagesizes, 4);
+	if (n <= 0)
+		return;
+	for (idx = 0; --n > 0; ) {
+		if (pagesizes[n] < pagesizes[idx])
+			idx = n;
+	}
+	size = ALIGN(TWOMEG, pagesizes[idx]);
+	npages = size >> self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->ptr = get_hugepage_region(size, GHR_STRICT);
+	if (buffer->ptr == NULL) {
+		free(buffer);
+		return;
+	}
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	free_hugepage_region(buffer->ptr);
+	buffer->ptr = NULL;
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Read mmap'ed file memory.
+ */
+TEST_F(hmm, file_read)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int fd;
+	off_t off;
+	ssize_t len;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	fd = hmm_create_file(size);
+	ASSERT_GE(fd, 0);
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = fd;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Write initial contents of the file. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+	off = lseek(fd, 0, SEEK_SET);
+	ASSERT_EQ(off, 0);
+	len = write(fd, buffer->mirror, size);
+	ASSERT_EQ(len, size);
+	memset(buffer->mirror, 0, size);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ,
+			   MAP_SHARED,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Simulate a device reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Write mmap'ed file memory.
+ */
+TEST_F(hmm, file_write)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+	int fd;
+	off_t off;
+	ssize_t len;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	fd = hmm_create_file(size);
+	ASSERT_GE(fd, 0);
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = fd;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_SHARED,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize data that the device will write to buffer->ptr. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Simulate a device writing system memory. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_WRITE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device wrote. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Check that the device also wrote the file. */
+	off = lseek(fd, 0, SEEK_SET);
+	ASSERT_EQ(off, 0);
+	len = read(fd, buffer->mirror, size);
+	ASSERT_EQ(len, size);
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory.
+ */
+TEST_F(hmm, migrate)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory and fault it back to system
+ * memory.
+ */
+TEST_F(hmm, migrate_fault)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Migrate memory to device. */
+	ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Fault pages back to system memory and check them. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Try to migrate various memory types to device private memory.
+ */
+TEST_F(hmm2, migrate_mixed)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	int *ptr;
+	unsigned char *p;
+	int ret;
+	int val;
+
+	npages = 6;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(size);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_NONE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+	p = buffer->ptr;
+
+	/* Now try to migrate everything to device 1. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 6);
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	/* We expect an error if the vma doesn't cover the range. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 3);
+	ASSERT_EQ(ret, -EINVAL);
+
+	/* Page 2 will be a read-only zero page. */
+	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 2 * self->page_size);
+	val = *ptr + 3;
+	ASSERT_EQ(val, 3);
+
+	/* Page 3 will be read-only. */
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 3 * self->page_size);
+	*ptr = val;
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 4 will be read-write. */
+	ret = mprotect(buffer->ptr + 4 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 4 * self->page_size);
+	*ptr = val;
+
+	/* Page 5 won't be migrated to device 0 because it's on device 1. */
+	buffer->ptr = p + 5 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, -ENOENT);
+	buffer->ptr = p;
+
+	/* Now try to migrate pages 2-3 to device 1. */
+	buffer->ptr = p + 2 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 2);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 2);
+	buffer->ptr = p;
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Migrate anonymous memory to device private memory and fault it back to system
+ * memory multiple times.
+ */
+TEST_F(hmm, migrate_multiple)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	unsigned long c;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; c++) {
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i;
+
+		/* Migrate memory to device. */
+		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_MIGRATE, buffer,
+				      npages);
+		ASSERT_EQ(ret, 0);
+		ASSERT_EQ(buffer->cpages, npages);
+
+		/* Check what the device read. */
+		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+
+		/* Fault pages back to system memory and check them. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i);
+
+		hmm_buffer_free(buffer);
+	}
+}
+
+/*
+ * Read anonymous memory multiple times.
+ */
+TEST_F(hmm, anon_read_multiple)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	unsigned long c;
+	int *ptr;
+	int ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; c++) {
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i + c;
+
+		/* Simulate a device reading system memory. */
+		ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
+				      npages);
+		ASSERT_EQ(ret, 0);
+		ASSERT_EQ(buffer->cpages, npages);
+		ASSERT_EQ(buffer->faults, 1);
+
+		/* Check what the device read. */
+		for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+			ASSERT_EQ(ptr[i], i + c);
+
+		hmm_buffer_free(buffer);
+	}
+}
+
+void *unmap_buffer(void *p)
+{
+	struct hmm_buffer *buffer = p;
+
+	/* Delay for a bit and then unmap buffer while it is being read. */
+	hmm_nanosleep(hmm_random() % 32000);
+	munmap(buffer->ptr + buffer->size / 2, buffer->size / 2);
+	buffer->ptr = NULL;
+
+	return NULL;
+}
+
+/*
+ * Try reading anonymous memory while it is being unmapped.
+ */
+TEST_F(hmm, anon_teardown)
+{
+	unsigned long npages;
+	unsigned long size;
+	unsigned long c;
+	void *ret;
+
+	npages = ALIGN(HMM_BUFFER_SIZE, self->page_size) >> self->page_shift;
+	ASSERT_NE(npages, 0);
+	size = npages << self->page_shift;
+
+	for (c = 0; c < NTIMES; ++c) {
+		pthread_t thread;
+		struct hmm_buffer *buffer;
+		unsigned long i;
+		int *ptr;
+		int rc;
+
+		buffer = malloc(sizeof(*buffer));
+		ASSERT_NE(buffer, NULL);
+
+		buffer->fd = -1;
+		buffer->size = size;
+		buffer->mirror = malloc(size);
+		ASSERT_NE(buffer->mirror, NULL);
+
+		buffer->ptr = mmap(NULL, size,
+				   PROT_READ | PROT_WRITE,
+				   MAP_PRIVATE | MAP_ANONYMOUS,
+				   buffer->fd, 0);
+		ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+		/* Initialize buffer in system memory. */
+		for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+			ptr[i] = i + c;
+
+		rc = pthread_create(&thread, NULL, unmap_buffer, buffer);
+		ASSERT_EQ(rc, 0);
+
+		/* Simulate a device reading system memory. */
+		rc = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ, buffer,
+				     npages);
+		if (rc == 0) {
+			ASSERT_EQ(buffer->cpages, npages);
+			ASSERT_EQ(buffer->faults, 1);
+
+			/* Check what the device read. */
+			for (i = 0, ptr = buffer->mirror;
+			     i < size / sizeof(*ptr);
+			     ++i)
+				ASSERT_EQ(ptr[i], i + c);
+		}
+
+		pthread_join(thread, &ret);
+		hmm_buffer_free(buffer);
+	}
+}
+
+/*
+ * Test memory snapshot without faulting in pages accessed by the device.
+ */
+TEST_F(hmm2, snapshot)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	int *ptr;
+	unsigned char *p;
+	unsigned char *m;
+	int ret;
+	int val;
+
+	npages = 7;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(npages);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_NONE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+	p = buffer->ptr;
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 2 will be read-only zero page. */
+	ret = mprotect(buffer->ptr + 2 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 2 * self->page_size);
+	val = *ptr + 3;
+	ASSERT_EQ(val, 3);
+
+	/* Page 3 will be read-only. */
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 3 * self->page_size);
+	*ptr = val;
+	ret = mprotect(buffer->ptr + 3 * self->page_size, self->page_size,
+				PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Page 4-6 will be read-write. */
+	ret = mprotect(buffer->ptr + 4 * self->page_size, 3 * self->page_size,
+				PROT_READ | PROT_WRITE);
+	ASSERT_EQ(ret, 0);
+	ptr = (int *)(buffer->ptr + 4 * self->page_size);
+	*ptr = val;
+
+	/* Page 5 will be migrated to device 0. */
+	buffer->ptr = p + 5 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+
+	/* Page 6 will be migrated to device 1. */
+	buffer->ptr = p + 6 * self->page_size;
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_MIGRATE, buffer, 1);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, 1);
+
+	/* Simulate a device snapshotting CPU pagetables. */
+	buffer->ptr = p;
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_SNAPSHOT, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+
+	/* Check what the device saw. */
+	m = buffer->mirror;
+	ASSERT_EQ(m[0], HMM_DMIRROR_PROT_NONE);
+	ASSERT_EQ(m[1], HMM_DMIRROR_PROT_NONE);
+	ASSERT_EQ(m[2], HMM_DMIRROR_PROT_ZERO | HMM_DMIRROR_PROT_READ);
+	ASSERT_EQ(m[3], HMM_DMIRROR_PROT_READ);
+	ASSERT_EQ(m[4], HMM_DMIRROR_PROT_WRITE);
+	ASSERT_EQ(m[5], HMM_DMIRROR_PROT_DEV_PRIVATE_LOCAL |
+			HMM_DMIRROR_PROT_WRITE);
+	ASSERT_EQ(m[6], HMM_DMIRROR_PROT_DEV_PRIVATE_REMOTE |
+			HMM_DMIRROR_PROT_WRITE);
+
+	hmm_buffer_free(buffer);
+}
+
+/*
+ * Test two devices reading the same memory (double mapped).
+ */
+TEST_F(hmm2, double_map)
+{
+	struct hmm_buffer *buffer;
+	unsigned long npages;
+	unsigned long size;
+	unsigned long i;
+	int *ptr;
+	int ret;
+
+	npages = 6;
+	size = npages << self->page_shift;
+
+	buffer = malloc(sizeof(*buffer));
+	ASSERT_NE(buffer, NULL);
+
+	buffer->fd = -1;
+	buffer->size = size;
+	buffer->mirror = malloc(npages);
+	ASSERT_NE(buffer->mirror, NULL);
+
+	/* Reserve a range of addresses. */
+	buffer->ptr = mmap(NULL, size,
+			   PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS,
+			   buffer->fd, 0);
+	ASSERT_NE(buffer->ptr, MAP_FAILED);
+
+	/* Initialize buffer in system memory. */
+	for (i = 0, ptr = buffer->ptr; i < size / sizeof(*ptr); ++i)
+		ptr[i] = i;
+
+	/* Make region read-only. */
+	ret = mprotect(buffer->ptr, size, PROT_READ);
+	ASSERT_EQ(ret, 0);
+
+	/* Simulate device 0 reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd0, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Simulate device 1 reading system memory. */
+	ret = hmm_dmirror_cmd(self->fd1, HMM_DMIRROR_READ, buffer, npages);
+	ASSERT_EQ(ret, 0);
+	ASSERT_EQ(buffer->cpages, npages);
+	ASSERT_EQ(buffer->faults, 1);
+
+	/* Check what the device read. */
+	for (i = 0, ptr = buffer->mirror; i < size / sizeof(*ptr); ++i)
+		ASSERT_EQ(ptr[i], i);
+
+	/* Punch a hole after the first page address. */
+	ret = munmap(buffer->ptr + self->page_size, self->page_size);
+	ASSERT_EQ(ret, 0);
+
+	hmm_buffer_free(buffer);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index a692ea828317..ea3728570585 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -237,4 +237,20 @@ else
 	exitcode=1
 fi
 
+echo "------------------------------------"
+echo "running HMM smoke test"
+echo "------------------------------------"
+./test_hmm.sh smoke
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
 exit $exitcode
diff --git a/tools/testing/selftests/vm/test_hmm.sh b/tools/testing/selftests/vm/test_hmm.sh
new file mode 100755
index 000000000000..461e4a99a362
--- /dev/null
+++ b/tools/testing/selftests/vm/test_hmm.sh
@@ -0,0 +1,97 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Copyright (C) 2018 Uladzislau Rezki (Sony) <urezki@gmail.com>
+#
+# This is a test script for the kernel test driver to analyse vmalloc
+# allocator. Therefore it is just a kernel module loader. You can specify
+# and pass different parameters in order to:
+#     a) analyse performance of vmalloc allocations;
+#     b) stressing and stability check of vmalloc subsystem.
+
+TEST_NAME="test_hmm"
+DRIVER="test_hmm"
+
+# 1 if fails
+exitcode=1
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+check_test_requirements()
+{
+	uid=$(id -u)
+	if [ $uid -ne 0 ]; then
+		echo "$0: Must be run as root"
+		exit $ksft_skip
+	fi
+
+	if ! which modprobe > /dev/null 2>&1; then
+		echo "$0: You need modprobe installed"
+		exit $ksft_skip
+	fi
+
+	if ! modinfo $DRIVER > /dev/null 2>&1; then
+		echo "$0: You must have the following enabled in your kernel:"
+		echo "CONFIG_TEST_HMM=m"
+		exit $ksft_skip
+	fi
+}
+
+load_driver()
+{
+	modprobe $DRIVER > /dev/null 2>&1
+	if [ $? == 0 ]; then
+		major=$(awk "\$2==\"HMM_DMIRROR\" {print \$1}" /proc/devices)
+		mknod /dev/hmm_dmirror0 c $major 0
+		mknod /dev/hmm_dmirror1 c $major 1
+	fi
+}
+
+unload_driver()
+{
+	modprobe -r $DRIVER > /dev/null 2>&1
+	rm -f /dev/hmm_dmirror?
+}
+
+run_smoke()
+{
+	echo "Running smoke test. Note, this test provides basic coverage."
+
+	load_driver
+	./hmm-tests
+	unload_driver
+}
+
+usage()
+{
+	echo -n "Usage: $0"
+	echo
+	echo "Example usage:"
+	echo
+	echo "# Shows help message"
+	echo "./${TEST_NAME}.sh"
+	echo
+	echo "# Smoke testing"
+	echo "./${TEST_NAME}.sh smoke"
+	echo
+	exit 0
+}
+
+function run_test()
+{
+	if [ $# -eq 0 ]; then
+		usage
+	else
+		if [ "$1" = "smoke" ]; then
+			run_smoke
+		else
+			usage
+		fi
+	fi
+}
+
+check_test_requirements
+run_test $@
+
+exit 0
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2019-12-16 19:57 ` [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe Ralph Campbell
@ 2019-12-17 20:51   ` Jason Gunthorpe
  2019-12-17 21:50     ` Ralph Campbell
  2020-01-09 19:48   ` Jason Gunthorpe
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2019-12-17 20:51 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan

On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote:
> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't
> be called safely from inside the invalidate() callback. This is fine for
> devices with explicit memory region register and unregister calls but it
> is desirable from a programming model standpoint to not require explicit
> memory region registration. Regions can be registered based on device
> address faults but without a mechanism for updating or removing the mmu
> interval notifiers in response to munmap(), the invalidation callbacks
> will be for regions that are stale or apply to different mmaped regions.

What we do in RDMA is drive the removal from a work queue, as we need
a synchronize_srcu anyhow to serialize everything to do with
destroying a part of the address space mirror.

Is it really necessary to have all this stuff just to save doing
something like a work queue?

Also, I think we are not taking core kernel APIs like this with out an
in-kernel user??

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 9e6caa8ecd19..55fbefcdc564 100644
> +++ b/include/linux/mmu_notifier.h
> @@ -233,11 +233,18 @@ struct mmu_notifier {
>   * @invalidate: Upon return the caller must stop using any SPTEs within this
>   *              range. This function can sleep. Return false only if sleeping
>   *              was required but mmu_notifier_range_blockable(range) is false.
> + * @release:	This function will be called when the mmu_interval_notifier
> + *		is removed from the interval tree. Defining this function also
> + *		allows mmu_interval_notifier_remove() and
> + *		mmu_interval_notifier_update() to be called from the
> + *		invalidate() callback function (i.e., they won't block waiting
> + *		for invalidations to finish.

Having a function called remove that doesn't block seems like very
poor choice of language, we've tended to use put to describe that
operation.

The difference is meaningful as people often create use after free
bugs in drivers when presented with interfaces named 'remove' or
'destroy' that don't actually guarentee there is not going to be
continued accesses to the memory.

>   */
>  struct mmu_interval_notifier_ops {
>  	bool (*invalidate)(struct mmu_interval_notifier *mni,
>  			   const struct mmu_notifier_range *range,
>  			   unsigned long cur_seq);
> +	void (*release)(struct mmu_interval_notifier *mni);
>  };
>  
>  struct mmu_interval_notifier {
> @@ -246,6 +253,8 @@ struct mmu_interval_notifier {
>  	struct mm_struct *mm;
>  	struct hlist_node deferred_item;
>  	unsigned long invalidate_seq;
> +	unsigned long deferred_start;
> +	unsigned long deferred_last;

I couldn't quite understand how something like this can work, what is
preventing parallel updates?

> +/**
> + * mmu_interval_notifier_update - Update interval notifier end
> + * @mni: Interval notifier to update
> + * @start: New starting virtual address to monitor
> + * @length: New length of the range to monitor
> + *
> + * This function updates the range being monitored.
> + * If there is no release() function defined, the call will wait for the
> + * update to finish before returning.
> + */
> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
> +				 unsigned long start, unsigned long length)
> +{

Update should probably be its own patch

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2019-12-17 20:51   ` Jason Gunthorpe
@ 2019-12-17 21:50     ` Ralph Campbell
  0 siblings, 0 replies; 12+ messages in thread
From: Ralph Campbell @ 2019-12-17 21:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan


On 12/17/19 12:51 PM, Jason Gunthorpe wrote:
> On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote:
>> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't
>> be called safely from inside the invalidate() callback. This is fine for
>> devices with explicit memory region register and unregister calls but it
>> is desirable from a programming model standpoint to not require explicit
>> memory region registration. Regions can be registered based on device
>> address faults but without a mechanism for updating or removing the mmu
>> interval notifiers in response to munmap(), the invalidation callbacks
>> will be for regions that are stale or apply to different mmaped regions.
> 
> What we do in RDMA is drive the removal from a work queue, as we need
> a synchronize_srcu anyhow to serialize everything to do with
> destroying a part of the address space mirror.
> 
> Is it really necessary to have all this stuff just to save doing
> something like a work queue?

Well, the invalidates already have to use the driver lock to synchronize
so handling the range tracking updates semi-synchronously seems more
straightforward to me.

Do you feel strongly that adding a work queue is the right way to handle
this?

> Also, I think we are not taking core kernel APIs like this with out an
> in-kernel user??

Right. I was looking for feedback before updating nouveau to use it.

>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 9e6caa8ecd19..55fbefcdc564 100644
>> +++ b/include/linux/mmu_notifier.h
>> @@ -233,11 +233,18 @@ struct mmu_notifier {
>>    * @invalidate: Upon return the caller must stop using any SPTEs within this
>>    *              range. This function can sleep. Return false only if sleeping
>>    *              was required but mmu_notifier_range_blockable(range) is false.
>> + * @release:	This function will be called when the mmu_interval_notifier
>> + *		is removed from the interval tree. Defining this function also
>> + *		allows mmu_interval_notifier_remove() and
>> + *		mmu_interval_notifier_update() to be called from the
>> + *		invalidate() callback function (i.e., they won't block waiting
>> + *		for invalidations to finish.
> 
> Having a function called remove that doesn't block seems like very
> poor choice of language, we've tended to use put to describe that
> operation.
> 
> The difference is meaningful as people often create use after free
> bugs in drivers when presented with interfaces named 'remove' or
> 'destroy' that don't actually guarentee there is not going to be
> continued accesses to the memory.

OK. I can rename it put().

>>    */
>>   struct mmu_interval_notifier_ops {
>>   	bool (*invalidate)(struct mmu_interval_notifier *mni,
>>   			   const struct mmu_notifier_range *range,
>>   			   unsigned long cur_seq);
>> +	void (*release)(struct mmu_interval_notifier *mni);
>>   };
>>   
>>   struct mmu_interval_notifier {
>> @@ -246,6 +253,8 @@ struct mmu_interval_notifier {
>>   	struct mm_struct *mm;
>>   	struct hlist_node deferred_item;
>>   	unsigned long invalidate_seq;
>> +	unsigned long deferred_start;
>> +	unsigned long deferred_last;
> 
> I couldn't quite understand how something like this can work, what is
> preventing parallel updates?

It is serialized by the struct mmu_notifier_mm lock.
If there are no tasks walking the interval tree, the update
happens synchronously under the lock. If there are walkers,
the start/last values are stored under the lock and the last caller's
values are used to update the interval tree when the last walker
finishes (under the lock again).

>> +/**
>> + * mmu_interval_notifier_update - Update interval notifier end
>> + * @mni: Interval notifier to update
>> + * @start: New starting virtual address to monitor
>> + * @length: New length of the range to monitor
>> + *
>> + * This function updates the range being monitored.
>> + * If there is no release() function defined, the call will wait for the
>> + * update to finish before returning.
>> + */
>> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
>> +				 unsigned long start, unsigned long length)
>> +{
> 
> Update should probably be its own patch
> 
> Jason

OK.
Thanks for the review.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2019-12-16 19:57 ` [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe Ralph Campbell
  2019-12-17 20:51   ` Jason Gunthorpe
@ 2020-01-09 19:48   ` Jason Gunthorpe
  2020-01-09 22:01     ` Ralph Campbell
  1 sibling, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2020-01-09 19:48 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan

On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote:
> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't
> be called safely from inside the invalidate() callback. This is fine for
> devices with explicit memory region register and unregister calls but it
> is desirable from a programming model standpoint to not require explicit
> memory region registration. Regions can be registered based on device
> address faults but without a mechanism for updating or removing the mmu
> interval notifiers in response to munmap(), the invalidation callbacks
> will be for regions that are stale or apply to different mmaped regions.
> 
> The invalidate() callback provides the necessary information (i.e.,
> the event type MMU_NOTIFY_UNMAP) so add insert, remove, and update
> functions that are safe to call from the invalidate() callback by
> extending the work done in mn_itree_inv_end() to update the interval tree
> when it is not being traversed.
> 
> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>  include/linux/mmu_notifier.h |  15 +++
>  mm/mmu_notifier.c            | 196 ++++++++++++++++++++++++++++++-----
>  2 files changed, 186 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 9e6caa8ecd19..55fbefcdc564 100644
> +++ b/include/linux/mmu_notifier.h
> @@ -233,11 +233,18 @@ struct mmu_notifier {
>   * @invalidate: Upon return the caller must stop using any SPTEs within this
>   *              range. This function can sleep. Return false only if sleeping
>   *              was required but mmu_notifier_range_blockable(range) is false.
> + * @release:	This function will be called when the mmu_interval_notifier
> + *		is removed from the interval tree. Defining this function also
> + *		allows mmu_interval_notifier_remove() and
> + *		mmu_interval_notifier_update() to be called from the
> + *		invalidate() callback function (i.e., they won't block waiting
> + *		for invalidations to finish.
>   */
>  struct mmu_interval_notifier_ops {
>  	bool (*invalidate)(struct mmu_interval_notifier *mni,
>  			   const struct mmu_notifier_range *range,
>  			   unsigned long cur_seq);
> +	void (*release)(struct mmu_interval_notifier *mni);
>  };
>  
>  struct mmu_interval_notifier {
> @@ -246,6 +253,8 @@ struct mmu_interval_notifier {
>  	struct mm_struct *mm;
>  	struct hlist_node deferred_item;
>  	unsigned long invalidate_seq;
> +	unsigned long deferred_start;
> +	unsigned long deferred_last;
>  };
>  
>  #ifdef CONFIG_MMU_NOTIFIER
> @@ -299,7 +308,13 @@ int mmu_interval_notifier_insert_locked(
>  	struct mmu_interval_notifier *mni, struct mm_struct *mm,
>  	unsigned long start, unsigned long length,
>  	const struct mmu_interval_notifier_ops *ops);
> +int mmu_interval_notifier_insert_safe(
> +	struct mmu_interval_notifier *mni, struct mm_struct *mm,
> +	unsigned long start, unsigned long length,
> +	const struct mmu_interval_notifier_ops *ops);
>  void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni);
> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
> +				 unsigned long start, unsigned long length);
>  
>  /**
>   * mmu_interval_set_seq - Save the invalidation sequence
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index f76ea05b1cb0..c303285750b2 100644
> +++ b/mm/mmu_notifier.c
> @@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>  {
>  	struct mmu_interval_notifier *mni;
>  	struct hlist_node *next;
> +	struct hlist_head removed_list;
>  
>  	spin_lock(&mmn_mm->lock);
>  	if (--mmn_mm->active_invalidate_ranges ||
> @@ -144,21 +145,47 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>  	 * The inv_end incorporates a deferred mechanism like rtnl_unlock().
>  	 * Adds and removes are queued until the final inv_end happens then
>  	 * they are progressed. This arrangement for tree updates is used to
> -	 * avoid using a blocking lock during invalidate_range_start.
> +	 * avoid using a blocking lock while walking the interval tree.
>  	 */
> +	INIT_HLIST_HEAD(&removed_list);
>  	hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list,
>  				  deferred_item) {
> +		hlist_del(&mni->deferred_item);
>  		if (RB_EMPTY_NODE(&mni->interval_tree.rb))
>  			interval_tree_insert(&mni->interval_tree,
>  					     &mmn_mm->itree);
> -		else
> +		else {
>  			interval_tree_remove(&mni->interval_tree,
>  					     &mmn_mm->itree);
> -		hlist_del(&mni->deferred_item);
> +			if (mni->deferred_last) {
> +				mni->interval_tree.start = mni->deferred_start;
> +				mni->interval_tree.last = mni->deferred_last;
> +				mni->deferred_last = 0;

Technicaly we can have an interval starting at zero.

I'd write it more like

if (mni->updated_start == mni->updated_end)
    insert
else
    remove

ie an empty interval can't get a notification so it should be removed
from the tree.

I also like the name 'updated' better than deferred, it is a bit
clearer..

Adding release should it's own patch.

> @@ -970,14 +999,52 @@ int mmu_interval_notifier_insert_locked(
>  EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
>  
>  /**
> - * mmu_interval_notifier_remove - Remove a interval notifier
> - * @mni: Interval notifier to unregister
> + * mmu_interval_notifier_insert_safe - Insert an interval notifier
> + * @mni: Interval notifier to register
> + * @mm : mm_struct to attach to
> + * @start: Starting virtual address to monitor
> + * @length: Length of the range to monitor
> + * @ops: Interval notifier callback operations
>   *
> - * This function must be paired with mmu_interval_notifier_insert(). It cannot
> - * be called from any ops callback.
> + * Return: -EINVAL if @mm hasn't been initialized for interval notifiers
> + *	by calling mmu_notifier_register(NULL, mm) or
> + *	__mmu_notifier_register(NULL, mm).
>   *
> - * Once this returns ops callbacks are no longer running on other CPUs and
> - * will not be called in future.
> + * This function subscribes the interval notifier for notifications from the
> + * mm.  Upon return, the ops related to mmu_interval_notifier will be called
> + * whenever an event that intersects with the given range occurs.
> + *
> + * This function is safe to call from the ops->invalidate() function.
> + * Upon return, the mmu_interval_notifier may not be present in the interval
> + * tree yet.  The caller must use the normal interval notifier read flow via
> + * mmu_interval_read_begin() to establish SPTEs for this range.

So why do we need this? You can't call hmm_range_fault from a
notifier. You just can't.

So there should be no reason to create an interval from the notifier,
do it from where you call hmm_range_fault, and it must be safe to
obtain the mmap_sem from that thread.

> +/**
> + * mmu_interval_notifier_update - Update interval notifier end
> + * @mni: Interval notifier to update
> + * @start: New starting virtual address to monitor
> + * @length: New length of the range to monitor
> + *
> + * This function updates the range being monitored.
> + * If there is no release() function defined, the call will wait for the
> + * update to finish before returning.
> + */
> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
> +				 unsigned long start, unsigned long length)
> +{
> +	struct mm_struct *mm = mni->mm;
> +	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +	unsigned long seq = 0;
> +	unsigned long last;
> +
> +	if (length == 0 || check_add_overflow(start, length - 1, &last))
> +		return -EOVERFLOW;
> +
> +	spin_lock(&mmn_mm->lock);
> +	if (mn_itree_is_invalidating(mmn_mm)) {
> +		/*
> +		 * Update is being called after insert put this on the
> +		 * deferred list, but before the deferred list was processed.
> +		 */
> +		if (RB_EMPTY_NODE(&mni->interval_tree.rb)) {
> +			mni->interval_tree.start = start;
> +			mni->interval_tree.last = last;
> +		} else {
> +			if (!mni->deferred_last)
> +				hlist_add_head(&mni->deferred_item,
> +					       &mmn_mm->deferred_list);
> +			mni->deferred_start = start;
> +			mni->deferred_last = last;
> +		}
> +		seq = mmn_mm->invalidate_seq;
> +	} else {
> +		WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb));
> +		interval_tree_remove(&mni->interval_tree, &mmn_mm->itree);
> +		mni->interval_tree.start = start;
> +		mni->interval_tree.last = last;
> +		interval_tree_insert(&mni->interval_tree, &mmn_mm->itree);
> +	}
> +	spin_unlock(&mmn_mm->lock);
> +
> +	if (!mni->ops->release && seq) {
> +		/*
> +		 * The possible sleep on progress in the invalidation requires
> +		 * the caller not hold any locks held by invalidation
> +		 * callbacks.
> +		 */
> +		lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
> +		lock_map_release(&__mmu_notifier_invalidate_range_start_map);
>  		wait_event(mmn_mm->wq,
>  			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
> +	}
>  
> -	/* pairs with mmgrab in mmu_interval_notifier_insert() */
> -	mmdrop(mm);
> +	return 0;
>  }
> -EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove);
> +EXPORT_SYMBOL_GPL(mmu_interval_notifier_update);

A 'update' should probably be the same as insert, it doesn't
necessarily take effect until mmu_interval_read_begin(), so nothing
contingent on release.

As before, I'm not sure what to do with this. We need an in-kernel
user for new apis, and I don't see a reason to make this more
complicated for a test program. 

The test program should match one of the existing driver flows, so use
the page table like scheme from ODP or the fixed lifetime scheme from
AMDGPU/ODP

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2020-01-09 19:48   ` Jason Gunthorpe
@ 2020-01-09 22:01     ` Ralph Campbell
  2020-01-09 23:25       ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Ralph Campbell @ 2020-01-09 22:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan


On 1/9/20 11:48 AM, Jason Gunthorpe wrote:
> On Mon, Dec 16, 2019 at 11:57:32AM -0800, Ralph Campbell wrote:
>> mmu_interval_notifier_insert() and mmu_interval_notifier_remove() can't
>> be called safely from inside the invalidate() callback. This is fine for
>> devices with explicit memory region register and unregister calls but it
>> is desirable from a programming model standpoint to not require explicit
>> memory region registration. Regions can be registered based on device
>> address faults but without a mechanism for updating or removing the mmu
>> interval notifiers in response to munmap(), the invalidation callbacks
>> will be for regions that are stale or apply to different mmaped regions.
>>
>> The invalidate() callback provides the necessary information (i.e.,
>> the event type MMU_NOTIFY_UNMAP) so add insert, remove, and update
>> functions that are safe to call from the invalidate() callback by
>> extending the work done in mn_itree_inv_end() to update the interval tree
>> when it is not being traversed.
>>
>> Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
>>   include/linux/mmu_notifier.h |  15 +++
>>   mm/mmu_notifier.c            | 196 ++++++++++++++++++++++++++++++-----
>>   2 files changed, 186 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index 9e6caa8ecd19..55fbefcdc564 100644
>> +++ b/include/linux/mmu_notifier.h
>> @@ -233,11 +233,18 @@ struct mmu_notifier {
>>    * @invalidate: Upon return the caller must stop using any SPTEs within this
>>    *              range. This function can sleep. Return false only if sleeping
>>    *              was required but mmu_notifier_range_blockable(range) is false.
>> + * @release:	This function will be called when the mmu_interval_notifier
>> + *		is removed from the interval tree. Defining this function also
>> + *		allows mmu_interval_notifier_remove() and
>> + *		mmu_interval_notifier_update() to be called from the
>> + *		invalidate() callback function (i.e., they won't block waiting
>> + *		for invalidations to finish.
>>    */
>>   struct mmu_interval_notifier_ops {
>>   	bool (*invalidate)(struct mmu_interval_notifier *mni,
>>   			   const struct mmu_notifier_range *range,
>>   			   unsigned long cur_seq);
>> +	void (*release)(struct mmu_interval_notifier *mni);
>>   };
>>   
>>   struct mmu_interval_notifier {
>> @@ -246,6 +253,8 @@ struct mmu_interval_notifier {
>>   	struct mm_struct *mm;
>>   	struct hlist_node deferred_item;
>>   	unsigned long invalidate_seq;
>> +	unsigned long deferred_start;
>> +	unsigned long deferred_last;
>>   };
>>   
>>   #ifdef CONFIG_MMU_NOTIFIER
>> @@ -299,7 +308,13 @@ int mmu_interval_notifier_insert_locked(
>>   	struct mmu_interval_notifier *mni, struct mm_struct *mm,
>>   	unsigned long start, unsigned long length,
>>   	const struct mmu_interval_notifier_ops *ops);
>> +int mmu_interval_notifier_insert_safe(
>> +	struct mmu_interval_notifier *mni, struct mm_struct *mm,
>> +	unsigned long start, unsigned long length,
>> +	const struct mmu_interval_notifier_ops *ops);
>>   void mmu_interval_notifier_remove(struct mmu_interval_notifier *mni);
>> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
>> +				 unsigned long start, unsigned long length);
>>   
>>   /**
>>    * mmu_interval_set_seq - Save the invalidation sequence
>> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
>> index f76ea05b1cb0..c303285750b2 100644
>> +++ b/mm/mmu_notifier.c
>> @@ -129,6 +129,7 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>>   {
>>   	struct mmu_interval_notifier *mni;
>>   	struct hlist_node *next;
>> +	struct hlist_head removed_list;
>>   
>>   	spin_lock(&mmn_mm->lock);
>>   	if (--mmn_mm->active_invalidate_ranges ||
>> @@ -144,21 +145,47 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>>   	 * The inv_end incorporates a deferred mechanism like rtnl_unlock().
>>   	 * Adds and removes are queued until the final inv_end happens then
>>   	 * they are progressed. This arrangement for tree updates is used to
>> -	 * avoid using a blocking lock during invalidate_range_start.
>> +	 * avoid using a blocking lock while walking the interval tree.
>>   	 */
>> +	INIT_HLIST_HEAD(&removed_list);
>>   	hlist_for_each_entry_safe(mni, next, &mmn_mm->deferred_list,
>>   				  deferred_item) {
>> +		hlist_del(&mni->deferred_item);
>>   		if (RB_EMPTY_NODE(&mni->interval_tree.rb))
>>   			interval_tree_insert(&mni->interval_tree,
>>   					     &mmn_mm->itree);
>> -		else
>> +		else {
>>   			interval_tree_remove(&mni->interval_tree,
>>   					     &mmn_mm->itree);
>> -		hlist_del(&mni->deferred_item);
>> +			if (mni->deferred_last) {
>> +				mni->interval_tree.start = mni->deferred_start;
>> +				mni->interval_tree.last = mni->deferred_last;
>> +				mni->deferred_last = 0;
> 
> Technicaly we can have an interval starting at zero.
> 
> I'd write it more like
> 
> if (mni->updated_start == mni->updated_end)
>      insert
> else
>      remove

OK, but I'm using updated_end == 0, not updated_start, and the end can't be zero.

> ie an empty interval can't get a notification so it should be removed
> from the tree.
> 
> I also like the name 'updated' better than deferred, it is a bit
> clearer..

OK.

> Adding release should it's own patch.

The release callback is associated with mmu_interval_notifier_put()
(i.e., async remove). Otherwise, there is no way to know when the
interval can be freed.

>> @@ -970,14 +999,52 @@ int mmu_interval_notifier_insert_locked(
>>   EXPORT_SYMBOL_GPL(mmu_interval_notifier_insert_locked);
>>   
>>   /**
>> - * mmu_interval_notifier_remove - Remove a interval notifier
>> - * @mni: Interval notifier to unregister
>> + * mmu_interval_notifier_insert_safe - Insert an interval notifier
>> + * @mni: Interval notifier to register
>> + * @mm : mm_struct to attach to
>> + * @start: Starting virtual address to monitor
>> + * @length: Length of the range to monitor
>> + * @ops: Interval notifier callback operations
>>    *
>> - * This function must be paired with mmu_interval_notifier_insert(). It cannot
>> - * be called from any ops callback.
>> + * Return: -EINVAL if @mm hasn't been initialized for interval notifiers
>> + *	by calling mmu_notifier_register(NULL, mm) or
>> + *	__mmu_notifier_register(NULL, mm).
>>    *
>> - * Once this returns ops callbacks are no longer running on other CPUs and
>> - * will not be called in future.
>> + * This function subscribes the interval notifier for notifications from the
>> + * mm.  Upon return, the ops related to mmu_interval_notifier will be called
>> + * whenever an event that intersects with the given range occurs.
>> + *
>> + * This function is safe to call from the ops->invalidate() function.
>> + * Upon return, the mmu_interval_notifier may not be present in the interval
>> + * tree yet.  The caller must use the normal interval notifier read flow via
>> + * mmu_interval_read_begin() to establish SPTEs for this range.
> 
> So why do we need this? You can't call hmm_range_fault from a
> notifier. You just can't.
> 
> So there should be no reason to create an interval from the notifier,
> do it from where you call hmm_range_fault, and it must be safe to
> obtain the mmap_sem from that thread.

I was thinking of the case where munmap() creates a hole in the interval.
The invalidate callback would need to update the interval to cover the
left side of the remaining interval and an insert to cover the right
side. Otherwise, the HW invalidation has to be extended to cover the
right side and rely on a fault to re-establish the right side interval.

>> +/**
>> + * mmu_interval_notifier_update - Update interval notifier end
>> + * @mni: Interval notifier to update
>> + * @start: New starting virtual address to monitor
>> + * @length: New length of the range to monitor
>> + *
>> + * This function updates the range being monitored.
>> + * If there is no release() function defined, the call will wait for the
>> + * update to finish before returning.
>> + */
>> +int mmu_interval_notifier_update(struct mmu_interval_notifier *mni,
>> +				 unsigned long start, unsigned long length)
>> +{
>> +	struct mm_struct *mm = mni->mm;
>> +	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
>> +	unsigned long seq = 0;
>> +	unsigned long last;
>> +
>> +	if (length == 0 || check_add_overflow(start, length - 1, &last))
>> +		return -EOVERFLOW;
>> +
>> +	spin_lock(&mmn_mm->lock);
>> +	if (mn_itree_is_invalidating(mmn_mm)) {
>> +		/*
>> +		 * Update is being called after insert put this on the
>> +		 * deferred list, but before the deferred list was processed.
>> +		 */
>> +		if (RB_EMPTY_NODE(&mni->interval_tree.rb)) {
>> +			mni->interval_tree.start = start;
>> +			mni->interval_tree.last = last;
>> +		} else {
>> +			if (!mni->deferred_last)
>> +				hlist_add_head(&mni->deferred_item,
>> +					       &mmn_mm->deferred_list);
>> +			mni->deferred_start = start;
>> +			mni->deferred_last = last;
>> +		}
>> +		seq = mmn_mm->invalidate_seq;
>> +	} else {
>> +		WARN_ON(RB_EMPTY_NODE(&mni->interval_tree.rb));
>> +		interval_tree_remove(&mni->interval_tree, &mmn_mm->itree);
>> +		mni->interval_tree.start = start;
>> +		mni->interval_tree.last = last;
>> +		interval_tree_insert(&mni->interval_tree, &mmn_mm->itree);
>> +	}
>> +	spin_unlock(&mmn_mm->lock);
>> +
>> +	if (!mni->ops->release && seq) {
>> +		/*
>> +		 * The possible sleep on progress in the invalidation requires
>> +		 * the caller not hold any locks held by invalidation
>> +		 * callbacks.
>> +		 */
>> +		lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>> +		lock_map_release(&__mmu_notifier_invalidate_range_start_map);
>>   		wait_event(mmn_mm->wq,
>>   			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
>> +	}
>>   
>> -	/* pairs with mmgrab in mmu_interval_notifier_insert() */
>> -	mmdrop(mm);
>> +	return 0;
>>   }
>> -EXPORT_SYMBOL_GPL(mmu_interval_notifier_remove);
>> +EXPORT_SYMBOL_GPL(mmu_interval_notifier_update);
> 
> A 'update' should probably be the same as insert, it doesn't
> necessarily take effect until mmu_interval_read_begin(), so nothing
> contingent on release.

Agreed. I was trying to keep mmu_interval_notifier_remove() unchanged
and use whether ops->release() was defined to indicate if it should be
synchronous or asynchronous.
The test for release in the update case was intended to be similar.

Now the plan for v6 is to leave mmu_interval_notifier_remove() unchanged,
add mmu_interval_notifier_put() for async/safe removal and make 'update'
be asynchronous only and, as you say, rely on mmu_interval_read_begin()
to be sure all delayed add/remove/updates are complete.

I'm also planning to add a mmu_interval_notifier_find() so that nouveau
and the self tests don't need to create a duplicate interval range tree
to track the intervals that they have registered. There isn't an existing
structure that the struct mmu_interval_notifier can just be added to so
it ends up being a separately allocated structure and would need to be
stored in some sort of table so I thought why not just use the itree.

So I guess the patch series will be insert_safe, put+release, update,
find, nouveau, and tests.

> As before, I'm not sure what to do with this. We need an in-kernel
> user for new apis, and I don't see a reason to make this more
> complicated for a test program.

I agree.

> The test program should match one of the existing driver flows, so use
> the page table like scheme from ODP or the fixed lifetime scheme from
> AMDGPU/ODP
> 
> Jason

This is all useful feedback. I am working on v6 which addresses your concerns
and updates nouveau to use the new API. I'm somewhat sidetracked by the lockdep
issue I posted about nouveau calling kmalloc(GFP_KERNEL) from the invalidation
callback so it may take me awhile to sort that out.
Since we are at -rc5, I'm guessing this won't have enough soak time to make 5.6.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2020-01-09 22:01     ` Ralph Campbell
@ 2020-01-09 23:25       ` Jason Gunthorpe
  2020-01-13 22:44         ` Ralph Campbell
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2020-01-09 23:25 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan

On Thu, Jan 09, 2020 at 02:01:21PM -0800, Ralph Campbell wrote:

> > I'd write it more like
> > 
> > if (mni->updated_start == mni->updated_end)
> >      insert
> > else
> >      remove
> 
> OK, but I'm using updated_end == 0, not updated_start, and the end can't be zero.

Tricky..
 
> > ie an empty interval can't get a notification so it should be removed
> > from the tree.
> > 
> > I also like the name 'updated' better than deferred, it is a bit
> > clearer..
> 
> OK.
> 
> > Adding release should it's own patch.
> 
> The release callback is associated with mmu_interval_notifier_put()
> (i.e., async remove). Otherwise, there is no way to know when the
> interval can be freed.

Okay, but this patch is just trying to add update?

> > So why do we need this? You can't call hmm_range_fault from a
> > notifier. You just can't.
> > 
> > So there should be no reason to create an interval from the notifier,
> > do it from where you call hmm_range_fault, and it must be safe to
> > obtain the mmap_sem from that thread.
> 
> I was thinking of the case where munmap() creates a hole in the interval.
> The invalidate callback would need to update the interval to cover the
> left side of the remaining interval and an insert to cover the right
> side. Otherwise, the HW invalidation has to be extended to cover the
> right side and rely on a fault to re-establish the right side interval.

This is very tricky because this algorithm can only work correctly if
done atomically as a batch entirely under the spinlock. Forcing it
into the defered list while holding the lock is the only way to do
something like that sensibly..

So 'update' is not some generic API you can call, it can only be done
while the interval tree is locked for reading. Thus 'safe' is probably
the wrong name, it is actually 'interval tree locked for read'

At the minimum this needs to be comprehensively documented and we need
a lockdep style assertion that we are locked when doing it..

And if we are defining things like that then it might as well be
expressed as a remove/insert/insert batch operation rather than
a somewhat confusing update.

> Now the plan for v6 is to leave mmu_interval_notifier_remove() unchanged,
> add mmu_interval_notifier_put() for async/safe removal and make 'update'
> be asynchronous only and, as you say, rely on mmu_interval_read_begin()
> to be sure all delayed add/remove/updates are complete.

Hm, we can see what injecting reference counts would look like.

> I'm also planning to add a mmu_interval_notifier_find() so that nouveau
> and the self tests don't need to create a duplicate interval range tree
> to track the intervals that they have registered. There isn't an existing
> structure that the struct mmu_interval_notifier can just be added to so
> it ends up being a separately allocated structure and would need to be
> stored in some sort of table so I thought why not just use the itree.

Okay, but for locking reasons find is also a little tricky. I suppose
find can obtain the read side lock on the interval tree and then the
caller would have to find_unlock once it has refcounted or finished
accessing the object. Much like how the invalidate callback is locked.

> This is all useful feedback. I am working on v6 which addresses your concerns
> and updates nouveau to use the new API. I'm somewhat sidetracked by the lockdep
> issue I posted about nouveau calling kmalloc(GFP_KERNEL) from the invalidation
> callback so it may take me awhile to sort that out.
> Since we are at -rc5, I'm guessing this won't have enough soak time to make 5.6.

Yes, sorry for the delay, lots of travel and a mountain of emails. I
am almost caught up now. But you can post it at least.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2020-01-09 23:25       ` Jason Gunthorpe
@ 2020-01-13 22:44         ` Ralph Campbell
  2020-01-14 12:45           ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Ralph Campbell @ 2020-01-13 22:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan


On 1/9/20 3:25 PM, Jason Gunthorpe wrote:
> On Thu, Jan 09, 2020 at 02:01:21PM -0800, Ralph Campbell wrote:
> 
>>> I'd write it more like
>>>
>>> if (mni->updated_start == mni->updated_end)
>>>       insert
>>> else
>>>       remove
>>
>> OK, but I'm using updated_end == 0, not updated_start, and the end can't be zero.
> 
> Tricky..
>   
>>> ie an empty interval can't get a notification so it should be removed
>>> from the tree.
>>>
>>> I also like the name 'updated' better than deferred, it is a bit
>>> clearer..
>>
>> OK.
>>
>>> Adding release should it's own patch.
>>
>> The release callback is associated with mmu_interval_notifier_put()
>> (i.e., async remove). Otherwise, there is no way to know when the
>> interval can be freed.
> 
> Okay, but this patch is just trying to add update?
> 
>>> So why do we need this? You can't call hmm_range_fault from a
>>> notifier. You just can't.
>>>
>>> So there should be no reason to create an interval from the notifier,
>>> do it from where you call hmm_range_fault, and it must be safe to
>>> obtain the mmap_sem from that thread.
>>
>> I was thinking of the case where munmap() creates a hole in the interval.
>> The invalidate callback would need to update the interval to cover the
>> left side of the remaining interval and an insert to cover the right
>> side. Otherwise, the HW invalidation has to be extended to cover the
>> right side and rely on a fault to re-establish the right side interval.
> 
> This is very tricky because this algorithm can only work correctly if
> done atomically as a batch entirely under the spinlock. Forcing it
> into the defered list while holding the lock is the only way to do
> something like that sensibly..
> 
> So 'update' is not some generic API you can call, it can only be done
> while the interval tree is locked for reading. Thus 'safe' is probably
> the wrong name, it is actually 'interval tree locked for read'
> 
> At the minimum this needs to be comprehensively documented and we need
> a lockdep style assertion that we are locked when doing it..
> 
> And if we are defining things like that then it might as well be
> expressed as a remove/insert/insert batch operation rather than
> a somewhat confusing update.
> 
>> Now the plan for v6 is to leave mmu_interval_notifier_remove() unchanged,
>> add mmu_interval_notifier_put() for async/safe removal and make 'update'
>> be asynchronous only and, as you say, rely on mmu_interval_read_begin()
>> to be sure all delayed add/remove/updates are complete.
> 
> Hm, we can see what injecting reference counts would look like.
> 
>> I'm also planning to add a mmu_interval_notifier_find() so that nouveau
>> and the self tests don't need to create a duplicate interval range tree
>> to track the intervals that they have registered. There isn't an existing
>> structure that the struct mmu_interval_notifier can just be added to so
>> it ends up being a separately allocated structure and would need to be
>> stored in some sort of table so I thought why not just use the itree.
> 
> Okay, but for locking reasons find is also a little tricky. I suppose
> find can obtain the read side lock on the interval tree and then the
> caller would have to find_unlock once it has refcounted or finished
> accessing the object. Much like how the invalidate callback is locked.
> 
>> This is all useful feedback. I am working on v6 which addresses your concerns
>> and updates nouveau to use the new API. I'm somewhat sidetracked by the lockdep
>> issue I posted about nouveau calling kmalloc(GFP_KERNEL) from the invalidation
>> callback so it may take me awhile to sort that out.
>> Since we are at -rc5, I'm guessing this won't have enough soak time to make 5.6.
> 
> Yes, sorry for the delay, lots of travel and a mountain of emails. I
> am almost caught up now. But you can post it at least.
> 
> Jason
> 
I'm using the device driver lock to serialize find/insert/update/remove
changes to the interval tree. The important thing is to have a registered
interval covering any shadow PTEs in the hardware and the driver lock that
protects the updates to the HW and sequence number also protects updates
to the registered intervals.
Hopefully, this will be easier to understand with v6 which I'm posting
for review.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2020-01-13 22:44         ` Ralph Campbell
@ 2020-01-14 12:45           ` Jason Gunthorpe
  2020-01-15 22:04             ` Ralph Campbell
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Gunthorpe @ 2020-01-14 12:45 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan

On Mon, Jan 13, 2020 at 02:44:52PM -0800, Ralph Campbell wrote:

> I'm using the device driver lock to serialize find/insert/update/remove
> changes to the interval tree.

The device driver lock can't really fully serialize this as it doesn't
effect how the intersection lookup runs.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2020-01-14 12:45           ` Jason Gunthorpe
@ 2020-01-15 22:04             ` Ralph Campbell
  2020-01-16 14:13               ` Jason Gunthorpe
  0 siblings, 1 reply; 12+ messages in thread
From: Ralph Campbell @ 2020-01-15 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan


On 1/14/20 4:45 AM, Jason Gunthorpe wrote:
> On Mon, Jan 13, 2020 at 02:44:52PM -0800, Ralph Campbell wrote:
> 
>> I'm using the device driver lock to serialize find/insert/update/remove
>> changes to the interval tree.
> 
> The device driver lock can't really fully serialize this as it doesn't
> effect how the intersection lookup runs.
> 
> Jason
> 

Single updates to the interval notifier are atomic due to the
struct mmu_notifier_mm spinlock so the issue is with "punching a hole"
in the interval. In that case, the existing interval is updated to cover
one side of the hole and a new interval is inserted for the other side
while holding the driver lock.

Since this sequence is done from the invalidate() callback, those two
operations will be deferred until the callback returns and any other
parallel invalidates complete (which would be serialized on
the driver lock). So none of these changes will be visible to the
interval tree walks until the last invalidate task calls
mn_itree_inv_end() and all the deferred changes are applied atomically
while holding the spinlock. I'll make sure to add comments explaining
this.

But I see your point if this sequence is done outside of the invalidate
callback. In that case, if the driver shrank the interval, an invalidate
callback for the right hand side could be missed before the insertion of
the new interval for the right hand side.
I'll explain this in the comments for nouveau_svmm_do_unmap() and
dmirror_do_unmap().

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe
  2020-01-15 22:04             ` Ralph Campbell
@ 2020-01-16 14:13               ` Jason Gunthorpe
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2020-01-16 14:13 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-rdma, linux-mm, linux-kernel, linux-kselftest,
	Jerome Glisse, John Hubbard, Christoph Hellwig, Andrew Morton,
	Shuah Khan

On Wed, Jan 15, 2020 at 02:04:47PM -0800, Ralph Campbell wrote:

> But I see your point if this sequence is done outside of the invalidate
> callback. In that case, if the driver shrank the interval, an invalidate
> callback for the right hand side could be missed before the insertion of
> the new interval for the right hand side.
> I'll explain this in the comments for nouveau_svmm_do_unmap() and
> dmirror_do_unmap().

Yes, this is why I'm not sure this is a good API for the core to
expose.

Batch manipulations is a resonable thing, but it should be forced to
work under safe conditions, ie while holding the required 'inv_begin'
on the interval tree, and the batching APIs should assert this
requirement.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-01-16 14:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-16 19:57 [PATCH v5 0/2] mm/hmm/test: add self tests for HMM Ralph Campbell
2019-12-16 19:57 ` [PATCH v5 1/2] mm/mmu_notifier: make interval notifier updates safe Ralph Campbell
2019-12-17 20:51   ` Jason Gunthorpe
2019-12-17 21:50     ` Ralph Campbell
2020-01-09 19:48   ` Jason Gunthorpe
2020-01-09 22:01     ` Ralph Campbell
2020-01-09 23:25       ` Jason Gunthorpe
2020-01-13 22:44         ` Ralph Campbell
2020-01-14 12:45           ` Jason Gunthorpe
2020-01-15 22:04             ` Ralph Campbell
2020-01-16 14:13               ` Jason Gunthorpe
2019-12-16 19:57 ` [PATCH v5 2/2] mm/hmm/test: add self tests for HMM Ralph Campbell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).