Linux-RDMA Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking
@ 2019-10-28 20:10 Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
                   ` (16 more replies)
  0 siblings, 17 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
they only use invalidate_range_start/end and immediately check the
invalidating range against some driver data structure to tell if the
driver is interested. Half of them use an interval_tree, the others are
simple linear search lists.

Of the ones I checked they largely seem to have various kinds of races,
bugs and poor implementation. This is a result of the complexity in how
the notifier interacts with get_user_pages(). It is extremely difficult to
use it correctly.

Consolidate all of this code together into the core mmu_notifier and
provide a locking scheme similar to hmm_mirror that allows the user to
safely use get_user_pages() and reliably know if the page list still
matches the mm.

This new arrangment plays nicely with the !blockable mode for
OOM. Scanning the interval tree is done such that the intersection test
will always succeed, and since there is no invalidate_range_end exposed to
drivers the scheme safely allows multiple drivers to be subscribed.

Four places are converted as an example of how the new API is used.
Four are left for future patches:
 - i915_gem has complex locking around destruction of a registration,
   needs more study
 - hfi1 (2nd user) needs access to the rbtree
 - scif_dma has a complicated logic flow
 - vhost's mmu notifiers are already being rewritten

This series, and the other code it depends on is available on my github:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

v2 changes:
- Add mmu_range_set_seq() to set the mrn sequence number under the driver
  lock and make the locking more understandable
- Add some additional comments around locking/READ_ONCe
- Make the WARN_ON flow in mn_itree_invalidate a bit easier to follow
- Fix wrong WARN_ON

Jason Gunthorpe (15):
  mm/mmu_notifier: define the header pre-processor parts even if
    disabled
  mm/mmu_notifier: add an interval tree notifier
  mm/hmm: allow hmm_range to be used with a mmu_range_notifier or
    hmm_mirror
  mm/hmm: define the pre-processor related parts of hmm.h even if
    disabled
  RDMA/odp: Use mmu_range_notifier_insert()
  RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
  drm/radeon: use mmu_range_notifier_insert
  xen/gntdev: Use select for DMA_SHARED_BUFFER
  xen/gntdev: use mmu_range_notifier_insert
  nouveau: use mmu_notifier directly for invalidate_range_start
  nouveau: use mmu_range_notifier instead of hmm_mirror
  drm/amdgpu: Call find_vma under mmap_sem
  drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  mm/hmm: remove hmm_mirror and related

 Documentation/vm/hmm.rst                      | 105 +---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 457 +++------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  53 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 111 ++--
 drivers/gpu/drm/nouveau/nouveau_svm.c         | 231 +++++---
 drivers/gpu/drm/radeon/radeon.h               |   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c            | 219 ++-----
 drivers/infiniband/core/device.c              |   1 -
 drivers/infiniband/core/umem_odp.c            | 288 +--------
 drivers/infiniband/hw/hfi1/file_ops.c         |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h              |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c     | 146 ++---
 drivers/infiniband/hw/hfi1/user_exp_rcv.h     |   3 +-
 drivers/infiniband/hw/mlx5/mlx5_ib.h          |   7 +-
 drivers/infiniband/hw/mlx5/mr.c               |   3 +-
 drivers/infiniband/hw/mlx5/odp.c              |  50 +-
 drivers/xen/Kconfig                           |   3 +-
 drivers/xen/gntdev-common.h                   |   8 +-
 drivers/xen/gntdev.c                          | 180 ++----
 include/linux/hmm.h                           | 195 +------
 include/linux/mmu_notifier.h                  | 144 ++++-
 include/rdma/ib_umem_odp.h                    |  65 +--
 include/rdma/ib_verbs.h                       |   2 -
 kernel/fork.c                                 |   1 -
 mm/Kconfig                                    |   2 +-
 mm/hmm.c                                      | 275 +--------
 mm/mmu_notifier.c                             | 546 +++++++++++++++++-
 32 files changed, 1225 insertions(+), 1922 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-11-05 21:23   ` John Hubbard
  2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Now that we have KERNEL_HEADER_TEST all headers are generally compile
tested, so relying on makefile tricks to avoid compiling code that depends
on CONFIG_MMU_NOTIFIER is more annoying.

Instead follow the usual pattern and provide most of the header with only
the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
ensures code compiles no matter what the config setting is.

While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/mmu_notifier.h | 46 +++++++++++++-----------------------
 mm/mmu_notifier.c            | 13 ++++++++++
 2 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 1bd8e6a09a3c27..12bd603d318ce7 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -7,8 +7,9 @@
 #include <linux/mm_types.h>
 #include <linux/srcu.h>
 
+struct mmu_notifier_mm;
 struct mmu_notifier;
-struct mmu_notifier_ops;
+struct mmu_notifier_range;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -40,36 +41,8 @@ enum mmu_notifier_event {
 	MMU_NOTIFY_SOFT_DIRTY,
 };
 
-#ifdef CONFIG_MMU_NOTIFIER
-
-#ifdef CONFIG_LOCKDEP
-extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
-#endif
-
-/*
- * The mmu notifier_mm structure is allocated and installed in
- * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
- * critical section and it's released only when mm_count reaches zero
- * in mmdrop().
- */
-struct mmu_notifier_mm {
-	/* all mmu notifiers registerd in this mm are queued in this list */
-	struct hlist_head list;
-	/* to serialize the list modifications and hlist_unhashed */
-	spinlock_t lock;
-};
-
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
 
-struct mmu_notifier_range {
-	struct vm_area_struct *vma;
-	struct mm_struct *mm;
-	unsigned long start;
-	unsigned long end;
-	unsigned flags;
-	enum mmu_notifier_event event;
-};
-
 struct mmu_notifier_ops {
 	/*
 	 * Called either by mmu_notifier_unregister or when the mm is
@@ -249,6 +222,21 @@ struct mmu_notifier {
 	unsigned int users;
 };
 
+#ifdef CONFIG_MMU_NOTIFIER
+
+#ifdef CONFIG_LOCKDEP
+extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
+#endif
+
+struct mmu_notifier_range {
+	struct vm_area_struct *vma;
+	struct mm_struct *mm;
+	unsigned long start;
+	unsigned long end;
+	unsigned flags;
+	enum mmu_notifier_event event;
+};
+
 static inline int mm_has_notifiers(struct mm_struct *mm)
 {
 	return unlikely(mm->mmu_notifier_mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 7fde88695f35d6..367670cfd02b7b 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -27,6 +27,19 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
 };
 #endif
 
+/*
+ * The mmu notifier_mm structure is allocated and installed in
+ * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
+ * critical section and it's released only when mm_count reaches zero
+ * in mmdrop().
+ */
+struct mmu_notifier_mm {
+	/* all mmu notifiers registered in this mm are queued in this list */
+	struct hlist_head list;
+	/* to serialize the list modifications and hlist_unhashed */
+	spinlock_t lock;
+};
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-29 22:04   ` Kuehling, Felix
  2019-11-07  0:23   ` John Hubbard
  2019-10-28 20:10 ` [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror Jason Gunthorpe
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe, Andrea Arcangeli,
	Michal Hocko

From: Jason Gunthorpe <jgg@mellanox.com>

Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs.  4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)

And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.

It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages().  Many of these implementations have
various subtle and difficult to fix races.

This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.

It includes:
 - An interval tree tracking VA ranges, with per-range callbacks
 - A read/write locking scheme for the interval tree that avoids
   sleeping in the notifier path (for OOM killer)
 - A sequence counter based collision-retry locking scheme to tell
   device page fault that a VA range is being concurrently invalidated.

This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
  invalidates are done, via active_invalidate_ranges count.
  This approach avoids having to intersect the interval tree twice (as
  umem_odp does) at the potential cost of a longer device page fault.

- kvm/umem_odp use a sequence counter to drive the collision retry,
  via invalidate_seq

- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
  This makes adding/removing interval tree members more deterministic

- seqlock, except this version makes the seqlock idea multi-holder on the
  write side by protecting it with active_invalidate_ranges and a spinlock

To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.

The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/mmu_notifier.h |  98 +++++++
 mm/Kconfig                   |   1 +
 mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
 3 files changed, 607 insertions(+), 25 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 12bd603d318ce7..51b92ba013ddce 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,10 +6,12 @@
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
 #include <linux/srcu.h>
+#include <linux/interval_tree.h>
 
 struct mmu_notifier_mm;
 struct mmu_notifier;
 struct mmu_notifier_range;
+struct mmu_range_notifier;
 
 /**
  * enum mmu_notifier_event - reason for the mmu notifier callback
@@ -32,6 +34,9 @@ struct mmu_notifier_range;
  * access flags). User should soft dirty the page in the end callback to make
  * sure that anyone relying on soft dirtyness catch pages that might be written
  * through non CPU mappings.
+ *
+ * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal that
+ * the mm refcount is zero and the range is no longer accessible.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -39,6 +44,7 @@ enum mmu_notifier_event {
 	MMU_NOTIFY_PROTECTION_VMA,
 	MMU_NOTIFY_PROTECTION_PAGE,
 	MMU_NOTIFY_SOFT_DIRTY,
+	MMU_NOTIFY_RELEASE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -222,6 +228,26 @@ struct mmu_notifier {
 	unsigned int users;
 };
 
+/**
+ * struct mmu_range_notifier_ops
+ * @invalidate: Upon return the caller must stop using any SPTEs within this
+ *              range, this function can sleep. Return false if blocking was
+ *              required but range is non-blocking
+ */
+struct mmu_range_notifier_ops {
+	bool (*invalidate)(struct mmu_range_notifier *mrn,
+			   const struct mmu_notifier_range *range,
+			   unsigned long cur_seq);
+};
+
+struct mmu_range_notifier {
+	struct interval_tree_node interval_tree;
+	const struct mmu_range_notifier_ops *ops;
+	struct hlist_node deferred_item;
+	unsigned long invalidate_seq;
+	struct mm_struct *mm;
+};
+
 #ifdef CONFIG_MMU_NOTIFIER
 
 #ifdef CONFIG_LOCKDEP
@@ -263,6 +289,78 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
 				   struct mm_struct *mm);
 extern void mmu_notifier_unregister(struct mmu_notifier *mn,
 				    struct mm_struct *mm);
+
+unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn);
+int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+			      unsigned long start, unsigned long length,
+			      struct mm_struct *mm);
+int mmu_range_notifier_insert_locked(struct mmu_range_notifier *mrn,
+				     unsigned long start, unsigned long length,
+				     struct mm_struct *mm);
+void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
+
+/**
+ * mmu_range_set_seq - Save the invalidation sequence
+ * @mrn - The mrn passed to invalidate
+ * @cur_seq - The cur_seq passed to invalidate
+ *
+ * This must be called unconditionally from the invalidate callback of a
+ * struct mmu_range_notifier_ops under the same lock that is used to call
+ * mmu_range_read_retry(). It updates the sequence number for later use by
+ * mmu_range_read_retry().
+ *
+ * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
+ * then this call is not required.
+ */
+static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
+				     unsigned long cur_seq)
+{
+	WRITE_ONCE(mrn->invalidate_seq, cur_seq);
+}
+
+/**
+ * mmu_range_read_retry - End a read side critical section against a VA range
+ * mrn: The range under lock
+ * seq: The return of the paired mmu_range_read_begin()
+ *
+ * This MUST be called under a user provided lock that is also held
+ * unconditionally by op->invalidate() when it calls mmu_range_set_seq().
+ *
+ * Each call should be paired with a single mmu_range_read_begin() and
+ * should be used to conclude the read side.
+ *
+ * Returns true if an invalidation collided with this critical section, and
+ * the caller should retry.
+ */
+static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
+					unsigned long seq)
+{
+	return mrn->invalidate_seq != seq;
+}
+
+/**
+ * mmu_range_check_retry - Test if a collision has occurred
+ * mrn: The range under lock
+ * seq: The return of the matching mmu_range_read_begin()
+ *
+ * This can be used in the critical section between mmu_range_read_begin() and
+ * mmu_range_read_retry().  A return of true indicates an invalidation has
+ * collided with this lock and a future mmu_range_read_retry() will return
+ * true.
+ *
+ * False is not reliable and only suggests a collision has not happened. It
+ * can be called many times and does not have to hold the user provided lock.
+ *
+ * This call can be used as part of loops and other expensive operations to
+ * expedite a retry.
+ */
+static inline bool mmu_range_check_retry(struct mmu_range_notifier *mrn,
+					 unsigned long seq)
+{
+	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
+	return READ_ONCE(mrn->invalidate_seq) != seq;
+}
+
 extern void __mmu_notifier_mm_destroy(struct mm_struct *mm);
 extern void __mmu_notifier_release(struct mm_struct *mm);
 extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm,
diff --git a/mm/Kconfig b/mm/Kconfig
index a5dae9a7eb510a..d0b5046d9aeffd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -284,6 +284,7 @@ config VIRT_TO_BUS
 config MMU_NOTIFIER
 	bool
 	select SRCU
+	select INTERVAL_TREE
 
 config KSM
 	bool "Enable KSM for page merging"
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 367670cfd02b7b..d02d3c8c223eb7 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -12,6 +12,7 @@
 #include <linux/export.h>
 #include <linux/mm.h>
 #include <linux/err.h>
+#include <linux/interval_tree.h>
 #include <linux/srcu.h>
 #include <linux/rcupdate.h>
 #include <linux/sched.h>
@@ -36,10 +37,243 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
 struct mmu_notifier_mm {
 	/* all mmu notifiers registered in this mm are queued in this list */
 	struct hlist_head list;
+	bool has_interval;
 	/* to serialize the list modifications and hlist_unhashed */
 	spinlock_t lock;
+	unsigned long invalidate_seq;
+	unsigned long active_invalidate_ranges;
+	struct rb_root_cached itree;
+	wait_queue_head_t wq;
+	struct hlist_head deferred_list;
 };
 
+/*
+ * This is a collision-retry read-side/write-side 'lock', a lot like a
+ * seqcount, however this allows multiple write-sides to hold it at
+ * once. Conceptually the write side is protecting the values of the PTEs in
+ * this mm, such that PTES cannot be read into SPTEs while any writer exists.
+ *
+ * Note that the core mm creates nested invalidate_range_start()/end() regions
+ * within the same thread, and runs invalidate_range_start()/end() in parallel
+ * on multiple CPUs. This is designed to not reduce concurrency or block
+ * progress on the mm side.
+ *
+ * As a secondary function, holding the full write side also serves to prevent
+ * writers for the itree, this is an optimization to avoid extra locking
+ * during invalidate_range_start/end notifiers.
+ *
+ * The write side has two states, fully excluded:
+ *  - mm->active_invalidate_ranges != 0
+ *  - mnn->invalidate_seq & 1 == True
+ *  - some range on the mm_struct is being invalidated
+ *  - the itree is not allowed to change
+ *
+ * And partially excluded:
+ *  - mm->active_invalidate_ranges != 0
+ *  - some range on the mm_struct is being invalidated
+ *  - the itree is allowed to change
+ *
+ * The later state avoids some expensive work on inv_end in the common case of
+ * no mrn monitoring the VA.
+ */
+static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
+{
+	lockdep_assert_held(&mmn_mm->lock);
+	return mmn_mm->invalidate_seq & 1;
+}
+
+static struct mmu_range_notifier *
+mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
+			 const struct mmu_notifier_range *range,
+			 unsigned long *seq)
+{
+	struct interval_tree_node *node;
+	struct mmu_range_notifier *res = NULL;
+
+	spin_lock(&mmn_mm->lock);
+	mmn_mm->active_invalidate_ranges++;
+	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
+					range->end - 1);
+	if (node) {
+		mmn_mm->invalidate_seq |= 1;
+		res = container_of(node, struct mmu_range_notifier,
+				   interval_tree);
+	}
+
+	*seq = mmn_mm->invalidate_seq;
+	spin_unlock(&mmn_mm->lock);
+	return res;
+}
+
+static struct mmu_range_notifier *
+mn_itree_inv_next(struct mmu_range_notifier *mrn,
+		  const struct mmu_notifier_range *range)
+{
+	struct interval_tree_node *node;
+
+	node = interval_tree_iter_next(&mrn->interval_tree, range->start,
+				       range->end - 1);
+	if (!node)
+		return NULL;
+	return container_of(node, struct mmu_range_notifier, interval_tree);
+}
+
+static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
+{
+	struct mmu_range_notifier *mrn;
+	struct hlist_node *next;
+	bool need_wake = false;
+
+	spin_lock(&mmn_mm->lock);
+	if (--mmn_mm->active_invalidate_ranges ||
+	    !mn_itree_is_invalidating(mmn_mm)) {
+		spin_unlock(&mmn_mm->lock);
+		return;
+	}
+
+	mmn_mm->invalidate_seq++;
+	need_wake = true;
+
+	/*
+	 * The inv_end incorporates a deferred mechanism like
+	 * rtnl_lock(). Adds and removes are queued until the final inv_end
+	 * happens then they are progressed. This arrangement for tree updates
+	 * is used to avoid using a blocking lock during
+	 * invalidate_range_start.
+	 */
+	hlist_for_each_entry_safe(mrn, next, &mmn_mm->deferred_list,
+				  deferred_item) {
+		if (RB_EMPTY_NODE(&mrn->interval_tree.rb))
+			interval_tree_insert(&mrn->interval_tree,
+					     &mmn_mm->itree);
+		else
+			interval_tree_remove(&mrn->interval_tree,
+					     &mmn_mm->itree);
+		hlist_del(&mrn->deferred_item);
+	}
+	spin_unlock(&mmn_mm->lock);
+
+	/*
+	 * TODO: Since we already have a spinlock above, this would be faster
+	 * as wake_up_q
+	 */
+	if (need_wake)
+		wake_up_all(&mmn_mm->wq);
+}
+
+/**
+ * mmu_range_read_begin - Begin a read side critical section against a VA range
+ * mrn: The range to lock
+ *
+ * mmu_range_read_begin()/mmu_range_read_retry() implement a collision-retry
+ * locking scheme similar to seqcount for the VA range under mrn. If the mm
+ * invokes invalidation during the critical section then
+ * mmu_range_read_retry() will return true.
+ *
+ * This is useful to obtain shadow PTEs where teardown or setup of the SPTEs
+ * require a blocking context.  The critical region formed by this lock can
+ * sleep, and the required 'user_lock' can also be a sleeping lock.
+ *
+ * The caller is required to provide a 'user_lock' to serialize both teardown
+ * and setup.
+ *
+ * The return value should be passed to mmu_range_read_retry().
+ */
+unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
+{
+	struct mmu_notifier_mm *mmn_mm = mrn->mm->mmu_notifier_mm;
+	unsigned long seq;
+	bool is_invalidating;
+
+	/*
+	 * If the mrn has a different seq value under the user_lock than we
+	 * started with then it has collided.
+	 *
+	 * If the mrn currently has the same seq value as the mmn_mm seq, then
+	 * it is currently between invalidate_start/end and is colliding.
+	 *
+	 * The locking looks broadly like this:
+	 *   mn_tree_invalidate_start():          mmu_range_read_begin():
+	 *                                         spin_lock
+	 *                                          seq = READ_ONCE(mrn->invalidate_seq);
+	 *                                          seq == mmn_mm->invalidate_seq
+	 *                                         spin_unlock
+	 *    spin_lock
+	 *     seq = ++mmn_mm->invalidate_seq
+	 *    spin_unlock
+	 *     op->invalidate_range():
+	 *       user_lock
+	 *        mmu_range_set_seq()
+	 *         mrn->invalidate_seq = seq
+	 *       user_unlock
+	 *
+	 *                          [Required: mmu_range_read_retry() == true]
+	 *
+	 *   mn_itree_inv_end():
+	 *    spin_lock
+	 *     seq = ++mmn_mm->invalidate_seq
+	 *    spin_unlock
+	 *
+	 *                                        user_lock
+	 *                                         mmu_range_read_retry():
+	 *                                          mrn->invalidate_seq != seq
+	 *                                        user_unlock
+	 *
+	 * Barriers are not needed here as any races here are closed by an
+	 * eventual mmu_range_read_retry(), which provides a barrier via the
+	 * user_lock.
+	 */
+	spin_lock(&mmn_mm->lock);
+	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
+	seq = READ_ONCE(mrn->invalidate_seq);
+	is_invalidating = seq == mmn_mm->invalidate_seq;
+	spin_unlock(&mmn_mm->lock);
+
+	/*
+	 * mrn->invalidate_seq is always set to an odd value. This ensures
+	 * that if seq does wrap we will always clear the below sleep in some
+	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
+	 * state.
+	 */
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+	if (is_invalidating)
+		wait_event(mmn_mm->wq,
+			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
+
+	/*
+	 * Notice that mmu_range_read_retry() can already be true at this
+	 * point, avoiding loops here allows the user of this lock to provide
+	 * a global time bound.
+	 */
+
+	return seq;
+}
+EXPORT_SYMBOL_GPL(mmu_range_read_begin);
+
+static void mn_itree_release(struct mmu_notifier_mm *mmn_mm,
+			     struct mm_struct *mm)
+{
+	struct mmu_notifier_range range = {
+		.flags = MMU_NOTIFIER_RANGE_BLOCKABLE,
+		.event = MMU_NOTIFY_RELEASE,
+		.mm = mm,
+		.start = 0,
+		.end = ULONG_MAX,
+	};
+	struct mmu_range_notifier *mrn;
+	unsigned long cur_seq;
+	bool ret;
+
+	for (mrn = mn_itree_inv_start_range(mmn_mm, &range, &cur_seq); mrn;
+	     mrn = mn_itree_inv_next(mrn, &range)) {
+		ret = mrn->ops->invalidate(mrn, &range, cur_seq);
+		WARN_ON(!ret);
+	}
+
+	mn_itree_inv_end(mmn_mm);
+}
+
 /*
  * This function can't run concurrently against mmu_notifier_register
  * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
@@ -52,17 +286,24 @@ struct mmu_notifier_mm {
  * can't go away from under us as exit_mmap holds an mm_count pin
  * itself.
  */
-void __mmu_notifier_release(struct mm_struct *mm)
+static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
+			     struct mm_struct *mm)
 {
 	struct mmu_notifier *mn;
 	int id;
 
+	if (mmn_mm->has_interval)
+		mn_itree_release(mmn_mm, mm);
+
+	if (hlist_empty(&mmn_mm->list))
+		return;
+
 	/*
 	 * SRCU here will block mmu_notifier_unregister until
 	 * ->release returns.
 	 */
 	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
+	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
 		/*
 		 * If ->release runs before mmu_notifier_unregister it must be
 		 * handled, as it's the only way for the driver to flush all
@@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
 		if (mn->ops->release)
 			mn->ops->release(mn, mm);
 
-	spin_lock(&mm->mmu_notifier_mm->lock);
-	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
-		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
+	spin_lock(&mmn_mm->lock);
+	while (unlikely(!hlist_empty(&mmn_mm->list))) {
+		mn = hlist_entry(mmn_mm->list.first,
 				 struct mmu_notifier,
 				 hlist);
 		/*
@@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
 		 */
 		hlist_del_init_rcu(&mn->hlist);
 	}
-	spin_unlock(&mm->mmu_notifier_mm->lock);
+	spin_unlock(&mmn_mm->lock);
 	srcu_read_unlock(&srcu, id);
 
 	/*
@@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
 	synchronize_srcu(&srcu);
 }
 
+void __mmu_notifier_release(struct mm_struct *mm)
+{
+	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
+
+	if (mmn_mm->has_interval)
+		mn_itree_release(mmn_mm, mm);
+
+	if (!hlist_empty(&mmn_mm->list))
+		mn_hlist_release(mmn_mm, mm);
+}
+
 /*
  * If no young bitflag is supported by the hardware, ->clear_flush_young can
  * unmap the address and return 1 or 0 depending if the mapping previously
@@ -172,14 +424,43 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
 	srcu_read_unlock(&srcu, id);
 }
 
-int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
+static int mn_itree_invalidate(struct mmu_notifier_mm *mmn_mm,
+				     const struct mmu_notifier_range *range)
+{
+	struct mmu_range_notifier *mrn;
+	unsigned long cur_seq;
+
+	for (mrn = mn_itree_inv_start_range(mmn_mm, range, &cur_seq); mrn;
+	     mrn = mn_itree_inv_next(mrn, range)) {
+		bool ret;
+
+		ret = mrn->ops->invalidate(mrn, range, cur_seq);
+		if (!ret) {
+			if (WARN_ON(mmu_notifier_range_blockable(range)))
+				continue;
+			goto out_would_block;
+		}
+	}
+	return 0;
+
+out_would_block:
+	/*
+	 * On -EAGAIN the non-blocking caller is not allowed to call
+	 * invalidate_range_end()
+	 */
+	mn_itree_inv_end(mmn_mm);
+	return -EAGAIN;
+}
+
+static int mn_hlist_invalidate_range_start(struct mmu_notifier_mm *mmn_mm,
+					   struct mmu_notifier_range *range)
 {
 	struct mmu_notifier *mn;
 	int ret = 0;
 	int id;
 
 	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
+	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) {
 		if (mn->ops->invalidate_range_start) {
 			int _ret;
 
@@ -203,15 +484,30 @@ int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
 	return ret;
 }
 
-void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
-					 bool only_end)
+int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
+{
+	struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
+	int ret = 0;
+
+	if (mmn_mm->has_interval) {
+		ret = mn_itree_invalidate(mmn_mm, range);
+		if (ret)
+			return ret;
+	}
+	if (!hlist_empty(&mmn_mm->list))
+		return mn_hlist_invalidate_range_start(mmn_mm, range);
+	return 0;
+}
+
+static void mn_hlist_invalidate_end(struct mmu_notifier_mm *mmn_mm,
+				    struct mmu_notifier_range *range,
+				    bool only_end)
 {
 	struct mmu_notifier *mn;
 	int id;
 
-	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	id = srcu_read_lock(&srcu);
-	hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
+	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist) {
 		/*
 		 * Call invalidate_range here too to avoid the need for the
 		 * subsystem of having to register an invalidate_range_end
@@ -238,6 +534,19 @@ void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
 		}
 	}
 	srcu_read_unlock(&srcu, id);
+}
+
+void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range,
+					 bool only_end)
+{
+	struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
+
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	if (mmn_mm->has_interval)
+		mn_itree_inv_end(mmn_mm);
+
+	if (!hlist_empty(&mmn_mm->list))
+		mn_hlist_invalidate_end(mmn_mm, range, only_end);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
 }
 
@@ -256,8 +565,9 @@ void __mmu_notifier_invalidate_range(struct mm_struct *mm,
 }
 
 /*
- * Same as mmu_notifier_register but here the caller must hold the
- * mmap_sem in write mode.
+ * Same as mmu_notifier_register but here the caller must hold the mmap_sem in
+ * write mode. A NULL mn signals the notifier is being registered for itree
+ * mode.
  */
 int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 {
@@ -274,9 +584,6 @@ int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 		fs_reclaim_release(GFP_KERNEL);
 	}
 
-	mn->mm = mm;
-	mn->users = 1;
-
 	if (!mm->mmu_notifier_mm) {
 		/*
 		 * kmalloc cannot be called under mm_take_all_locks(), but we
@@ -284,21 +591,22 @@ int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 		 * the write side of the mmap_sem.
 		 */
 		mmu_notifier_mm =
-			kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
+			kzalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
 		if (!mmu_notifier_mm)
 			return -ENOMEM;
 
 		INIT_HLIST_HEAD(&mmu_notifier_mm->list);
 		spin_lock_init(&mmu_notifier_mm->lock);
+		mmu_notifier_mm->invalidate_seq = 2;
+		mmu_notifier_mm->itree = RB_ROOT_CACHED;
+		init_waitqueue_head(&mmu_notifier_mm->wq);
+		INIT_HLIST_HEAD(&mmu_notifier_mm->deferred_list);
 	}
 
 	ret = mm_take_all_locks(mm);
 	if (unlikely(ret))
 		goto out_clean;
 
-	/* Pairs with the mmdrop in mmu_notifier_unregister_* */
-	mmgrab(mm);
-
 	/*
 	 * Serialize the update against mmu_notifier_unregister. A
 	 * side note: mmu_notifier_release can't run concurrently with
@@ -306,13 +614,28 @@ int __mmu_notifier_register(struct mmu_notifier *mn, struct mm_struct *mm)
 	 * current->mm or explicitly with get_task_mm() or similar).
 	 * We can't race against any other mmu notifier method either
 	 * thanks to mm_take_all_locks().
+	 *
+	 * release semantics on the initialization of the mmu_notifier_mm's
+         * contents are provided for unlocked readers.  acquire can only be
+         * used while holding the mmgrab or mmget, and is safe because once
+         * created the mmu_notififer_mm is not freed until the mm is
+         * destroyed.  As above, users holding the mmap_sem or one of the
+         * mm_take_all_locks() do not need to use acquire semantics.
 	 */
 	if (mmu_notifier_mm)
-		mm->mmu_notifier_mm = mmu_notifier_mm;
+		smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);
 
-	spin_lock(&mm->mmu_notifier_mm->lock);
-	hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list);
-	spin_unlock(&mm->mmu_notifier_mm->lock);
+	if (mn) {
+		/* Pairs with the mmdrop in mmu_notifier_unregister_* */
+		mmgrab(mm);
+		mn->mm = mm;
+		mn->users = 1;
+
+		spin_lock(&mm->mmu_notifier_mm->lock);
+		hlist_add_head_rcu(&mn->hlist, &mm->mmu_notifier_mm->list);
+		spin_unlock(&mm->mmu_notifier_mm->lock);
+	} else
+		mm->mmu_notifier_mm->has_interval = true;
 
 	mm_drop_all_locks(mm);
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
@@ -529,6 +852,166 @@ void mmu_notifier_put(struct mmu_notifier *mn)
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_put);
 
+static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+				       unsigned long start,
+				       unsigned long length,
+				       struct mmu_notifier_mm *mmn_mm,
+				       struct mm_struct *mm)
+{
+	mrn->mm = mm;
+	RB_CLEAR_NODE(&mrn->interval_tree.rb);
+	mrn->interval_tree.start = start;
+	/*
+	 * Note that the representation of the intervals in the interval tree
+	 * considers the ending point as contained in the interval.
+	 */
+	if (length == 0 ||
+	    check_add_overflow(start, length - 1, &mrn->interval_tree.last))
+		return -EOVERFLOW;
+
+	/* pairs with mmdrop in mmu_range_notifier_remove() */
+	mmgrab(mm);
+
+	/*
+	 * If some invalidate_range_start/end region is going on in parallel
+	 * we don't know what VA ranges are affected, so we must assume this
+	 * new range is included.
+	 *
+	 * If the itree is invalidating then we are not allowed to change
+	 * it. Retrying until invalidation is done is tricky due to the
+	 * possibility for live lock, instead defer the add to the unlock so
+	 * this algorithm is deterministic.
+	 *
+	 * In all cases the value for the mrn->mr_invalidate_seq should be
+	 * odd, see mmu_range_read_begin()
+	 */
+	spin_lock(&mmn_mm->lock);
+	if (mmn_mm->active_invalidate_ranges) {
+		if (mn_itree_is_invalidating(mmn_mm))
+			hlist_add_head(&mrn->deferred_item,
+				       &mmn_mm->deferred_list);
+		else {
+			mmn_mm->invalidate_seq |= 1;
+			interval_tree_insert(&mrn->interval_tree,
+					     &mmn_mm->itree);
+		}
+		mrn->invalidate_seq = mmn_mm->invalidate_seq;
+	} else {
+		WARN_ON(mn_itree_is_invalidating(mmn_mm));
+		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
+		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
+	}
+	spin_unlock(&mmn_mm->lock);
+	return 0;
+}
+
+/**
+ * mmu_range_notifier_insert - Insert a range notifier
+ * @mrn: Range notifier to register
+ * @start: Starting virtual address to monitor
+ * @length: Length of the range to monitor
+ * @mm : mm_struct to attach to
+ *
+ * This function subscribes the range notifier for notifications from the mm.
+ * Upon return the ops related to mmu_range_notifier will be called whenever
+ * an event that intersects with the given range occurs.
+ *
+ * Upon return the range_notifier may not be present in the interval tree yet.
+ * The caller must use the normal range notifier locking flow via
+ * mmu_range_read_begin() to establish SPTEs for this range.
+ */
+int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+			      unsigned long start, unsigned long length,
+			      struct mm_struct *mm)
+{
+	struct mmu_notifier_mm *mmn_mm;
+	int ret;
+
+	might_lock(&mm->mmap_sem);
+
+	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
+	if (!mmn_mm || !mmn_mm->has_interval) {
+		ret = mmu_notifier_register(NULL, mm);
+		if (ret)
+			return ret;
+		mmn_mm = mm->mmu_notifier_mm;
+	}
+	return __mmu_range_notifier_insert(mrn, start, length, mmn_mm, mm);
+}
+EXPORT_SYMBOL_GPL(mmu_range_notifier_insert);
+
+int mmu_range_notifier_insert_locked(struct mmu_range_notifier *mrn,
+				     unsigned long start, unsigned long length,
+				     struct mm_struct *mm)
+{
+	struct mmu_notifier_mm *mmn_mm;
+	int ret;
+
+	lockdep_assert_held_write(&mm->mmap_sem);
+
+	mmn_mm = mm->mmu_notifier_mm;
+	if (!mmn_mm || !mmn_mm->has_interval) {
+		ret = __mmu_notifier_register(NULL, mm);
+		if (ret)
+			return ret;
+		mmn_mm = mm->mmu_notifier_mm;
+	}
+	return __mmu_range_notifier_insert(mrn, start, length, mmn_mm, mm);
+}
+EXPORT_SYMBOL_GPL(mmu_range_notifier_insert_locked);
+
+/**
+ * mmu_range_notifier_remove - Remove a range notifier
+ * @mrn: Range notifier to unregister
+ *
+ * This function must be paired with mmu_range_notifier_insert(). It cannot be
+ * called from any ops callback.
+ *
+ * Once this returns ops callbacks are no longer running on other CPUs and
+ * will not be called in future.
+ */
+void mmu_range_notifier_remove(struct mmu_range_notifier *mrn)
+{
+	struct mm_struct *mm = mrn->mm;
+	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
+	unsigned long seq = 0;
+
+	might_sleep();
+
+	spin_lock(&mmn_mm->lock);
+	if (mn_itree_is_invalidating(mmn_mm)) {
+		/*
+		 * remove is being called after insert put this on the
+		 * deferred list, but before the deferred list was processed.
+		 */
+		if (RB_EMPTY_NODE(&mrn->interval_tree.rb)) {
+			hlist_del(&mrn->deferred_item);
+		} else {
+			hlist_add_head(&mrn->deferred_item,
+				       &mmn_mm->deferred_list);
+			seq = mmn_mm->invalidate_seq;
+		}
+	} else {
+		WARN_ON(RB_EMPTY_NODE(&mrn->interval_tree.rb));
+		interval_tree_remove(&mrn->interval_tree, &mmn_mm->itree);
+	}
+	spin_unlock(&mmn_mm->lock);
+
+	/*
+	 * The possible sleep on progress in the invalidation requires the
+	 * caller not hold any locks held by invalidation callbacks.
+	 */
+	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
+	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
+	if (seq)
+		wait_event(mmn_mm->wq,
+			   READ_ONCE(mmn_mm->invalidate_seq) != seq);
+
+	/* pairs with mmgrab in mmu_range_notifier_insert() */
+	mmdrop(mm);
+}
+EXPORT_SYMBOL_GPL(mmu_range_notifier_remove);
+
 /**
  * mmu_notifier_synchronize - Ensure all mmu_notifiers are freed
  *
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled Jason Gunthorpe
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

hmm_mirror's handling of ranges does not use a sequence count which
results in this bug:

         CPU0                                   CPU1
                                     hmm_range_wait_until_valid(range)
                                         valid == true
                                     hmm_range_fault(range)
hmm_invalidate_range_start()
   range->valid = false
hmm_invalidate_range_end()
   range->valid = true
                                     hmm_range_valid(range)
                                          valid == true

Where the hmm_range_valid should not have succeeded.

Adding the required sequence count would make it nearly identical to the
new mmu_range_notifier. Instead replace the hmm_mirror stuff with
mmu_range_notifier.

Co-existence of the two APIs is the first step.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h |  5 +++++
 mm/hmm.c            | 25 +++++++++++++++++++------
 2 files changed, 24 insertions(+), 6 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 3fec513b9c00f1..8ac1fd6a81af8f 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -145,6 +145,9 @@ enum hmm_pfn_value_e {
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
+ * @notifier: an optional mmu_range_notifier
+ * @notifier_seq: when notifier is used this is the result of
+ *                mmu_range_read_begin()
  * @hmm: the core HMM structure this range is active against
  * @vma: the vm area struct for the range
  * @list: all range lock are on a list
@@ -159,6 +162,8 @@ enum hmm_pfn_value_e {
  * @valid: pfns array did not change since it has been fill by an HMM function
  */
 struct hmm_range {
+	struct mmu_range_notifier *notifier;
+	unsigned long		notifier_seq;
 	struct hmm		*hmm;
 	struct list_head	list;
 	unsigned long		start;
diff --git a/mm/hmm.c b/mm/hmm.c
index 902f5fa6bf93ad..22ac3595771feb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -852,6 +852,14 @@ void hmm_range_unregister(struct hmm_range *range)
 }
 EXPORT_SYMBOL(hmm_range_unregister);
 
+static bool needs_retry(struct hmm_range *range)
+{
+	if (range->notifier)
+		return mmu_range_check_retry(range->notifier,
+					     range->notifier_seq);
+	return !range->valid;
+}
+
 static const struct mm_walk_ops hmm_walk_ops = {
 	.pud_entry	= hmm_vma_walk_pud,
 	.pmd_entry	= hmm_vma_walk_pmd,
@@ -892,18 +900,23 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
-	struct hmm *hmm = range->hmm;
+	struct mm_struct *mm;
 	struct vm_area_struct *vma;
 	int ret;
 
-	lockdep_assert_held(&hmm->mmu_notifier.mm->mmap_sem);
+	if (range->notifier)
+		mm = range->notifier->mm;
+	else
+		mm = range->hmm->mmu_notifier.mm;
+
+	lockdep_assert_held(&mm->mmap_sem);
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid)
+		if (needs_retry(range))
 			return -EBUSY;
 
-		vma = find_vma(hmm->mmu_notifier.mm, start);
+		vma = find_vma(mm, start);
 		if (vma == NULL || (vma->vm_flags & device_vma))
 			return -EFAULT;
 
@@ -933,7 +946,7 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 			start = hmm_vma_walk.last;
 
 			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && range->valid);
+		} while (ret == -EBUSY && !needs_retry(range));
 
 		if (ret) {
 			unsigned long i;
@@ -991,7 +1004,7 @@ long hmm_range_dma_map(struct hmm_range *range, struct device *device,
 			continue;
 
 		/* Check if range is being invalidated */
-		if (!range->valid) {
+		if (needs_retry(range)) {
 			ret = -EBUSY;
 			goto unmap;
 		}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Jason Gunthorpe
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Only the function calls are stubbed out with static inlines that always
fail. This is the standard way to write a header for an optional component
and makes it easier for drivers that only optionally need HMM_MIRROR.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h | 59 ++++++++++++++++++++++++++++++++++++---------
 kernel/fork.c       |  1 -
 2 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 8ac1fd6a81af8f..2666eb08a40615 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -62,8 +62,6 @@
 #include <linux/kconfig.h>
 #include <asm/pgtable.h>
 
-#ifdef CONFIG_HMM_MIRROR
-
 #include <linux/device.h>
 #include <linux/migrate.h>
 #include <linux/memremap.h>
@@ -374,6 +372,15 @@ struct hmm_mirror {
 	struct list_head		list;
 };
 
+/*
+ * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
+ */
+#define HMM_FAULT_ALLOW_RETRY		(1 << 0)
+
+/* Don't fault in missing PTEs, just snapshot the current state. */
+#define HMM_FAULT_SNAPSHOT		(1 << 1)
+
+#ifdef CONFIG_HMM_MIRROR
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
@@ -383,14 +390,6 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror);
 int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror);
 void hmm_range_unregister(struct hmm_range *range);
 
-/*
- * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
- */
-#define HMM_FAULT_ALLOW_RETRY		(1 << 0)
-
-/* Don't fault in missing PTEs, just snapshot the current state. */
-#define HMM_FAULT_SNAPSHOT		(1 << 1)
-
 long hmm_range_fault(struct hmm_range *range, unsigned int flags);
 
 long hmm_range_dma_map(struct hmm_range *range,
@@ -401,6 +400,44 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 			 struct device *device,
 			 dma_addr_t *daddrs,
 			 bool dirty);
+#else
+int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
+{
+	return -EOPNOTSUPP;
+}
+
+void hmm_mirror_unregister(struct hmm_mirror *mirror)
+{
+}
+
+int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
+{
+	return -EOPNOTSUPP;
+}
+
+void hmm_range_unregister(struct hmm_range *range)
+{
+}
+
+static inline long hmm_range_fault(struct hmm_range *range, unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline long hmm_range_dma_map(struct hmm_range *range,
+				     struct device *device, dma_addr_t *daddrs,
+				     unsigned int flags)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline long hmm_range_dma_unmap(struct hmm_range *range,
+				       struct device *device,
+				       dma_addr_t *daddrs, bool dirty)
+{
+	return -EOPNOTSUPP;
+}
+#endif
 
 /*
  * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range
@@ -411,6 +448,4 @@ long hmm_range_dma_unmap(struct hmm_range *range,
  */
 #define HMM_RANGE_DEFAULT_TIMEOUT 1000
 
-#endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-
 #endif /* LINUX_HMM_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index f9572f41612628..4561a65d19db88 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -40,7 +40,6 @@
 #include <linux/binfmts.h>
 #include <linux/mman.h>
 #include <linux/mmu_notifier.h>
-#include <linux/hmm.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
 #include <linux/vmacache.h>
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert()
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Replace the internal interval tree based mmu notifier with the new common
mmu_range_notifier_insert() API. This removes a lot of code and fixes a
deadlock that can be triggered in ODP:

 zap_page_range()
  mmu_notifier_invalidate_range_start()
   [..]
    ib_umem_notifier_invalidate_range_start()
       down_read(&per_mm->umem_rwsem)
  unmap_single_vma()
    [..]
      __split_huge_page_pmd()
        mmu_notifier_invalidate_range_start()
        [..]
           ib_umem_notifier_invalidate_range_start()
              down_read(&per_mm->umem_rwsem)   // DEADLOCK

        mmu_notifier_invalidate_range_end()
           up_read(&per_mm->umem_rwsem)
  mmu_notifier_invalidate_range_end()
     up_read(&per_mm->umem_rwsem)

The umem_rwsem is held across the range_start/end as the ODP algorithm for
invalidate_range_end cannot tolerate changes to the interval
tree. However, due to the nested invalidation regions the second
down_read() can deadlock if there are competing writers. The new core code
provides an alternative scheme to solve this problem.

Fixes: ca748c39ea3f ("RDMA/umem: Get rid of per_mm->notifier_count")
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/core/device.c     |   1 -
 drivers/infiniband/core/umem_odp.c   | 288 +++------------------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h |   7 +-
 drivers/infiniband/hw/mlx5/mr.c      |   3 +-
 drivers/infiniband/hw/mlx5/odp.c     |  50 +++--
 include/rdma/ib_umem_odp.h           |  65 ++----
 include/rdma/ib_verbs.h              |   2 -
 7 files changed, 69 insertions(+), 347 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 2dd2cfe9b56136..ac7924b3c73abe 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -2617,7 +2617,6 @@ void ib_set_device_ops(struct ib_device *dev, const struct ib_device_ops *ops)
 	SET_DEVICE_OP(dev_ops, get_vf_config);
 	SET_DEVICE_OP(dev_ops, get_vf_stats);
 	SET_DEVICE_OP(dev_ops, init_port);
-	SET_DEVICE_OP(dev_ops, invalidate_range);
 	SET_DEVICE_OP(dev_ops, iw_accept);
 	SET_DEVICE_OP(dev_ops, iw_add_ref);
 	SET_DEVICE_OP(dev_ops, iw_connect);
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index d7d5fadf0899ad..6132b8127e8435 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -48,197 +48,32 @@
 
 #include "uverbs.h"
 
-static void ib_umem_notifier_start_account(struct ib_umem_odp *umem_odp)
-{
-	mutex_lock(&umem_odp->umem_mutex);
-	if (umem_odp->notifiers_count++ == 0)
-		/*
-		 * Initialize the completion object for waiting on
-		 * notifiers. Since notifier_count is zero, no one should be
-		 * waiting right now.
-		 */
-		reinit_completion(&umem_odp->notifier_completion);
-	mutex_unlock(&umem_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_end_account(struct ib_umem_odp *umem_odp)
-{
-	mutex_lock(&umem_odp->umem_mutex);
-	/*
-	 * This sequence increase will notify the QP page fault that the page
-	 * that is going to be mapped in the spte could have been freed.
-	 */
-	++umem_odp->notifiers_seq;
-	if (--umem_odp->notifiers_count == 0)
-		complete_all(&umem_odp->notifier_completion);
-	mutex_unlock(&umem_odp->umem_mutex);
-}
-
-static void ib_umem_notifier_release(struct mmu_notifier *mn,
-				     struct mm_struct *mm)
-{
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-	struct rb_node *node;
-
-	down_read(&per_mm->umem_rwsem);
-	if (!per_mm->mn.users)
-		goto out;
-
-	for (node = rb_first_cached(&per_mm->umem_tree); node;
-	     node = rb_next(node)) {
-		struct ib_umem_odp *umem_odp =
-			rb_entry(node, struct ib_umem_odp, interval_tree.rb);
-
-		/*
-		 * Increase the number of notifiers running, to prevent any
-		 * further fault handling on this MR.
-		 */
-		ib_umem_notifier_start_account(umem_odp);
-		complete_all(&umem_odp->notifier_completion);
-		umem_odp->umem.ibdev->ops.invalidate_range(
-			umem_odp, ib_umem_start(umem_odp),
-			ib_umem_end(umem_odp));
-	}
-
-out:
-	up_read(&per_mm->umem_rwsem);
-}
-
-static int invalidate_range_start_trampoline(struct ib_umem_odp *item,
-					     u64 start, u64 end, void *cookie)
-{
-	ib_umem_notifier_start_account(item);
-	item->umem.ibdev->ops.invalidate_range(item, start, end);
-	return 0;
-}
-
-static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
-{
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-	int rc;
-
-	if (mmu_notifier_range_blockable(range))
-		down_read(&per_mm->umem_rwsem);
-	else if (!down_read_trylock(&per_mm->umem_rwsem))
-		return -EAGAIN;
-
-	if (!per_mm->mn.users) {
-		up_read(&per_mm->umem_rwsem);
-		/*
-		 * At this point users is permanently zero and visible to this
-		 * CPU without a lock, that fact is relied on to skip the unlock
-		 * in range_end.
-		 */
-		return 0;
-	}
-
-	rc = rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
-					   range->end,
-					   invalidate_range_start_trampoline,
-					   mmu_notifier_range_blockable(range),
-					   NULL);
-	if (rc)
-		up_read(&per_mm->umem_rwsem);
-	return rc;
-}
-
-static int invalidate_range_end_trampoline(struct ib_umem_odp *item, u64 start,
-					   u64 end, void *cookie)
-{
-	ib_umem_notifier_end_account(item);
-	return 0;
-}
-
-static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
-{
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-
-	if (unlikely(!per_mm->mn.users))
-		return;
-
-	rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
-				      range->end,
-				      invalidate_range_end_trampoline, true, NULL);
-	up_read(&per_mm->umem_rwsem);
-}
-
-static struct mmu_notifier *ib_umem_alloc_notifier(struct mm_struct *mm)
-{
-	struct ib_ucontext_per_mm *per_mm;
-
-	per_mm = kzalloc(sizeof(*per_mm), GFP_KERNEL);
-	if (!per_mm)
-		return ERR_PTR(-ENOMEM);
-
-	per_mm->umem_tree = RB_ROOT_CACHED;
-	init_rwsem(&per_mm->umem_rwsem);
-
-	WARN_ON(mm != current->mm);
-	rcu_read_lock();
-	per_mm->tgid = get_task_pid(current->group_leader, PIDTYPE_PID);
-	rcu_read_unlock();
-	return &per_mm->mn;
-}
-
-static void ib_umem_free_notifier(struct mmu_notifier *mn)
-{
-	struct ib_ucontext_per_mm *per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-
-	WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
-
-	put_pid(per_mm->tgid);
-	kfree(per_mm);
-}
-
-static const struct mmu_notifier_ops ib_umem_notifiers = {
-	.release                    = ib_umem_notifier_release,
-	.invalidate_range_start     = ib_umem_notifier_invalidate_range_start,
-	.invalidate_range_end       = ib_umem_notifier_invalidate_range_end,
-	.alloc_notifier		    = ib_umem_alloc_notifier,
-	.free_notifier		    = ib_umem_free_notifier,
-};
-
 static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp)
 {
-	struct ib_ucontext_per_mm *per_mm;
-	struct mmu_notifier *mn;
 	int ret;
 
 	umem_odp->umem.is_odp = 1;
+	mutex_init(&umem_odp->umem_mutex);
+
 	if (!umem_odp->is_implicit_odp) {
 		size_t page_size = 1UL << umem_odp->page_shift;
+		unsigned long start;
+		unsigned long end;
 		size_t pages;
 
-		umem_odp->interval_tree.start =
-			ALIGN_DOWN(umem_odp->umem.address, page_size);
+		start = ALIGN_DOWN(umem_odp->umem.address, page_size);
 		if (check_add_overflow(umem_odp->umem.address,
 				       (unsigned long)umem_odp->umem.length,
-				       &umem_odp->interval_tree.last))
+				       &end))
 			return -EOVERFLOW;
-		umem_odp->interval_tree.last =
-			ALIGN(umem_odp->interval_tree.last, page_size);
-		if (unlikely(umem_odp->interval_tree.last < page_size))
+		end = ALIGN(end, page_size);
+		if (unlikely(end < page_size))
 			return -EOVERFLOW;
 
-		pages = (umem_odp->interval_tree.last -
-			 umem_odp->interval_tree.start) >>
-			umem_odp->page_shift;
+		pages = (end - start) >> umem_odp->page_shift;
 		if (!pages)
 			return -EINVAL;
 
-		/*
-		 * Note that the representation of the intervals in the
-		 * interval tree considers the ending point as contained in
-		 * the interval.
-		 */
-		umem_odp->interval_tree.last--;
-
 		umem_odp->page_list = kvcalloc(
 			pages, sizeof(*umem_odp->page_list), GFP_KERNEL);
 		if (!umem_odp->page_list)
@@ -250,26 +85,15 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp)
 			ret = -ENOMEM;
 			goto out_page_list;
 		}
-	}
 
-	mn = mmu_notifier_get(&ib_umem_notifiers, umem_odp->umem.owning_mm);
-	if (IS_ERR(mn)) {
-		ret = PTR_ERR(mn);
-		goto out_dma_list;
-	}
-	umem_odp->per_mm = per_mm =
-		container_of(mn, struct ib_ucontext_per_mm, mn);
-
-	mutex_init(&umem_odp->umem_mutex);
-	init_completion(&umem_odp->notifier_completion);
+		ret = mmu_range_notifier_insert(&umem_odp->notifier, start,
+						end - start, current->mm);
+		if (ret)
+			goto out_dma_list;
 
-	if (!umem_odp->is_implicit_odp) {
-		down_write(&per_mm->umem_rwsem);
-		interval_tree_insert(&umem_odp->interval_tree,
-				     &per_mm->umem_tree);
-		up_write(&per_mm->umem_rwsem);
+		umem_odp->tgid =
+			get_task_pid(current->group_leader, PIDTYPE_PID);
 	}
-	mmgrab(umem_odp->umem.owning_mm);
 
 	return 0;
 
@@ -290,8 +114,8 @@ static inline int ib_init_umem_odp(struct ib_umem_odp *umem_odp)
  * @udata: udata from the syscall being used to create the umem
  * @access: ib_reg_mr access flags
  */
-struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_udata *udata,
-					       int access)
+struct ib_umem_odp *
+ib_umem_odp_alloc_implicit(struct ib_udata *udata, int access)
 {
 	struct ib_ucontext *context =
 		container_of(udata, struct uverbs_attr_bundle, driver_udata)
@@ -305,8 +129,6 @@ struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_udata *udata,
 
 	if (!context)
 		return ERR_PTR(-EIO);
-	if (WARN_ON_ONCE(!context->device->ops.invalidate_range))
-		return ERR_PTR(-EINVAL);
 
 	umem_odp = kzalloc(sizeof(*umem_odp), GFP_KERNEL);
 	if (!umem_odp)
@@ -336,8 +158,9 @@ EXPORT_SYMBOL(ib_umem_odp_alloc_implicit);
  * @addr: The starting userspace VA
  * @size: The length of the userspace VA
  */
-struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root,
-					    unsigned long addr, size_t size)
+struct ib_umem_odp *
+ib_umem_odp_alloc_child(struct ib_umem_odp *root, unsigned long addr,
+			size_t size, const struct mmu_range_notifier_ops *ops)
 {
 	/*
 	 * Caller must ensure that root cannot be freed during the call to
@@ -360,6 +183,7 @@ struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root,
 	umem->writable   = root->umem.writable;
 	umem->owning_mm  = root->umem.owning_mm;
 	odp_data->page_shift = PAGE_SHIFT;
+	odp_data->notifier.ops = ops;
 
 	ret = ib_init_umem_odp(odp_data);
 	if (ret) {
@@ -383,7 +207,8 @@ EXPORT_SYMBOL(ib_umem_odp_alloc_child);
  * conjunction with MMU notifiers.
  */
 struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned long addr,
-				    size_t size, int access)
+				    size_t size, int access,
+				    const struct mmu_range_notifier_ops *ops)
 {
 	struct ib_umem_odp *umem_odp;
 	struct ib_ucontext *context;
@@ -398,8 +223,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned long addr,
 	if (!context)
 		return ERR_PTR(-EIO);
 
-	if (WARN_ON_ONCE(!(access & IB_ACCESS_ON_DEMAND)) ||
-	    WARN_ON_ONCE(!context->device->ops.invalidate_range))
+	if (WARN_ON_ONCE(!(access & IB_ACCESS_ON_DEMAND)))
 		return ERR_PTR(-EINVAL);
 
 	umem_odp = kzalloc(sizeof(struct ib_umem_odp), GFP_KERNEL);
@@ -411,6 +235,7 @@ struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned long addr,
 	umem_odp->umem.address = addr;
 	umem_odp->umem.writable = ib_access_writable(access);
 	umem_odp->umem.owning_mm = mm = current->mm;
+	umem_odp->notifier.ops = ops;
 
 	umem_odp->page_shift = PAGE_SHIFT;
 	if (access & IB_ACCESS_HUGETLB) {
@@ -442,8 +267,6 @@ EXPORT_SYMBOL(ib_umem_odp_get);
 
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 {
-	struct ib_ucontext_per_mm *per_mm = umem_odp->per_mm;
-
 	/*
 	 * Ensure that no more pages are mapped in the umem.
 	 *
@@ -455,28 +278,11 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 		ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
 					    ib_umem_end(umem_odp));
 		mutex_unlock(&umem_odp->umem_mutex);
+		mmu_range_notifier_remove(&umem_odp->notifier);
 		kvfree(umem_odp->dma_list);
 		kvfree(umem_odp->page_list);
+		put_pid(umem_odp->tgid);
 	}
-
-	down_write(&per_mm->umem_rwsem);
-	if (!umem_odp->is_implicit_odp) {
-		interval_tree_remove(&umem_odp->interval_tree,
-				     &per_mm->umem_tree);
-		complete_all(&umem_odp->notifier_completion);
-	}
-	/*
-	 * NOTE! mmu_notifier_unregister() can happen between a start/end
-	 * callback, resulting in a missing end, and thus an unbalanced
-	 * lock. This doesn't really matter to us since we are about to kfree
-	 * the memory that holds the lock, however LOCKDEP doesn't like this.
-	 * Thus we call the mmu_notifier_put under the rwsem and test the
-	 * internal users count to reliably see if we are past this point.
-	 */
-	mmu_notifier_put(&per_mm->mn);
-	up_write(&per_mm->umem_rwsem);
-
-	mmdrop(umem_odp->umem.owning_mm);
 	kfree(umem_odp);
 }
 EXPORT_SYMBOL(ib_umem_odp_release);
@@ -501,7 +307,7 @@ EXPORT_SYMBOL(ib_umem_odp_release);
  */
 static int ib_umem_odp_map_dma_single_page(
 		struct ib_umem_odp *umem_odp,
-		int page_index,
+		unsigned int page_index,
 		struct page *page,
 		u64 access_mask,
 		unsigned long current_seq)
@@ -510,12 +316,7 @@ static int ib_umem_odp_map_dma_single_page(
 	dma_addr_t dma_addr;
 	int ret = 0;
 
-	/*
-	 * Note: we avoid writing if seq is different from the initial seq, to
-	 * handle case of a racing notifier. This check also allows us to bail
-	 * early if we have a notifier running in parallel with us.
-	 */
-	if (ib_umem_mmu_notifier_retry(umem_odp, current_seq)) {
+	if (mmu_range_check_retry(&umem_odp->notifier, current_seq)) {
 		ret = -EAGAIN;
 		goto out;
 	}
@@ -618,7 +419,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 	 * existing beyond the lifetime of the originating process.. Presumably
 	 * mmget_not_zero will fail in this case.
 	 */
-	owning_process = get_pid_task(umem_odp->per_mm->tgid, PIDTYPE_PID);
+	owning_process = get_pid_task(umem_odp->tgid, PIDTYPE_PID);
 	if (!owning_process || !mmget_not_zero(owning_mm)) {
 		ret = -EINVAL;
 		goto out_put_task;
@@ -762,32 +563,3 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	}
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
-
-/* @last is not a part of the interval. See comment for function
- * node_last.
- */
-int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
-				  u64 start, u64 last,
-				  umem_call_back cb,
-				  bool blockable,
-				  void *cookie)
-{
-	int ret_val = 0;
-	struct interval_tree_node *node, *next;
-	struct ib_umem_odp *umem;
-
-	if (unlikely(start == last))
-		return ret_val;
-
-	for (node = interval_tree_iter_first(root, start, last - 1);
-			node; node = next) {
-		/* TODO move the blockable decision up to the callback */
-		if (!blockable)
-			return -EAGAIN;
-		next = interval_tree_iter_next(node, start, last - 1);
-		umem = container_of(node, struct ib_umem_odp, interval_tree);
-		ret_val = cb(umem, start, last, cookie) || ret_val;
-	}
-
-	return ret_val;
-}
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index f61d4005c6c379..c719f08b351670 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -1263,8 +1263,6 @@ int mlx5_ib_odp_init_one(struct mlx5_ib_dev *ibdev);
 void mlx5_ib_odp_cleanup_one(struct mlx5_ib_dev *ibdev);
 int __init mlx5_ib_odp_init(void);
 void mlx5_ib_odp_cleanup(void);
-void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
-			      unsigned long end);
 void mlx5_odp_init_mr_cache_entry(struct mlx5_cache_ent *ent);
 void mlx5_odp_populate_klm(struct mlx5_klm *pklm, size_t offset,
 			   size_t nentries, struct mlx5_ib_mr *mr, int flags);
@@ -1294,11 +1292,10 @@ mlx5_ib_advise_mr_prefetch(struct ib_pd *pd,
 {
 	return -EOPNOTSUPP;
 }
-static inline void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp,
-					    unsigned long start,
-					    unsigned long end){};
 #endif /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
+extern const struct mmu_range_notifier_ops mlx5_mn_ops;
+
 /* Needed for rep profile */
 void __mlx5_ib_remove(struct mlx5_ib_dev *dev,
 		      const struct mlx5_ib_profile *profile,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 199f7959aaa510..fbe31830b22807 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -743,7 +743,8 @@ static int mr_umem_get(struct mlx5_ib_dev *dev, struct ib_udata *udata,
 	if (access_flags & IB_ACCESS_ON_DEMAND) {
 		struct ib_umem_odp *odp;
 
-		odp = ib_umem_odp_get(udata, start, length, access_flags);
+		odp = ib_umem_odp_get(udata, start, length, access_flags,
+				      &mlx5_mn_ops);
 		if (IS_ERR(odp)) {
 			mlx5_ib_dbg(dev, "umem get failed (%ld)\n",
 				    PTR_ERR(odp));
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index bcfc098466977e..f713eb82eeead4 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -241,17 +241,26 @@ static void destroy_unused_implicit_child_mr(struct mlx5_ib_mr *mr)
 	xa_unlock(&imr->implicit_children);
 }
 
-void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
-			      unsigned long end)
+static bool mlx5_ib_invalidate_range(struct mmu_range_notifier *mrn,
+				     const struct mmu_notifier_range *range,
+				     unsigned long cur_seq)
 {
+	struct ib_umem_odp *umem_odp =
+		container_of(mrn, struct ib_umem_odp, notifier);
 	struct mlx5_ib_mr *mr;
 	const u64 umr_block_mask = (MLX5_UMR_MTT_ALIGNMENT /
 				    sizeof(struct mlx5_mtt)) - 1;
 	u64 idx = 0, blk_start_idx = 0;
+	unsigned long start;
+	unsigned long end;
 	int in_block = 0;
 	u64 addr;
 
+	if (!mmu_notifier_range_blockable(range))
+		return false;
+
 	mutex_lock(&umem_odp->umem_mutex);
+	mmu_range_set_seq(mrn, cur_seq);
 	/*
 	 * If npages is zero then umem_odp->private may not be setup yet. This
 	 * does not complete until after the first page is mapped for DMA.
@@ -260,8 +269,8 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 		goto out;
 	mr = umem_odp->private;
 
-	start = max_t(u64, ib_umem_start(umem_odp), start);
-	end = min_t(u64, ib_umem_end(umem_odp), end);
+	start = max_t(u64, ib_umem_start(umem_odp), range->start);
+	end = min_t(u64, ib_umem_end(umem_odp), range->end);
 
 	/*
 	 * Iteration one - zap the HW's MTTs. The notifiers_count ensures that
@@ -312,8 +321,13 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 		destroy_unused_implicit_child_mr(mr);
 out:
 	mutex_unlock(&umem_odp->umem_mutex);
+	return true;
 }
 
+const struct mmu_range_notifier_ops mlx5_mn_ops = {
+	.invalidate = mlx5_ib_invalidate_range,
+};
+
 void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
 {
 	struct ib_odp_caps *caps = &dev->odp_caps;
@@ -414,7 +428,7 @@ static struct mlx5_ib_mr *implicit_get_child_mr(struct mlx5_ib_mr *imr,
 
 	odp = ib_umem_odp_alloc_child(to_ib_umem_odp(imr->umem),
 				      idx * MLX5_IMR_MTT_SIZE,
-				      MLX5_IMR_MTT_SIZE);
+				      MLX5_IMR_MTT_SIZE, &mlx5_mn_ops);
 	if (IS_ERR(odp))
 		return ERR_CAST(odp);
 
@@ -600,8 +614,9 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 			     u64 user_va, size_t bcnt, u32 *bytes_mapped,
 			     u32 flags)
 {
-	int current_seq, page_shift, ret, np;
+	int page_shift, ret, np;
 	bool downgrade = flags & MLX5_PF_FLAGS_DOWNGRADE;
+	unsigned long current_seq;
 	u64 access_mask;
 	u64 start_idx, page_mask;
 
@@ -613,12 +628,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	if (odp->umem.writable && !downgrade)
 		access_mask |= ODP_WRITE_ALLOWED_BIT;
 
-	current_seq = READ_ONCE(odp->notifiers_seq);
-	/*
-	 * Ensure the sequence number is valid for some time before we call
-	 * gup.
-	 */
-	smp_rmb();
+	current_seq = mmu_range_read_begin(&odp->notifier);
 
 	np = ib_umem_odp_map_dma_pages(odp, user_va, bcnt, access_mask,
 				       current_seq);
@@ -626,7 +636,7 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 		return np;
 
 	mutex_lock(&odp->umem_mutex);
-	if (!ib_umem_mmu_notifier_retry(odp, current_seq)) {
+	if (!mmu_range_read_retry(&odp->notifier, current_seq)) {
 		/*
 		 * No need to check whether the MTTs really belong to
 		 * this MR, since ib_umem_odp_map_dma_pages already
@@ -656,19 +666,6 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 	return np << (page_shift - PAGE_SHIFT);
 
 out:
-	if (ret == -EAGAIN) {
-		unsigned long timeout = msecs_to_jiffies(MMU_NOTIFIER_TIMEOUT);
-
-		if (!wait_for_completion_timeout(&odp->notifier_completion,
-						 timeout)) {
-			mlx5_ib_warn(
-				mr->dev,
-				"timeout waiting for mmu notifier. seq %d against %d. notifiers_count=%d\n",
-				current_seq, odp->notifiers_seq,
-				odp->notifiers_count);
-		}
-	}
-
 	return ret;
 }
 
@@ -1609,7 +1606,6 @@ void mlx5_odp_init_mr_cache_entry(struct mlx5_cache_ent *ent)
 
 static const struct ib_device_ops mlx5_ib_dev_odp_ops = {
 	.advise_mr = mlx5_ib_advise_mr,
-	.invalidate_range = mlx5_ib_invalidate_range,
 };
 
 int mlx5_ib_odp_init_one(struct mlx5_ib_dev *dev)
diff --git a/include/rdma/ib_umem_odp.h b/include/rdma/ib_umem_odp.h
index 09b0e4494986a9..98ed5435afccd9 100644
--- a/include/rdma/ib_umem_odp.h
+++ b/include/rdma/ib_umem_odp.h
@@ -35,11 +35,11 @@
 
 #include <rdma/ib_umem.h>
 #include <rdma/ib_verbs.h>
-#include <linux/interval_tree.h>
 
 struct ib_umem_odp {
 	struct ib_umem umem;
-	struct ib_ucontext_per_mm *per_mm;
+	struct mmu_range_notifier notifier;
+	struct pid *tgid;
 
 	/*
 	 * An array of the pages included in the on-demand paging umem.
@@ -62,13 +62,8 @@ struct ib_umem_odp {
 	struct mutex		umem_mutex;
 	void			*private; /* for the HW driver to use. */
 
-	int notifiers_seq;
-	int notifiers_count;
 	int npages;
 
-	/* Tree tracking */
-	struct interval_tree_node interval_tree;
-
 	/*
 	 * An implicit odp umem cannot be DMA mapped, has 0 length, and serves
 	 * only as an anchor for the driver to hold onto the per_mm. FIXME:
@@ -77,7 +72,6 @@ struct ib_umem_odp {
 	 */
 	bool is_implicit_odp;
 
-	struct completion	notifier_completion;
 	unsigned int		page_shift;
 };
 
@@ -89,13 +83,13 @@ static inline struct ib_umem_odp *to_ib_umem_odp(struct ib_umem *umem)
 /* Returns the first page of an ODP umem. */
 static inline unsigned long ib_umem_start(struct ib_umem_odp *umem_odp)
 {
-	return umem_odp->interval_tree.start;
+	return umem_odp->notifier.interval_tree.start;
 }
 
 /* Returns the address of the page after the last one of an ODP umem. */
 static inline unsigned long ib_umem_end(struct ib_umem_odp *umem_odp)
 {
-	return umem_odp->interval_tree.last + 1;
+	return umem_odp->notifier.interval_tree.last + 1;
 }
 
 static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
@@ -119,21 +113,14 @@ static inline size_t ib_umem_odp_num_pages(struct ib_umem_odp *umem_odp)
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
 
-struct ib_ucontext_per_mm {
-	struct mmu_notifier mn;
-	struct pid *tgid;
-
-	struct rb_root_cached umem_tree;
-	/* Protects umem_tree */
-	struct rw_semaphore umem_rwsem;
-};
-
 struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata, unsigned long addr,
-				    size_t size, int access);
+				    size_t size, int access,
+				    const struct mmu_range_notifier_ops *ops);
 struct ib_umem_odp *ib_umem_odp_alloc_implicit(struct ib_udata *udata,
 					       int access);
-struct ib_umem_odp *ib_umem_odp_alloc_child(struct ib_umem_odp *root_umem,
-					    unsigned long addr, size_t size);
+struct ib_umem_odp *
+ib_umem_odp_alloc_child(struct ib_umem_odp *root_umem, unsigned long addr,
+			size_t size, const struct mmu_range_notifier_ops *ops);
 void ib_umem_odp_release(struct ib_umem_odp *umem_odp);
 
 int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
@@ -143,39 +130,11 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
 void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 start_offset,
 				 u64 bound);
 
-typedef int (*umem_call_back)(struct ib_umem_odp *item, u64 start, u64 end,
-			      void *cookie);
-/*
- * Call the callback on each ib_umem in the range. Returns the logical or of
- * the return values of the functions called.
- */
-int rbt_ib_umem_for_each_in_range(struct rb_root_cached *root,
-				  u64 start, u64 end,
-				  umem_call_back cb,
-				  bool blockable, void *cookie);
-
-static inline int ib_umem_mmu_notifier_retry(struct ib_umem_odp *umem_odp,
-					     unsigned long mmu_seq)
-{
-	/*
-	 * This code is strongly based on the KVM code from
-	 * mmu_notifier_retry. Should be called with
-	 * the relevant locks taken (umem_odp->umem_mutex
-	 * and the ucontext umem_mutex semaphore locked for read).
-	 */
-
-	if (unlikely(umem_odp->notifiers_count))
-		return 1;
-	if (umem_odp->notifiers_seq != mmu_seq)
-		return 1;
-	return 0;
-}
-
 #else /* CONFIG_INFINIBAND_ON_DEMAND_PAGING */
 
-static inline struct ib_umem_odp *ib_umem_odp_get(struct ib_udata *udata,
-						  unsigned long addr,
-						  size_t size, int access)
+static inline struct ib_umem_odp *
+ib_umem_odp_get(struct ib_udata *udata, unsigned long addr, size_t size,
+		int access, const struct mmu_range_notifier_ops *ops)
 {
 	return ERR_PTR(-EINVAL);
 }
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 6a47ba85c54c11..2c30c859ae0d13 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2422,8 +2422,6 @@ struct ib_device_ops {
 			    u64 iova);
 	int (*unmap_fmr)(struct list_head *fmr_list);
 	int (*dealloc_fmr)(struct ib_fmr *fmr);
-	void (*invalidate_range)(struct ib_umem_odp *umem_odp,
-				 unsigned long start, unsigned long end);
 	int (*attach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid);
 	int (*detach_mcast)(struct ib_qp *qp, union ib_gid *gid, u16 lid);
 	struct ib_xrcd *(*alloc_xrcd)(struct ib_device *device,
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-29 12:19   ` Dennis Dalessandro
  2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

This converts one of the two users of mmu_notifiers to use the new API.
The conversion is fairly straightforward, however the existing use of
notifiers here seems to be racey.

Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/hw/hfi1/file_ops.c     |   2 +-
 drivers/infiniband/hw/hfi1/hfi.h          |   2 +-
 drivers/infiniband/hw/hfi1/user_exp_rcv.c | 146 +++++++++-------------
 drivers/infiniband/hw/hfi1/user_exp_rcv.h |   3 +-
 4 files changed, 60 insertions(+), 93 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index f9a7e9d29c8ba2..7c5e3fb224139a 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -1138,7 +1138,7 @@ static int get_ctxt_info(struct hfi1_filedata *fd, unsigned long arg, u32 len)
 			HFI1_CAP_UGET_MASK(uctxt->flags, MASK) |
 			HFI1_CAP_KGET_MASK(uctxt->flags, K2U);
 	/* adjust flag if this fd is not able to cache */
-	if (!fd->handler)
+	if (!fd->use_mn)
 		cinfo.runtime_flags |= HFI1_CAP_TID_UNMAP; /* no caching */
 
 	cinfo.num_active = hfi1_count_active_units();
diff --git a/drivers/infiniband/hw/hfi1/hfi.h b/drivers/infiniband/hw/hfi1/hfi.h
index fa45350a9a1d32..fc10d65fc3e13c 100644
--- a/drivers/infiniband/hw/hfi1/hfi.h
+++ b/drivers/infiniband/hw/hfi1/hfi.h
@@ -1444,7 +1444,7 @@ struct hfi1_filedata {
 	/* for cpu affinity; -1 if none */
 	int rec_cpu_num;
 	u32 tid_n_pinned;
-	struct mmu_rb_handler *handler;
+	bool use_mn;
 	struct tid_rb_node **entry_to_rb;
 	spinlock_t tid_lock; /* protect tid_[limit,used] counters */
 	u32 tid_limit;
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.c b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
index 3592a9ec155e85..a1ab3bd334f89e 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.c
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.c
@@ -59,11 +59,11 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
 			      struct tid_user_buf *tbuf,
 			      u32 rcventry, struct tid_group *grp,
 			      u16 pageidx, unsigned int npages);
-static int tid_rb_insert(void *arg, struct mmu_rb_node *node);
 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
 				    struct tid_rb_node *tnode);
-static void tid_rb_remove(void *arg, struct mmu_rb_node *node);
-static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode);
+static bool tid_rb_invalidate(struct mmu_range_notifier *mrn,
+			      const struct mmu_notifier_range *range,
+			      unsigned long cur_seq);
 static int program_rcvarray(struct hfi1_filedata *fd, struct tid_user_buf *,
 			    struct tid_group *grp,
 			    unsigned int start, u16 count,
@@ -73,10 +73,8 @@ static int unprogram_rcvarray(struct hfi1_filedata *fd, u32 tidinfo,
 			      struct tid_group **grp);
 static void clear_tid_node(struct hfi1_filedata *fd, struct tid_rb_node *node);
 
-static struct mmu_rb_ops tid_rb_ops = {
-	.insert = tid_rb_insert,
-	.remove = tid_rb_remove,
-	.invalidate = tid_rb_invalidate
+static const struct mmu_range_notifier_ops tid_mn_ops = {
+	.invalidate = tid_rb_invalidate,
 };
 
 /*
@@ -87,7 +85,6 @@ static struct mmu_rb_ops tid_rb_ops = {
 int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
 			   struct hfi1_ctxtdata *uctxt)
 {
-	struct hfi1_devdata *dd = uctxt->dd;
 	int ret = 0;
 
 	spin_lock_init(&fd->tid_lock);
@@ -109,20 +106,7 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
 			fd->entry_to_rb = NULL;
 			return -ENOMEM;
 		}
-
-		/*
-		 * Register MMU notifier callbacks. If the registration
-		 * fails, continue without TID caching for this context.
-		 */
-		ret = hfi1_mmu_rb_register(fd, fd->mm, &tid_rb_ops,
-					   dd->pport->hfi1_wq,
-					   &fd->handler);
-		if (ret) {
-			dd_dev_info(dd,
-				    "Failed MMU notifier registration %d\n",
-				    ret);
-			ret = 0;
-		}
+		fd->use_mn = true;
 	}
 
 	/*
@@ -139,7 +123,7 @@ int hfi1_user_exp_rcv_init(struct hfi1_filedata *fd,
 	 * init.
 	 */
 	spin_lock(&fd->tid_lock);
-	if (uctxt->subctxt_cnt && fd->handler) {
+	if (uctxt->subctxt_cnt && fd->use_mn) {
 		u16 remainder;
 
 		fd->tid_limit = uctxt->expected_count / uctxt->subctxt_cnt;
@@ -158,18 +142,10 @@ void hfi1_user_exp_rcv_free(struct hfi1_filedata *fd)
 {
 	struct hfi1_ctxtdata *uctxt = fd->uctxt;
 
-	/*
-	 * The notifier would have been removed when the process'es mm
-	 * was freed.
-	 */
-	if (fd->handler) {
-		hfi1_mmu_rb_unregister(fd->handler);
-	} else {
-		if (!EXP_TID_SET_EMPTY(uctxt->tid_full_list))
-			unlock_exp_tids(uctxt, &uctxt->tid_full_list, fd);
-		if (!EXP_TID_SET_EMPTY(uctxt->tid_used_list))
-			unlock_exp_tids(uctxt, &uctxt->tid_used_list, fd);
-	}
+	if (!EXP_TID_SET_EMPTY(uctxt->tid_full_list))
+		unlock_exp_tids(uctxt, &uctxt->tid_full_list, fd);
+	if (!EXP_TID_SET_EMPTY(uctxt->tid_used_list))
+		unlock_exp_tids(uctxt, &uctxt->tid_used_list, fd);
 
 	kfree(fd->invalid_tids);
 	fd->invalid_tids = NULL;
@@ -201,7 +177,7 @@ static void unpin_rcv_pages(struct hfi1_filedata *fd,
 
 	if (mapped) {
 		pci_unmap_single(dd->pcidev, node->dma_addr,
-				 node->mmu.len, PCI_DMA_FROMDEVICE);
+				 node->npages * PAGE_SIZE, PCI_DMA_FROMDEVICE);
 		pages = &node->pages[idx];
 	} else {
 		pages = &tidbuf->pages[idx];
@@ -777,8 +753,8 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
 		return -EFAULT;
 	}
 
-	node->mmu.addr = tbuf->vaddr + (pageidx * PAGE_SIZE);
-	node->mmu.len = npages * PAGE_SIZE;
+	node->notifier.ops = &tid_mn_ops;
+	node->fdata = fd;
 	node->phys = page_to_phys(pages[0]);
 	node->npages = npages;
 	node->rcventry = rcventry;
@@ -787,23 +763,34 @@ static int set_rcvarray_entry(struct hfi1_filedata *fd,
 	node->freed = false;
 	memcpy(node->pages, pages, sizeof(struct page *) * npages);
 
-	if (!fd->handler)
-		ret = tid_rb_insert(fd, &node->mmu);
-	else
-		ret = hfi1_mmu_rb_insert(fd->handler, &node->mmu);
-
-	if (ret) {
-		hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d",
-			  node->rcventry, node->mmu.addr, node->phys, ret);
-		pci_unmap_single(dd->pcidev, phys, npages * PAGE_SIZE,
-				 PCI_DMA_FROMDEVICE);
-		kfree(node);
-		return -EFAULT;
+	if (fd->use_mn) {
+		ret = mmu_range_notifier_insert(
+			&node->notifier, tbuf->vaddr + (pageidx * PAGE_SIZE),
+			npages * PAGE_SIZE, fd->mm);
+		if (ret)
+			goto out_unmap;
+		/*
+		 * FIXME: This is in the wrong order, the notifier should be
+		 * established before the pages are pinned by pin_rcv_pages.
+		 */
+		mmu_range_read_begin(&node->notifier);
 	}
+	fd->entry_to_rb[node->rcventry - uctxt->expected_base] = node;
+
 	hfi1_put_tid(dd, rcventry, PT_EXPECTED, phys, ilog2(npages) + 1);
 	trace_hfi1_exp_tid_reg(uctxt->ctxt, fd->subctxt, rcventry, npages,
-			       node->mmu.addr, node->phys, phys);
+			       node->notifier.interval_tree.start, node->phys,
+			       phys);
 	return 0;
+
+out_unmap:
+	hfi1_cdbg(TID, "Failed to insert RB node %u 0x%lx, 0x%lx %d",
+		  node->rcventry, node->notifier.interval_tree.start,
+		  node->phys, ret);
+	pci_unmap_single(dd->pcidev, phys, npages * PAGE_SIZE,
+			 PCI_DMA_FROMDEVICE);
+	kfree(node);
+	return -EFAULT;
 }
 
 static int unprogram_rcvarray(struct hfi1_filedata *fd, u32 tidinfo,
@@ -833,10 +820,9 @@ static int unprogram_rcvarray(struct hfi1_filedata *fd, u32 tidinfo,
 	if (grp)
 		*grp = node->grp;
 
-	if (!fd->handler)
-		cacheless_tid_rb_remove(fd, node);
-	else
-		hfi1_mmu_rb_remove(fd->handler, &node->mmu);
+	if (fd->use_mn)
+		mmu_range_notifier_remove(&node->notifier);
+	cacheless_tid_rb_remove(fd, node);
 
 	return 0;
 }
@@ -847,7 +833,8 @@ static void clear_tid_node(struct hfi1_filedata *fd, struct tid_rb_node *node)
 	struct hfi1_devdata *dd = uctxt->dd;
 
 	trace_hfi1_exp_tid_unreg(uctxt->ctxt, fd->subctxt, node->rcventry,
-				 node->npages, node->mmu.addr, node->phys,
+				 node->npages,
+				 node->notifier.interval_tree.start, node->phys,
 				 node->dma_addr);
 
 	/*
@@ -894,30 +881,29 @@ static void unlock_exp_tids(struct hfi1_ctxtdata *uctxt,
 				if (!node || node->rcventry != rcventry)
 					continue;
 
+				if (fd->use_mn)
+					mmu_range_notifier_remove(
+						&node->notifier);
 				cacheless_tid_rb_remove(fd, node);
 			}
 		}
 	}
 }
 
-/*
- * Always return 0 from this function.  A non-zero return indicates that the
- * remove operation will be called and that memory should be unpinned.
- * However, the driver cannot unpin out from under PSM.  Instead, retain the
- * memory (by returning 0) and inform PSM that the memory is going away.  PSM
- * will call back later when it has removed the memory from its list.
- */
-static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode)
+static bool tid_rb_invalidate(struct mmu_range_notifier *mrn,
+			      const struct mmu_notifier_range *range,
+			      unsigned long cur_seq)
 {
-	struct hfi1_filedata *fdata = arg;
-	struct hfi1_ctxtdata *uctxt = fdata->uctxt;
 	struct tid_rb_node *node =
-		container_of(mnode, struct tid_rb_node, mmu);
+		container_of(mrn, struct tid_rb_node, notifier);
+	struct hfi1_filedata *fdata = node->fdata;
+	struct hfi1_ctxtdata *uctxt = fdata->uctxt;
 
 	if (node->freed)
-		return 0;
+		return true;
 
-	trace_hfi1_exp_tid_inval(uctxt->ctxt, fdata->subctxt, node->mmu.addr,
+	trace_hfi1_exp_tid_inval(uctxt->ctxt, fdata->subctxt,
+				 node->notifier.interval_tree.start,
 				 node->rcventry, node->npages, node->dma_addr);
 	node->freed = true;
 
@@ -946,18 +932,7 @@ static int tid_rb_invalidate(void *arg, struct mmu_rb_node *mnode)
 		fdata->invalid_tid_idx++;
 	}
 	spin_unlock(&fdata->invalid_lock);
-	return 0;
-}
-
-static int tid_rb_insert(void *arg, struct mmu_rb_node *node)
-{
-	struct hfi1_filedata *fdata = arg;
-	struct tid_rb_node *tnode =
-		container_of(node, struct tid_rb_node, mmu);
-	u32 base = fdata->uctxt->expected_base;
-
-	fdata->entry_to_rb[tnode->rcventry - base] = tnode;
-	return 0;
+	return true;
 }
 
 static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
@@ -968,12 +943,3 @@ static void cacheless_tid_rb_remove(struct hfi1_filedata *fdata,
 	fdata->entry_to_rb[tnode->rcventry - base] = NULL;
 	clear_tid_node(fdata, tnode);
 }
-
-static void tid_rb_remove(void *arg, struct mmu_rb_node *node)
-{
-	struct hfi1_filedata *fdata = arg;
-	struct tid_rb_node *tnode =
-		container_of(node, struct tid_rb_node, mmu);
-
-	cacheless_tid_rb_remove(fdata, tnode);
-}
diff --git a/drivers/infiniband/hw/hfi1/user_exp_rcv.h b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
index 43b105de1d5427..b5314db083b125 100644
--- a/drivers/infiniband/hw/hfi1/user_exp_rcv.h
+++ b/drivers/infiniband/hw/hfi1/user_exp_rcv.h
@@ -65,7 +65,8 @@ struct tid_user_buf {
 };
 
 struct tid_rb_node {
-	struct mmu_rb_node mmu;
+	struct mmu_range_notifier notifier;
+	struct hfi1_filedata *fdata;
 	unsigned long phys;
 	struct tid_group *grp;
 	u32 rcventry;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-29  7:48   ` Koenig, Christian
  2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

The new API is an exact match for the needs of radeon.

For some reason radeon tries to remove overlapping ranges from the
interval tree, but interval trees (and mmu_range_notifier_insert)
support overlapping ranges directly. Simply delete all this code.

Since this driver is missing a invalidate_range_end callback, but
still calls get_user_pages(), it cannot be correct against all races.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Cc: Petr Cvek <petrcvekcz@gmail.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/gpu/drm/radeon/radeon.h    |   9 +-
 drivers/gpu/drm/radeon/radeon_mn.c | 219 ++++++-----------------------
 2 files changed, 52 insertions(+), 176 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index d59b004f669583..27959f3ace1152 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -68,6 +68,10 @@
 #include <linux/hashtable.h>
 #include <linux/dma-fence.h>
 
+#ifdef CONFIG_MMU_NOTIFIER
+#include <linux/mmu_notifier.h>
+#endif
+
 #include <drm/ttm/ttm_bo_api.h>
 #include <drm/ttm/ttm_bo_driver.h>
 #include <drm/ttm/ttm_placement.h>
@@ -509,8 +513,9 @@ struct radeon_bo {
 	struct ttm_bo_kmap_obj		dma_buf_vmap;
 	pid_t				pid;
 
-	struct radeon_mn		*mn;
-	struct list_head		mn_list;
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_range_notifier	notifier;
+#endif
 };
 #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
 
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index dbab9a3a969b9e..d3d41e20a64922 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -36,131 +36,51 @@
 
 #include "radeon.h"
 
-struct radeon_mn {
-	struct mmu_notifier	mn;
-
-	/* objects protected by lock */
-	struct mutex		lock;
-	struct rb_root_cached	objects;
-};
-
-struct radeon_mn_node {
-	struct interval_tree_node	it;
-	struct list_head		bos;
-};
-
 /**
- * radeon_mn_invalidate_range_start - callback to notify about mm change
+ * radeon_mn_invalidate - callback to notify about mm change
  *
  * @mn: our notifier
- * @mn: the mm this callback is about
- * @start: start of updated range
- * @end: end of updated range
+ * @range: the VMA under invalidation
  *
  * We block for all BOs between start and end to be idle and
  * unmap them by move them into system domain again.
  */
-static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
-				const struct mmu_notifier_range *range)
+static bool radeon_mn_invalidate(struct mmu_range_notifier *mn,
+				 const struct mmu_notifier_range *range,
+				 unsigned long cur_seq)
 {
-	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
+	struct radeon_bo *bo = container_of(mn, struct radeon_bo, notifier);
 	struct ttm_operation_ctx ctx = { false, false };
-	struct interval_tree_node *it;
-	unsigned long end;
-	int ret = 0;
-
-	/* notification is exclusive, but interval is inclusive */
-	end = range->end - 1;
-
-	/* TODO we should be able to split locking for interval tree and
-	 * the tear down.
-	 */
-	if (mmu_notifier_range_blockable(range))
-		mutex_lock(&rmn->lock);
-	else if (!mutex_trylock(&rmn->lock))
-		return -EAGAIN;
-
-	it = interval_tree_iter_first(&rmn->objects, range->start, end);
-	while (it) {
-		struct radeon_mn_node *node;
-		struct radeon_bo *bo;
-		long r;
-
-		if (!mmu_notifier_range_blockable(range)) {
-			ret = -EAGAIN;
-			goto out_unlock;
-		}
-
-		node = container_of(it, struct radeon_mn_node, it);
-		it = interval_tree_iter_next(it, range->start, end);
+	long r;
 
-		list_for_each_entry(bo, &node->bos, mn_list) {
+	if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
+		return true;
 
-			if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
-				continue;
+	if (!mmu_notifier_range_blockable(range))
+		return false;
 
-			r = radeon_bo_reserve(bo, true);
-			if (r) {
-				DRM_ERROR("(%ld) failed to reserve user bo\n", r);
-				continue;
-			}
-
-			r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
-				true, false, MAX_SCHEDULE_TIMEOUT);
-			if (r <= 0)
-				DRM_ERROR("(%ld) failed to wait for user bo\n", r);
-
-			radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
-			r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
-			if (r)
-				DRM_ERROR("(%ld) failed to validate user bo\n", r);
-
-			radeon_bo_unreserve(bo);
-		}
+	r = radeon_bo_reserve(bo, true);
+	if (r) {
+		DRM_ERROR("(%ld) failed to reserve user bo\n", r);
+		return true;
 	}
-	
-out_unlock:
-	mutex_unlock(&rmn->lock);
-
-	return ret;
-}
-
-static void radeon_mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
-{
-	struct mmu_notifier_range range = {
-		.mm = mm,
-		.start = 0,
-		.end = ULONG_MAX,
-		.flags = 0,
-		.event = MMU_NOTIFY_UNMAP,
-	};
-
-	radeon_mn_invalidate_range_start(mn, &range);
-}
-
-static struct mmu_notifier *radeon_mn_alloc_notifier(struct mm_struct *mm)
-{
-	struct radeon_mn *rmn;
 
-	rmn = kzalloc(sizeof(*rmn), GFP_KERNEL);
-	if (!rmn)
-		return ERR_PTR(-ENOMEM);
+	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
+				      MAX_SCHEDULE_TIMEOUT);
+	if (r <= 0)
+		DRM_ERROR("(%ld) failed to wait for user bo\n", r);
 
-	mutex_init(&rmn->lock);
-	rmn->objects = RB_ROOT_CACHED;
-	return &rmn->mn;
-}
+	radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
+	r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
+	if (r)
+		DRM_ERROR("(%ld) failed to validate user bo\n", r);
 
-static void radeon_mn_free_notifier(struct mmu_notifier *mn)
-{
-	kfree(container_of(mn, struct radeon_mn, mn));
+	radeon_bo_unreserve(bo);
+	return true;
 }
 
-static const struct mmu_notifier_ops radeon_mn_ops = {
-	.release = radeon_mn_release,
-	.invalidate_range_start = radeon_mn_invalidate_range_start,
-	.alloc_notifier = radeon_mn_alloc_notifier,
-	.free_notifier = radeon_mn_free_notifier,
+static const struct mmu_range_notifier_ops radeon_mn_ops = {
+	.invalidate = radeon_mn_invalidate,
 };
 
 /**
@@ -174,51 +94,21 @@ static const struct mmu_notifier_ops radeon_mn_ops = {
  */
 int radeon_mn_register(struct radeon_bo *bo, unsigned long addr)
 {
-	unsigned long end = addr + radeon_bo_size(bo) - 1;
-	struct mmu_notifier *mn;
-	struct radeon_mn *rmn;
-	struct radeon_mn_node *node = NULL;
-	struct list_head bos;
-	struct interval_tree_node *it;
-
-	mn = mmu_notifier_get(&radeon_mn_ops, current->mm);
-	if (IS_ERR(mn))
-		return PTR_ERR(mn);
-	rmn = container_of(mn, struct radeon_mn, mn);
-
-	INIT_LIST_HEAD(&bos);
-
-	mutex_lock(&rmn->lock);
-
-	while ((it = interval_tree_iter_first(&rmn->objects, addr, end))) {
-		kfree(node);
-		node = container_of(it, struct radeon_mn_node, it);
-		interval_tree_remove(&node->it, &rmn->objects);
-		addr = min(it->start, addr);
-		end = max(it->last, end);
-		list_splice(&node->bos, &bos);
-	}
-
-	if (!node) {
-		node = kmalloc(sizeof(struct radeon_mn_node), GFP_KERNEL);
-		if (!node) {
-			mutex_unlock(&rmn->lock);
-			return -ENOMEM;
-		}
-	}
-
-	bo->mn = rmn;
-
-	node->it.start = addr;
-	node->it.last = end;
-	INIT_LIST_HEAD(&node->bos);
-	list_splice(&bos, &node->bos);
-	list_add(&bo->mn_list, &node->bos);
-
-	interval_tree_insert(&node->it, &rmn->objects);
-
-	mutex_unlock(&rmn->lock);
-
+	int ret;
+
+	bo->notifier.ops = &radeon_mn_ops;
+	ret = mmu_range_notifier_insert(&bo->notifier, addr, radeon_bo_size(bo),
+					current->mm);
+	if (ret)
+		return ret;
+
+	/*
+	 * FIXME: radeon appears to allow get_user_pages to run during
+	 * invalidate_range_start/end, which is not a safe way to read the
+	 * PTEs. It should use the mmu_range_read_begin() scheme around the
+	 * get_user_pages to ensure that the PTEs are read properly
+	 */
+	mmu_range_read_begin(&bo->notifier);
 	return 0;
 }
 
@@ -231,27 +121,8 @@ int radeon_mn_register(struct radeon_bo *bo, unsigned long addr)
  */
 void radeon_mn_unregister(struct radeon_bo *bo)
 {
-	struct radeon_mn *rmn = bo->mn;
-	struct list_head *head;
-
-	if (!rmn)
+	if (!bo->notifier.mm)
 		return;
-
-	mutex_lock(&rmn->lock);
-	/* save the next list entry for later */
-	head = bo->mn_list.next;
-
-	list_del(&bo->mn_list);
-
-	if (list_empty(head)) {
-		struct radeon_mn_node *node;
-		node = container_of(head, struct radeon_mn_node, bos);
-		interval_tree_remove(&node->it, &rmn->objects);
-		kfree(node);
-	}
-
-	mutex_unlock(&rmn->lock);
-
-	mmu_notifier_put(&rmn->mn);
-	bo->mn = NULL;
+	mmu_range_notifier_remove(&bo->notifier);
+	bo->notifier.mm = NULL;
 }
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-11-01 18:26   ` Jason Gunthorpe
  2019-11-07  9:39   ` Jürgen Groß
  2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

DMA_SHARED_BUFFER can not be enabled by the user (it represents a library
set in the kernel). The kconfig convention is to use select for such
symbols so they are turned on implicitly when the user enables a kconfig
that needs them.

Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.

Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Reviewed-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/xen/Kconfig | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index 79cc75096f4232..a50dadd0109336 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -141,7 +141,8 @@ config XEN_GNTDEV
 
 config XEN_GNTDEV_DMABUF
 	bool "Add support for dma-buf grant access device driver extension"
-	depends on XEN_GNTDEV && XEN_GRANT_DMA_ALLOC && DMA_SHARED_BUFFER
+	depends on XEN_GNTDEV && XEN_GRANT_DMA_ALLOC
+	select DMA_SHARED_BUFFER
 	help
 	  Allows userspace processes and kernel modules to use Xen backed
 	  dma-buf implementation. With this extension grant references to
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-30 16:55   ` Boris Ostrovsky
  2019-11-04 22:03   ` Boris Ostrovsky
  2019-10-28 20:10 ` [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start Jason Gunthorpe
                   ` (7 subsequent siblings)
  16 siblings, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

gntdev simply wants to monitor a specific VMA for any notifier events,
this can be done straightforwardly using mmu_range_notifier_insert() over
the VMA's VA range.

The notifier should be attached until the original VMA is destroyed.

It is unclear if any of this is even sane, but at least a lot of duplicate
code is removed.

Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: xen-devel@lists.xenproject.org
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/xen/gntdev-common.h |   8 +-
 drivers/xen/gntdev.c        | 180 ++++++++++--------------------------
 2 files changed, 49 insertions(+), 139 deletions(-)

diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
index 2f8b949c3eeb14..b201fdd20b667b 100644
--- a/drivers/xen/gntdev-common.h
+++ b/drivers/xen/gntdev-common.h
@@ -21,15 +21,8 @@ struct gntdev_dmabuf_priv;
 struct gntdev_priv {
 	/* Maps with visible offsets in the file descriptor. */
 	struct list_head maps;
-	/*
-	 * Maps that are not visible; will be freed on munmap.
-	 * Only populated if populate_freeable_maps == 1
-	 */
-	struct list_head freeable_maps;
 	/* lock protects maps and freeable_maps. */
 	struct mutex lock;
-	struct mm_struct *mm;
-	struct mmu_notifier mn;
 
 #ifdef CONFIG_XEN_GRANT_DMA_ALLOC
 	/* Device for which DMA memory is allocated. */
@@ -49,6 +42,7 @@ struct gntdev_unmap_notify {
 };
 
 struct gntdev_grant_map {
+	struct mmu_range_notifier notifier;
 	struct list_head next;
 	struct vm_area_struct *vma;
 	int index;
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index a446a7221e13e9..12d626670bebbc 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -65,7 +65,6 @@ MODULE_PARM_DESC(limit, "Maximum number of grants that may be mapped by "
 static atomic_t pages_mapped = ATOMIC_INIT(0);
 
 static int use_ptemod;
-#define populate_freeable_maps use_ptemod
 
 static int unmap_grant_pages(struct gntdev_grant_map *map,
 			     int offset, int pages);
@@ -251,12 +250,6 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
 		evtchn_put(map->notify.event);
 	}
 
-	if (populate_freeable_maps && priv) {
-		mutex_lock(&priv->lock);
-		list_del(&map->next);
-		mutex_unlock(&priv->lock);
-	}
-
 	if (map->pages && !use_ptemod)
 		unmap_grant_pages(map, 0, map->count);
 	gntdev_free_map(map);
@@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
 	struct gntdev_priv *priv = file->private_data;
 
 	pr_debug("gntdev_vma_close %p\n", vma);
-	if (use_ptemod) {
-		/* It is possible that an mmu notifier could be running
-		 * concurrently, so take priv->lock to ensure that the vma won't
-		 * vanishing during the unmap_grant_pages call, since we will
-		 * spin here until that completes. Such a concurrent call will
-		 * not do any unmapping, since that has been done prior to
-		 * closing the vma, but it may still iterate the unmap_ops list.
-		 */
-		mutex_lock(&priv->lock);
+	if (use_ptemod && map->vma == vma) {
+		mmu_range_notifier_remove(&map->notifier);
 		map->vma = NULL;
-		mutex_unlock(&priv->lock);
 	}
 	vma->vm_private_data = NULL;
 	gntdev_put_map(priv, map);
@@ -477,109 +462,44 @@ static const struct vm_operations_struct gntdev_vmops = {
 
 /* ------------------------------------------------------------------ */
 
-static bool in_range(struct gntdev_grant_map *map,
-			      unsigned long start, unsigned long end)
-{
-	if (!map->vma)
-		return false;
-	if (map->vma->vm_start >= end)
-		return false;
-	if (map->vma->vm_end <= start)
-		return false;
-
-	return true;
-}
-
-static int unmap_if_in_range(struct gntdev_grant_map *map,
-			      unsigned long start, unsigned long end,
-			      bool blockable)
+static bool gntdev_invalidate(struct mmu_range_notifier *mn,
+			      const struct mmu_notifier_range *range,
+			      unsigned long cur_seq)
 {
+	struct gntdev_grant_map *map =
+		container_of(mn, struct gntdev_grant_map, notifier);
 	unsigned long mstart, mend;
 	int err;
 
-	if (!in_range(map, start, end))
-		return 0;
+	if (!mmu_notifier_range_blockable(range))
+		return false;
 
-	if (!blockable)
-		return -EAGAIN;
+	/*
+	 * If the VMA is split or otherwise changed the notifier is not
+	 * updated, but we don't want to process VA's outside the modified
+	 * VMA. FIXME: It would be much more understandable to just prevent
+	 * modifying the VMA in the first place.
+	 */
+	if (map->vma->vm_start >= range->end ||
+	    map->vma->vm_end <= range->start)
+		return true;
 
-	mstart = max(start, map->vma->vm_start);
-	mend   = min(end,   map->vma->vm_end);
+	mstart = max(range->start, map->vma->vm_start);
+	mend = min(range->end, map->vma->vm_end);
 	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
 			map->index, map->count,
 			map->vma->vm_start, map->vma->vm_end,
-			start, end, mstart, mend);
+			range->start, range->end, mstart, mend);
 	err = unmap_grant_pages(map,
 				(mstart - map->vma->vm_start) >> PAGE_SHIFT,
 				(mend - mstart) >> PAGE_SHIFT);
 	WARN_ON(err);
 
-	return 0;
-}
-
-static int mn_invl_range_start(struct mmu_notifier *mn,
-			       const struct mmu_notifier_range *range)
-{
-	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
-	struct gntdev_grant_map *map;
-	int ret = 0;
-
-	if (mmu_notifier_range_blockable(range))
-		mutex_lock(&priv->lock);
-	else if (!mutex_trylock(&priv->lock))
-		return -EAGAIN;
-
-	list_for_each_entry(map, &priv->maps, next) {
-		ret = unmap_if_in_range(map, range->start, range->end,
-					mmu_notifier_range_blockable(range));
-		if (ret)
-			goto out_unlock;
-	}
-	list_for_each_entry(map, &priv->freeable_maps, next) {
-		ret = unmap_if_in_range(map, range->start, range->end,
-					mmu_notifier_range_blockable(range));
-		if (ret)
-			goto out_unlock;
-	}
-
-out_unlock:
-	mutex_unlock(&priv->lock);
-
-	return ret;
-}
-
-static void mn_release(struct mmu_notifier *mn,
-		       struct mm_struct *mm)
-{
-	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
-	struct gntdev_grant_map *map;
-	int err;
-
-	mutex_lock(&priv->lock);
-	list_for_each_entry(map, &priv->maps, next) {
-		if (!map->vma)
-			continue;
-		pr_debug("map %d+%d (%lx %lx)\n",
-				map->index, map->count,
-				map->vma->vm_start, map->vma->vm_end);
-		err = unmap_grant_pages(map, /* offset */ 0, map->count);
-		WARN_ON(err);
-	}
-	list_for_each_entry(map, &priv->freeable_maps, next) {
-		if (!map->vma)
-			continue;
-		pr_debug("map %d+%d (%lx %lx)\n",
-				map->index, map->count,
-				map->vma->vm_start, map->vma->vm_end);
-		err = unmap_grant_pages(map, /* offset */ 0, map->count);
-		WARN_ON(err);
-	}
-	mutex_unlock(&priv->lock);
+	return true;
 }
 
-static const struct mmu_notifier_ops gntdev_mmu_ops = {
-	.release                = mn_release,
-	.invalidate_range_start = mn_invl_range_start,
+static const struct mmu_range_notifier_ops gntdev_mmu_ops = {
+	.invalidate = gntdev_invalidate,
 };
 
 /* ------------------------------------------------------------------ */
@@ -594,7 +514,6 @@ static int gntdev_open(struct inode *inode, struct file *flip)
 		return -ENOMEM;
 
 	INIT_LIST_HEAD(&priv->maps);
-	INIT_LIST_HEAD(&priv->freeable_maps);
 	mutex_init(&priv->lock);
 
 #ifdef CONFIG_XEN_GNTDEV_DMABUF
@@ -606,17 +525,6 @@ static int gntdev_open(struct inode *inode, struct file *flip)
 	}
 #endif
 
-	if (use_ptemod) {
-		priv->mm = get_task_mm(current);
-		if (!priv->mm) {
-			kfree(priv);
-			return -ENOMEM;
-		}
-		priv->mn.ops = &gntdev_mmu_ops;
-		ret = mmu_notifier_register(&priv->mn, priv->mm);
-		mmput(priv->mm);
-	}
-
 	if (ret) {
 		kfree(priv);
 		return ret;
@@ -653,16 +561,12 @@ static int gntdev_release(struct inode *inode, struct file *flip)
 		list_del(&map->next);
 		gntdev_put_map(NULL /* already removed */, map);
 	}
-	WARN_ON(!list_empty(&priv->freeable_maps));
 	mutex_unlock(&priv->lock);
 
 #ifdef CONFIG_XEN_GNTDEV_DMABUF
 	gntdev_dmabuf_fini(priv->dmabuf_priv);
 #endif
 
-	if (use_ptemod)
-		mmu_notifier_unregister(&priv->mn, priv->mm);
-
 	kfree(priv);
 	return 0;
 }
@@ -723,8 +627,6 @@ static long gntdev_ioctl_unmap_grant_ref(struct gntdev_priv *priv,
 	map = gntdev_find_map_index(priv, op.index >> PAGE_SHIFT, op.count);
 	if (map) {
 		list_del(&map->next);
-		if (populate_freeable_maps)
-			list_add_tail(&map->next, &priv->freeable_maps);
 		err = 0;
 	}
 	mutex_unlock(&priv->lock);
@@ -1096,11 +998,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 		goto unlock_out;
 	if (use_ptemod && map->vma)
 		goto unlock_out;
-	if (use_ptemod && priv->mm != vma->vm_mm) {
-		pr_warn("Huh? Other mm?\n");
-		goto unlock_out;
-	}
-
 	refcount_inc(&map->users);
 
 	vma->vm_ops = &gntdev_vmops;
@@ -1111,10 +1008,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 		vma->vm_flags |= VM_DONTCOPY;
 
 	vma->vm_private_data = map;
-
-	if (use_ptemod)
-		map->vma = vma;
-
 	if (map->flags) {
 		if ((vma->vm_flags & VM_WRITE) &&
 				(map->flags & GNTMAP_readonly))
@@ -1125,8 +1018,28 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 			map->flags |= GNTMAP_readonly;
 	}
 
+	if (use_ptemod) {
+		map->vma = vma;
+		err = mmu_range_notifier_insert_locked(
+			&map->notifier, vma->vm_start,
+			vma->vm_end - vma->vm_start, vma->vm_mm);
+		if (err)
+			goto out_unlock_put;
+	}
 	mutex_unlock(&priv->lock);
 
+	/*
+	 * gntdev takes the address of the PTE in find_grant_ptes() and passes
+	 * it to the hypervisor in gntdev_map_grant_pages(). The purpose of
+	 * the notifier is to prevent the hypervisor pointer to the PTE from
+	 * going stale.
+	 *
+	 * Since this vma's mappings can't be touched without the mmap_sem,
+	 * and we are holding it now, there is no need for the notifier_range
+	 * locking pattern.
+	 */
+	mmu_range_read_begin(&map->notifier);
+
 	if (use_ptemod) {
 		map->pages_vm_start = vma->vm_start;
 		err = apply_to_page_range(vma->vm_mm, vma->vm_start,
@@ -1175,8 +1088,11 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 	mutex_unlock(&priv->lock);
 out_put_map:
 	if (use_ptemod) {
-		map->vma = NULL;
 		unmap_grant_pages(map, 0, map->count);
+		if (map->vma) {
+			mmu_range_notifier_remove(&map->notifier);
+			map->vma = NULL;
+		}
 	}
 	gntdev_put_map(priv, map);
 	return err;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror Jason Gunthorpe
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

There is no reason to get the invalidate_range_start() callback via an
indirection through hmm_mirror, just register a normal notifier directly.

Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 95 ++++++++++++++++++---------
 1 file changed, 63 insertions(+), 32 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 668d4bd0c118f1..577f8811925a59 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -88,6 +88,7 @@ nouveau_ivmm_find(struct nouveau_svm *svm, u64 inst)
 }
 
 struct nouveau_svmm {
+	struct mmu_notifier notifier;
 	struct nouveau_vmm *vmm;
 	struct {
 		unsigned long start;
@@ -96,7 +97,6 @@ struct nouveau_svmm {
 
 	struct mutex mutex;
 
-	struct mm_struct *mm;
 	struct hmm_mirror mirror;
 };
 
@@ -251,10 +251,11 @@ nouveau_svmm_invalidate(struct nouveau_svmm *svmm, u64 start, u64 limit)
 }
 
 static int
-nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-					const struct mmu_notifier_range *update)
+nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
+				    const struct mmu_notifier_range *update)
 {
-	struct nouveau_svmm *svmm = container_of(mirror, typeof(*svmm), mirror);
+	struct nouveau_svmm *svmm =
+		container_of(mn, struct nouveau_svmm, notifier);
 	unsigned long start = update->start;
 	unsigned long limit = update->end;
 
@@ -264,6 +265,9 @@ nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
 	SVMM_DBG(svmm, "invalidate %016lx-%016lx", start, limit);
 
 	mutex_lock(&svmm->mutex);
+	if (unlikely(!svmm->vmm))
+		goto out;
+
 	if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
 		if (start < svmm->unmanaged.start) {
 			nouveau_svmm_invalidate(svmm, start,
@@ -273,19 +277,31 @@ nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
 	}
 
 	nouveau_svmm_invalidate(svmm, start, limit);
+
+out:
 	mutex_unlock(&svmm->mutex);
 	return 0;
 }
 
-static void
-nouveau_svmm_release(struct hmm_mirror *mirror)
+static void nouveau_svmm_free_notifier(struct mmu_notifier *mn)
+{
+	kfree(container_of(mn, struct nouveau_svmm, notifier));
+}
+
+static const struct mmu_notifier_ops nouveau_mn_ops = {
+	.invalidate_range_start = nouveau_svmm_invalidate_range_start,
+	.free_notifier = nouveau_svmm_free_notifier,
+};
+
+static int
+nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
+					const struct mmu_notifier_range *update)
 {
+	return 0;
 }
 
-static const struct hmm_mirror_ops
-nouveau_svmm = {
+static const struct hmm_mirror_ops nouveau_svmm = {
 	.sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables,
-	.release = nouveau_svmm_release,
 };
 
 void
@@ -294,7 +310,10 @@ nouveau_svmm_fini(struct nouveau_svmm **psvmm)
 	struct nouveau_svmm *svmm = *psvmm;
 	if (svmm) {
 		hmm_mirror_unregister(&svmm->mirror);
-		kfree(*psvmm);
+		mutex_lock(&svmm->mutex);
+		svmm->vmm = NULL;
+		mutex_unlock(&svmm->mutex);
+		mmu_notifier_put(&svmm->notifier);
 		*psvmm = NULL;
 	}
 }
@@ -320,7 +339,7 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
 	mutex_lock(&cli->mutex);
 	if (cli->svm.cli) {
 		ret = -EBUSY;
-		goto done;
+		goto out_free;
 	}
 
 	/* Allocate a new GPU VMM that can support SVM (managed by the
@@ -335,24 +354,33 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
 				.fault_replay = true,
 			    }, sizeof(struct gp100_vmm_v0), &cli->svm.vmm);
 	if (ret)
-		goto done;
+		goto out_free;
 
-	/* Enable HMM mirroring of CPU address-space to VMM. */
-	svmm->mm = get_task_mm(current);
-	down_write(&svmm->mm->mmap_sem);
+	down_write(&current->mm->mmap_sem);
 	svmm->mirror.ops = &nouveau_svmm;
-	ret = hmm_mirror_register(&svmm->mirror, svmm->mm);
-	if (ret == 0) {
-		cli->svm.svmm = svmm;
-		cli->svm.cli = cli;
-	}
-	up_write(&svmm->mm->mmap_sem);
-	mmput(svmm->mm);
+	ret = hmm_mirror_register(&svmm->mirror, current->mm);
+	if (ret)
+		goto out_mm_unlock;
 
-done:
+	svmm->notifier.ops = &nouveau_mn_ops;
+	ret = __mmu_notifier_register(&svmm->notifier, current->mm);
 	if (ret)
-		nouveau_svmm_fini(&svmm);
+		goto out_hmm_unregister;
+	/* Note, ownership of svmm transfers to mmu_notifier */
+
+	cli->svm.svmm = svmm;
+	cli->svm.cli = cli;
+	up_write(&current->mm->mmap_sem);
 	mutex_unlock(&cli->mutex);
+	return 0;
+
+out_hmm_unregister:
+	hmm_mirror_unregister(&svmm->mirror);
+out_mm_unlock:
+	up_write(&current->mm->mmap_sem);
+out_free:
+	mutex_unlock(&cli->mutex);
+	kfree(svmm);
 	return ret;
 }
 
@@ -494,12 +522,12 @@ nouveau_range_fault(struct nouveau_svmm *svmm, struct hmm_range *range)
 
 	ret = hmm_range_register(range, &svmm->mirror);
 	if (ret) {
-		up_read(&svmm->mm->mmap_sem);
+		up_read(&svmm->notifier.mm->mmap_sem);
 		return (int)ret;
 	}
 
 	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
-		up_read(&svmm->mm->mmap_sem);
+		up_read(&svmm->notifier.mm->mmap_sem);
 		return -EBUSY;
 	}
 
@@ -507,7 +535,7 @@ nouveau_range_fault(struct nouveau_svmm *svmm, struct hmm_range *range)
 	if (ret <= 0) {
 		if (ret == 0)
 			ret = -EBUSY;
-		up_read(&svmm->mm->mmap_sem);
+		up_read(&svmm->notifier.mm->mmap_sem);
 		hmm_range_unregister(range);
 		return ret;
 	}
@@ -587,12 +615,15 @@ nouveau_svm_fault(struct nvif_notify *notify)
 	args.i.p.version = 0;
 
 	for (fi = 0; fn = fi + 1, fi < buffer->fault_nr; fi = fn) {
+		struct mm_struct *mm;
+
 		/* Cancel any faults from non-SVM channels. */
 		if (!(svmm = buffer->fault[fi]->svmm)) {
 			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
 			continue;
 		}
 		SVMM_DBG(svmm, "addr %016llx", buffer->fault[fi]->addr);
+		mm = svmm->notifier.mm;
 
 		/* We try and group handling of faults within a small
 		 * window into a single update.
@@ -609,11 +640,11 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		/* Intersect fault window with the CPU VMA, cancelling
 		 * the fault if the address is invalid.
 		 */
-		down_read(&svmm->mm->mmap_sem);
-		vma = find_vma_intersection(svmm->mm, start, limit);
+		down_read(&mm->mmap_sem);
+		vma = find_vma_intersection(mm, start, limit);
 		if (!vma) {
 			SVMM_ERR(svmm, "wndw %016llx-%016llx", start, limit);
-			up_read(&svmm->mm->mmap_sem);
+			up_read(&mm->mmap_sem);
 			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
 			continue;
 		}
@@ -623,7 +654,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
 
 		if (buffer->fault[fi]->addr != start) {
 			SVMM_ERR(svmm, "addr %016llx", buffer->fault[fi]->addr);
-			up_read(&svmm->mm->mmap_sem);
+			up_read(&mm->mmap_sem);
 			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
 			continue;
 		}
@@ -704,7 +735,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
 						NULL);
 			svmm->vmm->vmm.object.client->super = false;
 			mutex_unlock(&svmm->mutex);
-			up_read(&svmm->mm->mmap_sem);
+			up_read(&mm->mmap_sem);
 		}
 
 		/* Cancel any faults in the window whose pages didn't manage
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-28 20:10 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Jason Gunthorpe
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Remove the hmm_mirror object and use the mmu_range_notifier API instead
for the range, and use the normal mmu_notifier API for the general
invalidation callback.

While here re-organize the pagefault path so the locking pattern is clear.

nouveau is the only driver that uses a temporary range object and instead
forwards nearly every invalidation range directly to the HW. While this is
not how the mmu_range_notifier was intended to be used, the overheads on
the pagefaulting path are similar to the existing hmm_mirror version.
Particularly since the interval tree will be small.

Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: dri-devel@lists.freedesktop.org
Cc: nouveau@lists.freedesktop.org
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/gpu/drm/nouveau/nouveau_svm.c | 180 ++++++++++++++------------
 1 file changed, 100 insertions(+), 80 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 577f8811925a59..f27317fbe36f45 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -96,8 +96,6 @@ struct nouveau_svmm {
 	} unmanaged;
 
 	struct mutex mutex;
-
-	struct hmm_mirror mirror;
 };
 
 #define SVMM_DBG(s,f,a...)                                                     \
@@ -293,23 +291,11 @@ static const struct mmu_notifier_ops nouveau_mn_ops = {
 	.free_notifier = nouveau_svmm_free_notifier,
 };
 
-static int
-nouveau_svmm_sync_cpu_device_pagetables(struct hmm_mirror *mirror,
-					const struct mmu_notifier_range *update)
-{
-	return 0;
-}
-
-static const struct hmm_mirror_ops nouveau_svmm = {
-	.sync_cpu_device_pagetables = nouveau_svmm_sync_cpu_device_pagetables,
-};
-
 void
 nouveau_svmm_fini(struct nouveau_svmm **psvmm)
 {
 	struct nouveau_svmm *svmm = *psvmm;
 	if (svmm) {
-		hmm_mirror_unregister(&svmm->mirror);
 		mutex_lock(&svmm->mutex);
 		svmm->vmm = NULL;
 		mutex_unlock(&svmm->mutex);
@@ -357,15 +343,10 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
 		goto out_free;
 
 	down_write(&current->mm->mmap_sem);
-	svmm->mirror.ops = &nouveau_svmm;
-	ret = hmm_mirror_register(&svmm->mirror, current->mm);
-	if (ret)
-		goto out_mm_unlock;
-
 	svmm->notifier.ops = &nouveau_mn_ops;
 	ret = __mmu_notifier_register(&svmm->notifier, current->mm);
 	if (ret)
-		goto out_hmm_unregister;
+		goto out_mm_unlock;
 	/* Note, ownership of svmm transfers to mmu_notifier */
 
 	cli->svm.svmm = svmm;
@@ -374,8 +355,6 @@ nouveau_svmm_init(struct drm_device *dev, void *data,
 	mutex_unlock(&cli->mutex);
 	return 0;
 
-out_hmm_unregister:
-	hmm_mirror_unregister(&svmm->mirror);
 out_mm_unlock:
 	up_write(&current->mm->mmap_sem);
 out_free:
@@ -503,43 +482,91 @@ nouveau_svm_fault_cache(struct nouveau_svm *svm,
 		fault->inst, fault->addr, fault->access);
 }
 
-static inline bool
-nouveau_range_done(struct hmm_range *range)
+struct svm_notifier {
+	struct mmu_range_notifier notifier;
+	struct nouveau_svmm *svmm;
+};
+
+static bool nouveau_svm_range_invalidate(struct mmu_range_notifier *mrn,
+					 const struct mmu_notifier_range *range,
+					 unsigned long cur_seq)
 {
-	bool ret = hmm_range_valid(range);
+	struct svm_notifier *sn =
+		container_of(mrn, struct svm_notifier, notifier);
 
-	hmm_range_unregister(range);
-	return ret;
+	/*
+	 * serializes the update to mrn->invalidate_seq done by caller and
+	 * prevents invalidation of the PTE from progressing while HW is being
+	 * programmed. This is very hacky and only works because the normal
+	 * notifier that does invalidation is always called after the range
+	 * notifier.
+	 */
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&sn->svmm->mutex);
+	else if (!mutex_trylock(&sn->svmm->mutex))
+		return false;
+	mmu_range_set_seq(mrn, cur_seq);
+	mutex_unlock(&sn->svmm->mutex);
+	return true;
 }
 
-static int
-nouveau_range_fault(struct nouveau_svmm *svmm, struct hmm_range *range)
+static const struct mmu_range_notifier_ops nouveau_svm_mrn_ops = {
+	.invalidate = nouveau_svm_range_invalidate,
+};
+
+static int nouveau_range_fault(struct nouveau_svmm *svmm,
+			       struct nouveau_drm *drm, void *data, u32 size,
+			       u64 *pfns,
+			       struct svm_notifier *notifier)
 {
+	unsigned long timeout =
+		jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+	/* Have HMM fault pages within the fault window to the GPU. */
+	struct hmm_range range = {
+		.notifier = &notifier->notifier,
+		.start = notifier->notifier.interval_tree.start,
+		.end = notifier->notifier.interval_tree.last + 1,
+		.pfns = pfns,
+		.flags = nouveau_svm_pfn_flags,
+		.values = nouveau_svm_pfn_values,
+		.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT,
+	};
+	struct mm_struct *mm = notifier->notifier.mm;
 	long ret;
 
-	range->default_flags = 0;
-	range->pfn_flags_mask = -1UL;
+	while (true) {
+		if (time_after(jiffies, timeout))
+			return -EBUSY;
 
-	ret = hmm_range_register(range, &svmm->mirror);
-	if (ret) {
-		up_read(&svmm->notifier.mm->mmap_sem);
-		return (int)ret;
-	}
+		range.notifier_seq = mmu_range_read_begin(range.notifier);
+		range.default_flags = 0;
+		range.pfn_flags_mask = -1UL;
+		down_read(&mm->mmap_sem);
+		ret = hmm_range_fault(&range, 0);
+		up_read(&mm->mmap_sem);
+		if (ret <= 0) {
+			if (ret == 0 || ret == -EBUSY)
+				continue;
+			return ret;
+		}
 
-	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
-		up_read(&svmm->notifier.mm->mmap_sem);
-		return -EBUSY;
+		mutex_lock(&svmm->mutex);
+		if (mmu_range_read_retry(range.notifier,
+					 range.notifier_seq)) {
+			mutex_unlock(&svmm->mutex);
+			continue;
+		}
+		break;
 	}
 
-	ret = hmm_range_fault(range, 0);
-	if (ret <= 0) {
-		if (ret == 0)
-			ret = -EBUSY;
-		up_read(&svmm->notifier.mm->mmap_sem);
-		hmm_range_unregister(range);
-		return ret;
-	}
-	return 0;
+	nouveau_dmem_convert_pfn(drm, &range);
+
+	svmm->vmm->vmm.object.client->super = true;
+	ret = nvif_object_ioctl(&svmm->vmm->vmm.object, data, size, NULL);
+	svmm->vmm->vmm.object.client->super = false;
+	mutex_unlock(&svmm->mutex);
+
+	return ret;
 }
 
 static int
@@ -559,7 +586,6 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		} i;
 		u64 phys[16];
 	} args;
-	struct hmm_range range;
 	struct vm_area_struct *vma;
 	u64 inst, start, limit;
 	int fi, fn, pi, fill;
@@ -615,6 +641,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
 	args.i.p.version = 0;
 
 	for (fi = 0; fn = fi + 1, fi < buffer->fault_nr; fi = fn) {
+		struct svm_notifier notifier;
 		struct mm_struct *mm;
 
 		/* Cancel any faults from non-SVM channels. */
@@ -623,7 +650,6 @@ nouveau_svm_fault(struct nvif_notify *notify)
 			continue;
 		}
 		SVMM_DBG(svmm, "addr %016llx", buffer->fault[fi]->addr);
-		mm = svmm->notifier.mm;
 
 		/* We try and group handling of faults within a small
 		 * window into a single update.
@@ -637,6 +663,12 @@ nouveau_svm_fault(struct nvif_notify *notify)
 			start = max_t(u64, start, svmm->unmanaged.limit);
 		SVMM_DBG(svmm, "wndw %016llx-%016llx", start, limit);
 
+		mm = svmm->notifier.mm;
+		if (!mmget_not_zero(mm)) {
+			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
+			continue;
+		}
+
 		/* Intersect fault window with the CPU VMA, cancelling
 		 * the fault if the address is invalid.
 		 */
@@ -645,16 +677,18 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		if (!vma) {
 			SVMM_ERR(svmm, "wndw %016llx-%016llx", start, limit);
 			up_read(&mm->mmap_sem);
+			mmput(mm);
 			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
 			continue;
 		}
 		start = max_t(u64, start, vma->vm_start);
 		limit = min_t(u64, limit, vma->vm_end);
+		up_read(&mm->mmap_sem);
 		SVMM_DBG(svmm, "wndw %016llx-%016llx", start, limit);
 
 		if (buffer->fault[fi]->addr != start) {
 			SVMM_ERR(svmm, "addr %016llx", buffer->fault[fi]->addr);
-			up_read(&mm->mmap_sem);
+			mmput(mm);
 			nouveau_svm_fault_cancel_fault(svm, buffer->fault[fi]);
 			continue;
 		}
@@ -710,33 +744,19 @@ nouveau_svm_fault(struct nvif_notify *notify)
 			 args.i.p.addr,
 			 args.i.p.addr + args.i.p.size, fn - fi);
 
-		/* Have HMM fault pages within the fault window to the GPU. */
-		range.start = args.i.p.addr;
-		range.end = args.i.p.addr + args.i.p.size;
-		range.pfns = args.phys;
-		range.flags = nouveau_svm_pfn_flags;
-		range.values = nouveau_svm_pfn_values;
-		range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
-again:
-		ret = nouveau_range_fault(svmm, &range);
-		if (ret == 0) {
-			mutex_lock(&svmm->mutex);
-			if (!nouveau_range_done(&range)) {
-				mutex_unlock(&svmm->mutex);
-				goto again;
-			}
-
-			nouveau_dmem_convert_pfn(svm->drm, &range);
-
-			svmm->vmm->vmm.object.client->super = true;
-			ret = nvif_object_ioctl(&svmm->vmm->vmm.object,
-						&args, sizeof(args.i) +
-						pi * sizeof(args.phys[0]),
-						NULL);
-			svmm->vmm->vmm.object.client->super = false;
-			mutex_unlock(&svmm->mutex);
-			up_read(&mm->mmap_sem);
+		notifier.svmm = svmm;
+		notifier.notifier.ops = &nouveau_svm_mrn_ops;
+		ret = mmu_range_notifier_insert(&notifier.notifier,
+						args.i.p.addr, args.i.p.size,
+						svmm->notifier.mm);
+		if (!ret) {
+			ret = nouveau_range_fault(
+				svmm, svm->drm, &args,
+				sizeof(args.i) + pi * sizeof(args.phys[0]),
+				args.phys, &notifier);
+			mmu_range_notifier_remove(&notifier.notifier);
 		}
+		mmput(mm);
 
 		/* Cancel any faults in the window whose pages didn't manage
 		 * to keep their valid bit, or stay writeable when required.
@@ -745,10 +765,10 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		 */
 		while (fi < fn) {
 			struct nouveau_svm_fault *fault = buffer->fault[fi++];
-			pi = (fault->addr - range.start) >> PAGE_SHIFT;
+			pi = (fault->addr - args.i.p.addr) >> PAGE_SHIFT;
 			if (ret ||
-			     !(range.pfns[pi] & NVIF_VMM_PFNMAP_V0_V) ||
-			    (!(range.pfns[pi] & NVIF_VMM_PFNMAP_V0_W) &&
+			     !(args.phys[pi] & NVIF_VMM_PFNMAP_V0_V) ||
+			    (!(args.phys[pi] & NVIF_VMM_PFNMAP_V0_W) &&
 			     fault->access != 0 && fault->access != 3)) {
 				nouveau_svm_fault_cancel_fault(svm, fault);
 				continue;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-29  7:49   ` Koenig, Christian
  2019-10-29 16:28   ` Kuehling, Felix
  2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror Jason Gunthorpe
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

find_vma() must be called under the mmap_sem, reorganize this code to
do the vma check after entering the lock.

Further, fix the unlocked use of struct task_struct's mm, instead use
the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
must be converted to a mm_get before acquiring mmap_sem or calling
find_vma().

Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 ++++++++++++++-----------
 1 file changed, 21 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index dff41d0a85fe96..c0e41f1f0c2365 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -35,6 +35,7 @@
 #include <linux/hmm.h>
 #include <linux/pagemap.h>
 #include <linux/sched/task.h>
+#include <linux/sched/mm.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/swap.h>
@@ -788,7 +789,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
 	struct ttm_tt *ttm = bo->tbo.ttm;
 	struct amdgpu_ttm_tt *gtt = (void *)ttm;
-	struct mm_struct *mm = gtt->usertask->mm;
+	struct mm_struct *mm;
 	unsigned long start = gtt->userptr;
 	struct vm_area_struct *vma;
 	struct hmm_range *range;
@@ -796,25 +797,14 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 	uint64_t *pfns;
 	int r = 0;
 
-	if (!mm) /* Happens during process shutdown */
-		return -ESRCH;
-
 	if (unlikely(!mirror)) {
 		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
-		r = -EFAULT;
-		goto out;
+		return -EFAULT;
 	}
 
-	vma = find_vma(mm, start);
-	if (unlikely(!vma || start < vma->vm_start)) {
-		r = -EFAULT;
-		goto out;
-	}
-	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
-		vma->vm_file)) {
-		r = -EPERM;
-		goto out;
-	}
+	mm = mirror->hmm->mmu_notifier.mm;
+	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
+		return -ESRCH;
 
 	range = kzalloc(sizeof(*range), GFP_KERNEL);
 	if (unlikely(!range)) {
@@ -847,6 +837,17 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
 
 	down_read(&mm->mmap_sem);
+	vma = find_vma(mm, start);
+	if (unlikely(!vma || start < vma->vm_start)) {
+		r = -EFAULT;
+		goto out_unlock;
+	}
+	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
+		vma->vm_file)) {
+		r = -EPERM;
+		goto out_unlock;
+	}
+
 	r = hmm_range_fault(range, 0);
 	up_read(&mm->mmap_sem);
 
@@ -865,15 +866,19 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 	}
 
 	gtt->range = range;
+	mmput(mm);
 
 	return 0;
 
+out_unlock:
+	up_read(&mm->mmap_sem);
 out_free_pfns:
 	hmm_range_unregister(range);
 	kvfree(pfns);
 out_free_ranges:
 	kfree(range);
 out:
+	mmput(mm);
 	return r;
 }
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-10-29  7:51   ` Koenig, Christian
  2019-10-29 22:14   ` Kuehling, Felix
  2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Remove the interval tree in the driver and rely on the tree maintained by
the mmu_notifier for delivering mmu_notifier invalidation callbacks.

For some reason amdgpu has a very complicated arrangement where it tries
to prevent duplicate entries in the interval_tree, this is not necessary,
each amdgpu_bo can be its own stand alone entry. interval_tree already
allows duplicates and overlaps in the tree.

Also, there is no need to remove entries upon a release callback, the
mmu_range API safely allows objects to remain registered beyond the
lifetime of the mm. The driver only has to stop touching the pages during
release.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 341 ++++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
 6 files changed, 84 insertions(+), 282 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index bd37df5dd6d048..60591a5d420021 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1006,6 +1006,8 @@ struct amdgpu_device {
 	struct mutex  lock_reset;
 	struct amdgpu_doorbell_index doorbell_index;
 
+	struct mutex			notifier_lock;
+
 	int asic_reset_res;
 	struct work_struct		xgmi_reset_work;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 6d021ecc8d598f..47700302a08b7f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -481,8 +481,7 @@ static void remove_kgd_mem_from_kfd_bo_list(struct kgd_mem *mem,
  *
  * Returns 0 for success, negative errno for errors.
  */
-static int init_user_pages(struct kgd_mem *mem, struct mm_struct *mm,
-			   uint64_t user_addr)
+static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
 {
 	struct amdkfd_process_info *process_info = mem->process_info;
 	struct amdgpu_bo *bo = mem->bo;
@@ -1195,7 +1194,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
 	add_kgd_mem_to_kfd_bo_list(*mem, avm->process_info, user_addr);
 
 	if (user_addr) {
-		ret = init_user_pages(*mem, current->mm, user_addr);
+		ret = init_user_pages(*mem, user_addr);
 		if (ret)
 			goto allocate_init_user_pages_failed;
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 5a1939dbd4e3e6..38f97998aaddb2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2633,6 +2633,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	mutex_init(&adev->virt.vf_errors.lock);
 	hash_init(adev->mn_hash);
 	mutex_init(&adev->lock_reset);
+	mutex_init(&adev->notifier_lock);
 	mutex_init(&adev->virt.dpm_mutex);
 	mutex_init(&adev->psp.mutex);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 31d4deb5d29484..4ffd7b90f4d907 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,66 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * struct amdgpu_mn_node
- *
- * @it: interval node defining start-last of the affected address range
- * @bos: list of all BOs in the affected address range
- *
- * Manages all BOs which are affected of a certain range of address space.
- */
-struct amdgpu_mn_node {
-	struct interval_tree_node	it;
-	struct list_head		bos;
-};
-
-/**
- * amdgpu_mn_destroy - destroy the HMM mirror
- *
- * @work: previously sheduled work item
- *
- * Lazy destroys the notifier from a work item
- */
-static void amdgpu_mn_destroy(struct work_struct *work)
-{
-	struct amdgpu_mn *amn = container_of(work, struct amdgpu_mn, work);
-	struct amdgpu_device *adev = amn->adev;
-	struct amdgpu_mn_node *node, *next_node;
-	struct amdgpu_bo *bo, *next_bo;
-
-	mutex_lock(&adev->mn_lock);
-	down_write(&amn->lock);
-	hash_del(&amn->node);
-	rbtree_postorder_for_each_entry_safe(node, next_node,
-					     &amn->objects.rb_root, it.rb) {
-		list_for_each_entry_safe(bo, next_bo, &node->bos, mn_list) {
-			bo->mn = NULL;
-			list_del_init(&bo->mn_list);
-		}
-		kfree(node);
-	}
-	up_write(&amn->lock);
-	mutex_unlock(&adev->mn_lock);
-
-	hmm_mirror_unregister(&amn->mirror);
-	kfree(amn);
-}
-
-/**
- * amdgpu_hmm_mirror_release - callback to notify about mm destruction
- *
- * @mirror: the HMM mirror (mm) this callback is about
- *
- * Shedule a work item to lazy destroy HMM mirror.
- */
-static void amdgpu_hmm_mirror_release(struct hmm_mirror *mirror)
-{
-	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
-
-	INIT_WORK(&amn->work, amdgpu_mn_destroy);
-	schedule_work(&amn->work);
-}
-
 /**
  * amdgpu_mn_lock - take the write side lock for this notifier
  *
@@ -133,157 +73,86 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
 }
 
 /**
- * amdgpu_mn_read_lock - take the read side lock for this notifier
- *
- * @amn: our notifier
- */
-static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
-{
-	if (blockable)
-		down_read(&amn->lock);
-	else if (!down_read_trylock(&amn->lock))
-		return -EAGAIN;
-
-	return 0;
-}
-
-/**
- * amdgpu_mn_read_unlock - drop the read side lock for this notifier
- *
- * @amn: our notifier
- */
-static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
-{
-	up_read(&amn->lock);
-}
-
-/**
- * amdgpu_mn_invalidate_node - unmap all BOs of a node
+ * amdgpu_mn_invalidate_gfx - callback to notify about mm change
  *
- * @node: the node with the BOs to unmap
- * @start: start of address range affected
- * @end: end of address range affected
+ * @mrn: the range (mm) is about to update
+ * @range: details on the invalidation
  *
  * Block for operations on BOs to finish and mark pages as accessed and
  * potentially dirty.
  */
-static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
-				      unsigned long start,
-				      unsigned long end)
+static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
+				     const struct mmu_notifier_range *range)
 {
-	struct amdgpu_bo *bo;
+	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
+	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 	long r;
 
-	list_for_each_entry(bo, &node->bos, mn_list) {
-
-		if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, start, end))
-			continue;
-
-		r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
-			true, false, MAX_SCHEDULE_TIMEOUT);
-		if (r <= 0)
-			DRM_ERROR("(%ld) failed to wait for user bo\n", r);
-	}
+	/* FIXME: Is this necessary? */
+	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
+					  range->end))
+		return true;
+
+	if (!mmu_notifier_range_blockable(range))
+		return false;
+
+	mutex_lock(&adev->notifier_lock);
+	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
+				      MAX_SCHEDULE_TIMEOUT);
+	mutex_unlock(&adev->notifier_lock);
+	if (r <= 0)
+		DRM_ERROR("(%ld) failed to wait for user bo\n", r);
+	return true;
 }
 
+static const struct mmu_range_notifier_ops amdgpu_mn_gfx_ops = {
+	.invalidate = amdgpu_mn_invalidate_gfx,
+};
+
 /**
- * amdgpu_mn_sync_pagetables_gfx - callback to notify about mm change
+ * amdgpu_mn_invalidate_hsa - callback to notify about mm change
  *
- * @mirror: the hmm_mirror (mm) is about to update
- * @update: the update start, end address
+ * @mrn: the range (mm) is about to update
+ * @range: details on the invalidation
  *
- * Block for operations on BOs to finish and mark pages as accessed and
- * potentially dirty.
+ * We temporarily evict the BO attached to this range. This necessitates
+ * evicting all user-mode queues of the process.
  */
-static int
-amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror,
-			      const struct mmu_notifier_range *update)
+static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
+				     const struct mmu_notifier_range *range)
 {
-	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
-	unsigned long start = update->start;
-	unsigned long end = update->end;
-	bool blockable = mmu_notifier_range_blockable(update);
-	struct interval_tree_node *it;
-
-	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
-
-	/* TODO we should be able to split locking for interval tree and
-	 * amdgpu_mn_invalidate_node
-	 */
-	if (amdgpu_mn_read_lock(amn, blockable))
-		return -EAGAIN;
-
-	it = interval_tree_iter_first(&amn->objects, start, end);
-	while (it) {
-		struct amdgpu_mn_node *node;
-
-		if (!blockable) {
-			amdgpu_mn_read_unlock(amn);
-			return -EAGAIN;
-		}
+	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
+	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 
-		node = container_of(it, struct amdgpu_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
+	/* FIXME: Is this necessary? */
+	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
+					  range->end))
+		return true;
 
-		amdgpu_mn_invalidate_node(node, start, end);
-	}
+	if (!mmu_notifier_range_blockable(range))
+		return false;
 
-	amdgpu_mn_read_unlock(amn);
+	mutex_lock(&adev->notifier_lock);
+	amdgpu_amdkfd_evict_userptr(bo->kfd_bo, bo->notifier.mm);
+	mutex_unlock(&adev->notifier_lock);
 
-	return 0;
+	return true;
 }
 
-/**
- * amdgpu_mn_sync_pagetables_hsa - callback to notify about mm change
- *
- * @mirror: the hmm_mirror (mm) is about to update
- * @update: the update start, end address
- *
- * We temporarily evict all BOs between start and end. This
- * necessitates evicting all user-mode queues of the process. The BOs
- * are restorted in amdgpu_mn_invalidate_range_end_hsa.
- */
-static int
-amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
-			      const struct mmu_notifier_range *update)
+static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
+	.invalidate = amdgpu_mn_invalidate_hsa,
+};
+
+static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
+				     const struct mmu_notifier_range *update)
 {
 	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
-	unsigned long start = update->start;
-	unsigned long end = update->end;
-	bool blockable = mmu_notifier_range_blockable(update);
-	struct interval_tree_node *it;
 
-	/* notification is exclusive, but interval is inclusive */
-	end -= 1;
-
-	if (amdgpu_mn_read_lock(amn, blockable))
-		return -EAGAIN;
-
-	it = interval_tree_iter_first(&amn->objects, start, end);
-	while (it) {
-		struct amdgpu_mn_node *node;
-		struct amdgpu_bo *bo;
-
-		if (!blockable) {
-			amdgpu_mn_read_unlock(amn);
-			return -EAGAIN;
-		}
-
-		node = container_of(it, struct amdgpu_mn_node, it);
-		it = interval_tree_iter_next(it, start, end);
-
-		list_for_each_entry(bo, &node->bos, mn_list) {
-			struct kgd_mem *mem = bo->kfd_bo;
-
-			if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm,
-							 start, end))
-				amdgpu_amdkfd_evict_userptr(mem, amn->mm);
-		}
-	}
-
-	amdgpu_mn_read_unlock(amn);
+	if (!mmu_notifier_range_blockable(update))
+		return false;
 
+	down_read(&amn->lock);
+	up_read(&amn->lock);
 	return 0;
 }
 
@@ -295,12 +164,10 @@ amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
 
 static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = {
 	[AMDGPU_MN_TYPE_GFX] = {
-		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_gfx,
-		.release = amdgpu_hmm_mirror_release
+		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
 	},
 	[AMDGPU_MN_TYPE_HSA] = {
-		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_hsa,
-		.release = amdgpu_hmm_mirror_release
+		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
 	},
 };
 
@@ -327,7 +194,8 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
 	}
 
 	hash_for_each_possible(adev->mn_hash, amn, node, key)
-		if (AMDGPU_MN_KEY(amn->mm, amn->type) == key)
+		if (AMDGPU_MN_KEY(amn->mirror.hmm->mmu_notifier.mm,
+				  amn->type) == key)
 			goto release_locks;
 
 	amn = kzalloc(sizeof(*amn), GFP_KERNEL);
@@ -337,10 +205,8 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
 	}
 
 	amn->adev = adev;
-	amn->mm = mm;
 	init_rwsem(&amn->lock);
 	amn->type = type;
-	amn->objects = RB_ROOT_CACHED;
 
 	amn->mirror.ops = &amdgpu_hmm_mirror_ops[type];
 	r = hmm_mirror_register(&amn->mirror, mm);
@@ -369,100 +235,33 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
  * @bo: amdgpu buffer object
  * @addr: userptr addr we should monitor
  *
- * Registers an HMM mirror for the given BO at the specified address.
+ * Registers a mmu_notifier for the given BO at the specified address.
  * Returns 0 on success, -ERRNO if anything goes wrong.
  */
 int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr)
 {
-	unsigned long end = addr + amdgpu_bo_size(bo) - 1;
-	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
-	enum amdgpu_mn_type type =
-		bo->kfd_bo ? AMDGPU_MN_TYPE_HSA : AMDGPU_MN_TYPE_GFX;
-	struct amdgpu_mn *amn;
-	struct amdgpu_mn_node *node = NULL, *new_node;
-	struct list_head bos;
-	struct interval_tree_node *it;
-
-	amn = amdgpu_mn_get(adev, type);
-	if (IS_ERR(amn))
-		return PTR_ERR(amn);
-
-	new_node = kmalloc(sizeof(*new_node), GFP_KERNEL);
-	if (!new_node)
-		return -ENOMEM;
-
-	INIT_LIST_HEAD(&bos);
-
-	down_write(&amn->lock);
-
-	while ((it = interval_tree_iter_first(&amn->objects, addr, end))) {
-		kfree(node);
-		node = container_of(it, struct amdgpu_mn_node, it);
-		interval_tree_remove(&node->it, &amn->objects);
-		addr = min(it->start, addr);
-		end = max(it->last, end);
-		list_splice(&node->bos, &bos);
-	}
-
-	if (!node)
-		node = new_node;
+	if (bo->kfd_bo)
+		bo->notifier.ops = &amdgpu_mn_hsa_ops;
 	else
-		kfree(new_node);
-
-	bo->mn = amn;
-
-	node->it.start = addr;
-	node->it.last = end;
-	INIT_LIST_HEAD(&node->bos);
-	list_splice(&bos, &node->bos);
-	list_add(&bo->mn_list, &node->bos);
+		bo->notifier.ops = &amdgpu_mn_gfx_ops;
 
-	interval_tree_insert(&node->it, &amn->objects);
-
-	up_write(&amn->lock);
-
-	return 0;
+	return mmu_range_notifier_insert(&bo->notifier, addr,
+					 amdgpu_bo_size(bo), current->mm);
 }
 
 /**
- * amdgpu_mn_unregister - unregister a BO for HMM mirror updates
+ * amdgpu_mn_unregister - unregister a BO for notifier updates
  *
  * @bo: amdgpu buffer object
  *
- * Remove any registration of HMM mirror updates from the buffer object.
+ * Remove any registration of mmu notifier updates from the buffer object.
  */
 void amdgpu_mn_unregister(struct amdgpu_bo *bo)
 {
-	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
-	struct amdgpu_mn *amn;
-	struct list_head *head;
-
-	mutex_lock(&adev->mn_lock);
-
-	amn = bo->mn;
-	if (amn == NULL) {
-		mutex_unlock(&adev->mn_lock);
+	if (!bo->notifier.mm)
 		return;
-	}
-
-	down_write(&amn->lock);
-
-	/* save the next list entry for later */
-	head = bo->mn_list.next;
-
-	bo->mn = NULL;
-	list_del_init(&bo->mn_list);
-
-	if (list_empty(head)) {
-		struct amdgpu_mn_node *node;
-
-		node = container_of(head, struct amdgpu_mn_node, bos);
-		interval_tree_remove(&node->it, &amn->objects);
-		kfree(node);
-	}
-
-	up_write(&amn->lock);
-	mutex_unlock(&adev->mn_lock);
+	mmu_range_notifier_remove(&bo->notifier);
+	bo->notifier.mm = NULL;
 }
 
 /* flags used by HMM internal, not related to CPU/GPU PTE flags */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
index b8ed68943625c2..d73ab2947b22b2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
@@ -39,12 +39,10 @@ enum amdgpu_mn_type {
  * struct amdgpu_mn
  *
  * @adev: amdgpu device pointer
- * @mm: process address space
  * @type: type of MMU notifier
  * @work: destruction work item
  * @node: hash table node to find structure by adev and mn
  * @lock: rw semaphore protecting the notifier nodes
- * @objects: interval tree containing amdgpu_mn_nodes
  * @mirror: HMM mirror function support
  *
  * Data for each amdgpu device and process address space.
@@ -52,7 +50,6 @@ enum amdgpu_mn_type {
 struct amdgpu_mn {
 	/* constant after initialisation */
 	struct amdgpu_device	*adev;
-	struct mm_struct	*mm;
 	enum amdgpu_mn_type	type;
 
 	/* only used on destruction */
@@ -63,7 +60,6 @@ struct amdgpu_mn {
 
 	/* objects protected by lock */
 	struct rw_semaphore	lock;
-	struct rb_root_cached	objects;
 
 #ifdef CONFIG_HMM_MIRROR
 	/* HMM mirror */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
index 658f4c9779b704..4b44ab850f94c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
@@ -30,6 +30,9 @@
 
 #include <drm/amdgpu_drm.h>
 #include "amdgpu.h"
+#ifdef CONFIG_MMU_NOTIFIER
+#include <linux/mmu_notifier.h>
+#endif
 
 #define AMDGPU_BO_INVALID_OFFSET	LONG_MAX
 #define AMDGPU_BO_MAX_PLACEMENTS	3
@@ -100,10 +103,12 @@ struct amdgpu_bo {
 	struct ttm_bo_kmap_obj		dma_buf_vmap;
 	struct amdgpu_mn		*mn;
 
-	union {
-		struct list_head	mn_list;
-		struct list_head	shadow_list;
-	};
+
+#ifdef CONFIG_MMU_NOTIFIER
+	struct mmu_range_notifier	notifier;
+#endif
+
+	struct list_head		shadow_list;
 
 	struct kgd_mem                  *kfd_bo;
 };
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (12 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror Jason Gunthorpe
@ 2019-10-28 20:10 ` " Jason Gunthorpe
  2019-10-29 19:22   ` Yang, Philip
  2019-10-28 20:10 ` [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related Jason Gunthorpe
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Convert the collision-retry lock around hmm_range_fault to use the one now
provided by the mmu_range notifier.

Although this driver does not seem to use the collision retry lock that
hmm provides correctly, it can still be converted over to use the
mmu_range_notifier api instead of hmm_mirror without too much trouble.

This also deletes another place where a driver is associating additional
data (struct amdgpu_mn) with a mmu_struct.

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
Cc: amd-gfx@lists.freedesktop.org
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 148 ++----------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  49 ------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  76 ++++-----
 5 files changed, 66 insertions(+), 225 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 47700302a08b7f..1bcedb9b477dce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,10 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
 			return ret;
 		}
 
+		/*
+		 * FIXME: Cannot ignore the return code, must hold
+		 * notifier_lock
+		 */
 		amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
 
 		/* Mark the BO as valid unless it was invalidated
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 2e53feed40e230..76771f5f0b60ab 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -607,8 +607,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 		e->tv.num_shared = 2;
 
 	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
-	if (p->bo_list->first_userptr != p->bo_list->num_entries)
-		p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX);
 
 	INIT_LIST_HEAD(&duplicates);
 	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd);
@@ -1291,11 +1289,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	if (r)
 		goto error_unlock;
 
-	/* No memory allocation is allowed while holding the mn lock.
-	 * p->mn is hold until amdgpu_cs_submit is finished and fence is added
-	 * to BOs.
+	/* No memory allocation is allowed while holding the notifier lock.
+	 * The lock is held until amdgpu_cs_submit is finished and fence is
+	 * added to BOs.
 	 */
-	amdgpu_mn_lock(p->mn);
+	mutex_lock(&p->adev->notifier_lock);
 
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
@@ -1338,13 +1336,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
-	amdgpu_mn_unlock(p->mn);
+	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
-	amdgpu_mn_unlock(p->mn);
+	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
 	amdgpu_job_free(job);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 4ffd7b90f4d907..cb718a064eb491 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,28 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * amdgpu_mn_lock - take the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_lock(struct amdgpu_mn *mn)
-{
-	if (mn)
-		down_write(&mn->lock);
-}
-
-/**
- * amdgpu_mn_unlock - drop the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_unlock(struct amdgpu_mn *mn)
-{
-	if (mn)
-		up_write(&mn->lock);
-}
-
 /**
  * amdgpu_mn_invalidate_gfx - callback to notify about mm change
  *
@@ -82,12 +60,19 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  * potentially dirty.
  */
 static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
-				     const struct mmu_notifier_range *range)
+				     const struct mmu_notifier_range *range,
+				     unsigned long cur_seq)
 {
 	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 	long r;
 
+	/*
+	 * FIXME: Must hold some lock shared with
+	 * amdgpu_ttm_tt_get_user_pages_done()
+	 */
+	mmu_range_set_seq(mrn, cur_seq);
+
 	/* FIXME: Is this necessary? */
 	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
 					  range->end))
@@ -119,11 +104,18 @@ static const struct mmu_range_notifier_ops amdgpu_mn_gfx_ops = {
  * evicting all user-mode queues of the process.
  */
 static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
-				     const struct mmu_notifier_range *range)
+				     const struct mmu_notifier_range *range,
+				     unsigned long cur_seq)
 {
 	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 
+	/*
+	 * FIXME: Must hold some lock shared with
+	 * amdgpu_ttm_tt_get_user_pages_done()
+	 */
+	mmu_range_set_seq(mrn, cur_seq);
+
 	/* FIXME: Is this necessary? */
 	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
 					  range->end))
@@ -143,92 +135,6 @@ static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
 	.invalidate = amdgpu_mn_invalidate_hsa,
 };
 
-static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
-				     const struct mmu_notifier_range *update)
-{
-	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
-
-	if (!mmu_notifier_range_blockable(update))
-		return false;
-
-	down_read(&amn->lock);
-	up_read(&amn->lock);
-	return 0;
-}
-
-/* Low bits of any reasonable mm pointer will be unused due to struct
- * alignment. Use these bits to make a unique key from the mm pointer
- * and notifier type.
- */
-#define AMDGPU_MN_KEY(mm, type) ((unsigned long)(mm) + (type))
-
-static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = {
-	[AMDGPU_MN_TYPE_GFX] = {
-		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
-	},
-	[AMDGPU_MN_TYPE_HSA] = {
-		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
-	},
-};
-
-/**
- * amdgpu_mn_get - create HMM mirror context
- *
- * @adev: amdgpu device pointer
- * @type: type of MMU notifier context
- *
- * Creates a HMM mirror context for current->mm.
- */
-struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
-				enum amdgpu_mn_type type)
-{
-	struct mm_struct *mm = current->mm;
-	struct amdgpu_mn *amn;
-	unsigned long key = AMDGPU_MN_KEY(mm, type);
-	int r;
-
-	mutex_lock(&adev->mn_lock);
-	if (down_write_killable(&mm->mmap_sem)) {
-		mutex_unlock(&adev->mn_lock);
-		return ERR_PTR(-EINTR);
-	}
-
-	hash_for_each_possible(adev->mn_hash, amn, node, key)
-		if (AMDGPU_MN_KEY(amn->mirror.hmm->mmu_notifier.mm,
-				  amn->type) == key)
-			goto release_locks;
-
-	amn = kzalloc(sizeof(*amn), GFP_KERNEL);
-	if (!amn) {
-		amn = ERR_PTR(-ENOMEM);
-		goto release_locks;
-	}
-
-	amn->adev = adev;
-	init_rwsem(&amn->lock);
-	amn->type = type;
-
-	amn->mirror.ops = &amdgpu_hmm_mirror_ops[type];
-	r = hmm_mirror_register(&amn->mirror, mm);
-	if (r)
-		goto free_amn;
-
-	hash_add(adev->mn_hash, &amn->node, AMDGPU_MN_KEY(mm, type));
-
-release_locks:
-	up_write(&mm->mmap_sem);
-	mutex_unlock(&adev->mn_lock);
-
-	return amn;
-
-free_amn:
-	up_write(&mm->mmap_sem);
-	mutex_unlock(&adev->mn_lock);
-	kfree(amn);
-
-	return ERR_PTR(r);
-}
-
 /**
  * amdgpu_mn_register - register a BO for notifier updates
  *
@@ -263,25 +169,3 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo)
 	mmu_range_notifier_remove(&bo->notifier);
 	bo->notifier.mm = NULL;
 }
-
-/* flags used by HMM internal, not related to CPU/GPU PTE flags */
-static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
-		(1 << 0), /* HMM_PFN_VALID */
-		(1 << 1), /* HMM_PFN_WRITE */
-		0 /* HMM_PFN_DEVICE_PRIVATE */
-};
-
-static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
-		0xfffffffffffffffeUL, /* HMM_PFN_ERROR */
-		0, /* HMM_PFN_NONE */
-		0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */
-};
-
-void amdgpu_hmm_init_range(struct hmm_range *range)
-{
-	if (range) {
-		range->flags = hmm_range_flags;
-		range->values = hmm_range_values;
-		range->pfn_shift = PAGE_SHIFT;
-	}
-}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
index d73ab2947b22b2..a292238f75ebae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
@@ -30,59 +30,10 @@
 #include <linux/workqueue.h>
 #include <linux/interval_tree.h>
 
-enum amdgpu_mn_type {
-	AMDGPU_MN_TYPE_GFX,
-	AMDGPU_MN_TYPE_HSA,
-};
-
-/**
- * struct amdgpu_mn
- *
- * @adev: amdgpu device pointer
- * @type: type of MMU notifier
- * @work: destruction work item
- * @node: hash table node to find structure by adev and mn
- * @lock: rw semaphore protecting the notifier nodes
- * @mirror: HMM mirror function support
- *
- * Data for each amdgpu device and process address space.
- */
-struct amdgpu_mn {
-	/* constant after initialisation */
-	struct amdgpu_device	*adev;
-	enum amdgpu_mn_type	type;
-
-	/* only used on destruction */
-	struct work_struct	work;
-
-	/* protected by adev->mn_lock */
-	struct hlist_node	node;
-
-	/* objects protected by lock */
-	struct rw_semaphore	lock;
-
-#ifdef CONFIG_HMM_MIRROR
-	/* HMM mirror */
-	struct hmm_mirror	mirror;
-#endif
-};
-
 #if defined(CONFIG_HMM_MIRROR)
-void amdgpu_mn_lock(struct amdgpu_mn *mn);
-void amdgpu_mn_unlock(struct amdgpu_mn *mn);
-struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
-				enum amdgpu_mn_type type);
 int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr);
 void amdgpu_mn_unregister(struct amdgpu_bo *bo);
-void amdgpu_hmm_init_range(struct hmm_range *range);
 #else
-static inline void amdgpu_mn_lock(struct amdgpu_mn *mn) {}
-static inline void amdgpu_mn_unlock(struct amdgpu_mn *mn) {}
-static inline struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
-					      enum amdgpu_mn_type type)
-{
-	return NULL;
-}
 static inline int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr)
 {
 	DRM_WARN_ONCE("HMM_MIRROR kernel config option is not enabled, "
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c0e41f1f0c2365..65d9824b54f2a9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -773,6 +773,20 @@ struct amdgpu_ttm_tt {
 #endif
 };
 
+#ifdef CONFIG_DRM_AMDGPU_USERPTR
+/* flags used by HMM internal, not related to CPU/GPU PTE flags */
+static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
+	(1 << 0), /* HMM_PFN_VALID */
+	(1 << 1), /* HMM_PFN_WRITE */
+	0 /* HMM_PFN_DEVICE_PRIVATE */
+};
+
+static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
+	0xfffffffffffffffeUL, /* HMM_PFN_ERROR */
+	0, /* HMM_PFN_NONE */
+	0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */
+};
+
 /**
  * amdgpu_ttm_tt_get_user_pages - get device accessible pages that back user
  * memory and start HMM tracking CPU page table update
@@ -780,29 +794,27 @@ struct amdgpu_ttm_tt {
  * Calling function must call amdgpu_ttm_tt_userptr_range_done() once and only
  * once afterwards to stop HMM tracking
  */
-#if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
-
-#define MAX_RETRY_HMM_RANGE_FAULT	16
-
 int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 {
-	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
 	struct ttm_tt *ttm = bo->tbo.ttm;
 	struct amdgpu_ttm_tt *gtt = (void *)ttm;
 	struct mm_struct *mm;
+	struct hmm_range *range;
 	unsigned long start = gtt->userptr;
 	struct vm_area_struct *vma;
-	struct hmm_range *range;
 	unsigned long i;
-	uint64_t *pfns;
 	int r = 0;
 
-	if (unlikely(!mirror)) {
-		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
+	mm = bo->notifier.mm;
+	if (unlikely(!mm)) {
+		DRM_DEBUG_DRIVER("BO is not registered?\n");
 		return -EFAULT;
 	}
 
-	mm = mirror->hmm->mmu_notifier.mm;
+	/* Another get_user_pages is running at the same time?? */
+	if (WARN_ON(gtt->range))
+		return -EFAULT;
+
 	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
 		return -ESRCH;
 
@@ -811,30 +823,24 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 		r = -ENOMEM;
 		goto out;
 	}
+	range->notifier = &bo->notifier;
+	range->flags = hmm_range_flags;
+	range->values = hmm_range_values;
+	range->pfn_shift = PAGE_SHIFT;
+	range->start = bo->notifier.interval_tree.start;
+	range->end = bo->notifier.interval_tree.last + 1;
+	range->default_flags = hmm_range_flags[HMM_PFN_VALID];
+	if (!amdgpu_ttm_tt_is_readonly(ttm))
+		range->default_flags |= range->flags[HMM_PFN_WRITE];
 
-	pfns = kvmalloc_array(ttm->num_pages, sizeof(*pfns), GFP_KERNEL);
-	if (unlikely(!pfns)) {
+	range->pfns = kvmalloc_array(ttm->num_pages, sizeof(*range->pfns),
+				     GFP_KERNEL);
+	if (unlikely(!range->pfns)) {
 		r = -ENOMEM;
 		goto out_free_ranges;
 	}
 
-	amdgpu_hmm_init_range(range);
-	range->default_flags = range->flags[HMM_PFN_VALID];
-	range->default_flags |= amdgpu_ttm_tt_is_readonly(ttm) ?
-				0 : range->flags[HMM_PFN_WRITE];
-	range->pfn_flags_mask = 0;
-	range->pfns = pfns;
-	range->start = start;
-	range->end = start + ttm->num_pages * PAGE_SIZE;
-
-	hmm_range_register(range, mirror);
-
-	/*
-	 * Just wait for range to be valid, safe to ignore return value as we
-	 * will use the return value of hmm_range_fault() below under the
-	 * mmap_sem to ascertain the validity of the range.
-	 */
-	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
+	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
 
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
@@ -855,10 +861,10 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 		goto out_free_pfns;
 
 	for (i = 0; i < ttm->num_pages; i++) {
-		pages[i] = hmm_device_entry_to_page(range, pfns[i]);
+		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
 		if (unlikely(!pages[i])) {
 			pr_err("Page fault failed for pfn[%lu] = 0x%llx\n",
-			       i, pfns[i]);
+			       i, range->pfns[i]);
 			r = -ENOMEM;
 
 			goto out_free_pfns;
@@ -873,8 +879,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 out_unlock:
 	up_read(&mm->mmap_sem);
 out_free_pfns:
-	hmm_range_unregister(range);
-	kvfree(pfns);
+	kvfree(range->pfns);
 out_free_ranges:
 	kfree(range);
 out:
@@ -903,9 +908,8 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
 		"No user pages to check\n");
 
 	if (gtt->range) {
-		r = hmm_range_valid(gtt->range);
-		hmm_range_unregister(gtt->range);
-
+		r = mmu_range_read_retry(gtt->range->notifier,
+					 gtt->range->notifier_seq);
 		kvfree(gtt->range->pfns);
 		kfree(gtt->range);
 		gtt->range = NULL;
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (13 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
@ 2019-10-28 20:10 ` Jason Gunthorpe
  2019-11-01 19:54 ` [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
  2019-11-01 20:54 ` Ralph Campbell
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-28 20:10 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

The only two users of this are now converted to use mmu_range_notifier,
delete all the code and update hmm.rst.

Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 Documentation/vm/hmm.rst | 105 ++++-----------
 include/linux/hmm.h      | 183 +------------------------
 mm/Kconfig               |   1 -
 mm/hmm.c                 | 284 +--------------------------------------
 4 files changed, 33 insertions(+), 540 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 0a5960beccf76d..a247643035c4e2 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -147,49 +147,16 @@ Address space mirroring implementation and API
 Address space mirroring's main objective is to allow duplication of a range of
 CPU page table into a device page table; HMM helps keep both synchronized. A
 device driver that wants to mirror a process address space must start with the
-registration of an hmm_mirror struct::
-
- int hmm_mirror_register(struct hmm_mirror *mirror,
-                         struct mm_struct *mm);
-
-The mirror struct has a set of callbacks that are used
-to propagate CPU page tables::
-
- struct hmm_mirror_ops {
-     /* release() - release hmm_mirror
-      *
-      * @mirror: pointer to struct hmm_mirror
-      *
-      * This is called when the mm_struct is being released.  The callback
-      * must ensure that all access to any pages obtained from this mirror
-      * is halted before the callback returns. All future access should
-      * fault.
-      */
-     void (*release)(struct hmm_mirror *mirror);
-
-     /* sync_cpu_device_pagetables() - synchronize page tables
-      *
-      * @mirror: pointer to struct hmm_mirror
-      * @update: update information (see struct mmu_notifier_range)
-      * Return: -EAGAIN if update.blockable false and callback need to
-      *         block, 0 otherwise.
-      *
-      * This callback ultimately originates from mmu_notifiers when the CPU
-      * page table is updated. The device driver must update its page table
-      * in response to this callback. The update argument tells what action
-      * to perform.
-      *
-      * The device driver must not return from this callback until the device
-      * page tables are completely updated (TLBs flushed, etc); this is a
-      * synchronous call.
-      */
-     int (*sync_cpu_device_pagetables)(struct hmm_mirror *mirror,
-                                       const struct hmm_update *update);
- };
-
-The device driver must perform the update action to the range (mark range
-read only, or fully unmap, etc.). The device must complete the update before
-the driver callback returns.
+registration of a mmu_range_notifier::
+
+ mrn->ops = &driver_ops;
+ int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
+			      unsigned long start, unsigned long length,
+			      struct mm_struct *mm);
+
+During the driver_ops->invalidate() callback the device driver must perform
+the update action to the range (mark range read only, or fully unmap,
+etc.). The device must complete the update before the driver callback returns.
 
 When the device driver wants to populate a range of virtual addresses, it can
 use::
@@ -216,70 +183,46 @@ The usage pattern is::
       struct hmm_range range;
       ...
 
+      range.notifier = &mrn;
       range.start = ...;
       range.end = ...;
       range.pfns = ...;
       range.flags = ...;
       range.values = ...;
       range.pfn_shift = ...;
-      hmm_range_register(&range, mirror);
 
-      /*
-       * Just wait for range to be valid, safe to ignore return value as we
-       * will use the return value of hmm_range_fault() below under the
-       * mmap_sem to ascertain the validity of the range.
-       */
-      hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
+      if (!mmget_not_zero(mrn->notifier.mm))
+          return -EFAULT;
 
  again:
+      range.notifier_seq = mmu_range_read_begin(&mrn);
       down_read(&mm->mmap_sem);
       ret = hmm_range_fault(&range, HMM_RANGE_SNAPSHOT);
       if (ret) {
           up_read(&mm->mmap_sem);
-          if (ret == -EBUSY) {
-            /*
-             * No need to check hmm_range_wait_until_valid() return value
-             * on retry we will get proper error with hmm_range_fault()
-             */
-            hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
-            goto again;
-          }
-          hmm_range_unregister(&range);
+          if (ret == -EBUSY)
+                 goto again;
           return ret;
       }
+      up_read(&mm->mmap_sem);
+
       take_lock(driver->update);
-      if (!hmm_range_valid(&range)) {
+      if (mmu_range_read_retry(&mrn, range.notifier_seq) {
           release_lock(driver->update);
-          up_read(&mm->mmap_sem);
           goto again;
       }
 
-      // Use pfns array content to update device page table
+      /* Use pfns array content to update device page table,
+       * under the update lock */
 
-      hmm_range_unregister(&range);
       release_lock(driver->update);
-      up_read(&mm->mmap_sem);
       return 0;
  }
 
 The driver->update lock is the same lock that the driver takes inside its
-sync_cpu_device_pagetables() callback. That lock must be held before calling
-hmm_range_valid() to avoid any race with a concurrent CPU page table update.
-
-HMM implements all this on top of the mmu_notifier API because we wanted a
-simpler API and also to be able to perform optimizations latter on like doing
-concurrent device updates in multi-devices scenario.
-
-HMM also serves as an impedance mismatch between how CPU page table updates
-are done (by CPU write to the page table and TLB flushes) and how devices
-update their own page table. Device updates are a multi-step process. First,
-appropriate commands are written to a buffer, then this buffer is scheduled for
-execution on the device. It is only once the device has executed commands in
-the buffer that the update is done. Creating and scheduling the update command
-buffer can happen concurrently for multiple devices. Waiting for each device to
-report commands as executed is serialized (there is no point in doing this
-concurrently).
-
+invalidate() callback. That lock must be held before calling
+mmu_range_read_retry() to avoid any race with a concurrent CPU page table
+update.
 
 Leverage default_flags and pfn_flags_mask
 =========================================
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2666eb08a40615..b4af5173523232 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -68,29 +68,6 @@
 #include <linux/completion.h>
 #include <linux/mmu_notifier.h>
 
-
-/*
- * struct hmm - HMM per mm struct
- *
- * @mm: mm struct this HMM struct is bound to
- * @lock: lock protecting ranges list
- * @ranges: list of range being snapshotted
- * @mirrors: list of mirrors for this mm
- * @mmu_notifier: mmu notifier to track updates to CPU page table
- * @mirrors_sem: read/write semaphore protecting the mirrors list
- * @wq: wait queue for user waiting on a range invalidation
- * @notifiers: count of active mmu notifiers
- */
-struct hmm {
-	struct mmu_notifier	mmu_notifier;
-	spinlock_t		ranges_lock;
-	struct list_head	ranges;
-	struct list_head	mirrors;
-	struct rw_semaphore	mirrors_sem;
-	wait_queue_head_t	wq;
-	long			notifiers;
-};
-
 /*
  * hmm_pfn_flag_e - HMM flag enums
  *
@@ -143,9 +120,8 @@ enum hmm_pfn_value_e {
 /*
  * struct hmm_range - track invalidation lock on virtual address range
  *
- * @notifier: an optional mmu_range_notifier
- * @notifier_seq: when notifier is used this is the result of
- *                mmu_range_read_begin()
+ * @notifier: a mmu_range_notifier that includes the start/end
+ * @notifier_seq: result of mmu_range_read_begin()
  * @hmm: the core HMM structure this range is active against
  * @vma: the vm area struct for the range
  * @list: all range lock are on a list
@@ -162,8 +138,6 @@ enum hmm_pfn_value_e {
 struct hmm_range {
 	struct mmu_range_notifier *notifier;
 	unsigned long		notifier_seq;
-	struct hmm		*hmm;
-	struct list_head	list;
 	unsigned long		start;
 	unsigned long		end;
 	uint64_t		*pfns;
@@ -172,32 +146,8 @@ struct hmm_range {
 	uint64_t		default_flags;
 	uint64_t		pfn_flags_mask;
 	uint8_t			pfn_shift;
-	bool			valid;
 };
 
-/*
- * hmm_range_wait_until_valid() - wait for range to be valid
- * @range: range affected by invalidation to wait on
- * @timeout: time out for wait in ms (ie abort wait after that period of time)
- * Return: true if the range is valid, false otherwise.
- */
-static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
-					      unsigned long timeout)
-{
-	return wait_event_timeout(range->hmm->wq, range->valid,
-				  msecs_to_jiffies(timeout)) != 0;
-}
-
-/*
- * hmm_range_valid() - test if a range is valid or not
- * @range: range
- * Return: true if the range is valid, false otherwise.
- */
-static inline bool hmm_range_valid(struct hmm_range *range)
-{
-	return range->valid;
-}
-
 /*
  * hmm_device_entry_to_page() - return struct page pointed to by a device entry
  * @range: range use to decode device entry value
@@ -267,111 +217,6 @@ static inline uint64_t hmm_device_entry_from_pfn(const struct hmm_range *range,
 		range->flags[HMM_PFN_VALID];
 }
 
-/*
- * Mirroring: how to synchronize device page table with CPU page table.
- *
- * A device driver that is participating in HMM mirroring must always
- * synchronize with CPU page table updates. For this, device drivers can either
- * directly use mmu_notifier APIs or they can use the hmm_mirror API. Device
- * drivers can decide to register one mirror per device per process, or just
- * one mirror per process for a group of devices. The pattern is:
- *
- *      int device_bind_address_space(..., struct mm_struct *mm, ...)
- *      {
- *          struct device_address_space *das;
- *
- *          // Device driver specific initialization, and allocation of das
- *          // which contains an hmm_mirror struct as one of its fields.
- *          ...
- *
- *          ret = hmm_mirror_register(&das->mirror, mm, &device_mirror_ops);
- *          if (ret) {
- *              // Cleanup on error
- *              return ret;
- *          }
- *
- *          // Other device driver specific initialization
- *          ...
- *      }
- *
- * Once an hmm_mirror is registered for an address space, the device driver
- * will get callbacks through sync_cpu_device_pagetables() operation (see
- * hmm_mirror_ops struct).
- *
- * Device driver must not free the struct containing the hmm_mirror struct
- * before calling hmm_mirror_unregister(). The expected usage is to do that when
- * the device driver is unbinding from an address space.
- *
- *
- *      void device_unbind_address_space(struct device_address_space *das)
- *      {
- *          // Device driver specific cleanup
- *          ...
- *
- *          hmm_mirror_unregister(&das->mirror);
- *
- *          // Other device driver specific cleanup, and now das can be freed
- *          ...
- *      }
- */
-
-struct hmm_mirror;
-
-/*
- * struct hmm_mirror_ops - HMM mirror device operations callback
- *
- * @update: callback to update range on a device
- */
-struct hmm_mirror_ops {
-	/* release() - release hmm_mirror
-	 *
-	 * @mirror: pointer to struct hmm_mirror
-	 *
-	 * This is called when the mm_struct is being released.  The callback
-	 * must ensure that all access to any pages obtained from this mirror
-	 * is halted before the callback returns. All future access should
-	 * fault.
-	 */
-	void (*release)(struct hmm_mirror *mirror);
-
-	/* sync_cpu_device_pagetables() - synchronize page tables
-	 *
-	 * @mirror: pointer to struct hmm_mirror
-	 * @update: update information (see struct mmu_notifier_range)
-	 * Return: -EAGAIN if mmu_notifier_range_blockable(update) is false
-	 * and callback needs to block, 0 otherwise.
-	 *
-	 * This callback ultimately originates from mmu_notifiers when the CPU
-	 * page table is updated. The device driver must update its page table
-	 * in response to this callback. The update argument tells what action
-	 * to perform.
-	 *
-	 * The device driver must not return from this callback until the device
-	 * page tables are completely updated (TLBs flushed, etc); this is a
-	 * synchronous call.
-	 */
-	int (*sync_cpu_device_pagetables)(
-		struct hmm_mirror *mirror,
-		const struct mmu_notifier_range *update);
-};
-
-/*
- * struct hmm_mirror - mirror struct for a device driver
- *
- * @hmm: pointer to struct hmm (which is unique per mm_struct)
- * @ops: device driver callback for HMM mirror operations
- * @list: for list of mirrors of a given mm
- *
- * Each address space (mm_struct) being mirrored by a device must register one
- * instance of an hmm_mirror struct with HMM. HMM will track the list of all
- * mirrors for each mm_struct.
- */
-struct hmm_mirror {
-	struct hmm			*hmm;
-	const struct hmm_mirror_ops	*ops;
-	struct list_head		list;
-};
-
 /*
  * Retry fault if non-blocking, drop mmap_sem and return -EAGAIN in that case.
  */
@@ -381,15 +226,9 @@ struct hmm_mirror {
 #define HMM_FAULT_SNAPSHOT		(1 << 1)
 
 #ifdef CONFIG_HMM_MIRROR
-int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
-void hmm_mirror_unregister(struct hmm_mirror *mirror);
-
 /*
  * Please see Documentation/vm/hmm.rst for how to use the range API.
  */
-int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror);
-void hmm_range_unregister(struct hmm_range *range);
-
 long hmm_range_fault(struct hmm_range *range, unsigned int flags);
 
 long hmm_range_dma_map(struct hmm_range *range,
@@ -401,24 +240,6 @@ long hmm_range_dma_unmap(struct hmm_range *range,
 			 dma_addr_t *daddrs,
 			 bool dirty);
 #else
-int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
-{
-	return -EOPNOTSUPP;
-}
-
-void hmm_mirror_unregister(struct hmm_mirror *mirror)
-{
-}
-
-int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
-{
-	return -EOPNOTSUPP;
-}
-
-void hmm_range_unregister(struct hmm_range *range)
-{
-}
-
 static inline long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 {
 	return -EOPNOTSUPP;
diff --git a/mm/Kconfig b/mm/Kconfig
index d0b5046d9aeffd..e38ff1d5968dbf 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -675,7 +675,6 @@ config DEV_PAGEMAP_OPS
 config HMM_MIRROR
 	bool
 	depends on MMU
-	depends on MMU_NOTIFIER
 
 config DEVICE_PRIVATE
 	bool "Unaddressable device memory (GPU memory, ...)"
diff --git a/mm/hmm.c b/mm/hmm.c
index 22ac3595771feb..75d15a820e182e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -26,193 +26,6 @@
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
-static struct mmu_notifier *hmm_alloc_notifier(struct mm_struct *mm)
-{
-	struct hmm *hmm;
-
-	hmm = kzalloc(sizeof(*hmm), GFP_KERNEL);
-	if (!hmm)
-		return ERR_PTR(-ENOMEM);
-
-	init_waitqueue_head(&hmm->wq);
-	INIT_LIST_HEAD(&hmm->mirrors);
-	init_rwsem(&hmm->mirrors_sem);
-	INIT_LIST_HEAD(&hmm->ranges);
-	spin_lock_init(&hmm->ranges_lock);
-	hmm->notifiers = 0;
-	return &hmm->mmu_notifier;
-}
-
-static void hmm_free_notifier(struct mmu_notifier *mn)
-{
-	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
-
-	WARN_ON(!list_empty(&hmm->ranges));
-	WARN_ON(!list_empty(&hmm->mirrors));
-	kfree(hmm);
-}
-
-static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
-{
-	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
-	struct hmm_mirror *mirror;
-
-	/*
-	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
-	 * prevented as long as a range exists.
-	 */
-	WARN_ON(!list_empty_careful(&hmm->ranges));
-
-	down_read(&hmm->mirrors_sem);
-	list_for_each_entry(mirror, &hmm->mirrors, list) {
-		/*
-		 * Note: The driver is not allowed to trigger
-		 * hmm_mirror_unregister() from this thread.
-		 */
-		if (mirror->ops->release)
-			mirror->ops->release(mirror);
-	}
-	up_read(&hmm->mirrors_sem);
-}
-
-static void notifiers_decrement(struct hmm *hmm)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&hmm->ranges_lock, flags);
-	hmm->notifiers--;
-	if (!hmm->notifiers) {
-		struct hmm_range *range;
-
-		list_for_each_entry(range, &hmm->ranges, list) {
-			if (range->valid)
-				continue;
-			range->valid = true;
-		}
-		wake_up_all(&hmm->wq);
-	}
-	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
-}
-
-static int hmm_invalidate_range_start(struct mmu_notifier *mn,
-			const struct mmu_notifier_range *nrange)
-{
-	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
-	struct hmm_mirror *mirror;
-	struct hmm_range *range;
-	unsigned long flags;
-	int ret = 0;
-
-	spin_lock_irqsave(&hmm->ranges_lock, flags);
-	hmm->notifiers++;
-	list_for_each_entry(range, &hmm->ranges, list) {
-		if (nrange->end < range->start || nrange->start >= range->end)
-			continue;
-
-		range->valid = false;
-	}
-	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
-
-	if (mmu_notifier_range_blockable(nrange))
-		down_read(&hmm->mirrors_sem);
-	else if (!down_read_trylock(&hmm->mirrors_sem)) {
-		ret = -EAGAIN;
-		goto out;
-	}
-
-	list_for_each_entry(mirror, &hmm->mirrors, list) {
-		int rc;
-
-		rc = mirror->ops->sync_cpu_device_pagetables(mirror, nrange);
-		if (rc) {
-			if (WARN_ON(mmu_notifier_range_blockable(nrange) ||
-			    rc != -EAGAIN))
-				continue;
-			ret = -EAGAIN;
-			break;
-		}
-	}
-	up_read(&hmm->mirrors_sem);
-
-out:
-	if (ret)
-		notifiers_decrement(hmm);
-	return ret;
-}
-
-static void hmm_invalidate_range_end(struct mmu_notifier *mn,
-			const struct mmu_notifier_range *nrange)
-{
-	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
-
-	notifiers_decrement(hmm);
-}
-
-static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
-	.release		= hmm_release,
-	.invalidate_range_start	= hmm_invalidate_range_start,
-	.invalidate_range_end	= hmm_invalidate_range_end,
-	.alloc_notifier		= hmm_alloc_notifier,
-	.free_notifier		= hmm_free_notifier,
-};
-
-/*
- * hmm_mirror_register() - register a mirror against an mm
- *
- * @mirror: new mirror struct to register
- * @mm: mm to register against
- * Return: 0 on success, -ENOMEM if no memory, -EINVAL if invalid arguments
- *
- * To start mirroring a process address space, the device driver must register
- * an HMM mirror struct.
- *
- * The caller cannot unregister the hmm_mirror while any ranges are
- * registered.
- *
- * Callers using this function must put a call to mmu_notifier_synchronize()
- * in their module exit functions.
- */
-int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
-{
-	struct mmu_notifier *mn;
-
-	lockdep_assert_held_write(&mm->mmap_sem);
-
-	/* Sanity check */
-	if (!mm || !mirror || !mirror->ops)
-		return -EINVAL;
-
-	mn = mmu_notifier_get_locked(&hmm_mmu_notifier_ops, mm);
-	if (IS_ERR(mn))
-		return PTR_ERR(mn);
-	mirror->hmm = container_of(mn, struct hmm, mmu_notifier);
-
-	down_write(&mirror->hmm->mirrors_sem);
-	list_add(&mirror->list, &mirror->hmm->mirrors);
-	up_write(&mirror->hmm->mirrors_sem);
-
-	return 0;
-}
-EXPORT_SYMBOL(hmm_mirror_register);
-
-/*
- * hmm_mirror_unregister() - unregister a mirror
- *
- * @mirror: mirror struct to unregister
- *
- * Stop mirroring a process address space, and cleanup.
- */
-void hmm_mirror_unregister(struct hmm_mirror *mirror)
-{
-	struct hmm *hmm = mirror->hmm;
-
-	down_write(&hmm->mirrors_sem);
-	list_del(&mirror->list);
-	up_write(&hmm->mirrors_sem);
-	mmu_notifier_put(&hmm->mmu_notifier);
-}
-EXPORT_SYMBOL(hmm_mirror_unregister);
-
 struct hmm_vma_walk {
 	struct hmm_range	*range;
 	struct dev_pagemap	*pgmap;
@@ -779,87 +592,6 @@ static void hmm_pfns_clear(struct hmm_range *range,
 		*pfns = range->values[HMM_PFN_NONE];
 }
 
-/*
- * hmm_range_register() - start tracking change to CPU page table over a range
- * @range: range
- * @mm: the mm struct for the range of virtual address
- *
- * Return: 0 on success, -EFAULT if the address space is no longer valid
- *
- * Track updates to the CPU page table see include/linux/hmm.h
- */
-int hmm_range_register(struct hmm_range *range, struct hmm_mirror *mirror)
-{
-	struct hmm *hmm = mirror->hmm;
-	unsigned long flags;
-
-	range->valid = false;
-	range->hmm = NULL;
-
-	if ((range->start & (PAGE_SIZE - 1)) || (range->end & (PAGE_SIZE - 1)))
-		return -EINVAL;
-	if (range->start >= range->end)
-		return -EINVAL;
-
-	/* Prevent hmm_release() from running while the range is valid */
-	if (!mmget_not_zero(hmm->mmu_notifier.mm))
-		return -EFAULT;
-
-	/* Initialize range to track CPU page table updates. */
-	spin_lock_irqsave(&hmm->ranges_lock, flags);
-
-	range->hmm = hmm;
-	list_add(&range->list, &hmm->ranges);
-
-	/*
-	 * If there are any concurrent notifiers we have to wait for them for
-	 * the range to be valid (see hmm_range_wait_until_valid()).
-	 */
-	if (!hmm->notifiers)
-		range->valid = true;
-	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
-
-	return 0;
-}
-EXPORT_SYMBOL(hmm_range_register);
-
-/*
- * hmm_range_unregister() - stop tracking change to CPU page table over a range
- * @range: range
- *
- * Range struct is used to track updates to the CPU page table after a call to
- * hmm_range_register(). See include/linux/hmm.h for how to use it.
- */
-void hmm_range_unregister(struct hmm_range *range)
-{
-	struct hmm *hmm = range->hmm;
-	unsigned long flags;
-
-	spin_lock_irqsave(&hmm->ranges_lock, flags);
-	list_del_init(&range->list);
-	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
-
-	/* Drop reference taken by hmm_range_register() */
-	mmput(hmm->mmu_notifier.mm);
-
-	/*
-	 * The range is now invalid and the ref on the hmm is dropped, so
-	 * poison the pointer.  Leave other fields in place, for the caller's
-	 * use.
-	 */
-	range->valid = false;
-	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
-}
-EXPORT_SYMBOL(hmm_range_unregister);
-
-static bool needs_retry(struct hmm_range *range)
-{
-	if (range->notifier)
-		return mmu_range_check_retry(range->notifier,
-					     range->notifier_seq);
-	return !range->valid;
-}
-
 static const struct mm_walk_ops hmm_walk_ops = {
 	.pud_entry	= hmm_vma_walk_pud,
 	.pmd_entry	= hmm_vma_walk_pmd,
@@ -900,20 +632,15 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 	const unsigned long device_vma = VM_IO | VM_PFNMAP | VM_MIXEDMAP;
 	unsigned long start = range->start, end;
 	struct hmm_vma_walk hmm_vma_walk;
-	struct mm_struct *mm;
+	struct mm_struct *mm = range->notifier->mm;
 	struct vm_area_struct *vma;
 	int ret;
 
-	if (range->notifier)
-		mm = range->notifier->mm;
-	else
-		mm = range->hmm->mmu_notifier.mm;
-
 	lockdep_assert_held(&mm->mmap_sem);
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (needs_retry(range))
+		if (mmu_range_check_retry(range->notifier, range->notifier_seq))
 			return -EBUSY;
 
 		vma = find_vma(mm, start);
@@ -946,7 +673,9 @@ long hmm_range_fault(struct hmm_range *range, unsigned int flags)
 			start = hmm_vma_walk.last;
 
 			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && !needs_retry(range));
+		} while (ret == -EBUSY &&
+			 !mmu_range_check_retry(range->notifier,
+						range->notifier_seq));
 
 		if (ret) {
 			unsigned long i;
@@ -1004,7 +733,8 @@ long hmm_range_dma_map(struct hmm_range *range, struct device *device,
 			continue;
 
 		/* Check if range is being invalidated */
-		if (needs_retry(range)) {
+		if (mmu_range_check_retry(range->notifier,
+					  range->notifier_seq)) {
 			ret = -EBUSY;
 			goto unmap;
 		}
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert
  2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
@ 2019-10-29  7:48   ` Koenig, Christian
  0 siblings, 0 replies; 71+ messages in thread
From: Koenig, Christian @ 2019-10-29  7:48 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Kuehling, Felix
  Cc: linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

Am 28.10.19 um 21:10 schrieb Jason Gunthorpe:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> The new API is an exact match for the needs of radeon.
>
> For some reason radeon tries to remove overlapping ranges from the
> interval tree, but interval trees (and mmu_range_notifier_insert)
> support overlapping ranges directly. Simply delete all this code.
>
> Since this driver is missing a invalidate_range_end callback, but
> still calls get_user_pages(), it cannot be correct against all races.
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Cc: Petr Cvek <petrcvekcz@gmail.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/radeon/radeon.h    |   9 +-
>   drivers/gpu/drm/radeon/radeon_mn.c | 219 ++++++-----------------------
>   2 files changed, 52 insertions(+), 176 deletions(-)
>
> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> index d59b004f669583..27959f3ace1152 100644
> --- a/drivers/gpu/drm/radeon/radeon.h
> +++ b/drivers/gpu/drm/radeon/radeon.h
> @@ -68,6 +68,10 @@
>   #include <linux/hashtable.h>
>   #include <linux/dma-fence.h>
>   
> +#ifdef CONFIG_MMU_NOTIFIER
> +#include <linux/mmu_notifier.h>
> +#endif
> +
>   #include <drm/ttm/ttm_bo_api.h>
>   #include <drm/ttm/ttm_bo_driver.h>
>   #include <drm/ttm/ttm_placement.h>
> @@ -509,8 +513,9 @@ struct radeon_bo {
>   	struct ttm_bo_kmap_obj		dma_buf_vmap;
>   	pid_t				pid;
>   
> -	struct radeon_mn		*mn;
> -	struct list_head		mn_list;
> +#ifdef CONFIG_MMU_NOTIFIER
> +	struct mmu_range_notifier	notifier;
> +#endif
>   };
>   #define gem_to_radeon_bo(gobj) container_of((gobj), struct radeon_bo, tbo.base)
>   
> diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
> index dbab9a3a969b9e..d3d41e20a64922 100644
> --- a/drivers/gpu/drm/radeon/radeon_mn.c
> +++ b/drivers/gpu/drm/radeon/radeon_mn.c
> @@ -36,131 +36,51 @@
>   
>   #include "radeon.h"
>   
> -struct radeon_mn {
> -	struct mmu_notifier	mn;
> -
> -	/* objects protected by lock */
> -	struct mutex		lock;
> -	struct rb_root_cached	objects;
> -};
> -
> -struct radeon_mn_node {
> -	struct interval_tree_node	it;
> -	struct list_head		bos;
> -};
> -
>   /**
> - * radeon_mn_invalidate_range_start - callback to notify about mm change
> + * radeon_mn_invalidate - callback to notify about mm change
>    *
>    * @mn: our notifier
> - * @mn: the mm this callback is about
> - * @start: start of updated range
> - * @end: end of updated range
> + * @range: the VMA under invalidation
>    *
>    * We block for all BOs between start and end to be idle and
>    * unmap them by move them into system domain again.
>    */
> -static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
> -				const struct mmu_notifier_range *range)
> +static bool radeon_mn_invalidate(struct mmu_range_notifier *mn,
> +				 const struct mmu_notifier_range *range,
> +				 unsigned long cur_seq)
>   {
> -	struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
> +	struct radeon_bo *bo = container_of(mn, struct radeon_bo, notifier);
>   	struct ttm_operation_ctx ctx = { false, false };
> -	struct interval_tree_node *it;
> -	unsigned long end;
> -	int ret = 0;
> -
> -	/* notification is exclusive, but interval is inclusive */
> -	end = range->end - 1;
> -
> -	/* TODO we should be able to split locking for interval tree and
> -	 * the tear down.
> -	 */
> -	if (mmu_notifier_range_blockable(range))
> -		mutex_lock(&rmn->lock);
> -	else if (!mutex_trylock(&rmn->lock))
> -		return -EAGAIN;
> -
> -	it = interval_tree_iter_first(&rmn->objects, range->start, end);
> -	while (it) {
> -		struct radeon_mn_node *node;
> -		struct radeon_bo *bo;
> -		long r;
> -
> -		if (!mmu_notifier_range_blockable(range)) {
> -			ret = -EAGAIN;
> -			goto out_unlock;
> -		}
> -
> -		node = container_of(it, struct radeon_mn_node, it);
> -		it = interval_tree_iter_next(it, range->start, end);
> +	long r;
>   
> -		list_for_each_entry(bo, &node->bos, mn_list) {
> +	if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
> +		return true;
>   
> -			if (!bo->tbo.ttm || bo->tbo.ttm->state != tt_bound)
> -				continue;
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
>   
> -			r = radeon_bo_reserve(bo, true);
> -			if (r) {
> -				DRM_ERROR("(%ld) failed to reserve user bo\n", r);
> -				continue;
> -			}
> -
> -			r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
> -				true, false, MAX_SCHEDULE_TIMEOUT);
> -			if (r <= 0)
> -				DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> -
> -			radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
> -			r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
> -			if (r)
> -				DRM_ERROR("(%ld) failed to validate user bo\n", r);
> -
> -			radeon_bo_unreserve(bo);
> -		}
> +	r = radeon_bo_reserve(bo, true);
> +	if (r) {
> +		DRM_ERROR("(%ld) failed to reserve user bo\n", r);
> +		return true;
>   	}
> -	
> -out_unlock:
> -	mutex_unlock(&rmn->lock);
> -
> -	return ret;
> -}
> -
> -static void radeon_mn_release(struct mmu_notifier *mn, struct mm_struct *mm)
> -{
> -	struct mmu_notifier_range range = {
> -		.mm = mm,
> -		.start = 0,
> -		.end = ULONG_MAX,
> -		.flags = 0,
> -		.event = MMU_NOTIFY_UNMAP,
> -	};
> -
> -	radeon_mn_invalidate_range_start(mn, &range);
> -}
> -
> -static struct mmu_notifier *radeon_mn_alloc_notifier(struct mm_struct *mm)
> -{
> -	struct radeon_mn *rmn;
>   
> -	rmn = kzalloc(sizeof(*rmn), GFP_KERNEL);
> -	if (!rmn)
> -		return ERR_PTR(-ENOMEM);
> +	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
> +				      MAX_SCHEDULE_TIMEOUT);
> +	if (r <= 0)
> +		DRM_ERROR("(%ld) failed to wait for user bo\n", r);
>   
> -	mutex_init(&rmn->lock);
> -	rmn->objects = RB_ROOT_CACHED;
> -	return &rmn->mn;
> -}
> +	radeon_ttm_placement_from_domain(bo, RADEON_GEM_DOMAIN_CPU);
> +	r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
> +	if (r)
> +		DRM_ERROR("(%ld) failed to validate user bo\n", r);
>   
> -static void radeon_mn_free_notifier(struct mmu_notifier *mn)
> -{
> -	kfree(container_of(mn, struct radeon_mn, mn));
> +	radeon_bo_unreserve(bo);
> +	return true;
>   }
>   
> -static const struct mmu_notifier_ops radeon_mn_ops = {
> -	.release = radeon_mn_release,
> -	.invalidate_range_start = radeon_mn_invalidate_range_start,
> -	.alloc_notifier = radeon_mn_alloc_notifier,
> -	.free_notifier = radeon_mn_free_notifier,
> +static const struct mmu_range_notifier_ops radeon_mn_ops = {
> +	.invalidate = radeon_mn_invalidate,
>   };
>   
>   /**
> @@ -174,51 +94,21 @@ static const struct mmu_notifier_ops radeon_mn_ops = {
>    */
>   int radeon_mn_register(struct radeon_bo *bo, unsigned long addr)
>   {
> -	unsigned long end = addr + radeon_bo_size(bo) - 1;
> -	struct mmu_notifier *mn;
> -	struct radeon_mn *rmn;
> -	struct radeon_mn_node *node = NULL;
> -	struct list_head bos;
> -	struct interval_tree_node *it;
> -
> -	mn = mmu_notifier_get(&radeon_mn_ops, current->mm);
> -	if (IS_ERR(mn))
> -		return PTR_ERR(mn);
> -	rmn = container_of(mn, struct radeon_mn, mn);
> -
> -	INIT_LIST_HEAD(&bos);
> -
> -	mutex_lock(&rmn->lock);
> -
> -	while ((it = interval_tree_iter_first(&rmn->objects, addr, end))) {
> -		kfree(node);
> -		node = container_of(it, struct radeon_mn_node, it);
> -		interval_tree_remove(&node->it, &rmn->objects);
> -		addr = min(it->start, addr);
> -		end = max(it->last, end);
> -		list_splice(&node->bos, &bos);
> -	}
> -
> -	if (!node) {
> -		node = kmalloc(sizeof(struct radeon_mn_node), GFP_KERNEL);
> -		if (!node) {
> -			mutex_unlock(&rmn->lock);
> -			return -ENOMEM;
> -		}
> -	}
> -
> -	bo->mn = rmn;
> -
> -	node->it.start = addr;
> -	node->it.last = end;
> -	INIT_LIST_HEAD(&node->bos);
> -	list_splice(&bos, &node->bos);
> -	list_add(&bo->mn_list, &node->bos);
> -
> -	interval_tree_insert(&node->it, &rmn->objects);
> -
> -	mutex_unlock(&rmn->lock);
> -
> +	int ret;
> +
> +	bo->notifier.ops = &radeon_mn_ops;
> +	ret = mmu_range_notifier_insert(&bo->notifier, addr, radeon_bo_size(bo),
> +					current->mm);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * FIXME: radeon appears to allow get_user_pages to run during
> +	 * invalidate_range_start/end, which is not a safe way to read the
> +	 * PTEs. It should use the mmu_range_read_begin() scheme around the
> +	 * get_user_pages to ensure that the PTEs are read properly
> +	 */
> +	mmu_range_read_begin(&bo->notifier);
>   	return 0;
>   }
>   
> @@ -231,27 +121,8 @@ int radeon_mn_register(struct radeon_bo *bo, unsigned long addr)
>    */
>   void radeon_mn_unregister(struct radeon_bo *bo)
>   {
> -	struct radeon_mn *rmn = bo->mn;
> -	struct list_head *head;
> -
> -	if (!rmn)
> +	if (!bo->notifier.mm)
>   		return;
> -
> -	mutex_lock(&rmn->lock);
> -	/* save the next list entry for later */
> -	head = bo->mn_list.next;
> -
> -	list_del(&bo->mn_list);
> -
> -	if (list_empty(head)) {
> -		struct radeon_mn_node *node;
> -		node = container_of(head, struct radeon_mn_node, bos);
> -		interval_tree_remove(&node->it, &rmn->objects);
> -		kfree(node);
> -	}
> -
> -	mutex_unlock(&rmn->lock);
> -
> -	mmu_notifier_put(&rmn->mn);
> -	bo->mn = NULL;
> +	mmu_range_notifier_remove(&bo->notifier);
> +	bo->notifier.mm = NULL;
>   }


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem
  2019-10-28 20:10 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Jason Gunthorpe
@ 2019-10-29  7:49   ` Koenig, Christian
  2019-10-29 16:28   ` Kuehling, Felix
  1 sibling, 0 replies; 71+ messages in thread
From: Koenig, Christian @ 2019-10-29  7:49 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Kuehling, Felix
  Cc: linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

Am 28.10.19 um 21:10 schrieb Jason Gunthorpe:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> find_vma() must be called under the mmap_sem, reorganize this code to
> do the vma check after entering the lock.
>
> Further, fix the unlocked use of struct task_struct's mm, instead use
> the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
> must be converted to a mm_get before acquiring mmap_sem or calling
> find_vma().
>
> Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
> Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads")
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 ++++++++++++++-----------
>   1 file changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index dff41d0a85fe96..c0e41f1f0c2365 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -35,6 +35,7 @@
>   #include <linux/hmm.h>
>   #include <linux/pagemap.h>
>   #include <linux/sched/task.h>
> +#include <linux/sched/mm.h>
>   #include <linux/seq_file.h>
>   #include <linux/slab.h>
>   #include <linux/swap.h>
> @@ -788,7 +789,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
>   	struct ttm_tt *ttm = bo->tbo.ttm;
>   	struct amdgpu_ttm_tt *gtt = (void *)ttm;
> -	struct mm_struct *mm = gtt->usertask->mm;
> +	struct mm_struct *mm;
>   	unsigned long start = gtt->userptr;
>   	struct vm_area_struct *vma;
>   	struct hmm_range *range;
> @@ -796,25 +797,14 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	uint64_t *pfns;
>   	int r = 0;
>   
> -	if (!mm) /* Happens during process shutdown */
> -		return -ESRCH;
> -
>   	if (unlikely(!mirror)) {
>   		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
> -		r = -EFAULT;
> -		goto out;
> +		return -EFAULT;
>   	}
>   
> -	vma = find_vma(mm, start);
> -	if (unlikely(!vma || start < vma->vm_start)) {
> -		r = -EFAULT;
> -		goto out;
> -	}
> -	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
> -		vma->vm_file)) {
> -		r = -EPERM;
> -		goto out;
> -	}
> +	mm = mirror->hmm->mmu_notifier.mm;
> +	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
> +		return -ESRCH;
>   
>   	range = kzalloc(sizeof(*range), GFP_KERNEL);
>   	if (unlikely(!range)) {
> @@ -847,6 +837,17 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
>   
>   	down_read(&mm->mmap_sem);
> +	vma = find_vma(mm, start);
> +	if (unlikely(!vma || start < vma->vm_start)) {
> +		r = -EFAULT;
> +		goto out_unlock;
> +	}
> +	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
> +		vma->vm_file)) {
> +		r = -EPERM;
> +		goto out_unlock;
> +	}
> +
>   	r = hmm_range_fault(range, 0);
>   	up_read(&mm->mmap_sem);
>   
> @@ -865,15 +866,19 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	}
>   
>   	gtt->range = range;
> +	mmput(mm);
>   
>   	return 0;
>   
> +out_unlock:
> +	up_read(&mm->mmap_sem);
>   out_free_pfns:
>   	hmm_range_unregister(range);
>   	kvfree(pfns);
>   out_free_ranges:
>   	kfree(range);
>   out:
> +	mmput(mm);
>   	return r;
>   }
>   


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror Jason Gunthorpe
@ 2019-10-29  7:51   ` Koenig, Christian
  2019-10-29 13:59     ` Jason Gunthorpe
  2019-10-29 22:14   ` Kuehling, Felix
  1 sibling, 1 reply; 71+ messages in thread
From: Koenig, Christian @ 2019-10-29  7:51 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Kuehling, Felix
  Cc: linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

Am 28.10.19 um 21:10 schrieb Jason Gunthorpe:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> Remove the interval tree in the driver and rely on the tree maintained by
> the mmu_notifier for delivering mmu_notifier invalidation callbacks.
>
> For some reason amdgpu has a very complicated arrangement where it tries
> to prevent duplicate entries in the interval_tree, this is not necessary,
> each amdgpu_bo can be its own stand alone entry. interval_tree already
> allows duplicates and overlaps in the tree.
>
> Also, there is no need to remove entries upon a release callback, the
> mmu_range API safely allows objects to remain registered beyond the
> lifetime of the mm. The driver only has to stop touching the pages during
> release.
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 341 ++++--------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |   4 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
>   6 files changed, 84 insertions(+), 282 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index bd37df5dd6d048..60591a5d420021 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1006,6 +1006,8 @@ struct amdgpu_device {
>   	struct mutex  lock_reset;
>   	struct amdgpu_doorbell_index doorbell_index;
>   
> +	struct mutex			notifier_lock;
> +
>   	int asic_reset_res;
>   	struct work_struct		xgmi_reset_work;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 6d021ecc8d598f..47700302a08b7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -481,8 +481,7 @@ static void remove_kgd_mem_from_kfd_bo_list(struct kgd_mem *mem,
>    *
>    * Returns 0 for success, negative errno for errors.
>    */
> -static int init_user_pages(struct kgd_mem *mem, struct mm_struct *mm,
> -			   uint64_t user_addr)
> +static int init_user_pages(struct kgd_mem *mem, uint64_t user_addr)
>   {
>   	struct amdkfd_process_info *process_info = mem->process_info;
>   	struct amdgpu_bo *bo = mem->bo;
> @@ -1195,7 +1194,7 @@ int amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu(
>   	add_kgd_mem_to_kfd_bo_list(*mem, avm->process_info, user_addr);
>   
>   	if (user_addr) {
> -		ret = init_user_pages(*mem, current->mm, user_addr);
> +		ret = init_user_pages(*mem, user_addr);
>   		if (ret)
>   			goto allocate_init_user_pages_failed;
>   	}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 5a1939dbd4e3e6..38f97998aaddb2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2633,6 +2633,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	mutex_init(&adev->virt.vf_errors.lock);
>   	hash_init(adev->mn_hash);
>   	mutex_init(&adev->lock_reset);
> +	mutex_init(&adev->notifier_lock);
>   	mutex_init(&adev->virt.dpm_mutex);
>   	mutex_init(&adev->psp.mutex);
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 31d4deb5d29484..4ffd7b90f4d907 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -50,66 +50,6 @@
>   #include "amdgpu.h"
>   #include "amdgpu_amdkfd.h"
>   
> -/**
> - * struct amdgpu_mn_node
> - *
> - * @it: interval node defining start-last of the affected address range
> - * @bos: list of all BOs in the affected address range
> - *
> - * Manages all BOs which are affected of a certain range of address space.
> - */
> -struct amdgpu_mn_node {
> -	struct interval_tree_node	it;
> -	struct list_head		bos;
> -};
> -
> -/**
> - * amdgpu_mn_destroy - destroy the HMM mirror
> - *
> - * @work: previously sheduled work item
> - *
> - * Lazy destroys the notifier from a work item
> - */
> -static void amdgpu_mn_destroy(struct work_struct *work)
> -{
> -	struct amdgpu_mn *amn = container_of(work, struct amdgpu_mn, work);
> -	struct amdgpu_device *adev = amn->adev;
> -	struct amdgpu_mn_node *node, *next_node;
> -	struct amdgpu_bo *bo, *next_bo;
> -
> -	mutex_lock(&adev->mn_lock);
> -	down_write(&amn->lock);
> -	hash_del(&amn->node);
> -	rbtree_postorder_for_each_entry_safe(node, next_node,
> -					     &amn->objects.rb_root, it.rb) {
> -		list_for_each_entry_safe(bo, next_bo, &node->bos, mn_list) {
> -			bo->mn = NULL;
> -			list_del_init(&bo->mn_list);
> -		}
> -		kfree(node);
> -	}
> -	up_write(&amn->lock);
> -	mutex_unlock(&adev->mn_lock);
> -
> -	hmm_mirror_unregister(&amn->mirror);
> -	kfree(amn);
> -}
> -
> -/**
> - * amdgpu_hmm_mirror_release - callback to notify about mm destruction
> - *
> - * @mirror: the HMM mirror (mm) this callback is about
> - *
> - * Shedule a work item to lazy destroy HMM mirror.
> - */
> -static void amdgpu_hmm_mirror_release(struct hmm_mirror *mirror)
> -{
> -	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -
> -	INIT_WORK(&amn->work, amdgpu_mn_destroy);
> -	schedule_work(&amn->work);
> -}
> -
>   /**
>    * amdgpu_mn_lock - take the write side lock for this notifier
>    *
> @@ -133,157 +73,86 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   }
>   
>   /**
> - * amdgpu_mn_read_lock - take the read side lock for this notifier
> - *
> - * @amn: our notifier
> - */
> -static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> -{
> -	if (blockable)
> -		down_read(&amn->lock);
> -	else if (!down_read_trylock(&amn->lock))
> -		return -EAGAIN;
> -
> -	return 0;
> -}
> -
> -/**
> - * amdgpu_mn_read_unlock - drop the read side lock for this notifier
> - *
> - * @amn: our notifier
> - */
> -static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
> -{
> -	up_read(&amn->lock);
> -}
> -
> -/**
> - * amdgpu_mn_invalidate_node - unmap all BOs of a node
> + * amdgpu_mn_invalidate_gfx - callback to notify about mm change
>    *
> - * @node: the node with the BOs to unmap
> - * @start: start of address range affected
> - * @end: end of address range affected
> + * @mrn: the range (mm) is about to update
> + * @range: details on the invalidation
>    *
>    * Block for operations on BOs to finish and mark pages as accessed and
>    * potentially dirty.
>    */
> -static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> -				      unsigned long start,
> -				      unsigned long end)
> +static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
> +				     const struct mmu_notifier_range *range)
>   {
> -	struct amdgpu_bo *bo;
> +	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   	long r;
>   
> -	list_for_each_entry(bo, &node->bos, mn_list) {
> -
> -		if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, start, end))
> -			continue;
> -
> -		r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
> -			true, false, MAX_SCHEDULE_TIMEOUT);
> -		if (r <= 0)
> -			DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> -	}
> +	/* FIXME: Is this necessary? */

Most likely not.

Christian.

> +	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> +					  range->end))
> +		return true;
> +
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
> +
> +	mutex_lock(&adev->notifier_lock);
> +	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
> +				      MAX_SCHEDULE_TIMEOUT);
> +	mutex_unlock(&adev->notifier_lock);
> +	if (r <= 0)
> +		DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> +	return true;
>   }
>   
> +static const struct mmu_range_notifier_ops amdgpu_mn_gfx_ops = {
> +	.invalidate = amdgpu_mn_invalidate_gfx,
> +};
> +
>   /**
> - * amdgpu_mn_sync_pagetables_gfx - callback to notify about mm change
> + * amdgpu_mn_invalidate_hsa - callback to notify about mm change
>    *
> - * @mirror: the hmm_mirror (mm) is about to update
> - * @update: the update start, end address
> + * @mrn: the range (mm) is about to update
> + * @range: details on the invalidation
>    *
> - * Block for operations on BOs to finish and mark pages as accessed and
> - * potentially dirty.
> + * We temporarily evict the BO attached to this range. This necessitates
> + * evicting all user-mode queues of the process.
>    */
> -static int
> -amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror,
> -			      const struct mmu_notifier_range *update)
> +static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
> +				     const struct mmu_notifier_range *range)
>   {
> -	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -	unsigned long start = update->start;
> -	unsigned long end = update->end;
> -	bool blockable = mmu_notifier_range_blockable(update);
> -	struct interval_tree_node *it;
> -
> -	/* notification is exclusive, but interval is inclusive */
> -	end -= 1;
> -
> -	/* TODO we should be able to split locking for interval tree and
> -	 * amdgpu_mn_invalidate_node
> -	 */
> -	if (amdgpu_mn_read_lock(amn, blockable))
> -		return -EAGAIN;
> -
> -	it = interval_tree_iter_first(&amn->objects, start, end);
> -	while (it) {
> -		struct amdgpu_mn_node *node;
> -
> -		if (!blockable) {
> -			amdgpu_mn_read_unlock(amn);
> -			return -EAGAIN;
> -		}
> +	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		it = interval_tree_iter_next(it, start, end);
> +	/* FIXME: Is this necessary? */
> +	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> +					  range->end))
> +		return true;
>   
> -		amdgpu_mn_invalidate_node(node, start, end);
> -	}
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
>   
> -	amdgpu_mn_read_unlock(amn);
> +	mutex_lock(&adev->notifier_lock);
> +	amdgpu_amdkfd_evict_userptr(bo->kfd_bo, bo->notifier.mm);
> +	mutex_unlock(&adev->notifier_lock);
>   
> -	return 0;
> +	return true;
>   }
>   
> -/**
> - * amdgpu_mn_sync_pagetables_hsa - callback to notify about mm change
> - *
> - * @mirror: the hmm_mirror (mm) is about to update
> - * @update: the update start, end address
> - *
> - * We temporarily evict all BOs between start and end. This
> - * necessitates evicting all user-mode queues of the process. The BOs
> - * are restorted in amdgpu_mn_invalidate_range_end_hsa.
> - */
> -static int
> -amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
> -			      const struct mmu_notifier_range *update)
> +static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
> +	.invalidate = amdgpu_mn_invalidate_hsa,
> +};
> +
> +static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
> +				     const struct mmu_notifier_range *update)
>   {
>   	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -	unsigned long start = update->start;
> -	unsigned long end = update->end;
> -	bool blockable = mmu_notifier_range_blockable(update);
> -	struct interval_tree_node *it;
>   
> -	/* notification is exclusive, but interval is inclusive */
> -	end -= 1;
> -
> -	if (amdgpu_mn_read_lock(amn, blockable))
> -		return -EAGAIN;
> -
> -	it = interval_tree_iter_first(&amn->objects, start, end);
> -	while (it) {
> -		struct amdgpu_mn_node *node;
> -		struct amdgpu_bo *bo;
> -
> -		if (!blockable) {
> -			amdgpu_mn_read_unlock(amn);
> -			return -EAGAIN;
> -		}
> -
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		it = interval_tree_iter_next(it, start, end);
> -
> -		list_for_each_entry(bo, &node->bos, mn_list) {
> -			struct kgd_mem *mem = bo->kfd_bo;
> -
> -			if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm,
> -							 start, end))
> -				amdgpu_amdkfd_evict_userptr(mem, amn->mm);
> -		}
> -	}
> -
> -	amdgpu_mn_read_unlock(amn);
> +	if (!mmu_notifier_range_blockable(update))
> +		return false;
>   
> +	down_read(&amn->lock);
> +	up_read(&amn->lock);
>   	return 0;
>   }
>   
> @@ -295,12 +164,10 @@ amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
>   
>   static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = {
>   	[AMDGPU_MN_TYPE_GFX] = {
> -		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_gfx,
> -		.release = amdgpu_hmm_mirror_release
> +		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
>   	},
>   	[AMDGPU_MN_TYPE_HSA] = {
> -		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_hsa,
> -		.release = amdgpu_hmm_mirror_release
> +		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
>   	},
>   };
>   
> @@ -327,7 +194,8 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>   	}
>   
>   	hash_for_each_possible(adev->mn_hash, amn, node, key)
> -		if (AMDGPU_MN_KEY(amn->mm, amn->type) == key)
> +		if (AMDGPU_MN_KEY(amn->mirror.hmm->mmu_notifier.mm,
> +				  amn->type) == key)
>   			goto release_locks;
>   
>   	amn = kzalloc(sizeof(*amn), GFP_KERNEL);
> @@ -337,10 +205,8 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>   	}
>   
>   	amn->adev = adev;
> -	amn->mm = mm;
>   	init_rwsem(&amn->lock);
>   	amn->type = type;
> -	amn->objects = RB_ROOT_CACHED;
>   
>   	amn->mirror.ops = &amdgpu_hmm_mirror_ops[type];
>   	r = hmm_mirror_register(&amn->mirror, mm);
> @@ -369,100 +235,33 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>    * @bo: amdgpu buffer object
>    * @addr: userptr addr we should monitor
>    *
> - * Registers an HMM mirror for the given BO at the specified address.
> + * Registers a mmu_notifier for the given BO at the specified address.
>    * Returns 0 on success, -ERRNO if anything goes wrong.
>    */
>   int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr)
>   {
> -	unsigned long end = addr + amdgpu_bo_size(bo) - 1;
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> -	enum amdgpu_mn_type type =
> -		bo->kfd_bo ? AMDGPU_MN_TYPE_HSA : AMDGPU_MN_TYPE_GFX;
> -	struct amdgpu_mn *amn;
> -	struct amdgpu_mn_node *node = NULL, *new_node;
> -	struct list_head bos;
> -	struct interval_tree_node *it;
> -
> -	amn = amdgpu_mn_get(adev, type);
> -	if (IS_ERR(amn))
> -		return PTR_ERR(amn);
> -
> -	new_node = kmalloc(sizeof(*new_node), GFP_KERNEL);
> -	if (!new_node)
> -		return -ENOMEM;
> -
> -	INIT_LIST_HEAD(&bos);
> -
> -	down_write(&amn->lock);
> -
> -	while ((it = interval_tree_iter_first(&amn->objects, addr, end))) {
> -		kfree(node);
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		interval_tree_remove(&node->it, &amn->objects);
> -		addr = min(it->start, addr);
> -		end = max(it->last, end);
> -		list_splice(&node->bos, &bos);
> -	}
> -
> -	if (!node)
> -		node = new_node;
> +	if (bo->kfd_bo)
> +		bo->notifier.ops = &amdgpu_mn_hsa_ops;
>   	else
> -		kfree(new_node);
> -
> -	bo->mn = amn;
> -
> -	node->it.start = addr;
> -	node->it.last = end;
> -	INIT_LIST_HEAD(&node->bos);
> -	list_splice(&bos, &node->bos);
> -	list_add(&bo->mn_list, &node->bos);
> +		bo->notifier.ops = &amdgpu_mn_gfx_ops;
>   
> -	interval_tree_insert(&node->it, &amn->objects);
> -
> -	up_write(&amn->lock);
> -
> -	return 0;
> +	return mmu_range_notifier_insert(&bo->notifier, addr,
> +					 amdgpu_bo_size(bo), current->mm);
>   }
>   
>   /**
> - * amdgpu_mn_unregister - unregister a BO for HMM mirror updates
> + * amdgpu_mn_unregister - unregister a BO for notifier updates
>    *
>    * @bo: amdgpu buffer object
>    *
> - * Remove any registration of HMM mirror updates from the buffer object.
> + * Remove any registration of mmu notifier updates from the buffer object.
>    */
>   void amdgpu_mn_unregister(struct amdgpu_bo *bo)
>   {
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> -	struct amdgpu_mn *amn;
> -	struct list_head *head;
> -
> -	mutex_lock(&adev->mn_lock);
> -
> -	amn = bo->mn;
> -	if (amn == NULL) {
> -		mutex_unlock(&adev->mn_lock);
> +	if (!bo->notifier.mm)
>   		return;
> -	}
> -
> -	down_write(&amn->lock);
> -
> -	/* save the next list entry for later */
> -	head = bo->mn_list.next;
> -
> -	bo->mn = NULL;
> -	list_del_init(&bo->mn_list);
> -
> -	if (list_empty(head)) {
> -		struct amdgpu_mn_node *node;
> -
> -		node = container_of(head, struct amdgpu_mn_node, bos);
> -		interval_tree_remove(&node->it, &amn->objects);
> -		kfree(node);
> -	}
> -
> -	up_write(&amn->lock);
> -	mutex_unlock(&adev->mn_lock);
> +	mmu_range_notifier_remove(&bo->notifier);
> +	bo->notifier.mm = NULL;
>   }
>   
>   /* flags used by HMM internal, not related to CPU/GPU PTE flags */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> index b8ed68943625c2..d73ab2947b22b2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> @@ -39,12 +39,10 @@ enum amdgpu_mn_type {
>    * struct amdgpu_mn
>    *
>    * @adev: amdgpu device pointer
> - * @mm: process address space
>    * @type: type of MMU notifier
>    * @work: destruction work item
>    * @node: hash table node to find structure by adev and mn
>    * @lock: rw semaphore protecting the notifier nodes
> - * @objects: interval tree containing amdgpu_mn_nodes
>    * @mirror: HMM mirror function support
>    *
>    * Data for each amdgpu device and process address space.
> @@ -52,7 +50,6 @@ enum amdgpu_mn_type {
>   struct amdgpu_mn {
>   	/* constant after initialisation */
>   	struct amdgpu_device	*adev;
> -	struct mm_struct	*mm;
>   	enum amdgpu_mn_type	type;
>   
>   	/* only used on destruction */
> @@ -63,7 +60,6 @@ struct amdgpu_mn {
>   
>   	/* objects protected by lock */
>   	struct rw_semaphore	lock;
> -	struct rb_root_cached	objects;
>   
>   #ifdef CONFIG_HMM_MIRROR
>   	/* HMM mirror */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> index 658f4c9779b704..4b44ab850f94c2 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.h
> @@ -30,6 +30,9 @@
>   
>   #include <drm/amdgpu_drm.h>
>   #include "amdgpu.h"
> +#ifdef CONFIG_MMU_NOTIFIER
> +#include <linux/mmu_notifier.h>
> +#endif
>   
>   #define AMDGPU_BO_INVALID_OFFSET	LONG_MAX
>   #define AMDGPU_BO_MAX_PLACEMENTS	3
> @@ -100,10 +103,12 @@ struct amdgpu_bo {
>   	struct ttm_bo_kmap_obj		dma_buf_vmap;
>   	struct amdgpu_mn		*mn;
>   
> -	union {
> -		struct list_head	mn_list;
> -		struct list_head	shadow_list;
> -	};
> +
> +#ifdef CONFIG_MMU_NOTIFIER
> +	struct mmu_range_notifier	notifier;
> +#endif
> +
> +	struct list_head		shadow_list;
>   
>   	struct kgd_mem                  *kfd_bo;
>   };


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
  2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
@ 2019-10-29 12:19   ` Dennis Dalessandro
  2019-10-29 12:51     ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Dennis Dalessandro @ 2019-10-29 12:19 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou, Juergen Gross,
	Mike Marciniszyn, Oleksandr Andrushchenko, Petr Cvek,
	Stefano Stabellini, nouveau, xen-devel, Christoph Hellwig,
	Jason Gunthorpe

On 10/28/2019 4:10 PM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This converts one of the two users of mmu_notifiers to use the new API.
> The conversion is fairly straightforward, however the existing use of
> notifiers here seems to be racey.
> 
> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

I tested v1, and replied to it [1]. I can re-test with this version if 
you like as well.

[1] https://marc.info/?l=linux-rdma&m=157235130606412&w=2

-Denny

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
  2019-10-29 12:19   ` Dennis Dalessandro
@ 2019-10-29 12:51     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 12:51 UTC (permalink / raw)
  To: Dennis Dalessandro
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Boris Ostrovsky, Christian König, David Zhou,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On Tue, Oct 29, 2019 at 08:19:20AM -0400, Dennis Dalessandro wrote:
> On 10/28/2019 4:10 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This converts one of the two users of mmu_notifiers to use the new API.
> > The conversion is fairly straightforward, however the existing use of
> > notifiers here seems to be racey.
> > 
> > Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
> > Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> I tested v1, and replied to it [1]. I can re-test with this version if you
> like as well.
> 
> [1] https://marc.info/?l=linux-rdma&m=157235130606412&w=2

I think it is fine, nothing really changed in v2, thanks

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem
  2019-10-29 16:28   ` Kuehling, Felix
@ 2019-10-29 13:07     ` Christian König
  2019-10-29 17:19     ` Jason Gunthorpe
  1 sibling, 0 replies; 71+ messages in thread
From: Christian König @ 2019-10-29 13:07 UTC (permalink / raw)
  To: Kuehling, Felix, Jason Gunthorpe, linux-mm, Jerome Glisse,
	Ralph Campbell, John Hubbard
  Cc: Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, Jason Gunthorpe, dri-devel, Deucher,
	Alexander, xen-devel, Boris Ostrovsky, Petr Cvek, Koenig,
	Christian, Ben Skeggs

Am 29.10.19 um 17:28 schrieb Kuehling, Felix:
> On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
>> From: Jason Gunthorpe <jgg@mellanox.com>
>>
>> find_vma() must be called under the mmap_sem, reorganize this code to
>> do the vma check after entering the lock.
>>
>> Further, fix the unlocked use of struct task_struct's mm, instead use
>> the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
>> must be converted to a mm_get before acquiring mmap_sem or calling
>> find_vma().
>>
>> Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
>> Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads")
>> Cc: Alex Deucher <alexander.deucher@amd.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
>> Cc: amd-gfx@lists.freedesktop.org
>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> One question inline to confirm my understanding. Otherwise this patch is
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 ++++++++++++++-----------
>>    1 file changed, 21 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index dff41d0a85fe96..c0e41f1f0c2365 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -35,6 +35,7 @@
>>    #include <linux/hmm.h>
>>    #include <linux/pagemap.h>
>>    #include <linux/sched/task.h>
>> +#include <linux/sched/mm.h>
>>    #include <linux/seq_file.h>
>>    #include <linux/slab.h>
>>    #include <linux/swap.h>
>> @@ -788,7 +789,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>>    	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
>>    	struct ttm_tt *ttm = bo->tbo.ttm;
>>    	struct amdgpu_ttm_tt *gtt = (void *)ttm;
>> -	struct mm_struct *mm = gtt->usertask->mm;
>> +	struct mm_struct *mm;
>>    	unsigned long start = gtt->userptr;
>>    	struct vm_area_struct *vma;
>>    	struct hmm_range *range;
>> @@ -796,25 +797,14 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>>    	uint64_t *pfns;
>>    	int r = 0;
>>    
>> -	if (!mm) /* Happens during process shutdown */
>> -		return -ESRCH;
>> -
>>    	if (unlikely(!mirror)) {
>>    		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
>> -		r = -EFAULT;
>> -		goto out;
>> +		return -EFAULT;
>>    	}
>>    
>> -	vma = find_vma(mm, start);
>> -	if (unlikely(!vma || start < vma->vm_start)) {
>> -		r = -EFAULT;
>> -		goto out;
>> -	}
>> -	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
>> -		vma->vm_file)) {
>> -		r = -EPERM;
>> -		goto out;
>> -	}
>> +	mm = mirror->hmm->mmu_notifier.mm;
>> +	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
> This works because mirror->hmm->mmu_notifier holds an mmgrab reference
> to the mm? So the MM will not just go away, but if the mmget refcount is
> 0, it means the mm is marked for destruction and shouldn't be used any more.

Yes, exactly. That is a rather common pattern, one reference count for 
the functionality and one for the structure.

When the functionality is gone the structure might still be alive for 
some reason. TTM and a couple of other structures use the same approach.

Christian.

>
>
>> +		return -ESRCH;
>>    
>>    	range = kzalloc(sizeof(*range), GFP_KERNEL);
>>    	if (unlikely(!range)) {
>> @@ -847,6 +837,17 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>>    	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
>>    
>>    	down_read(&mm->mmap_sem);
>> +	vma = find_vma(mm, start);
>> +	if (unlikely(!vma || start < vma->vm_start)) {
>> +		r = -EFAULT;
>> +		goto out_unlock;
>> +	}
>> +	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
>> +		vma->vm_file)) {
>> +		r = -EPERM;
>> +		goto out_unlock;
>> +	}
>> +
>>    	r = hmm_range_fault(range, 0);
>>    	up_read(&mm->mmap_sem);
>>    
>> @@ -865,15 +866,19 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>>    	}
>>    
>>    	gtt->range = range;
>> +	mmput(mm);
>>    
>>    	return 0;
>>    
>> +out_unlock:
>> +	up_read(&mm->mmap_sem);
>>    out_free_pfns:
>>    	hmm_range_unregister(range);
>>    	kvfree(pfns);
>>    out_free_ranges:
>>    	kfree(range);
>>    out:
>> +	mmput(mm);
>>    	return r;
>>    }
>>    
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  2019-10-29  7:51   ` Koenig, Christian
@ 2019-10-29 13:59     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 13:59 UTC (permalink / raw)
  To: Koenig, Christian
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, linux-rdma, dri-devel, amd-gfx, Deucher, Alexander,
	Ben Skeggs, Boris Ostrovsky, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Tue, Oct 29, 2019 at 07:51:30AM +0000, Koenig, Christian wrote:
> > +static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
> > +				     const struct mmu_notifier_range *range)
> >   {
> > -	struct amdgpu_bo *bo;
> > +	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> > +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> >   	long r;
> >   
> > -	list_for_each_entry(bo, &node->bos, mn_list) {
> > -
> > -		if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, start, end))
> > -			continue;
> > -
> > -		r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
> > -			true, false, MAX_SCHEDULE_TIMEOUT);
> > -		if (r <= 0)
> > -			DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> > -	}
> > +	/* FIXME: Is this necessary? */
> 
> Most likely not.
> 
> Christian.
> 
> > +	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> > +					  range->end))
> > +		return true;

So is the bo->tbo.mem.num_pages == bo->tbo.ttm.num_pages always?

And userptr can't be zero here, or at least it doesn't matter if it is?

> > +static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
> > +				     const struct mmu_notifier_range *range)
> >   {
> > -	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> > -	unsigned long start = update->start;
> > -	unsigned long end = update->end;
> > -	bool blockable = mmu_notifier_range_blockable(update);
> > -	struct interval_tree_node *it;
> > -
> > -	/* notification is exclusive, but interval is inclusive */
> > -	end -= 1;
> > -
> > -	/* TODO we should be able to split locking for interval tree and
> > -	 * amdgpu_mn_invalidate_node
> > -	 */
> > -	if (amdgpu_mn_read_lock(amn, blockable))
> > -		return -EAGAIN;
> > -
> > -	it = interval_tree_iter_first(&amn->objects, start, end);
> > -	while (it) {
> > -		struct amdgpu_mn_node *node;
> > -
> > -		if (!blockable) {
> > -			amdgpu_mn_read_unlock(amn);
> > -			return -EAGAIN;
> > -		}
> > +	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> > +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> >   
> > -		node = container_of(it, struct amdgpu_mn_node, it);
> > -		it = interval_tree_iter_next(it, start, end);
> > +	/* FIXME: Is this necessary? */
> > +	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> > +					  range->end))
> > +		return true;
> >   
> > -		amdgpu_mn_invalidate_node(node, start, end);
> > -	}

This one too right?

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem
  2019-10-28 20:10 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Jason Gunthorpe
  2019-10-29  7:49   ` Koenig, Christian
@ 2019-10-29 16:28   ` Kuehling, Felix
  2019-10-29 13:07     ` Christian König
  2019-10-29 17:19     ` Jason Gunthorpe
  1 sibling, 2 replies; 71+ messages in thread
From: Kuehling, Felix @ 2019-10-29 16:28 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Koenig, Christian, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> find_vma() must be called under the mmap_sem, reorganize this code to
> do the vma check after entering the lock.
>
> Further, fix the unlocked use of struct task_struct's mm, instead use
> the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
> must be converted to a mm_get before acquiring mmap_sem or calling
> find_vma().
>
> Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
> Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads")
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

One question inline to confirm my understanding. Otherwise this patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 ++++++++++++++-----------
>   1 file changed, 21 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index dff41d0a85fe96..c0e41f1f0c2365 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -35,6 +35,7 @@
>   #include <linux/hmm.h>
>   #include <linux/pagemap.h>
>   #include <linux/sched/task.h>
> +#include <linux/sched/mm.h>
>   #include <linux/seq_file.h>
>   #include <linux/slab.h>
>   #include <linux/swap.h>
> @@ -788,7 +789,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
>   	struct ttm_tt *ttm = bo->tbo.ttm;
>   	struct amdgpu_ttm_tt *gtt = (void *)ttm;
> -	struct mm_struct *mm = gtt->usertask->mm;
> +	struct mm_struct *mm;
>   	unsigned long start = gtt->userptr;
>   	struct vm_area_struct *vma;
>   	struct hmm_range *range;
> @@ -796,25 +797,14 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	uint64_t *pfns;
>   	int r = 0;
>   
> -	if (!mm) /* Happens during process shutdown */
> -		return -ESRCH;
> -
>   	if (unlikely(!mirror)) {
>   		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
> -		r = -EFAULT;
> -		goto out;
> +		return -EFAULT;
>   	}
>   
> -	vma = find_vma(mm, start);
> -	if (unlikely(!vma || start < vma->vm_start)) {
> -		r = -EFAULT;
> -		goto out;
> -	}
> -	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
> -		vma->vm_file)) {
> -		r = -EPERM;
> -		goto out;
> -	}
> +	mm = mirror->hmm->mmu_notifier.mm;
> +	if (!mmget_not_zero(mm)) /* Happens during process shutdown */

This works because mirror->hmm->mmu_notifier holds an mmgrab reference 
to the mm? So the MM will not just go away, but if the mmget refcount is 
0, it means the mm is marked for destruction and shouldn't be used any more.


> +		return -ESRCH;
>   
>   	range = kzalloc(sizeof(*range), GFP_KERNEL);
>   	if (unlikely(!range)) {
> @@ -847,6 +837,17 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
>   
>   	down_read(&mm->mmap_sem);
> +	vma = find_vma(mm, start);
> +	if (unlikely(!vma || start < vma->vm_start)) {
> +		r = -EFAULT;
> +		goto out_unlock;
> +	}
> +	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
> +		vma->vm_file)) {
> +		r = -EPERM;
> +		goto out_unlock;
> +	}
> +
>   	r = hmm_range_fault(range, 0);
>   	up_read(&mm->mmap_sem);
>   
> @@ -865,15 +866,19 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   	}
>   
>   	gtt->range = range;
> +	mmput(mm);
>   
>   	return 0;
>   
> +out_unlock:
> +	up_read(&mm->mmap_sem);
>   out_free_pfns:
>   	hmm_range_unregister(range);
>   	kvfree(pfns);
>   out_free_ranges:
>   	kfree(range);
>   out:
> +	mmput(mm);
>   	return r;
>   }
>   

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem
  2019-10-29 16:28   ` Kuehling, Felix
  2019-10-29 13:07     ` Christian König
@ 2019-10-29 17:19     ` Jason Gunthorpe
  1 sibling, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 17:19 UTC (permalink / raw)
  To: Kuehling, Felix
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Koenig, Christian, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Tue, Oct 29, 2019 at 04:28:43PM +0000, Kuehling, Felix wrote:
> On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> >
> > find_vma() must be called under the mmap_sem, reorganize this code to
> > do the vma check after entering the lock.
> >
> > Further, fix the unlocked use of struct task_struct's mm, instead use
> > the mm from hmm_mirror which has an active mm_grab. Also the mm_grab
> > must be converted to a mm_get before acquiring mmap_sem or calling
> > find_vma().
> >
> > Fixes: 66c45500bfdc ("drm/amdgpu: use new HMM APIs and helpers")
> > Fixes: 0919195f2b0d ("drm/amdgpu: Enable amdgpu_ttm_tt_get_user_pages in worker threads")
> > Cc: Alex Deucher <alexander.deucher@amd.com>
> > Cc: Christian König <christian.koenig@amd.com>
> > Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> > Cc: amd-gfx@lists.freedesktop.org
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> One question inline to confirm my understanding. Otherwise this patch is
> 
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

Thanks

> > -	if (unlikely((gtt->userflags & AMDGPU_GEM_USERPTR_ANONONLY) &&
> > -		vma->vm_file)) {
> > -		r = -EPERM;
> > -		goto out;
> > -	}
> > +	mm = mirror->hmm->mmu_notifier.mm;
> > +	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
> 
> This works because mirror->hmm->mmu_notifier holds an mmgrab reference 
> to the mm?

Yes, this makes sure the mm pointer remains valid

> So the MM will not just go away, but if the mmget refcount is 0, it
> means the mm is marked for destruction and shouldn't be used any
> more.

Not just marked for destruction, but that another thread is
progressing or finished release().

The other detail here is that in general you can't get the mmap_sem
without also having a mmget as exit_mmap() does not lock the mmap_sem
in some places where it alters the datastructures. ie racing
find_vma() with exit_mmap() is not allowed.

This means we have to hold the mmget across the hmm_range_fault(), but
we can drop the mmget and then test mmu_range_read_retry() under the
driver lock. It will return true if the mmget refcount has gone to
zero in the mean time.

But I think this is probably a poor driver design, a driver should
just hold the mmget() until it has completed establishing the shadow
PTEs, as it is hard to see a reason not to..

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
@ 2019-10-29 19:22   ` Yang, Philip
  2019-10-29 19:25     ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Yang, Philip @ 2019-10-29 19:22 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Kuehling, Felix
  Cc: Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, Jason Gunthorpe, dri-devel, Deucher,
	Alexander, xen-devel, Boris Ostrovsky, Petr Cvek, Koenig,
	Christian, Ben Skeggs

Hi Jason,

I did quick test after merging amd-staging-drm-next with the 
mmu_notifier branch, which includes this set changes. The test result 
has different failures, app stuck intermittently, GUI no display etc. I 
am understanding the changes and will try to figure out the cause.

Regards,
Philip

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Convert the collision-retry lock around hmm_range_fault to use the one now
> provided by the mmu_range notifier.
> 
> Although this driver does not seem to use the collision retry lock that
> hmm provides correctly, it can still be converted over to use the
> mmu_range_notifier api instead of hmm_mirror without too much trouble.
> 
> This also deletes another place where a driver is associating additional
> data (struct amdgpu_mn) with a mmu_struct.
> 
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 148 ++----------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  49 ------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       |  76 ++++-----
>   5 files changed, 66 insertions(+), 225 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> index 47700302a08b7f..1bcedb9b477dce 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
> @@ -1738,6 +1738,10 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
>   			return ret;
>   		}
>   
> +		/*
> +		 * FIXME: Cannot ignore the return code, must hold
> +		 * notifier_lock
> +		 */
>   		amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
>   
>   		/* Mark the BO as valid unless it was invalidated
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index 2e53feed40e230..76771f5f0b60ab 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -607,8 +607,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
>   		e->tv.num_shared = 2;
>   
>   	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
> -	if (p->bo_list->first_userptr != p->bo_list->num_entries)
> -		p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX);
>   
>   	INIT_LIST_HEAD(&duplicates);
>   	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd);
> @@ -1291,11 +1289,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   	if (r)
>   		goto error_unlock;
>   
> -	/* No memory allocation is allowed while holding the mn lock.
> -	 * p->mn is hold until amdgpu_cs_submit is finished and fence is added
> -	 * to BOs.
> +	/* No memory allocation is allowed while holding the notifier lock.
> +	 * The lock is held until amdgpu_cs_submit is finished and fence is
> +	 * added to BOs.
>   	 */
> -	amdgpu_mn_lock(p->mn);
> +	mutex_lock(&p->adev->notifier_lock);
>   
>   	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
>   	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
> @@ -1338,13 +1336,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
>   	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
>   
>   	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
> -	amdgpu_mn_unlock(p->mn);
> +	mutex_unlock(&p->adev->notifier_lock);
>   
>   	return 0;
>   
>   error_abort:
>   	drm_sched_job_cleanup(&job->base);
> -	amdgpu_mn_unlock(p->mn);
> +	mutex_unlock(&p->adev->notifier_lock);
>   
>   error_unlock:
>   	amdgpu_job_free(job);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 4ffd7b90f4d907..cb718a064eb491 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -50,28 +50,6 @@
>   #include "amdgpu.h"
>   #include "amdgpu_amdkfd.h"
>   
> -/**
> - * amdgpu_mn_lock - take the write side lock for this notifier
> - *
> - * @mn: our notifier
> - */
> -void amdgpu_mn_lock(struct amdgpu_mn *mn)
> -{
> -	if (mn)
> -		down_write(&mn->lock);
> -}
> -
> -/**
> - * amdgpu_mn_unlock - drop the write side lock for this notifier
> - *
> - * @mn: our notifier
> - */
> -void amdgpu_mn_unlock(struct amdgpu_mn *mn)
> -{
> -	if (mn)
> -		up_write(&mn->lock);
> -}
> -
>   /**
>    * amdgpu_mn_invalidate_gfx - callback to notify about mm change
>    *
> @@ -82,12 +60,19 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>    * potentially dirty.
>    */
>   static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
> -				     const struct mmu_notifier_range *range)
> +				     const struct mmu_notifier_range *range,
> +				     unsigned long cur_seq)
>   {
>   	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
>   	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   	long r;
>   
> +	/*
> +	 * FIXME: Must hold some lock shared with
> +	 * amdgpu_ttm_tt_get_user_pages_done()
> +	 */
> +	mmu_range_set_seq(mrn, cur_seq);
> +
>   	/* FIXME: Is this necessary? */
>   	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
>   					  range->end))
> @@ -119,11 +104,18 @@ static const struct mmu_range_notifier_ops amdgpu_mn_gfx_ops = {
>    * evicting all user-mode queues of the process.
>    */
>   static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
> -				     const struct mmu_notifier_range *range)
> +				     const struct mmu_notifier_range *range,
> +				     unsigned long cur_seq)
>   {
>   	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
>   	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   
> +	/*
> +	 * FIXME: Must hold some lock shared with
> +	 * amdgpu_ttm_tt_get_user_pages_done()
> +	 */
> +	mmu_range_set_seq(mrn, cur_seq);
> +
>   	/* FIXME: Is this necessary? */
>   	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
>   					  range->end))
> @@ -143,92 +135,6 @@ static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
>   	.invalidate = amdgpu_mn_invalidate_hsa,
>   };
>   
> -static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
> -				     const struct mmu_notifier_range *update)
> -{
> -	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -
> -	if (!mmu_notifier_range_blockable(update))
> -		return false;
> -
> -	down_read(&amn->lock);
> -	up_read(&amn->lock);
> -	return 0;
> -}
> -
> -/* Low bits of any reasonable mm pointer will be unused due to struct
> - * alignment. Use these bits to make a unique key from the mm pointer
> - * and notifier type.
> - */
> -#define AMDGPU_MN_KEY(mm, type) ((unsigned long)(mm) + (type))
> -
> -static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = {
> -	[AMDGPU_MN_TYPE_GFX] = {
> -		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
> -	},
> -	[AMDGPU_MN_TYPE_HSA] = {
> -		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
> -	},
> -};
> -
> -/**
> - * amdgpu_mn_get - create HMM mirror context
> - *
> - * @adev: amdgpu device pointer
> - * @type: type of MMU notifier context
> - *
> - * Creates a HMM mirror context for current->mm.
> - */
> -struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
> -				enum amdgpu_mn_type type)
> -{
> -	struct mm_struct *mm = current->mm;
> -	struct amdgpu_mn *amn;
> -	unsigned long key = AMDGPU_MN_KEY(mm, type);
> -	int r;
> -
> -	mutex_lock(&adev->mn_lock);
> -	if (down_write_killable(&mm->mmap_sem)) {
> -		mutex_unlock(&adev->mn_lock);
> -		return ERR_PTR(-EINTR);
> -	}
> -
> -	hash_for_each_possible(adev->mn_hash, amn, node, key)
> -		if (AMDGPU_MN_KEY(amn->mirror.hmm->mmu_notifier.mm,
> -				  amn->type) == key)
> -			goto release_locks;
> -
> -	amn = kzalloc(sizeof(*amn), GFP_KERNEL);
> -	if (!amn) {
> -		amn = ERR_PTR(-ENOMEM);
> -		goto release_locks;
> -	}
> -
> -	amn->adev = adev;
> -	init_rwsem(&amn->lock);
> -	amn->type = type;
> -
> -	amn->mirror.ops = &amdgpu_hmm_mirror_ops[type];
> -	r = hmm_mirror_register(&amn->mirror, mm);
> -	if (r)
> -		goto free_amn;
> -
> -	hash_add(adev->mn_hash, &amn->node, AMDGPU_MN_KEY(mm, type));
> -
> -release_locks:
> -	up_write(&mm->mmap_sem);
> -	mutex_unlock(&adev->mn_lock);
> -
> -	return amn;
> -
> -free_amn:
> -	up_write(&mm->mmap_sem);
> -	mutex_unlock(&adev->mn_lock);
> -	kfree(amn);
> -
> -	return ERR_PTR(r);
> -}
> -
>   /**
>    * amdgpu_mn_register - register a BO for notifier updates
>    *
> @@ -263,25 +169,3 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo)
>   	mmu_range_notifier_remove(&bo->notifier);
>   	bo->notifier.mm = NULL;
>   }
> -
> -/* flags used by HMM internal, not related to CPU/GPU PTE flags */
> -static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
> -		(1 << 0), /* HMM_PFN_VALID */
> -		(1 << 1), /* HMM_PFN_WRITE */
> -		0 /* HMM_PFN_DEVICE_PRIVATE */
> -};
> -
> -static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
> -		0xfffffffffffffffeUL, /* HMM_PFN_ERROR */
> -		0, /* HMM_PFN_NONE */
> -		0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */
> -};
> -
> -void amdgpu_hmm_init_range(struct hmm_range *range)
> -{
> -	if (range) {
> -		range->flags = hmm_range_flags;
> -		range->values = hmm_range_values;
> -		range->pfn_shift = PAGE_SHIFT;
> -	}
> -}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> index d73ab2947b22b2..a292238f75ebae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
> @@ -30,59 +30,10 @@
>   #include <linux/workqueue.h>
>   #include <linux/interval_tree.h>
>   
> -enum amdgpu_mn_type {
> -	AMDGPU_MN_TYPE_GFX,
> -	AMDGPU_MN_TYPE_HSA,
> -};
> -
> -/**
> - * struct amdgpu_mn
> - *
> - * @adev: amdgpu device pointer
> - * @type: type of MMU notifier
> - * @work: destruction work item
> - * @node: hash table node to find structure by adev and mn
> - * @lock: rw semaphore protecting the notifier nodes
> - * @mirror: HMM mirror function support
> - *
> - * Data for each amdgpu device and process address space.
> - */
> -struct amdgpu_mn {
> -	/* constant after initialisation */
> -	struct amdgpu_device	*adev;
> -	enum amdgpu_mn_type	type;
> -
> -	/* only used on destruction */
> -	struct work_struct	work;
> -
> -	/* protected by adev->mn_lock */
> -	struct hlist_node	node;
> -
> -	/* objects protected by lock */
> -	struct rw_semaphore	lock;
> -
> -#ifdef CONFIG_HMM_MIRROR
> -	/* HMM mirror */
> -	struct hmm_mirror	mirror;
> -#endif
> -};
> -
>   #if defined(CONFIG_HMM_MIRROR)
> -void amdgpu_mn_lock(struct amdgpu_mn *mn);
> -void amdgpu_mn_unlock(struct amdgpu_mn *mn);
> -struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
> -				enum amdgpu_mn_type type);
>   int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr);
>   void amdgpu_mn_unregister(struct amdgpu_bo *bo);
> -void amdgpu_hmm_init_range(struct hmm_range *range);
>   #else
> -static inline void amdgpu_mn_lock(struct amdgpu_mn *mn) {}
> -static inline void amdgpu_mn_unlock(struct amdgpu_mn *mn) {}
> -static inline struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
> -					      enum amdgpu_mn_type type)
> -{
> -	return NULL;
> -}
>   static inline int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr)
>   {
>   	DRM_WARN_ONCE("HMM_MIRROR kernel config option is not enabled, "
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index c0e41f1f0c2365..65d9824b54f2a9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -773,6 +773,20 @@ struct amdgpu_ttm_tt {
>   #endif
>   };
>   
> +#ifdef CONFIG_DRM_AMDGPU_USERPTR
> +/* flags used by HMM internal, not related to CPU/GPU PTE flags */
> +static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
> +	(1 << 0), /* HMM_PFN_VALID */
> +	(1 << 1), /* HMM_PFN_WRITE */
> +	0 /* HMM_PFN_DEVICE_PRIVATE */
> +};
> +
> +static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
> +	0xfffffffffffffffeUL, /* HMM_PFN_ERROR */
> +	0, /* HMM_PFN_NONE */
> +	0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */
> +};
> +
>   /**
>    * amdgpu_ttm_tt_get_user_pages - get device accessible pages that back user
>    * memory and start HMM tracking CPU page table update
> @@ -780,29 +794,27 @@ struct amdgpu_ttm_tt {
>    * Calling function must call amdgpu_ttm_tt_userptr_range_done() once and only
>    * once afterwards to stop HMM tracking
>    */
> -#if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
> -
> -#define MAX_RETRY_HMM_RANGE_FAULT	16
> -
>   int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   {
> -	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
>   	struct ttm_tt *ttm = bo->tbo.ttm;
>   	struct amdgpu_ttm_tt *gtt = (void *)ttm;
>   	struct mm_struct *mm;
> +	struct hmm_range *range;
>   	unsigned long start = gtt->userptr;
>   	struct vm_area_struct *vma;
> -	struct hmm_range *range;
>   	unsigned long i;
> -	uint64_t *pfns;
>   	int r = 0;
>   
> -	if (unlikely(!mirror)) {
> -		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
> +	mm = bo->notifier.mm;
> +	if (unlikely(!mm)) {
> +		DRM_DEBUG_DRIVER("BO is not registered?\n");
>   		return -EFAULT;
>   	}
>   
> -	mm = mirror->hmm->mmu_notifier.mm;
> +	/* Another get_user_pages is running at the same time?? */
> +	if (WARN_ON(gtt->range))
> +		return -EFAULT;
> +
>   	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
>   		return -ESRCH;
>   
> @@ -811,30 +823,24 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   		r = -ENOMEM;
>   		goto out;
>   	}
> +	range->notifier = &bo->notifier;
> +	range->flags = hmm_range_flags;
> +	range->values = hmm_range_values;
> +	range->pfn_shift = PAGE_SHIFT;
> +	range->start = bo->notifier.interval_tree.start;
> +	range->end = bo->notifier.interval_tree.last + 1;
> +	range->default_flags = hmm_range_flags[HMM_PFN_VALID];
> +	if (!amdgpu_ttm_tt_is_readonly(ttm))
> +		range->default_flags |= range->flags[HMM_PFN_WRITE];
>   
> -	pfns = kvmalloc_array(ttm->num_pages, sizeof(*pfns), GFP_KERNEL);
> -	if (unlikely(!pfns)) {
> +	range->pfns = kvmalloc_array(ttm->num_pages, sizeof(*range->pfns),
> +				     GFP_KERNEL);
> +	if (unlikely(!range->pfns)) {
>   		r = -ENOMEM;
>   		goto out_free_ranges;
>   	}
>   
> -	amdgpu_hmm_init_range(range);
> -	range->default_flags = range->flags[HMM_PFN_VALID];
> -	range->default_flags |= amdgpu_ttm_tt_is_readonly(ttm) ?
> -				0 : range->flags[HMM_PFN_WRITE];
> -	range->pfn_flags_mask = 0;
> -	range->pfns = pfns;
> -	range->start = start;
> -	range->end = start + ttm->num_pages * PAGE_SIZE;
> -
> -	hmm_range_register(range, mirror);
> -
> -	/*
> -	 * Just wait for range to be valid, safe to ignore return value as we
> -	 * will use the return value of hmm_range_fault() below under the
> -	 * mmap_sem to ascertain the validity of the range.
> -	 */
> -	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
> +	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
>   
>   	down_read(&mm->mmap_sem);
>   	vma = find_vma(mm, start);
> @@ -855,10 +861,10 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   		goto out_free_pfns;
>   
>   	for (i = 0; i < ttm->num_pages; i++) {
> -		pages[i] = hmm_device_entry_to_page(range, pfns[i]);
> +		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
>   		if (unlikely(!pages[i])) {
>   			pr_err("Page fault failed for pfn[%lu] = 0x%llx\n",
> -			       i, pfns[i]);
> +			       i, range->pfns[i]);
>   			r = -ENOMEM;
>   
>   			goto out_free_pfns;
> @@ -873,8 +879,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>   out_unlock:
>   	up_read(&mm->mmap_sem);
>   out_free_pfns:
> -	hmm_range_unregister(range);
> -	kvfree(pfns);
> +	kvfree(range->pfns);
>   out_free_ranges:
>   	kfree(range);
>   out:
> @@ -903,9 +908,8 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
>   		"No user pages to check\n");
>   
>   	if (gtt->range) {
> -		r = hmm_range_valid(gtt->range);
> -		hmm_range_unregister(gtt->range);
> -
> +		r = mmu_range_read_retry(gtt->range->notifier,
> +					 gtt->range->notifier_seq);
>   		kvfree(gtt->range->pfns);
>   		kfree(gtt->range);
>   		gtt->range = NULL;
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-10-29 19:22   ` Yang, Philip
@ 2019-10-29 19:25     ` Jason Gunthorpe
  2019-11-01 14:44       ` Yang, Philip
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 19:25 UTC (permalink / raw)
  To: Yang, Philip
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote:
> Hi Jason,
> 
> I did quick test after merging amd-staging-drm-next with the 
> mmu_notifier branch, which includes this set changes. The test result 
> has different failures, app stuck intermittently, GUI no display etc. I 
> am understanding the changes and will try to figure out the cause.

Thanks! I'm not surprised by this given how difficult this patch was
to make. Let me know if I can assist in any way

Please ensure to run with lockdep enabled.. Your symptops sounds sort
of like deadlocking?

Regards,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
@ 2019-10-29 22:04   ` Kuehling, Felix
  2019-10-29 22:56     ` Jason Gunthorpe
  2019-11-07  0:23   ` John Hubbard
  1 sibling, 1 reply; 71+ messages in thread
From: Kuehling, Felix @ 2019-10-29 22:04 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Koenig, Christian, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe, Andrea Arcangeli,
	Michal Hocko

I haven't had enough time to fully understand the deferred logic in this 
change. I spotted one problem, see comments inline.

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> Of the 13 users of mmu_notifiers, 8 of them use only
> invalidate_range_start/end() and immediately intersect the
> mmu_notifier_range with some kind of internal list of VAs.  4 use an
> interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
> of some kind (scif_dma, vhost, gntdev, hmm)
>
> And the remaining 5 either don't use invalidate_range_start() or do some
> special thing with it.
>
> It turns out that building a correct scheme with an interval tree is
> pretty complicated, particularly if the use case is synchronizing against
> another thread doing get_user_pages().  Many of these implementations have
> various subtle and difficult to fix races.
>
> This approach puts the interval tree as common code at the top of the mmu
> notifier call tree and implements a shareable locking scheme.
>
> It includes:
>   - An interval tree tracking VA ranges, with per-range callbacks
>   - A read/write locking scheme for the interval tree that avoids
>     sleeping in the notifier path (for OOM killer)
>   - A sequence counter based collision-retry locking scheme to tell
>     device page fault that a VA range is being concurrently invalidated.
>
> This is based on various ideas:
> - hmm accumulates invalidated VA ranges and releases them when all
>    invalidates are done, via active_invalidate_ranges count.
>    This approach avoids having to intersect the interval tree twice (as
>    umem_odp does) at the potential cost of a longer device page fault.
>
> - kvm/umem_odp use a sequence counter to drive the collision retry,
>    via invalidate_seq
>
> - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
>    This makes adding/removing interval tree members more deterministic
>
> - seqlock, except this version makes the seqlock idea multi-holder on the
>    write side by protecting it with active_invalidate_ranges and a spinlock
>
> To minimize MM overhead when only the interval tree is being used, the
> entire SRCU and hlist overheads are dropped using some simple
> branches. Similarly the interval tree overhead is dropped when in hlist
> mode.
>
> The overhead from the mandatory spinlock is broadly the same as most of
> existing users which already had a lock (or two) of some sort on the
> invalidation path.
>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Acked-by: Christian König <christian.koenig@amd.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   include/linux/mmu_notifier.h |  98 +++++++
>   mm/Kconfig                   |   1 +
>   mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
>   3 files changed, 607 insertions(+), 25 deletions(-)
>
[snip]
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
[snip]
>    * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
>    * can't go away from under us as exit_mmap holds an mm_count pin
>    * itself.
>    */
> -void __mmu_notifier_release(struct mm_struct *mm)
> +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> +			     struct mm_struct *mm)
>   {
>   	struct mmu_notifier *mn;
>   	int id;
>   
> +	if (mmn_mm->has_interval)
> +		mn_itree_release(mmn_mm, mm);
> +
> +	if (hlist_empty(&mmn_mm->list))
> +		return;

This seems to duplicate the conditions in __mmu_notifier_release. See my 
comments below, I think one of them is wrong. I suspect this one, 
because __mmu_notifier_release follows the same pattern as the other 
notifiers.


> +
>   	/*
>   	 * SRCU here will block mmu_notifier_unregister until
>   	 * ->release returns.
>   	 */
>   	id = srcu_read_lock(&srcu);
> -	hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist)
> +	hlist_for_each_entry_rcu(mn, &mmn_mm->list, hlist)
>   		/*
>   		 * If ->release runs before mmu_notifier_unregister it must be
>   		 * handled, as it's the only way for the driver to flush all
> @@ -72,9 +313,9 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   		if (mn->ops->release)
>   			mn->ops->release(mn, mm);
>   
> -	spin_lock(&mm->mmu_notifier_mm->lock);
> -	while (unlikely(!hlist_empty(&mm->mmu_notifier_mm->list))) {
> -		mn = hlist_entry(mm->mmu_notifier_mm->list.first,
> +	spin_lock(&mmn_mm->lock);
> +	while (unlikely(!hlist_empty(&mmn_mm->list))) {
> +		mn = hlist_entry(mmn_mm->list.first,
>   				 struct mmu_notifier,
>   				 hlist);
>   		/*
> @@ -85,7 +326,7 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   		 */
>   		hlist_del_init_rcu(&mn->hlist);
>   	}
> -	spin_unlock(&mm->mmu_notifier_mm->lock);
> +	spin_unlock(&mmn_mm->lock);
>   	srcu_read_unlock(&srcu, id);
>   
>   	/*
> @@ -100,6 +341,17 @@ void __mmu_notifier_release(struct mm_struct *mm)
>   	synchronize_srcu(&srcu);
>   }
>   
> +void __mmu_notifier_release(struct mm_struct *mm)
> +{
> +	struct mmu_notifier_mm *mmn_mm = mm->mmu_notifier_mm;
> +
> +	if (mmn_mm->has_interval)
> +		mn_itree_release(mmn_mm, mm);

If mmn_mm->list is not empty, this will be done twice because 
mn_hlist_release duplicates this.


> +
> +	if (!hlist_empty(&mmn_mm->list))
> +		mn_hlist_release(mmn_mm, mm);

mn_hlist_release checks the same condition itself.


> +}
> +
>   /*
>    * If no young bitflag is supported by the hardware, ->clear_flush_young can
>    * unmap the address and return 1 or 0 depending if the mapping previously
[snip]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror Jason Gunthorpe
  2019-10-29  7:51   ` Koenig, Christian
@ 2019-10-29 22:14   ` Kuehling, Felix
  2019-10-29 23:09     ` Jason Gunthorpe
  1 sibling, 1 reply; 71+ messages in thread
From: Kuehling, Felix @ 2019-10-29 22:14 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Koenig, Christian, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

On 2019-10-28 4:10 p.m., Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> Remove the interval tree in the driver and rely on the tree maintained by
> the mmu_notifier for delivering mmu_notifier invalidation callbacks.
>
> For some reason amdgpu has a very complicated arrangement where it tries
> to prevent duplicate entries in the interval_tree, this is not necessary,
> each amdgpu_bo can be its own stand alone entry. interval_tree already
> allows duplicates and overlaps in the tree.
>
> Also, there is no need to remove entries upon a release callback, the
> mmu_range API safely allows objects to remain registered beyond the
> lifetime of the mm. The driver only has to stop touching the pages during
> release.
>
> Cc: Alex Deucher <alexander.deucher@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: David (ChunMing) Zhou <David1.Zhou@amd.com>
> Cc: amd-gfx@lists.freedesktop.org
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 341 ++++--------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |   4 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
>   6 files changed, 84 insertions(+), 282 deletions(-)
[snip]
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index 31d4deb5d29484..4ffd7b90f4d907 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
[snip]
> @@ -50,66 +50,6 @@
>   #include "amdgpu.h"
>   #include "amdgpu_amdkfd.h"
>   
> -/**
> - * struct amdgpu_mn_node
> - *
> - * @it: interval node defining start-last of the affected address range
> - * @bos: list of all BOs in the affected address range
> - *
> - * Manages all BOs which are affected of a certain range of address space.
> - */
> -struct amdgpu_mn_node {
> -	struct interval_tree_node	it;
> -	struct list_head		bos;
> -};
> -
> -/**
> - * amdgpu_mn_destroy - destroy the HMM mirror
> - *
> - * @work: previously sheduled work item
> - *
> - * Lazy destroys the notifier from a work item
> - */
> -static void amdgpu_mn_destroy(struct work_struct *work)
> -{
> -	struct amdgpu_mn *amn = container_of(work, struct amdgpu_mn, work);
> -	struct amdgpu_device *adev = amn->adev;
> -	struct amdgpu_mn_node *node, *next_node;
> -	struct amdgpu_bo *bo, *next_bo;
> -
> -	mutex_lock(&adev->mn_lock);
> -	down_write(&amn->lock);
> -	hash_del(&amn->node);
> -	rbtree_postorder_for_each_entry_safe(node, next_node,
> -					     &amn->objects.rb_root, it.rb) {
> -		list_for_each_entry_safe(bo, next_bo, &node->bos, mn_list) {
> -			bo->mn = NULL;
> -			list_del_init(&bo->mn_list);
> -		}
> -		kfree(node);
> -	}
> -	up_write(&amn->lock);
> -	mutex_unlock(&adev->mn_lock);
> -
> -	hmm_mirror_unregister(&amn->mirror);
> -	kfree(amn);
> -}
> -
> -/**
> - * amdgpu_hmm_mirror_release - callback to notify about mm destruction
> - *
> - * @mirror: the HMM mirror (mm) this callback is about
> - *
> - * Shedule a work item to lazy destroy HMM mirror.
> - */
> -static void amdgpu_hmm_mirror_release(struct hmm_mirror *mirror)
> -{
> -	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -
> -	INIT_WORK(&amn->work, amdgpu_mn_destroy);
> -	schedule_work(&amn->work);
> -}
> -
>   /**
>    * amdgpu_mn_lock - take the write side lock for this notifier
>    *
> @@ -133,157 +73,86 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
>   }
>   
>   /**
> - * amdgpu_mn_read_lock - take the read side lock for this notifier
> - *
> - * @amn: our notifier
> - */
> -static int amdgpu_mn_read_lock(struct amdgpu_mn *amn, bool blockable)
> -{
> -	if (blockable)
> -		down_read(&amn->lock);
> -	else if (!down_read_trylock(&amn->lock))
> -		return -EAGAIN;
> -
> -	return 0;
> -}
> -
> -/**
> - * amdgpu_mn_read_unlock - drop the read side lock for this notifier
> - *
> - * @amn: our notifier
> - */
> -static void amdgpu_mn_read_unlock(struct amdgpu_mn *amn)
> -{
> -	up_read(&amn->lock);
> -}
> -
> -/**
> - * amdgpu_mn_invalidate_node - unmap all BOs of a node
> + * amdgpu_mn_invalidate_gfx - callback to notify about mm change
>    *
> - * @node: the node with the BOs to unmap
> - * @start: start of address range affected
> - * @end: end of address range affected
> + * @mrn: the range (mm) is about to update
> + * @range: details on the invalidation
>    *
>    * Block for operations on BOs to finish and mark pages as accessed and
>    * potentially dirty.
>    */
> -static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
> -				      unsigned long start,
> -				      unsigned long end)
> +static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
> +				     const struct mmu_notifier_range *range)
>   {
> -	struct amdgpu_bo *bo;
> +	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   	long r;
>   
> -	list_for_each_entry(bo, &node->bos, mn_list) {
> -
> -		if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, start, end))
> -			continue;
> -
> -		r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv,
> -			true, false, MAX_SCHEDULE_TIMEOUT);
> -		if (r <= 0)
> -			DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> -	}
> +	/* FIXME: Is this necessary? */
> +	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> +					  range->end))
> +		return true;
> +
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
> +
> +	mutex_lock(&adev->notifier_lock);
> +	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
> +				      MAX_SCHEDULE_TIMEOUT);
> +	mutex_unlock(&adev->notifier_lock);
> +	if (r <= 0)
> +		DRM_ERROR("(%ld) failed to wait for user bo\n", r);
> +	return true;
>   }
>   
> +static const struct mmu_range_notifier_ops amdgpu_mn_gfx_ops = {
> +	.invalidate = amdgpu_mn_invalidate_gfx,
> +};
> +
>   /**
> - * amdgpu_mn_sync_pagetables_gfx - callback to notify about mm change
> + * amdgpu_mn_invalidate_hsa - callback to notify about mm change
>    *
> - * @mirror: the hmm_mirror (mm) is about to update
> - * @update: the update start, end address
> + * @mrn: the range (mm) is about to update
> + * @range: details on the invalidation
>    *
> - * Block for operations on BOs to finish and mark pages as accessed and
> - * potentially dirty.
> + * We temporarily evict the BO attached to this range. This necessitates
> + * evicting all user-mode queues of the process.
>    */
> -static int
> -amdgpu_mn_sync_pagetables_gfx(struct hmm_mirror *mirror,
> -			      const struct mmu_notifier_range *update)
> +static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
> +				     const struct mmu_notifier_range *range)
>   {
> -	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -	unsigned long start = update->start;
> -	unsigned long end = update->end;
> -	bool blockable = mmu_notifier_range_blockable(update);
> -	struct interval_tree_node *it;
> -
> -	/* notification is exclusive, but interval is inclusive */
> -	end -= 1;
> -
> -	/* TODO we should be able to split locking for interval tree and
> -	 * amdgpu_mn_invalidate_node
> -	 */
> -	if (amdgpu_mn_read_lock(amn, blockable))
> -		return -EAGAIN;
> -
> -	it = interval_tree_iter_first(&amn->objects, start, end);
> -	while (it) {
> -		struct amdgpu_mn_node *node;
> -
> -		if (!blockable) {
> -			amdgpu_mn_read_unlock(amn);
> -			return -EAGAIN;
> -		}
> +	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
> +	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>   
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		it = interval_tree_iter_next(it, start, end);
> +	/* FIXME: Is this necessary? */
> +	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> +					  range->end))
> +		return true;
>   
> -		amdgpu_mn_invalidate_node(node, start, end);
> -	}
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
>   
> -	amdgpu_mn_read_unlock(amn);
> +	mutex_lock(&adev->notifier_lock);
> +	amdgpu_amdkfd_evict_userptr(bo->kfd_bo, bo->notifier.mm);
> +	mutex_unlock(&adev->notifier_lock);
>   
> -	return 0;
> +	return true;
>   }
>   
> -/**
> - * amdgpu_mn_sync_pagetables_hsa - callback to notify about mm change
> - *
> - * @mirror: the hmm_mirror (mm) is about to update
> - * @update: the update start, end address
> - *
> - * We temporarily evict all BOs between start and end. This
> - * necessitates evicting all user-mode queues of the process. The BOs
> - * are restorted in amdgpu_mn_invalidate_range_end_hsa.
> - */
> -static int
> -amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
> -			      const struct mmu_notifier_range *update)
> +static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
> +	.invalidate = amdgpu_mn_invalidate_hsa,
> +};
> +
> +static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
> +				     const struct mmu_notifier_range *update)
>   {
>   	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> -	unsigned long start = update->start;
> -	unsigned long end = update->end;
> -	bool blockable = mmu_notifier_range_blockable(update);
> -	struct interval_tree_node *it;
>   
> -	/* notification is exclusive, but interval is inclusive */
> -	end -= 1;
> -
> -	if (amdgpu_mn_read_lock(amn, blockable))
> -		return -EAGAIN;
> -
> -	it = interval_tree_iter_first(&amn->objects, start, end);
> -	while (it) {
> -		struct amdgpu_mn_node *node;
> -		struct amdgpu_bo *bo;
> -
> -		if (!blockable) {
> -			amdgpu_mn_read_unlock(amn);
> -			return -EAGAIN;
> -		}
> -
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		it = interval_tree_iter_next(it, start, end);
> -
> -		list_for_each_entry(bo, &node->bos, mn_list) {
> -			struct kgd_mem *mem = bo->kfd_bo;
> -
> -			if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm,
> -							 start, end))
> -				amdgpu_amdkfd_evict_userptr(mem, amn->mm);
> -		}
> -	}
> -
> -	amdgpu_mn_read_unlock(amn);
> +	if (!mmu_notifier_range_blockable(update))
> +		return false;

This should return -EAGAIN. Not sure it matters much, because this whole 
function disappears in the next commit in the series. It seems to be 
only vestigial at this point.

Regards,
   Felix

>   
> +	down_read(&amn->lock);
> +	up_read(&amn->lock);
>   	return 0;
>   }
>   
> @@ -295,12 +164,10 @@ amdgpu_mn_sync_pagetables_hsa(struct hmm_mirror *mirror,
>   
>   static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = {
>   	[AMDGPU_MN_TYPE_GFX] = {
> -		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_gfx,
> -		.release = amdgpu_hmm_mirror_release
> +		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
>   	},
>   	[AMDGPU_MN_TYPE_HSA] = {
> -		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables_hsa,
> -		.release = amdgpu_hmm_mirror_release
> +		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
>   	},
>   };
>   
> @@ -327,7 +194,8 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>   	}
>   
>   	hash_for_each_possible(adev->mn_hash, amn, node, key)
> -		if (AMDGPU_MN_KEY(amn->mm, amn->type) == key)
> +		if (AMDGPU_MN_KEY(amn->mirror.hmm->mmu_notifier.mm,
> +				  amn->type) == key)
>   			goto release_locks;
>   
>   	amn = kzalloc(sizeof(*amn), GFP_KERNEL);
> @@ -337,10 +205,8 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>   	}
>   
>   	amn->adev = adev;
> -	amn->mm = mm;
>   	init_rwsem(&amn->lock);
>   	amn->type = type;
> -	amn->objects = RB_ROOT_CACHED;
>   
>   	amn->mirror.ops = &amdgpu_hmm_mirror_ops[type];
>   	r = hmm_mirror_register(&amn->mirror, mm);
> @@ -369,100 +235,33 @@ struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
>    * @bo: amdgpu buffer object
>    * @addr: userptr addr we should monitor
>    *
> - * Registers an HMM mirror for the given BO at the specified address.
> + * Registers a mmu_notifier for the given BO at the specified address.
>    * Returns 0 on success, -ERRNO if anything goes wrong.
>    */
>   int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr)
>   {
> -	unsigned long end = addr + amdgpu_bo_size(bo) - 1;
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> -	enum amdgpu_mn_type type =
> -		bo->kfd_bo ? AMDGPU_MN_TYPE_HSA : AMDGPU_MN_TYPE_GFX;
> -	struct amdgpu_mn *amn;
> -	struct amdgpu_mn_node *node = NULL, *new_node;
> -	struct list_head bos;
> -	struct interval_tree_node *it;
> -
> -	amn = amdgpu_mn_get(adev, type);
> -	if (IS_ERR(amn))
> -		return PTR_ERR(amn);
> -
> -	new_node = kmalloc(sizeof(*new_node), GFP_KERNEL);
> -	if (!new_node)
> -		return -ENOMEM;
> -
> -	INIT_LIST_HEAD(&bos);
> -
> -	down_write(&amn->lock);
> -
> -	while ((it = interval_tree_iter_first(&amn->objects, addr, end))) {
> -		kfree(node);
> -		node = container_of(it, struct amdgpu_mn_node, it);
> -		interval_tree_remove(&node->it, &amn->objects);
> -		addr = min(it->start, addr);
> -		end = max(it->last, end);
> -		list_splice(&node->bos, &bos);
> -	}
> -
> -	if (!node)
> -		node = new_node;
> +	if (bo->kfd_bo)
> +		bo->notifier.ops = &amdgpu_mn_hsa_ops;
>   	else
> -		kfree(new_node);
> -
> -	bo->mn = amn;
> -
> -	node->it.start = addr;
> -	node->it.last = end;
> -	INIT_LIST_HEAD(&node->bos);
> -	list_splice(&bos, &node->bos);
> -	list_add(&bo->mn_list, &node->bos);
> +		bo->notifier.ops = &amdgpu_mn_gfx_ops;
>   
> -	interval_tree_insert(&node->it, &amn->objects);
> -
> -	up_write(&amn->lock);
> -
> -	return 0;
> +	return mmu_range_notifier_insert(&bo->notifier, addr,
> +					 amdgpu_bo_size(bo), current->mm);
>   }
>   
>   /**
> - * amdgpu_mn_unregister - unregister a BO for HMM mirror updates
> + * amdgpu_mn_unregister - unregister a BO for notifier updates
>    *
>    * @bo: amdgpu buffer object
>    *
> - * Remove any registration of HMM mirror updates from the buffer object.
> + * Remove any registration of mmu notifier updates from the buffer object.
>    */
>   void amdgpu_mn_unregister(struct amdgpu_bo *bo)
>   {
> -	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
> -	struct amdgpu_mn *amn;
> -	struct list_head *head;
> -
> -	mutex_lock(&adev->mn_lock);
> -
> -	amn = bo->mn;
> -	if (amn == NULL) {
> -		mutex_unlock(&adev->mn_lock);
> +	if (!bo->notifier.mm)
>   		return;
> -	}
> -
> -	down_write(&amn->lock);
> -
> -	/* save the next list entry for later */
> -	head = bo->mn_list.next;
> -
> -	bo->mn = NULL;
> -	list_del_init(&bo->mn_list);
> -
> -	if (list_empty(head)) {
> -		struct amdgpu_mn_node *node;
> -
> -		node = container_of(head, struct amdgpu_mn_node, bos);
> -		interval_tree_remove(&node->it, &amn->objects);
> -		kfree(node);
> -	}
> -
> -	up_write(&amn->lock);
> -	mutex_unlock(&adev->mn_lock);
> +	mmu_range_notifier_remove(&bo->notifier);
> +	bo->notifier.mm = NULL;
>   }
>   
>   /* flags used by HMM internal, not related to CPU/GPU PTE flags */
[snip]


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-10-29 22:04   ` Kuehling, Felix
@ 2019-10-29 22:56     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 22:56 UTC (permalink / raw)
  To: Kuehling, Felix
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Koenig, Christian, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Tue, Oct 29, 2019 at 10:04:45PM +0000, Kuehling, Felix wrote:

> >    * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> > @@ -52,17 +286,24 @@ struct mmu_notifier_mm {
> >    * can't go away from under us as exit_mmap holds an mm_count pin
> >    * itself.
> >    */
> > -void __mmu_notifier_release(struct mm_struct *mm)
> > +static void mn_hlist_release(struct mmu_notifier_mm *mmn_mm,
> > +			     struct mm_struct *mm)
> >   {
> >   	struct mmu_notifier *mn;
> >   	int id;
> >   
> > +	if (mmn_mm->has_interval)
> > +		mn_itree_release(mmn_mm, mm);
> > +
> > +	if (hlist_empty(&mmn_mm->list))
> > +		return;
> 
> This seems to duplicate the conditions in __mmu_notifier_release. See my 
> comments below, I think one of them is wrong. I suspect this one, 
> because __mmu_notifier_release follows the same pattern as the other 
> notifiers.

Yep, this is a rebasing error from a earlier version, the above two
lines should be deleted.

I think it is harmless so it should not impact any testing.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
  2019-10-29 22:14   ` Kuehling, Felix
@ 2019-10-29 23:09     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-10-29 23:09 UTC (permalink / raw)
  To: Kuehling, Felix
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	linux-rdma, dri-devel, amd-gfx, Deucher, Alexander, Ben Skeggs,
	Boris Ostrovsky, Koenig, Christian, Zhou, David(ChunMing),
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Tue, Oct 29, 2019 at 10:14:29PM +0000, Kuehling, Felix wrote:

> > +static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
> > +	.invalidate = amdgpu_mn_invalidate_hsa,
> > +};
> > +
> > +static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
> > +				     const struct mmu_notifier_range *update)
> >   {
> >   	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
> > -	unsigned long start = update->start;
> > -	unsigned long end = update->end;
> > -	bool blockable = mmu_notifier_range_blockable(update);
> > -	struct interval_tree_node *it;
> >   
> > -	/* notification is exclusive, but interval is inclusive */
> > -	end -= 1;
> > -
> > -	if (amdgpu_mn_read_lock(amn, blockable))
> > -		return -EAGAIN;
> > -
> > -	it = interval_tree_iter_first(&amn->objects, start, end);
> > -	while (it) {
> > -		struct amdgpu_mn_node *node;
> > -		struct amdgpu_bo *bo;
> > -
> > -		if (!blockable) {
> > -			amdgpu_mn_read_unlock(amn);
> > -			return -EAGAIN;
> > -		}
> > -
> > -		node = container_of(it, struct amdgpu_mn_node, it);
> > -		it = interval_tree_iter_next(it, start, end);
> > -
> > -		list_for_each_entry(bo, &node->bos, mn_list) {
> > -			struct kgd_mem *mem = bo->kfd_bo;
> > -
> > -			if (amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm,
> > -							 start, end))
> > -				amdgpu_amdkfd_evict_userptr(mem, amn->mm);
> > -		}
> > -	}
> > -
> > -	amdgpu_mn_read_unlock(amn);
> > +	if (!mmu_notifier_range_blockable(update))
> > +		return false;
> 
> This should return -EAGAIN. Not sure it matters much, because this whole 
> function disappears in the next commit in the series. It seems to be 
> only vestigial at this point.

Right, the only reason it is still here is that I couldn't really tell
if this:

> > +	down_read(&amn->lock);
> > +	up_read(&amn->lock);
> >   	return 0;
> >   }

Was serving as the 'driver lock' in the hmm scheme... If not then the
whole thing should just be deleted at this point.

I fixed the EAGAIN though

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
@ 2019-10-30 16:55   ` Boris Ostrovsky
  2019-11-01 17:48     ` Jason Gunthorpe
  2019-11-04 22:03   ` Boris Ostrovsky
  1 sibling, 1 reply; 71+ messages in thread
From: Boris Ostrovsky @ 2019-10-30 16:55 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig, Jason Gunthorpe

On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> gntdev simply wants to monitor a specific VMA for any notifier events,
> this can be done straightforwardly using mmu_range_notifier_insert() over
> the VMA's VA range.
>
> The notifier should be attached until the original VMA is destroyed.
>
> It is unclear if any of this is even sane, but at least a lot of duplicate
> code is removed.

I didn't have a chance to look at the patch itself yet but as a heads-up
--- it crashes dom0.

-boris


>
> Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  drivers/xen/gntdev-common.h |   8 +-
>  drivers/xen/gntdev.c        | 180 ++++++++++--------------------------
>  2 files changed, 49 insertions(+), 139 deletions(-)
>
> diff --git a/drivers/xen/gntdev-common.h b/drivers/xen/gntdev-common.h
> index 2f8b949c3eeb14..b201fdd20b667b 100644
> --- a/drivers/xen/gntdev-common.h
> +++ b/drivers/xen/gntdev-common.h
> @@ -21,15 +21,8 @@ struct gntdev_dmabuf_priv;
>  struct gntdev_priv {
>  	/* Maps with visible offsets in the file descriptor. */
>  	struct list_head maps;
> -	/*
> -	 * Maps that are not visible; will be freed on munmap.
> -	 * Only populated if populate_freeable_maps == 1
> -	 */
> -	struct list_head freeable_maps;
>  	/* lock protects maps and freeable_maps. */
>  	struct mutex lock;
> -	struct mm_struct *mm;
> -	struct mmu_notifier mn;
>  
>  #ifdef CONFIG_XEN_GRANT_DMA_ALLOC
>  	/* Device for which DMA memory is allocated. */
> @@ -49,6 +42,7 @@ struct gntdev_unmap_notify {
>  };
>  
>  struct gntdev_grant_map {
> +	struct mmu_range_notifier notifier;
>  	struct list_head next;
>  	struct vm_area_struct *vma;
>  	int index;
> diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
> index a446a7221e13e9..12d626670bebbc 100644
> --- a/drivers/xen/gntdev.c
> +++ b/drivers/xen/gntdev.c
> @@ -65,7 +65,6 @@ MODULE_PARM_DESC(limit, "Maximum number of grants that may be mapped by "
>  static atomic_t pages_mapped = ATOMIC_INIT(0);
>  
>  static int use_ptemod;
> -#define populate_freeable_maps use_ptemod
>  
>  static int unmap_grant_pages(struct gntdev_grant_map *map,
>  			     int offset, int pages);
> @@ -251,12 +250,6 @@ void gntdev_put_map(struct gntdev_priv *priv, struct gntdev_grant_map *map)
>  		evtchn_put(map->notify.event);
>  	}
>  
> -	if (populate_freeable_maps && priv) {
> -		mutex_lock(&priv->lock);
> -		list_del(&map->next);
> -		mutex_unlock(&priv->lock);
> -	}
> -
>  	if (map->pages && !use_ptemod)
>  		unmap_grant_pages(map, 0, map->count);
>  	gntdev_free_map(map);
> @@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
>  	struct gntdev_priv *priv = file->private_data;
>  
>  	pr_debug("gntdev_vma_close %p\n", vma);
> -	if (use_ptemod) {
> -		/* It is possible that an mmu notifier could be running
> -		 * concurrently, so take priv->lock to ensure that the vma won't
> -		 * vanishing during the unmap_grant_pages call, since we will
> -		 * spin here until that completes. Such a concurrent call will
> -		 * not do any unmapping, since that has been done prior to
> -		 * closing the vma, but it may still iterate the unmap_ops list.
> -		 */
> -		mutex_lock(&priv->lock);
> +	if (use_ptemod && map->vma == vma) {
> +		mmu_range_notifier_remove(&map->notifier);
>  		map->vma = NULL;
> -		mutex_unlock(&priv->lock);
>  	}
>  	vma->vm_private_data = NULL;
>  	gntdev_put_map(priv, map);
> @@ -477,109 +462,44 @@ static const struct vm_operations_struct gntdev_vmops = {
>  
>  /* ------------------------------------------------------------------ */
>  
> -static bool in_range(struct gntdev_grant_map *map,
> -			      unsigned long start, unsigned long end)
> -{
> -	if (!map->vma)
> -		return false;
> -	if (map->vma->vm_start >= end)
> -		return false;
> -	if (map->vma->vm_end <= start)
> -		return false;
> -
> -	return true;
> -}
> -
> -static int unmap_if_in_range(struct gntdev_grant_map *map,
> -			      unsigned long start, unsigned long end,
> -			      bool blockable)
> +static bool gntdev_invalidate(struct mmu_range_notifier *mn,
> +			      const struct mmu_notifier_range *range,
> +			      unsigned long cur_seq)
>  {
> +	struct gntdev_grant_map *map =
> +		container_of(mn, struct gntdev_grant_map, notifier);
>  	unsigned long mstart, mend;
>  	int err;
>  
> -	if (!in_range(map, start, end))
> -		return 0;
> +	if (!mmu_notifier_range_blockable(range))
> +		return false;
>  
> -	if (!blockable)
> -		return -EAGAIN;
> +	/*
> +	 * If the VMA is split or otherwise changed the notifier is not
> +	 * updated, but we don't want to process VA's outside the modified
> +	 * VMA. FIXME: It would be much more understandable to just prevent
> +	 * modifying the VMA in the first place.
> +	 */
> +	if (map->vma->vm_start >= range->end ||
> +	    map->vma->vm_end <= range->start)
> +		return true;
>  
> -	mstart = max(start, map->vma->vm_start);
> -	mend   = min(end,   map->vma->vm_end);
> +	mstart = max(range->start, map->vma->vm_start);
> +	mend = min(range->end, map->vma->vm_end);
>  	pr_debug("map %d+%d (%lx %lx), range %lx %lx, mrange %lx %lx\n",
>  			map->index, map->count,
>  			map->vma->vm_start, map->vma->vm_end,
> -			start, end, mstart, mend);
> +			range->start, range->end, mstart, mend);
>  	err = unmap_grant_pages(map,
>  				(mstart - map->vma->vm_start) >> PAGE_SHIFT,
>  				(mend - mstart) >> PAGE_SHIFT);
>  	WARN_ON(err);
>  
> -	return 0;
> -}
> -
> -static int mn_invl_range_start(struct mmu_notifier *mn,
> -			       const struct mmu_notifier_range *range)
> -{
> -	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> -	struct gntdev_grant_map *map;
> -	int ret = 0;
> -
> -	if (mmu_notifier_range_blockable(range))
> -		mutex_lock(&priv->lock);
> -	else if (!mutex_trylock(&priv->lock))
> -		return -EAGAIN;
> -
> -	list_for_each_entry(map, &priv->maps, next) {
> -		ret = unmap_if_in_range(map, range->start, range->end,
> -					mmu_notifier_range_blockable(range));
> -		if (ret)
> -			goto out_unlock;
> -	}
> -	list_for_each_entry(map, &priv->freeable_maps, next) {
> -		ret = unmap_if_in_range(map, range->start, range->end,
> -					mmu_notifier_range_blockable(range));
> -		if (ret)
> -			goto out_unlock;
> -	}
> -
> -out_unlock:
> -	mutex_unlock(&priv->lock);
> -
> -	return ret;
> -}
> -
> -static void mn_release(struct mmu_notifier *mn,
> -		       struct mm_struct *mm)
> -{
> -	struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
> -	struct gntdev_grant_map *map;
> -	int err;
> -
> -	mutex_lock(&priv->lock);
> -	list_for_each_entry(map, &priv->maps, next) {
> -		if (!map->vma)
> -			continue;
> -		pr_debug("map %d+%d (%lx %lx)\n",
> -				map->index, map->count,
> -				map->vma->vm_start, map->vma->vm_end);
> -		err = unmap_grant_pages(map, /* offset */ 0, map->count);
> -		WARN_ON(err);
> -	}
> -	list_for_each_entry(map, &priv->freeable_maps, next) {
> -		if (!map->vma)
> -			continue;
> -		pr_debug("map %d+%d (%lx %lx)\n",
> -				map->index, map->count,
> -				map->vma->vm_start, map->vma->vm_end);
> -		err = unmap_grant_pages(map, /* offset */ 0, map->count);
> -		WARN_ON(err);
> -	}
> -	mutex_unlock(&priv->lock);
> +	return true;
>  }
>  
> -static const struct mmu_notifier_ops gntdev_mmu_ops = {
> -	.release                = mn_release,
> -	.invalidate_range_start = mn_invl_range_start,
> +static const struct mmu_range_notifier_ops gntdev_mmu_ops = {
> +	.invalidate = gntdev_invalidate,
>  };
>  
>  /* ------------------------------------------------------------------ */
> @@ -594,7 +514,6 @@ static int gntdev_open(struct inode *inode, struct file *flip)
>  		return -ENOMEM;
>  
>  	INIT_LIST_HEAD(&priv->maps);
> -	INIT_LIST_HEAD(&priv->freeable_maps);
>  	mutex_init(&priv->lock);
>  
>  #ifdef CONFIG_XEN_GNTDEV_DMABUF
> @@ -606,17 +525,6 @@ static int gntdev_open(struct inode *inode, struct file *flip)
>  	}
>  #endif
>  
> -	if (use_ptemod) {
> -		priv->mm = get_task_mm(current);
> -		if (!priv->mm) {
> -			kfree(priv);
> -			return -ENOMEM;
> -		}
> -		priv->mn.ops = &gntdev_mmu_ops;
> -		ret = mmu_notifier_register(&priv->mn, priv->mm);
> -		mmput(priv->mm);
> -	}
> -
>  	if (ret) {
>  		kfree(priv);
>  		return ret;
> @@ -653,16 +561,12 @@ static int gntdev_release(struct inode *inode, struct file *flip)
>  		list_del(&map->next);
>  		gntdev_put_map(NULL /* already removed */, map);
>  	}
> -	WARN_ON(!list_empty(&priv->freeable_maps));
>  	mutex_unlock(&priv->lock);
>  
>  #ifdef CONFIG_XEN_GNTDEV_DMABUF
>  	gntdev_dmabuf_fini(priv->dmabuf_priv);
>  #endif
>  
> -	if (use_ptemod)
> -		mmu_notifier_unregister(&priv->mn, priv->mm);
> -
>  	kfree(priv);
>  	return 0;
>  }
> @@ -723,8 +627,6 @@ static long gntdev_ioctl_unmap_grant_ref(struct gntdev_priv *priv,
>  	map = gntdev_find_map_index(priv, op.index >> PAGE_SHIFT, op.count);
>  	if (map) {
>  		list_del(&map->next);
> -		if (populate_freeable_maps)
> -			list_add_tail(&map->next, &priv->freeable_maps);
>  		err = 0;
>  	}
>  	mutex_unlock(&priv->lock);
> @@ -1096,11 +998,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
>  		goto unlock_out;
>  	if (use_ptemod && map->vma)
>  		goto unlock_out;
> -	if (use_ptemod && priv->mm != vma->vm_mm) {
> -		pr_warn("Huh? Other mm?\n");
> -		goto unlock_out;
> -	}
> -
>  	refcount_inc(&map->users);
>  
>  	vma->vm_ops = &gntdev_vmops;
> @@ -1111,10 +1008,6 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
>  		vma->vm_flags |= VM_DONTCOPY;
>  
>  	vma->vm_private_data = map;
> -
> -	if (use_ptemod)
> -		map->vma = vma;
> -
>  	if (map->flags) {
>  		if ((vma->vm_flags & VM_WRITE) &&
>  				(map->flags & GNTMAP_readonly))
> @@ -1125,8 +1018,28 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
>  			map->flags |= GNTMAP_readonly;
>  	}
>  
> +	if (use_ptemod) {
> +		map->vma = vma;
> +		err = mmu_range_notifier_insert_locked(
> +			&map->notifier, vma->vm_start,
> +			vma->vm_end - vma->vm_start, vma->vm_mm);
> +		if (err)
> +			goto out_unlock_put;
> +	}
>  	mutex_unlock(&priv->lock);
>  
> +	/*
> +	 * gntdev takes the address of the PTE in find_grant_ptes() and passes
> +	 * it to the hypervisor in gntdev_map_grant_pages(). The purpose of
> +	 * the notifier is to prevent the hypervisor pointer to the PTE from
> +	 * going stale.
> +	 *
> +	 * Since this vma's mappings can't be touched without the mmap_sem,
> +	 * and we are holding it now, there is no need for the notifier_range
> +	 * locking pattern.
> +	 */
> +	mmu_range_read_begin(&map->notifier);
> +
>  	if (use_ptemod) {
>  		map->pages_vm_start = vma->vm_start;
>  		err = apply_to_page_range(vma->vm_mm, vma->vm_start,
> @@ -1175,8 +1088,11 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
>  	mutex_unlock(&priv->lock);
>  out_put_map:
>  	if (use_ptemod) {
> -		map->vma = NULL;
>  		unmap_grant_pages(map, 0, map->count);
> +		if (map->vma) {
> +			mmu_range_notifier_remove(&map->notifier);
> +			map->vma = NULL;
> +		}
>  	}
>  	gntdev_put_map(priv, map);
>  	return err;


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-10-29 19:25     ` Jason Gunthorpe
@ 2019-11-01 14:44       ` Yang, Philip
  2019-11-01 15:12         ` Jason Gunthorpe
                           ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Yang, Philip @ 2019-11-01 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

[-- Attachment #1: Type: text/plain, Size: 911 bytes --]



On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote:
> On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote:
>> Hi Jason,
>>
>> I did quick test after merging amd-staging-drm-next with the
>> mmu_notifier branch, which includes this set changes. The test result
>> has different failures, app stuck intermittently, GUI no display etc. I
>> am understanding the changes and will try to figure out the cause.
> 
> Thanks! I'm not surprised by this given how difficult this patch was
> to make. Let me know if I can assist in any way
> 
> Please ensure to run with lockdep enabled.. Your symptops sounds sort
> of like deadlocking?
> 
Hi Jason,

Attached patch fix several issues in amdgpu driver, maybe you can squash 
this into patch 14. With this is done, patch 12, 13, 14 is Reviewed-by 
and Tested-by Philip Yang <philip.yang@amd.com>

Regards,
Philip

> Regards,
> Jason
> 

[-- Warning: decoded text below may be mangled --]
[-- Attachment #2: 0001-drm-amdgpu-issues-with-new-mmu_range_notifier-api.patch --]
[-- Type: text/x-patch; name="0001-drm-amdgpu-issues-with-new-mmu_range_notifier-api.patch", Size: 5274 bytes --]

From 5a0bd4d8cef8472fe2904550142d288feed8cd81 Mon Sep 17 00:00:00 2001
From: Philip Yang <Philip.Yang@amd.com>
Date: Thu, 31 Oct 2019 09:10:30 -0400
Subject: [PATCH] drm/amdgpu: issues with new mmu_range_notifier api

put mmu_range_set_seq under the same lock which is used to call
mmu_range_read_retry.

fix amdgpu_ttm_tt_get_user_pages_done return value, because
mmu_range_read_retry means !hmm_range_valid

retry if hmm_range_fault return -EBUSY

fix false WARN for missing get_user_page_done, we should check all
pages not just the first page, don't understand why this issue is
triggered by this change.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c  | 32 +++++++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 37 +++++++++++++++++--------
 2 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index cb718a064eb4..c8bbd06f1009 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -67,21 +67,15 @@ static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 	long r;
 
-	/*
-	 * FIXME: Must hold some lock shared with
-	 * amdgpu_ttm_tt_get_user_pages_done()
-	 */
-	mmu_range_set_seq(mrn, cur_seq);
+	mutex_lock(&adev->notifier_lock);
 
-	/* FIXME: Is this necessary? */
-	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
-					  range->end))
-		return true;
+	mmu_range_set_seq(mrn, cur_seq);
 
-	if (!mmu_notifier_range_blockable(range))
+	if (!mmu_notifier_range_blockable(range)) {
+		mutex_unlock(&adev->notifier_lock);
 		return false;
+	}
 
-	mutex_lock(&adev->notifier_lock);
 	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
 				      MAX_SCHEDULE_TIMEOUT);
 	mutex_unlock(&adev->notifier_lock);
@@ -110,21 +104,15 @@ static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
 	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 
-	/*
-	 * FIXME: Must hold some lock shared with
-	 * amdgpu_ttm_tt_get_user_pages_done()
-	 */
-	mmu_range_set_seq(mrn, cur_seq);
+	mutex_lock(&adev->notifier_lock);
 
-	/* FIXME: Is this necessary? */
-	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
-					  range->end))
-		return true;
+	mmu_range_set_seq(mrn, cur_seq);
 
-	if (!mmu_notifier_range_blockable(range))
+	if (!mmu_notifier_range_blockable(range)) {
+		mutex_unlock(&adev->notifier_lock);
 		return false;
+	}
 
-	mutex_lock(&adev->notifier_lock);
 	amdgpu_amdkfd_evict_userptr(bo->kfd_bo, bo->notifier.mm);
 	mutex_unlock(&adev->notifier_lock);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index a38437fd290a..56fde43d5efa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -799,10 +799,11 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 {
 	struct ttm_tt *ttm = bo->tbo.ttm;
 	struct amdgpu_ttm_tt *gtt = (void *)ttm;
-	struct mm_struct *mm;
-	struct hmm_range *range;
 	unsigned long start = gtt->userptr;
 	struct vm_area_struct *vma;
+	struct hmm_range *range;
+	unsigned long timeout;
+	struct mm_struct *mm;
 	unsigned long i;
 	int r = 0;
 
@@ -841,8 +842,6 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 		goto out_free_ranges;
 	}
 
-	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
-
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (unlikely(!vma || start < vma->vm_start)) {
@@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 		r = -EPERM;
 		goto out_unlock;
 	}
+	up_read(&mm->mmap_sem);
+	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
+
+retry:
+	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
 
+	down_read(&mm->mmap_sem);
 	r = hmm_range_fault(range, 0);
 	up_read(&mm->mmap_sem);
-
-	if (unlikely(r < 0))
+	if (unlikely(r <= 0)) {
+		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
+			goto retry;
 		goto out_free_pfns;
+	}
 
 	for (i = 0; i < ttm->num_pages; i++) {
 		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
@@ -916,7 +923,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
 		gtt->range = NULL;
 	}
 
-	return r;
+	return !r;
 }
 #endif
 
@@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
 	sg_free_table(ttm->sg);
 
 #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
-	if (gtt->range &&
-	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
-						      gtt->range->pfns[0]))
-		WARN_ONCE(1, "Missing get_user_page_done\n");
+	if (gtt->range) {
+		unsigned long i;
+
+		for (i = 0; i < ttm->num_pages; i++) {
+			if (ttm->pages[i] !=
+				hmm_device_entry_to_page(gtt->range,
+					      gtt->range->pfns[i]))
+				break;
+		}
+
+		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
+	}
 #endif
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 14:44       ` Yang, Philip
@ 2019-11-01 15:12         ` Jason Gunthorpe
  2019-11-01 15:59           ` Yang, Philip
  2019-11-01 18:21         ` Jason Gunthorpe
  2019-11-01 18:34         ` [PATCH v2a " Jason Gunthorpe
  2 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 15:12 UTC (permalink / raw)
  To: Yang, Philip
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote:
> 
> 
> On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote:
> > On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote:
> >> Hi Jason,
> >>
> >> I did quick test after merging amd-staging-drm-next with the
> >> mmu_notifier branch, which includes this set changes. The test result
> >> has different failures, app stuck intermittently, GUI no display etc. I
> >> am understanding the changes and will try to figure out the cause.
> > 
> > Thanks! I'm not surprised by this given how difficult this patch was
> > to make. Let me know if I can assist in any way
> > 
> > Please ensure to run with lockdep enabled.. Your symptops sounds sort
> > of like deadlocking?
> > 
> Hi Jason,
> 
> Attached patch fix several issues in amdgpu driver, maybe you can squash 
> this into patch 14. With this is done, patch 12, 13, 14 is Reviewed-by 
> and Tested-by Philip Yang <philip.yang@amd.com>

Wow, this is great thanks! Can you clarify what the problems you found
were? Was the bug the 'return !r' below?

I'll also add your signed off by

Here are some remarks:

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> index cb718a064eb4..c8bbd06f1009 100644
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
> @@ -67,21 +67,15 @@ static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
>  	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>  	long r;
>  
> -	/*
> -	 * FIXME: Must hold some lock shared with
> -	 * amdgpu_ttm_tt_get_user_pages_done()
> -	 */
> -	mmu_range_set_seq(mrn, cur_seq);
> +	mutex_lock(&adev->notifier_lock);
>  
> -	/* FIXME: Is this necessary? */
> -	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
> -					  range->end))
> -		return true;
> +	mmu_range_set_seq(mrn, cur_seq);
>  
> -	if (!mmu_notifier_range_blockable(range))
> +	if (!mmu_notifier_range_blockable(range)) {
> +		mutex_unlock(&adev->notifier_lock);
>  		return false;

This test for range_blockable should be before mutex_lock, I can move
it up

Also, do you know if notifier_lock is held while calling
amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
to amdgpu_ttm_tt_get_user_pages_done()?

> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>  		r = -EPERM;
>  		goto out_unlock;
>  	}
> +	up_read(&mm->mmap_sem);
> +	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +
> +retry:
> +	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
>  
> +	down_read(&mm->mmap_sem);
>  	r = hmm_range_fault(range, 0);
>  	up_read(&mm->mmap_sem);
> -
> -	if (unlikely(r < 0))
> +	if (unlikely(r <= 0)) {
> +		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
> +			goto retry;
>  		goto out_free_pfns;
> +	}

This isn't really right, a retry loop like this needs to go all the
way to mmu_range_read_retry() and done under the notifier_lock. ie
mmu_range_read_retry() can fail just as likely as hmm_range_fault()
can, and drivers are supposed to retry in both cases, with a single
timeout.

AFAICT it is a major bug that many places ignore the return code of
amdgpu_ttm_tt_get_user_pages_done() ???

However, this is all pre-existing bugs, so I'm OK go ahead with this
patch as modified. I advise AMD to make a followup patch ..

I'll add a FIXME note to this effect.

>  	for (i = 0; i < ttm->num_pages; i++) {
>  		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
> @@ -916,7 +923,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
>  		gtt->range = NULL;
>  	}
>  
> -	return r;
> +	return !r;

Ah is this the major error? hmm_range_valid() is inverted vs
mmu_range_read_retry()?

>  }
>  #endif
>  
> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
>  	sg_free_table(ttm->sg);
>  
>  #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
> -	if (gtt->range &&
> -	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
> -						      gtt->range->pfns[0]))
> -		WARN_ONCE(1, "Missing get_user_page_done\n");
> +	if (gtt->range) {
> +		unsigned long i;
> +
> +		for (i = 0; i < ttm->num_pages; i++) {
> +			if (ttm->pages[i] !=
> +				hmm_device_entry_to_page(gtt->range,
> +					      gtt->range->pfns[i]))
> +				break;
> +		}
> +
> +		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
> +	}

Is this related/necessary? I can put it in another patch if it is just
debugging improvement? Please advise

Thanks a lot,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 15:12         ` Jason Gunthorpe
@ 2019-11-01 15:59           ` Yang, Philip
  2019-11-01 17:42             ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Yang, Philip @ 2019-11-01 15:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs



On 2019-11-01 11:12 a.m., Jason Gunthorpe wrote:
> On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote:
>>
>>
>> On 2019-10-29 3:25 p.m., Jason Gunthorpe wrote:
>>> On Tue, Oct 29, 2019 at 07:22:37PM +0000, Yang, Philip wrote:
>>>> Hi Jason,
>>>>
>>>> I did quick test after merging amd-staging-drm-next with the
>>>> mmu_notifier branch, which includes this set changes. The test result
>>>> has different failures, app stuck intermittently, GUI no display etc. I
>>>> am understanding the changes and will try to figure out the cause.
>>>
>>> Thanks! I'm not surprised by this given how difficult this patch was
>>> to make. Let me know if I can assist in any way
>>>
>>> Please ensure to run with lockdep enabled.. Your symptops sounds sort
>>> of like deadlocking?
>>>
>> Hi Jason,
>>
>> Attached patch fix several issues in amdgpu driver, maybe you can squash
>> this into patch 14. With this is done, patch 12, 13, 14 is Reviewed-by
>> and Tested-by Philip Yang <philip.yang@amd.com>
> 
> Wow, this is great thanks! Can you clarify what the problems you found
> were? Was the bug the 'return !r' below?
> 
Yes. return !r is critical one, and retry if hmm_range_fault return 
-EBUSY is needed too.

> I'll also add your signed off by
> 
> Here are some remarks:
> 
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>> index cb718a064eb4..c8bbd06f1009 100644
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
>> @@ -67,21 +67,15 @@ static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
>>   	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
>>   	long r;
>>   
>> -	/*
>> -	 * FIXME: Must hold some lock shared with
>> -	 * amdgpu_ttm_tt_get_user_pages_done()
>> -	 */
>> -	mmu_range_set_seq(mrn, cur_seq);
>> +	mutex_lock(&adev->notifier_lock);
>>   
>> -	/* FIXME: Is this necessary? */
>> -	if (!amdgpu_ttm_tt_affect_userptr(bo->tbo.ttm, range->start,
>> -					  range->end))
>> -		return true;
>> +	mmu_range_set_seq(mrn, cur_seq);
>>   
>> -	if (!mmu_notifier_range_blockable(range))
>> +	if (!mmu_notifier_range_blockable(range)) {
>> +		mutex_unlock(&adev->notifier_lock);
>>   		return false;
> 
> This test for range_blockable should be before mutex_lock, I can move
> it up
> 
yes, thanks.
> Also, do you know if notifier_lock is held while calling
> amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
> to amdgpu_ttm_tt_get_user_pages_done()?
> 
gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check 
amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but 
check mem->invalid flag which is updated from invalidate callback. It 
takes more time to change, I will come to another patch to fix it later.

>> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>>   		r = -EPERM;
>>   		goto out_unlock;
>>   	}
>> +	up_read(&mm->mmap_sem);
>> +	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
>> +
>> +retry:
>> +	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
>>   
>> +	down_read(&mm->mmap_sem);
>>   	r = hmm_range_fault(range, 0);
>>   	up_read(&mm->mmap_sem);
>> -
>> -	if (unlikely(r < 0))
>> +	if (unlikely(r <= 0)) {
>> +		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
>> +			goto retry;
>>   		goto out_free_pfns;
>> +	}
> 
> This isn't really right, a retry loop like this needs to go all the
> way to mmu_range_read_retry() and done under the notifier_lock. ie
> mmu_range_read_retry() can fail just as likely as hmm_range_fault()
> can, and drivers are supposed to retry in both cases, with a single
> timeout.
> 
For gpu, check mmu_range_read_retry return value under the notifier_lock 
to do retry is in seperate location, not in same retry loop.

> AFAICT it is a major bug that many places ignore the return code of
> amdgpu_ttm_tt_get_user_pages_done() ???
>
For kfd, explained above.

> However, this is all pre-existing bugs, so I'm OK go ahead with this
> patch as modified. I advise AMD to make a followup patch ..
> 
yes, I will.
> I'll add a FIXME note to this effect.
> 
>>   	for (i = 0; i < ttm->num_pages; i++) {
>>   		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
>> @@ -916,7 +923,7 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
>>   		gtt->range = NULL;
>>   	}
>>   
>> -	return r;
>> +	return !r;
> 
> Ah is this the major error? hmm_range_valid() is inverted vs
> mmu_range_read_retry()?
> 
yes.
>>   }
>>   #endif
>>   
>> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
>>   	sg_free_table(ttm->sg);
>>   
>>   #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
>> -	if (gtt->range &&
>> -	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
>> -						      gtt->range->pfns[0]))
>> -		WARN_ONCE(1, "Missing get_user_page_done\n");
>> +	if (gtt->range) {
>> +		unsigned long i;
>> +
>> +		for (i = 0; i < ttm->num_pages; i++) {
>> +			if (ttm->pages[i] !=
>> +				hmm_device_entry_to_page(gtt->range,
>> +					      gtt->range->pfns[i]))
>> +				break;
>> +		}
>> +
>> +		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
>> +	}
> 
> Is this related/necessary? I can put it in another patch if it is just
> debugging improvement? Please advise
> 
I see this WARN backtrace now, but I didn't see it before. This is 
somehow related.

> Thanks a lot,
> Jason
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 15:59           ` Yang, Philip
@ 2019-11-01 17:42             ` Jason Gunthorpe
  2019-11-01 19:19               ` Jason Gunthorpe
  2019-11-01 19:45               ` Yang, Philip
  0 siblings, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 17:42 UTC (permalink / raw)
  To: Yang, Philip
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote:
> > This test for range_blockable should be before mutex_lock, I can move
> > it up
> > 
> yes, thanks.

Okay, I wrote it like this:

	if (mmu_notifier_range_blockable(range))
		mutex_lock(&adev->notifier_lock);
	else if (!mutex_trylock(&adev->notifier_lock))
		return false;

> > Also, do you know if notifier_lock is held while calling
> > amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
> > to amdgpu_ttm_tt_get_user_pages_done()?
> 
> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check 
> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but 
> check mem->invalid flag which is updated from invalidate callback. It 
> takes more time to change, I will come to another patch to fix it later.

Ah.. confusing, OK, I'll let you sort that

> > However, this is all pre-existing bugs, so I'm OK go ahead with this
> > patch as modified. I advise AMD to make a followup patch ..
> > 
> yes, I will.

While you are here, this is also wrong:

int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
{
	down_read(&mm->mmap_sem);
	r = hmm_range_fault(range, 0);
	up_read(&mm->mmap_sem);
	if (unlikely(r <= 0)) {
		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
			goto retry;
		goto out_free_pfns;
	}

	for (i = 0; i < ttm->num_pages; i++) {
		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);

It is not allowed to read the results of hmm_range_fault() outside
locking, and in particular, we can't convert to a struct page.

This must be done inside the notifier_lock, after checking
mmu_range_read_retry(), all handling of the struct page must be
structured like that.

> >> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
> >>   	sg_free_table(ttm->sg);
> >>   
> >>   #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
> >> -	if (gtt->range &&
> >> -	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
> >> -						      gtt->range->pfns[0]))
> >> -		WARN_ONCE(1, "Missing get_user_page_done\n");
> >> +	if (gtt->range) {
> >> +		unsigned long i;
> >> +
> >> +		for (i = 0; i < ttm->num_pages; i++) {
> >> +			if (ttm->pages[i] !=
> >> +				hmm_device_entry_to_page(gtt->range,
> >> +					      gtt->range->pfns[i]))
> >> +				break;
> >> +		}
> >> +
> >> +		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
> >> +	}
> > 
> > Is this related/necessary? I can put it in another patch if it is just
> > debugging improvement? Please advise
> > 
> I see this WARN backtrace now, but I didn't see it before. This is 
> somehow related.

Hm, might be instructive to learn what is going on..

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-10-30 16:55   ` Boris Ostrovsky
@ 2019-11-01 17:48     ` Jason Gunthorpe
  2019-11-01 18:51       ` Boris Ostrovsky
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 17:48 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On Wed, Oct 30, 2019 at 12:55:37PM -0400, Boris Ostrovsky wrote:
> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> >
> > gntdev simply wants to monitor a specific VMA for any notifier events,
> > this can be done straightforwardly using mmu_range_notifier_insert() over
> > the VMA's VA range.
> >
> > The notifier should be attached until the original VMA is destroyed.
> >
> > It is unclear if any of this is even sane, but at least a lot of duplicate
> > code is removed.
> 
> I didn't have a chance to look at the patch itself yet but as a heads-up
> --- it crashes dom0.

Thanks Boris. I spent a bit of time and got a VM running with a xen
4.9 hypervisor and a kernel with this patch series. It a ubuntu bionic
VM with the distro's xen stuff.

Can you give some guidance how you made it crash? I see the VM
autoloaded gntdev:

Module                  Size  Used by
xen_gntdev             24576  2
xen_evtchn             16384  1
xenfs                  16384  1
xen_privcmd            24576  16 xenfs

And lsof says several xen processes have the chardev open:

xenstored  819                 root   13u      CHR              10,53      0t0      19595 /dev/xen/gntdev
xenconsol  857                 root    8u      CHR              10,53      0t0      19595 /dev/xen/gntdev
xenconsol  857 860             root    8u      CHR              10,53      0t0      19595 /dev/xen/gntdev

But no crashing..

However, I wasn't able to get my usual debug kernel .config to boot
with the xen hypervisor, it crashes on early boot with:

(XEN) Dom0 has maximum 8 VCPUs
(XEN) Scrubbing Free RAM on 1 nodes using 8 CPUs
(XEN) .done.
(XEN) Initial low memory virq threshold set at 0x1000 pages.
(XEN) Std. Loglevel: All
(XEN) Guest Loglevel: All
(XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to Xen)
(XEN) Freed 468kB init memory
(XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0002]
(XEN) Pagetable walk from fffffbfff0480fbe:
(XEN)  L4[0x1f7] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S: fault at ffff82d080348a06 entry.o#create_bounce_frame+0x135/0x15f
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.9.2  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff82b9f731>]
(XEN) RFLAGS: 0000000000000296   EM: 1   CONTEXT: pv guest (d0v0)
(XEN) rax: fffffbfff0480fbe   rbx: 0000000000000000   rcx: 00000000c0000101
(XEN) rdx: 00000000ffffffff   rsi: ffffffff84026000   rdi: ffffffff82cb4a20
(XEN) rbp: ffffffff82407ff8   rsp: ffffffff82407da0   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 1ffffffff0480fbe   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003506e0
(XEN) cr3: 0000000034027000   cr2: fffffbfff0480fbe
(XEN) fsb: 0000000000000000   gsb: ffffffff82b61000   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033

Which is surely some .config issue, but I didn't figure out what.

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 14:44       ` Yang, Philip
  2019-11-01 15:12         ` Jason Gunthorpe
@ 2019-11-01 18:21         ` Jason Gunthorpe
  2019-11-01 18:34         ` [PATCH v2a " Jason Gunthorpe
  2 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 18:21 UTC (permalink / raw)
  To: Yang, Philip, Jerome Glisse
  Cc: linux-mm, Ralph Campbell, John Hubbard, Kuehling, Felix,
	Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

On Fri, Nov 01, 2019 at 02:44:51PM +0000, Yang, Philip wrote:
> @@ -854,12 +853,20 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>  		r = -EPERM;
>  		goto out_unlock;
>  	}
> +	up_read(&mm->mmap_sem);
> +	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
> +
> +retry:
> +	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
>  
> +	down_read(&mm->mmap_sem);
>  	r = hmm_range_fault(range, 0);
>  	up_read(&mm->mmap_sem);
> -
> -	if (unlikely(r < 0))
> +	if (unlikely(r <= 0)) {
> +		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
> +			goto retry;
>  		goto out_free_pfns;
> +	}

I was reflecting on why this suddently became necessary, and I think
what might be happening is that hmm_range_fault() is trigging
invalidations as it runs (ie it is faulting in pages or something) and
that in turn causes the mrn to need retry.

The hmm version of this had a bug where a full
invalidate_range_start/end pair would not trigger retry, so this this
didn't happen.

This is unfortunate as the retry is unnecessary, but at this time I
can't think of a good way to separate an ignorable synchronous
invalidation caused by hmm_range_fault from an async one that cannot
be ignored..

A basic fix would be to not update the mrq seq in the notifier if
the invalidate is triggered by hmm_range_fault, but that seems
difficult to determine..

Any thoughts Jerome?

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER
  2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
@ 2019-11-01 18:26   ` Jason Gunthorpe
  2019-11-05 14:44     ` Jürgen Groß
  2019-11-07  9:39   ` Jürgen Groß
  1 sibling, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 18:26 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Mon, Oct 28, 2019 at 05:10:25PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> DMA_SHARED_BUFFER can not be enabled by the user (it represents a library
> set in the kernel). The kconfig convention is to use select for such
> symbols so they are turned on implicitly when the user enables a kconfig
> that needs them.
> 
> Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.
> 
> Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
> Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Reviewed-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  drivers/xen/Kconfig | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Juergen/Oleksandr/Xen Maintainers:

Would you take this patch through a xen related tree? The only reason
I had in this series is to make it easier to compile-test the gntdev
changes.

Since it is looking like the gntdev rework might not make it this
cycle it is probably best for you to take it.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [PATCH v2a 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 14:44       ` Yang, Philip
  2019-11-01 15:12         ` Jason Gunthorpe
  2019-11-01 18:21         ` Jason Gunthorpe
@ 2019-11-01 18:34         ` " Jason Gunthorpe
  2 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 18:34 UTC (permalink / raw)
  To: Yang, Philip
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

Convert the collision-retry lock around hmm_range_fault to use the one now
provided by the mmu_range notifier.

Although this driver does not seem to use the collision retry lock that
hmm provides correctly, it can still be converted over to use the
mmu_range_notifier api instead of hmm_mirror without too much trouble.

This also deletes another place where a driver is associating additional
data (struct amdgpu_mn) with a mmu_struct.

Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Reviewed-by: Philip Yang <Philip.Yang@amd.com>
Tested-by: Philip Yang <Philip.Yang@amd.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 150 ++----------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  49 ------
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 116 +++++++++-----
 5 files changed, 96 insertions(+), 237 deletions(-)

Philip, here is what it loos like after combining the two patches, thanks

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 47700302a08b7f..1bcedb9b477dce 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -1738,6 +1738,10 @@ static int update_invalid_user_pages(struct amdkfd_process_info *process_info,
 			return ret;
 		}
 
+		/*
+		 * FIXME: Cannot ignore the return code, must hold
+		 * notifier_lock
+		 */
 		amdgpu_ttm_tt_get_user_pages_done(bo->tbo.ttm);
 
 		/* Mark the BO as valid unless it was invalidated
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index 82823d9a8ba887..22c989bca7514c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -603,8 +603,6 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 		e->tv.num_shared = 2;
 
 	amdgpu_bo_list_get_list(p->bo_list, &p->validated);
-	if (p->bo_list->first_userptr != p->bo_list->num_entries)
-		p->mn = amdgpu_mn_get(p->adev, AMDGPU_MN_TYPE_GFX);
 
 	INIT_LIST_HEAD(&duplicates);
 	amdgpu_vm_get_pd_bo(&fpriv->vm, &p->validated, &p->vm_pd);
@@ -1287,11 +1285,11 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	if (r)
 		goto error_unlock;
 
-	/* No memory allocation is allowed while holding the mn lock.
-	 * p->mn is hold until amdgpu_cs_submit is finished and fence is added
-	 * to BOs.
+	/* No memory allocation is allowed while holding the notifier lock.
+	 * The lock is held until amdgpu_cs_submit is finished and fence is
+	 * added to BOs.
 	 */
-	amdgpu_mn_lock(p->mn);
+	mutex_lock(&p->adev->notifier_lock);
 
 	/* If userptr are invalidated after amdgpu_cs_parser_bos(), return
 	 * -EAGAIN, drmIoctl in libdrm will restart the amdgpu_cs_ioctl.
@@ -1334,13 +1332,13 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
 	amdgpu_vm_move_to_lru_tail(p->adev, &fpriv->vm);
 
 	ttm_eu_fence_buffer_objects(&p->ticket, &p->validated, p->fence);
-	amdgpu_mn_unlock(p->mn);
+	mutex_unlock(&p->adev->notifier_lock);
 
 	return 0;
 
 error_abort:
 	drm_sched_job_cleanup(&job->base);
-	amdgpu_mn_unlock(p->mn);
+	mutex_unlock(&p->adev->notifier_lock);
 
 error_unlock:
 	amdgpu_job_free(job);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index ac74320b71e4e7..f7be34907e54f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -50,28 +50,6 @@
 #include "amdgpu.h"
 #include "amdgpu_amdkfd.h"
 
-/**
- * amdgpu_mn_lock - take the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_lock(struct amdgpu_mn *mn)
-{
-	if (mn)
-		down_write(&mn->lock);
-}
-
-/**
- * amdgpu_mn_unlock - drop the write side lock for this notifier
- *
- * @mn: our notifier
- */
-void amdgpu_mn_unlock(struct amdgpu_mn *mn)
-{
-	if (mn)
-		up_write(&mn->lock);
-}
-
 /**
  * amdgpu_mn_invalidate_gfx - callback to notify about mm change
  *
@@ -82,16 +60,20 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
  * potentially dirty.
  */
 static bool amdgpu_mn_invalidate_gfx(struct mmu_range_notifier *mrn,
-				     const struct mmu_notifier_range *range)
+				     const struct mmu_notifier_range *range,
+				     unsigned long cur_seq)
 {
 	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 	long r;
 
-	if (!mmu_notifier_range_blockable(range))
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&adev->notifier_lock);
+	else if (!mutex_trylock(&adev->notifier_lock))
 		return false;
 
-	mutex_lock(&adev->notifier_lock);
+	mmu_range_set_seq(mrn, cur_seq);
+
 	r = dma_resv_wait_timeout_rcu(bo->tbo.base.resv, true, false,
 				      MAX_SCHEDULE_TIMEOUT);
 	mutex_unlock(&adev->notifier_lock);
@@ -114,15 +96,19 @@ static const struct mmu_range_notifier_ops amdgpu_mn_gfx_ops = {
  * evicting all user-mode queues of the process.
  */
 static bool amdgpu_mn_invalidate_hsa(struct mmu_range_notifier *mrn,
-				     const struct mmu_notifier_range *range)
+				     const struct mmu_notifier_range *range,
+				     unsigned long cur_seq)
 {
 	struct amdgpu_bo *bo = container_of(mrn, struct amdgpu_bo, notifier);
 	struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
 
-	if (!mmu_notifier_range_blockable(range))
+	if (mmu_notifier_range_blockable(range))
+		mutex_lock(&adev->notifier_lock);
+	else if (!mutex_trylock(&adev->notifier_lock))
 		return false;
 
-	mutex_lock(&adev->notifier_lock);
+	mmu_range_set_seq(mrn, cur_seq);
+
 	amdgpu_amdkfd_evict_userptr(bo->kfd_bo, bo->notifier.mm);
 	mutex_unlock(&adev->notifier_lock);
 
@@ -133,92 +119,6 @@ static const struct mmu_range_notifier_ops amdgpu_mn_hsa_ops = {
 	.invalidate = amdgpu_mn_invalidate_hsa,
 };
 
-static int amdgpu_mn_sync_pagetables(struct hmm_mirror *mirror,
-				     const struct mmu_notifier_range *update)
-{
-	struct amdgpu_mn *amn = container_of(mirror, struct amdgpu_mn, mirror);
-
-	if (!mmu_notifier_range_blockable(update))
-		return -EAGAIN;
-
-	down_read(&amn->lock);
-	up_read(&amn->lock);
-	return 0;
-}
-
-/* Low bits of any reasonable mm pointer will be unused due to struct
- * alignment. Use these bits to make a unique key from the mm pointer
- * and notifier type.
- */
-#define AMDGPU_MN_KEY(mm, type) ((unsigned long)(mm) + (type))
-
-static struct hmm_mirror_ops amdgpu_hmm_mirror_ops[] = {
-	[AMDGPU_MN_TYPE_GFX] = {
-		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
-	},
-	[AMDGPU_MN_TYPE_HSA] = {
-		.sync_cpu_device_pagetables = amdgpu_mn_sync_pagetables,
-	},
-};
-
-/**
- * amdgpu_mn_get - create HMM mirror context
- *
- * @adev: amdgpu device pointer
- * @type: type of MMU notifier context
- *
- * Creates a HMM mirror context for current->mm.
- */
-struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
-				enum amdgpu_mn_type type)
-{
-	struct mm_struct *mm = current->mm;
-	struct amdgpu_mn *amn;
-	unsigned long key = AMDGPU_MN_KEY(mm, type);
-	int r;
-
-	mutex_lock(&adev->mn_lock);
-	if (down_write_killable(&mm->mmap_sem)) {
-		mutex_unlock(&adev->mn_lock);
-		return ERR_PTR(-EINTR);
-	}
-
-	hash_for_each_possible(adev->mn_hash, amn, node, key)
-		if (AMDGPU_MN_KEY(amn->mirror.hmm->mmu_notifier.mm,
-				  amn->type) == key)
-			goto release_locks;
-
-	amn = kzalloc(sizeof(*amn), GFP_KERNEL);
-	if (!amn) {
-		amn = ERR_PTR(-ENOMEM);
-		goto release_locks;
-	}
-
-	amn->adev = adev;
-	init_rwsem(&amn->lock);
-	amn->type = type;
-
-	amn->mirror.ops = &amdgpu_hmm_mirror_ops[type];
-	r = hmm_mirror_register(&amn->mirror, mm);
-	if (r)
-		goto free_amn;
-
-	hash_add(adev->mn_hash, &amn->node, AMDGPU_MN_KEY(mm, type));
-
-release_locks:
-	up_write(&mm->mmap_sem);
-	mutex_unlock(&adev->mn_lock);
-
-	return amn;
-
-free_amn:
-	up_write(&mm->mmap_sem);
-	mutex_unlock(&adev->mn_lock);
-	kfree(amn);
-
-	return ERR_PTR(r);
-}
-
 /**
  * amdgpu_mn_register - register a BO for notifier updates
  *
@@ -253,25 +153,3 @@ void amdgpu_mn_unregister(struct amdgpu_bo *bo)
 	mmu_range_notifier_remove(&bo->notifier);
 	bo->notifier.mm = NULL;
 }
-
-/* flags used by HMM internal, not related to CPU/GPU PTE flags */
-static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
-		(1 << 0), /* HMM_PFN_VALID */
-		(1 << 1), /* HMM_PFN_WRITE */
-		0 /* HMM_PFN_DEVICE_PRIVATE */
-};
-
-static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
-		0xfffffffffffffffeUL, /* HMM_PFN_ERROR */
-		0, /* HMM_PFN_NONE */
-		0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */
-};
-
-void amdgpu_hmm_init_range(struct hmm_range *range)
-{
-	if (range) {
-		range->flags = hmm_range_flags;
-		range->values = hmm_range_values;
-		range->pfn_shift = PAGE_SHIFT;
-	}
-}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
index d73ab2947b22b2..a292238f75ebae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h
@@ -30,59 +30,10 @@
 #include <linux/workqueue.h>
 #include <linux/interval_tree.h>
 
-enum amdgpu_mn_type {
-	AMDGPU_MN_TYPE_GFX,
-	AMDGPU_MN_TYPE_HSA,
-};
-
-/**
- * struct amdgpu_mn
- *
- * @adev: amdgpu device pointer
- * @type: type of MMU notifier
- * @work: destruction work item
- * @node: hash table node to find structure by adev and mn
- * @lock: rw semaphore protecting the notifier nodes
- * @mirror: HMM mirror function support
- *
- * Data for each amdgpu device and process address space.
- */
-struct amdgpu_mn {
-	/* constant after initialisation */
-	struct amdgpu_device	*adev;
-	enum amdgpu_mn_type	type;
-
-	/* only used on destruction */
-	struct work_struct	work;
-
-	/* protected by adev->mn_lock */
-	struct hlist_node	node;
-
-	/* objects protected by lock */
-	struct rw_semaphore	lock;
-
-#ifdef CONFIG_HMM_MIRROR
-	/* HMM mirror */
-	struct hmm_mirror	mirror;
-#endif
-};
-
 #if defined(CONFIG_HMM_MIRROR)
-void amdgpu_mn_lock(struct amdgpu_mn *mn);
-void amdgpu_mn_unlock(struct amdgpu_mn *mn);
-struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
-				enum amdgpu_mn_type type);
 int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr);
 void amdgpu_mn_unregister(struct amdgpu_bo *bo);
-void amdgpu_hmm_init_range(struct hmm_range *range);
 #else
-static inline void amdgpu_mn_lock(struct amdgpu_mn *mn) {}
-static inline void amdgpu_mn_unlock(struct amdgpu_mn *mn) {}
-static inline struct amdgpu_mn *amdgpu_mn_get(struct amdgpu_device *adev,
-					      enum amdgpu_mn_type type)
-{
-	return NULL;
-}
 static inline int amdgpu_mn_register(struct amdgpu_bo *bo, unsigned long addr)
 {
 	DRM_WARN_ONCE("HMM_MIRROR kernel config option is not enabled, "
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index c0e41f1f0c2365..5f4d8ab76f1da0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -773,6 +773,20 @@ struct amdgpu_ttm_tt {
 #endif
 };
 
+#ifdef CONFIG_DRM_AMDGPU_USERPTR
+/* flags used by HMM internal, not related to CPU/GPU PTE flags */
+static const uint64_t hmm_range_flags[HMM_PFN_FLAG_MAX] = {
+	(1 << 0), /* HMM_PFN_VALID */
+	(1 << 1), /* HMM_PFN_WRITE */
+	0 /* HMM_PFN_DEVICE_PRIVATE */
+};
+
+static const uint64_t hmm_range_values[HMM_PFN_VALUE_MAX] = {
+	0xfffffffffffffffeUL, /* HMM_PFN_ERROR */
+	0, /* HMM_PFN_NONE */
+	0xfffffffffffffffcUL /* HMM_PFN_SPECIAL */
+};
+
 /**
  * amdgpu_ttm_tt_get_user_pages - get device accessible pages that back user
  * memory and start HMM tracking CPU page table update
@@ -780,29 +794,28 @@ struct amdgpu_ttm_tt {
  * Calling function must call amdgpu_ttm_tt_userptr_range_done() once and only
  * once afterwards to stop HMM tracking
  */
-#if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
-
-#define MAX_RETRY_HMM_RANGE_FAULT	16
-
 int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 {
-	struct hmm_mirror *mirror = bo->mn ? &bo->mn->mirror : NULL;
 	struct ttm_tt *ttm = bo->tbo.ttm;
 	struct amdgpu_ttm_tt *gtt = (void *)ttm;
-	struct mm_struct *mm;
 	unsigned long start = gtt->userptr;
 	struct vm_area_struct *vma;
 	struct hmm_range *range;
+	unsigned long timeout;
+	struct mm_struct *mm;
 	unsigned long i;
-	uint64_t *pfns;
 	int r = 0;
 
-	if (unlikely(!mirror)) {
-		DRM_DEBUG_DRIVER("Failed to get hmm_mirror\n");
+	mm = bo->notifier.mm;
+	if (unlikely(!mm)) {
+		DRM_DEBUG_DRIVER("BO is not registered?\n");
 		return -EFAULT;
 	}
 
-	mm = mirror->hmm->mmu_notifier.mm;
+	/* Another get_user_pages is running at the same time?? */
+	if (WARN_ON(gtt->range))
+		return -EFAULT;
+
 	if (!mmget_not_zero(mm)) /* Happens during process shutdown */
 		return -ESRCH;
 
@@ -811,31 +824,23 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 		r = -ENOMEM;
 		goto out;
 	}
+	range->notifier = &bo->notifier;
+	range->flags = hmm_range_flags;
+	range->values = hmm_range_values;
+	range->pfn_shift = PAGE_SHIFT;
+	range->start = bo->notifier.interval_tree.start;
+	range->end = bo->notifier.interval_tree.last + 1;
+	range->default_flags = hmm_range_flags[HMM_PFN_VALID];
+	if (!amdgpu_ttm_tt_is_readonly(ttm))
+		range->default_flags |= range->flags[HMM_PFN_WRITE];
 
-	pfns = kvmalloc_array(ttm->num_pages, sizeof(*pfns), GFP_KERNEL);
-	if (unlikely(!pfns)) {
+	range->pfns = kvmalloc_array(ttm->num_pages, sizeof(*range->pfns),
+				     GFP_KERNEL);
+	if (unlikely(!range->pfns)) {
 		r = -ENOMEM;
 		goto out_free_ranges;
 	}
 
-	amdgpu_hmm_init_range(range);
-	range->default_flags = range->flags[HMM_PFN_VALID];
-	range->default_flags |= amdgpu_ttm_tt_is_readonly(ttm) ?
-				0 : range->flags[HMM_PFN_WRITE];
-	range->pfn_flags_mask = 0;
-	range->pfns = pfns;
-	range->start = start;
-	range->end = start + ttm->num_pages * PAGE_SIZE;
-
-	hmm_range_register(range, mirror);
-
-	/*
-	 * Just wait for range to be valid, safe to ignore return value as we
-	 * will use the return value of hmm_range_fault() below under the
-	 * mmap_sem to ascertain the validity of the range.
-	 */
-	hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT);
-
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (unlikely(!vma || start < vma->vm_start)) {
@@ -847,18 +852,31 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 		r = -EPERM;
 		goto out_unlock;
 	}
+	up_read(&mm->mmap_sem);
+	timeout = jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT);
 
+retry:
+	range->notifier_seq = mmu_range_read_begin(&bo->notifier);
+
+	down_read(&mm->mmap_sem);
 	r = hmm_range_fault(range, 0);
 	up_read(&mm->mmap_sem);
-
-	if (unlikely(r < 0))
+	if (unlikely(r <= 0)) {
+		/*
+		 * FIXME: This timeout should encompass the retry from
+		 * mmu_range_read_retry() as well.
+		 */
+		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
+			goto retry;
 		goto out_free_pfns;
+	}
 
 	for (i = 0; i < ttm->num_pages; i++) {
-		pages[i] = hmm_device_entry_to_page(range, pfns[i]);
+		/* FIXME: The pages cannot be touched outside the notifier_lock */
+		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
 		if (unlikely(!pages[i])) {
 			pr_err("Page fault failed for pfn[%lu] = 0x%llx\n",
-			       i, pfns[i]);
+			       i, range->pfns[i]);
 			r = -ENOMEM;
 
 			goto out_free_pfns;
@@ -873,8 +891,7 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
 out_unlock:
 	up_read(&mm->mmap_sem);
 out_free_pfns:
-	hmm_range_unregister(range);
-	kvfree(pfns);
+	kvfree(range->pfns);
 out_free_ranges:
 	kfree(range);
 out:
@@ -903,15 +920,18 @@ bool amdgpu_ttm_tt_get_user_pages_done(struct ttm_tt *ttm)
 		"No user pages to check\n");
 
 	if (gtt->range) {
-		r = hmm_range_valid(gtt->range);
-		hmm_range_unregister(gtt->range);
-
+		/*
+		 * FIXME: Must always hold notifier_lock for this, and must
+		 * not ignore the return code.
+		 */
+		r = mmu_range_read_retry(gtt->range->notifier,
+					 gtt->range->notifier_seq);
 		kvfree(gtt->range->pfns);
 		kfree(gtt->range);
 		gtt->range = NULL;
 	}
 
-	return r;
+	return !r;
 }
 #endif
 
@@ -992,10 +1012,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
 	sg_free_table(ttm->sg);
 
 #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
-	if (gtt->range &&
-	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
-						      gtt->range->pfns[0]))
-		WARN_ONCE(1, "Missing get_user_page_done\n");
+	if (gtt->range) {
+		unsigned long i;
+
+		for (i = 0; i < ttm->num_pages; i++) {
+			if (ttm->pages[i] !=
+				hmm_device_entry_to_page(gtt->range,
+					      gtt->range->pfns[i]))
+				break;
+		}
+
+		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
+	}
 #endif
 }
 
-- 
2.23.0


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-01 17:48     ` Jason Gunthorpe
@ 2019-11-01 18:51       ` Boris Ostrovsky
  2019-11-01 19:17         ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Boris Ostrovsky @ 2019-11-01 18:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On 11/1/19 1:48 PM, Jason Gunthorpe wrote:
> On Wed, Oct 30, 2019 at 12:55:37PM -0400, Boris Ostrovsky wrote:
>> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> gntdev simply wants to monitor a specific VMA for any notifier events,
>>> this can be done straightforwardly using mmu_range_notifier_insert() over
>>> the VMA's VA range.
>>>
>>> The notifier should be attached until the original VMA is destroyed.
>>>
>>> It is unclear if any of this is even sane, but at least a lot of duplicate
>>> code is removed.
>> I didn't have a chance to look at the patch itself yet but as a heads-up
>> --- it crashes dom0.
> Thanks Boris. I spent a bit of time and got a VM running with a xen
> 4.9 hypervisor and a kernel with this patch series. It a ubuntu bionic
> VM with the distro's xen stuff.
>
> Can you give some guidance how you made it crash? 

It crashes trying to dereference mrn->ops->invalidate in
mn_itree_invalidate() when a guest exits.

I don't think you've initialized notifier ops. I don't see you using
gntdev_mmu_ops anywhere.

-boris


> I see the VM
> autoloaded gntdev:
>
> Module                  Size  Used by
> xen_gntdev             24576  2
> xen_evtchn             16384  1
> xenfs                  16384  1
> xen_privcmd            24576  16 xenfs
>
> And lsof says several xen processes have the chardev open:
>
> xenstored  819                 root   13u      CHR              10,53      0t0      19595 /dev/xen/gntdev
> xenconsol  857                 root    8u      CHR              10,53      0t0      19595 /dev/xen/gntdev
> xenconsol  857 860             root    8u      CHR              10,53      0t0      19595 /dev/xen/gntdev
>
> But no crashing..
>
> However, I wasn't able to get my usual debug kernel .config to boot
> with the xen hypervisor, it crashes on early boot with:
>
> (XEN) Dom0 has maximum 8 VCPUs
> (XEN) Scrubbing Free RAM on 1 nodes using 8 CPUs
> (XEN) .done.
> (XEN) Initial low memory virq threshold set at 0x1000 pages.
> (XEN) Std. Loglevel: All
> (XEN) Guest Loglevel: All
> (XEN) *** Serial input -> DOM0 (type 'CTRL-a' three times to switch input to Xen)
> (XEN) Freed 468kB init memory
> (XEN) d0v0 Unhandled page fault fault/trap [#14, ec=0002]
> (XEN) Pagetable walk from fffffbfff0480fbe:
> (XEN)  L4[0x1f7] = 0000000000000000 ffffffffffffffff
> (XEN) domain_crash_sync called from entry.S: fault at ffff82d080348a06 entry.o#create_bounce_frame+0x135/0x15f
> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
> (XEN) ----[ Xen-4.9.2  x86_64  debug=n   Not tainted ]----
> (XEN) CPU:    0
> (XEN) RIP:    e033:[<ffffffff82b9f731>]
> (XEN) RFLAGS: 0000000000000296   EM: 1   CONTEXT: pv guest (d0v0)
> (XEN) rax: fffffbfff0480fbe   rbx: 0000000000000000   rcx: 00000000c0000101
> (XEN) rdx: 00000000ffffffff   rsi: ffffffff84026000   rdi: ffffffff82cb4a20
> (XEN) rbp: ffffffff82407ff8   rsp: ffffffff82407da0   r8:  0000000000000000
> (XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
> (XEN) r12: 0000000000000000   r13: 1ffffffff0480fbe   r14: 0000000000000000
> (XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000003506e0
> (XEN) cr3: 0000000034027000   cr2: fffffbfff0480fbe
> (XEN) fsb: 0000000000000000   gsb: ffffffff82b61000   gss: 0000000000000000
> (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
>
> Which is surely some .config issue, but I didn't figure out what.
>
> Jason


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-01 18:51       ` Boris Ostrovsky
@ 2019-11-01 19:17         ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 19:17 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On Fri, Nov 01, 2019 at 02:51:46PM -0400, Boris Ostrovsky wrote:
> On 11/1/19 1:48 PM, Jason Gunthorpe wrote:
> > On Wed, Oct 30, 2019 at 12:55:37PM -0400, Boris Ostrovsky wrote:
> >> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> >>> From: Jason Gunthorpe <jgg@mellanox.com>
> >>>
> >>> gntdev simply wants to monitor a specific VMA for any notifier events,
> >>> this can be done straightforwardly using mmu_range_notifier_insert() over
> >>> the VMA's VA range.
> >>>
> >>> The notifier should be attached until the original VMA is destroyed.
> >>>
> >>> It is unclear if any of this is even sane, but at least a lot of duplicate
> >>> code is removed.
> >> I didn't have a chance to look at the patch itself yet but as a heads-up
> > Thanks Boris. I spent a bit of time and got a VM running with a xen
> > 4.9 hypervisor and a kernel with this patch series. It a ubuntu bionic
> > VM with the distro's xen stuff.
> >
> > Can you give some guidance how you made it crash? 
> 
> It crashes trying to dereference mrn->ops->invalidate in
> mn_itree_invalidate() when a guest exits.
> 
> I don't think you've initialized notifier ops. I don't see you using
> gntdev_mmu_ops anywhere.

So weird the compiler didn't complain about an unused static...

But yes, this is a mistake, it should be:

diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index 37b278857ad807..0ca35485fd3865 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -1011,6 +1011,7 @@ static int gntdev_mmap(struct file *flip, struct vm_area_struct *vma)
 
 	if (use_ptemod) {
 		map->vma = vma;
+		map->notifier.ops = &gntdev_mmu_ops;
 		err = mmu_range_notifier_insert_locked(
 			&map->notifier, vma->vm_start,
 			vma->vm_end - vma->vm_start, vma->vm_mm);

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 17:42             ` Jason Gunthorpe
@ 2019-11-01 19:19               ` Jason Gunthorpe
  2019-11-01 19:45               ` Yang, Philip
  1 sibling, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 19:19 UTC (permalink / raw)
  To: Yang, Philip
  Cc: linux-mm, linux-rdma, amd-gfx, dri-devel, Deucher, Alexander,
	Koenig, Christian

On Fri, Nov 01, 2019 at 02:42:21PM -0300, Jason Gunthorpe wrote:
> On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote:
> > > This test for range_blockable should be before mutex_lock, I can move
> > > it up
> > > 
> > yes, thanks.
> 
> Okay, I wrote it like this:
> 
> 	if (mmu_notifier_range_blockable(range))
> 		mutex_lock(&adev->notifier_lock);
> 	else if (!mutex_trylock(&adev->notifier_lock))
> 		return false;

Never mind, this routine sleeps for other reasons it should just be as
it was:

	if (!mmu_notifier_range_blockable(range))
		return false;

	mutex_lock(&adev->notifier_lock);

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 17:42             ` Jason Gunthorpe
  2019-11-01 19:19               ` Jason Gunthorpe
@ 2019-11-01 19:45               ` Yang, Philip
  2019-11-01 19:50                 ` Yang, Philip
  2019-11-01 19:51                 ` Jason Gunthorpe
  1 sibling, 2 replies; 71+ messages in thread
From: Yang, Philip @ 2019-11-01 19:45 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs



On 2019-11-01 1:42 p.m., Jason Gunthorpe wrote:
> On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote:
>>> This test for range_blockable should be before mutex_lock, I can move
>>> it up
>>>
>> yes, thanks.
> 
> Okay, I wrote it like this:
> 
> 	if (mmu_notifier_range_blockable(range))
> 		mutex_lock(&adev->notifier_lock);
> 	else if (!mutex_trylock(&adev->notifier_lock))
> 		return false;
> 
>>> Also, do you know if notifier_lock is held while calling
>>> amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
>>> to amdgpu_ttm_tt_get_user_pages_done()?
>>
>> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check
>> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but
>> check mem->invalid flag which is updated from invalidate callback. It
>> takes more time to change, I will come to another patch to fix it later.
> 
> Ah.. confusing, OK, I'll let you sort that
> 
>>> However, this is all pre-existing bugs, so I'm OK go ahead with this
>>> patch as modified. I advise AMD to make a followup patch ..
>>>
>> yes, I will.
> 
> While you are here, this is also wrong:
> 
> int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
> {
> 	down_read(&mm->mmap_sem);
> 	r = hmm_range_fault(range, 0);
> 	up_read(&mm->mmap_sem);
> 	if (unlikely(r <= 0)) {
> 		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
> 			goto retry;
> 		goto out_free_pfns;
> 	}
> 
> 	for (i = 0; i < ttm->num_pages; i++) {
> 		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
> 
> It is not allowed to read the results of hmm_range_fault() outside
> locking, and in particular, we can't convert to a struct page.
> 
> This must be done inside the notifier_lock, after checking
> mmu_range_read_retry(), all handling of the struct page must be
> structured like that.
> 
Below change will fix this, then driver will call mmu_range_read_retry 
second time using same range->notifier_seq to check if range is 
invalidated inside amdgpu_cs_submit, this looks ok for me.

@@ -868,6 +869,13 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
*bo, struct page **pages)
                 goto out_free_pfns;
         }

+       mutex_lock(&adev->notifier_lock);
+
+       if (mmu_range_read_retry(&bo->notifier, range->notifier_seq)) {
+               mutex_unlock(&adev->notifier_lock);
+               goto retry;
+       }
+
         for (i = 0; i < ttm->num_pages; i++) {
                 pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
                 if (unlikely(!pages[i])) {
@@ -875,10 +883,12 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
*bo, struct page **pages)
                                i, range->pfns[i]);
                         r = -ENOMEM;

+                       mutex_unlock(&adev->notifier_lock);
                         goto out_free_pfns;
                 }
         }

+       mutex_unlock(&adev->notifier_lock);
         gtt->range = range;
         mmput(mm);

Philip

>>>> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
>>>>    	sg_free_table(ttm->sg);
>>>>    
>>>>    #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
>>>> -	if (gtt->range &&
>>>> -	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
>>>> -						      gtt->range->pfns[0]))
>>>> -		WARN_ONCE(1, "Missing get_user_page_done\n");
>>>> +	if (gtt->range) {
>>>> +		unsigned long i;
>>>> +
>>>> +		for (i = 0; i < ttm->num_pages; i++) {
>>>> +			if (ttm->pages[i] !=
>>>> +				hmm_device_entry_to_page(gtt->range,
>>>> +					      gtt->range->pfns[i]))
>>>> +				break;
>>>> +		}
>>>> +
>>>> +		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
>>>> +	}
>>>
>>> Is this related/necessary? I can put it in another patch if it is just
>>> debugging improvement? Please advise
>>>
>> I see this WARN backtrace now, but I didn't see it before. This is
>> somehow related.
> 
> Hm, might be instructive to learn what is going on..
> 
> Thanks,
> Jason
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 19:45               ` Yang, Philip
@ 2019-11-01 19:50                 ` Yang, Philip
  2019-11-01 19:51                 ` Jason Gunthorpe
  1 sibling, 0 replies; 71+ messages in thread
From: Yang, Philip @ 2019-11-01 19:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: nouveau, dri-devel, linux-mm, Zhou, David(ChunMing),
	Stefano Stabellini, Oleksandr Andrushchenko, linux-rdma, amd-gfx,
	Christoph Hellwig, Ben Skeggs, xen-devel, Ralph Campbell,
	John Hubbard, Jerome Glisse, Dennis Dalessandro, Boris Ostrovsky,
	Petr Cvek, Juergen Gross, Mike Marciniszyn, Kuehling, Felix,
	Deucher, Alexander, Koenig, Christian

Sorry, resend patch, the one in previous email missed couple of lines 
duo to copy/paste.

On 2019-11-01 3:45 p.m., Yang, Philip wrote:
> 
> 
> On 2019-11-01 1:42 p.m., Jason Gunthorpe wrote:
>> On Fri, Nov 01, 2019 at 03:59:26PM +0000, Yang, Philip wrote:
>>>> This test for range_blockable should be before mutex_lock, I can move
>>>> it up
>>>>
>>> yes, thanks.
>>
>> Okay, I wrote it like this:
>>
>> 	if (mmu_notifier_range_blockable(range))
>> 		mutex_lock(&adev->notifier_lock);
>> 	else if (!mutex_trylock(&adev->notifier_lock))
>> 		return false;
>>
>>>> Also, do you know if notifier_lock is held while calling
>>>> amdgpu_ttm_tt_get_user_pages_done()? Can we add a 'lock assert held'
>>>> to amdgpu_ttm_tt_get_user_pages_done()?
>>>
>>> gpu side hold notifier_lock but kfd side doesn't. kfd side doesn't check
>>> amdgpu_ttm_tt_get_user_pages_done/mmu_range_read_retry return value but
>>> check mem->invalid flag which is updated from invalidate callback. It
>>> takes more time to change, I will come to another patch to fix it later.
>>
>> Ah.. confusing, OK, I'll let you sort that
>>
>>>> However, this is all pre-existing bugs, so I'm OK go ahead with this
>>>> patch as modified. I advise AMD to make a followup patch ..
>>>>
>>> yes, I will.
>>
>> While you are here, this is also wrong:
>>
>> int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page **pages)
>> {
>> 	down_read(&mm->mmap_sem);
>> 	r = hmm_range_fault(range, 0);
>> 	up_read(&mm->mmap_sem);
>> 	if (unlikely(r <= 0)) {
>> 		if ((r == 0 || r == -EBUSY) && !time_after(jiffies, timeout))
>> 			goto retry;
>> 		goto out_free_pfns;
>> 	}
>>
>> 	for (i = 0; i < ttm->num_pages; i++) {
>> 		pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
>>
>> It is not allowed to read the results of hmm_range_fault() outside
>> locking, and in particular, we can't convert to a struct page.
>>
>> This must be done inside the notifier_lock, after checking
>> mmu_range_read_retry(), all handling of the struct page must be
>> structured like that.
>>
> Below change will fix this, then driver will call mmu_range_read_retry
> second time using same range->notifier_seq to check if range is
> invalidated inside amdgpu_cs_submit, this looks ok for me.
> 
@@ -797,6 +797,7 @@ static const uint64_t 
hmm_range_values[HMM_PFN_VALUE_MAX] = {
   */
  int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo *bo, struct page 
**pages)
  {
+       struct amdgpu_device *adev = amdgpu_ttm_adev(bo->tbo.bdev);
         struct ttm_tt *ttm = bo->tbo.ttm;
         struct amdgpu_ttm_tt *gtt = (void *)ttm;
         unsigned long start = gtt->userptr;
@@ -868,6 +869,13 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
*bo, struct page **pages)
                 goto out_free_pfns;
         }

+       mutex_lock(&adev->notifier_lock);
+
+       if (mmu_range_read_retry(&bo->notifier, range->notifier_seq)) {
+               mutex_unlock(&adev->notifier_lock);
+               goto retry;
+       }
+
         for (i = 0; i < ttm->num_pages; i++) {
                 pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
                 if (unlikely(!pages[i])) {
@@ -875,10 +883,12 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
*bo, struct page **pages)
                                i, range->pfns[i]);
                         r = -ENOMEM;

+                       mutex_unlock(&adev->notifier_lock);
                         goto out_free_pfns;
                 }
         }

+       mutex_unlock(&adev->notifier_lock);
         gtt->range = range;
         mmput(mm);

> 
> Philip
> 
>>>>> @@ -997,10 +1004,18 @@ static void amdgpu_ttm_tt_unpin_userptr(struct ttm_tt *ttm)
>>>>>     	sg_free_table(ttm->sg);
>>>>>     
>>>>>     #if IS_ENABLED(CONFIG_DRM_AMDGPU_USERPTR)
>>>>> -	if (gtt->range &&
>>>>> -	    ttm->pages[0] == hmm_device_entry_to_page(gtt->range,
>>>>> -						      gtt->range->pfns[0]))
>>>>> -		WARN_ONCE(1, "Missing get_user_page_done\n");
>>>>> +	if (gtt->range) {
>>>>> +		unsigned long i;
>>>>> +
>>>>> +		for (i = 0; i < ttm->num_pages; i++) {
>>>>> +			if (ttm->pages[i] !=
>>>>> +				hmm_device_entry_to_page(gtt->range,
>>>>> +					      gtt->range->pfns[i]))
>>>>> +				break;
>>>>> +		}
>>>>> +
>>>>> +		WARN((i == ttm->num_pages), "Missing get_user_page_done\n");
>>>>> +	}
>>>>
>>>> Is this related/necessary? I can put it in another patch if it is just
>>>> debugging improvement? Please advise
>>>>
>>> I see this WARN backtrace now, but I didn't see it before. This is
>>> somehow related.
>>
>> Hm, might be instructive to learn what is going on..
>>
>> Thanks,
>> Jason
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
  2019-11-01 19:45               ` Yang, Philip
  2019-11-01 19:50                 ` Yang, Philip
@ 2019-11-01 19:51                 ` Jason Gunthorpe
  1 sibling, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 19:51 UTC (permalink / raw)
  To: Yang, Philip
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Kuehling,
	Felix, Juergen Gross, Zhou, David(ChunMing),
	Mike Marciniszyn, Stefano Stabellini, Oleksandr Andrushchenko,
	linux-rdma, nouveau, Dennis Dalessandro, amd-gfx,
	Christoph Hellwig, dri-devel, Deucher, Alexander, xen-devel,
	Boris Ostrovsky, Petr Cvek, Koenig, Christian, Ben Skeggs

On Fri, Nov 01, 2019 at 07:45:22PM +0000, Yang, Philip wrote:

> > This must be done inside the notifier_lock, after checking
> > mmu_range_read_retry(), all handling of the struct page must be
> > structured like that.
> > 
> Below change will fix this, then driver will call mmu_range_read_retry 
> second time using same range->notifier_seq to check if range is 
> invalidated inside amdgpu_cs_submit, this looks ok for me.

Lets defer this to some patch trying to fix it, I find it hard to
follow..

> @@ -868,6 +869,13 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
> *bo, struct page **pages)
>                  goto out_free_pfns;
>          }
> 
> +       mutex_lock(&adev->notifier_lock);
> +
> +       if (mmu_range_read_retry(&bo->notifier, range->notifier_seq)) {
> +               mutex_unlock(&adev->notifier_lock);
> +               goto retry;
> +       }
> +
>          for (i = 0; i < ttm->num_pages; i++) {
>                  pages[i] = hmm_device_entry_to_page(range, range->pfns[i]);
>                  if (unlikely(!pages[i])) {
> @@ -875,10 +883,12 @@ int amdgpu_ttm_tt_get_user_pages(struct amdgpu_bo 
> *bo, struct page **pages)
>                                 i, range->pfns[i]);
>                          r = -ENOMEM;
> 
> +                       mutex_unlock(&adev->notifier_lock);
>                          goto out_free_pfns;
>                  }
>          }

Well, maybe? 

The question now is what happens to 'pages' ? With this arrangment the
driver cannot touch 'pages' without also again going under the lock
and checking retry. 

If it doesn't touch it, then lets just move this device_entry_to_page
to a more appropriate place?

I'd prefer it if the driver could be structured in the normal way,
with a clear locked region where the page list is handled..

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (14 preceding siblings ...)
  2019-10-28 20:10 ` [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related Jason Gunthorpe
@ 2019-11-01 19:54 ` Jason Gunthorpe
  2019-11-01 20:54 ` Ralph Campbell
  16 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-01 19:54 UTC (permalink / raw)
  To: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Mon, Oct 28, 2019 at 05:10:17PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.

Now that we have the most of the driver changes tested and reviewed
I'm going to move this series into linux-next via the hmm tree, minus
the xen gntdev patches as they are not working yet.

I will keep collecting acks and any additional changes.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking
  2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
                   ` (15 preceding siblings ...)
  2019-11-01 19:54 ` [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
@ 2019-11-01 20:54 ` Ralph Campbell
  2019-11-04 20:40   ` Jason Gunthorpe
  16 siblings, 1 reply; 71+ messages in thread
From: Ralph Campbell @ 2019-11-01 20:54 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe


On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> 8 of the mmu_notifier using drivers (i915_gem, radeon_mn, umem_odp, hfi1,
> scif_dma, vhost, gntdev, hmm) drivers are using a common pattern where
> they only use invalidate_range_start/end and immediately check the
> invalidating range against some driver data structure to tell if the
> driver is interested. Half of them use an interval_tree, the others are
> simple linear search lists.
> 
> Of the ones I checked they largely seem to have various kinds of races,
> bugs and poor implementation. This is a result of the complexity in how
> the notifier interacts with get_user_pages(). It is extremely difficult to
> use it correctly.
> 
> Consolidate all of this code together into the core mmu_notifier and
> provide a locking scheme similar to hmm_mirror that allows the user to
> safely use get_user_pages() and reliably know if the page list still
> matches the mm.
> 
> This new arrangment plays nicely with the !blockable mode for
> OOM. Scanning the interval tree is done such that the intersection test
> will always succeed, and since there is no invalidate_range_end exposed to
> drivers the scheme safely allows multiple drivers to be subscribed.
> 
> Four places are converted as an example of how the new API is used.
> Four are left for future patches:
>   - i915_gem has complex locking around destruction of a registration,
>     needs more study
>   - hfi1 (2nd user) needs access to the rbtree
>   - scif_dma has a complicated logic flow
>   - vhost's mmu notifiers are already being rewritten
> 
> This series, and the other code it depends on is available on my github:
> 
> https://github.com/jgunthorpe/linux/commits/mmu_notifier
> 
> v2 changes:
> - Add mmu_range_set_seq() to set the mrn sequence number under the driver
>    lock and make the locking more understandable
> - Add some additional comments around locking/READ_ONCe
> - Make the WARN_ON flow in mn_itree_invalidate a bit easier to follow
> - Fix wrong WARN_ON
> 
> Jason Gunthorpe (15):
>    mm/mmu_notifier: define the header pre-processor parts even if
>      disabled
>    mm/mmu_notifier: add an interval tree notifier
>    mm/hmm: allow hmm_range to be used with a mmu_range_notifier or
>      hmm_mirror
>    mm/hmm: define the pre-processor related parts of hmm.h even if
>      disabled
>    RDMA/odp: Use mmu_range_notifier_insert()
>    RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv
>    drm/radeon: use mmu_range_notifier_insert
>    xen/gntdev: Use select for DMA_SHARED_BUFFER
>    xen/gntdev: use mmu_range_notifier_insert
>    nouveau: use mmu_notifier directly for invalidate_range_start
>    nouveau: use mmu_range_notifier instead of hmm_mirror
>    drm/amdgpu: Call find_vma under mmap_sem
>    drm/amdgpu: Use mmu_range_insert instead of hmm_mirror
>    drm/amdgpu: Use mmu_range_notifier instead of hmm_mirror
>    mm/hmm: remove hmm_mirror and related
> 
>   Documentation/vm/hmm.rst                      | 105 +---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   2 +
>   .../gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c  |   9 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  14 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    |   1 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c        | 457 +++------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_mn.h        |  53 --
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.h    |  13 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c       | 111 ++--
>   drivers/gpu/drm/nouveau/nouveau_svm.c         | 231 +++++---
>   drivers/gpu/drm/radeon/radeon.h               |   9 +-
>   drivers/gpu/drm/radeon/radeon_mn.c            | 219 ++-----
>   drivers/infiniband/core/device.c              |   1 -
>   drivers/infiniband/core/umem_odp.c            | 288 +--------
>   drivers/infiniband/hw/hfi1/file_ops.c         |   2 +-
>   drivers/infiniband/hw/hfi1/hfi.h              |   2 +-
>   drivers/infiniband/hw/hfi1/user_exp_rcv.c     | 146 ++---
>   drivers/infiniband/hw/hfi1/user_exp_rcv.h     |   3 +-
>   drivers/infiniband/hw/mlx5/mlx5_ib.h          |   7 +-
>   drivers/infiniband/hw/mlx5/mr.c               |   3 +-
>   drivers/infiniband/hw/mlx5/odp.c              |  50 +-
>   drivers/xen/Kconfig                           |   3 +-
>   drivers/xen/gntdev-common.h                   |   8 +-
>   drivers/xen/gntdev.c                          | 180 ++----
>   include/linux/hmm.h                           | 195 +------
>   include/linux/mmu_notifier.h                  | 144 ++++-
>   include/rdma/ib_umem_odp.h                    |  65 +--
>   include/rdma/ib_verbs.h                       |   2 -
>   kernel/fork.c                                 |   1 -
>   mm/Kconfig                                    |   2 +-
>   mm/hmm.c                                      | 275 +--------
>   mm/mmu_notifier.c                             | 546 +++++++++++++++++-
>   32 files changed, 1225 insertions(+), 1922 deletions(-)
> 

You can add my Tested-by for the mm and nouveau changes.
IOW, patches 1-4, 10-11, and 15.

Tested-by: Ralph Campbell <rcampbell@nvidia.com>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking
  2019-11-01 20:54 ` Ralph Campbell
@ 2019-11-04 20:40   ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-04 20:40 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: linux-mm, Jerome Glisse, John Hubbard, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Fri, Nov 01, 2019 at 01:54:45PM -0700, Ralph Campbell wrote:
> You can add my Tested-by for the mm and nouveau changes.
> IOW, patches 1-4, 10-11, and 15.
> 
> Tested-by: Ralph Campbell <rcampbell@nvidia.com>

Got it, thanks

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
  2019-10-30 16:55   ` Boris Ostrovsky
@ 2019-11-04 22:03   ` Boris Ostrovsky
  2019-11-05  2:31     ` Jason Gunthorpe
  1 sibling, 1 reply; 71+ messages in thread
From: Boris Ostrovsky @ 2019-11-04 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig, Jason Gunthorpe

On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> @@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
>  	struct gntdev_priv *priv = file->private_data;
>  
>  	pr_debug("gntdev_vma_close %p\n", vma);
> -	if (use_ptemod) {
> -		/* It is possible that an mmu notifier could be running
> -		 * concurrently, so take priv->lock to ensure that the vma won't
> -		 * vanishing during the unmap_grant_pages call, since we will
> -		 * spin here until that completes. Such a concurrent call will
> -		 * not do any unmapping, since that has been done prior to
> -		 * closing the vma, but it may still iterate the unmap_ops list.
> -		 */
> -		mutex_lock(&priv->lock);
> +	if (use_ptemod && map->vma == vma) {


Is it possible for map->vma not to be equal to vma?

-boris


> +		mmu_range_notifier_remove(&map->notifier);
>  		map->vma = NULL;
> -		mutex_unlock(&priv->lock);
>  	}
>  	vma->vm_private_data = NULL;
>  	gntdev_put_map(priv, map);
>


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-04 22:03   ` Boris Ostrovsky
@ 2019-11-05  2:31     ` Jason Gunthorpe
  2019-11-05 15:16       ` Boris Ostrovsky
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-05  2:31 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On Mon, Nov 04, 2019 at 05:03:31PM -0500, Boris Ostrovsky wrote:
> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
> > @@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
> >  	struct gntdev_priv *priv = file->private_data;
> >  
> >  	pr_debug("gntdev_vma_close %p\n", vma);
> > -	if (use_ptemod) {
> > -		/* It is possible that an mmu notifier could be running
> > -		 * concurrently, so take priv->lock to ensure that the vma won't
> > -		 * vanishing during the unmap_grant_pages call, since we will
> > -		 * spin here until that completes. Such a concurrent call will
> > -		 * not do any unmapping, since that has been done prior to
> > -		 * closing the vma, but it may still iterate the unmap_ops list.
> > -		 */
> > -		mutex_lock(&priv->lock);
> > +	if (use_ptemod && map->vma == vma) {
> 
> 
> Is it possible for map->vma not to be equal to vma?

It could be NULL at least if use_ptemod is not set.

Otherwise, I'm not sure, the confusing bit is that the map comes from
here:

        map = gntdev_find_map_index(priv, index, count);

It looks like the intent is that the map->vma is always set to the
only vma that has the map as private_data.

So, I suppose it can be relaxed to a null test and a WARN_ON that it
hasn't changed?

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER
  2019-11-01 18:26   ` Jason Gunthorpe
@ 2019-11-05 14:44     ` Jürgen Groß
  0 siblings, 0 replies; 71+ messages in thread
From: Jürgen Groß @ 2019-11-05 14:44 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On 01.11.19 19:26, Jason Gunthorpe wrote:
> On Mon, Oct 28, 2019 at 05:10:25PM -0300, Jason Gunthorpe wrote:
>> From: Jason Gunthorpe <jgg@mellanox.com>
>>
>> DMA_SHARED_BUFFER can not be enabled by the user (it represents a library
>> set in the kernel). The kconfig convention is to use select for such
>> symbols so they are turned on implicitly when the user enables a kconfig
>> that needs them.
>>
>> Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.
>>
>> Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
>> Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
>> Cc: xen-devel@lists.xenproject.org
>> Cc: Juergen Gross <jgross@suse.com>
>> Cc: Stefano Stabellini <sstabellini@kernel.org>
>> Reviewed-by: Juergen Gross <jgross@suse.com>
>> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>> ---
>>   drivers/xen/Kconfig | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> Juergen/Oleksandr/Xen Maintainers:
> 
> Would you take this patch through a xen related tree? The only reason
> I had in this series is to make it easier to compile-test the gntdev
> changes.

Yes, I can take it for 5.5.


Juergen

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-05  2:31     ` Jason Gunthorpe
@ 2019-11-05 15:16       ` Boris Ostrovsky
  2019-11-07 20:36         ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Boris Ostrovsky @ 2019-11-05 15:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On 11/4/19 9:31 PM, Jason Gunthorpe wrote:
> On Mon, Nov 04, 2019 at 05:03:31PM -0500, Boris Ostrovsky wrote:
>> On 10/28/19 4:10 PM, Jason Gunthorpe wrote:
>>> @@ -445,17 +438,9 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
>>>  	struct gntdev_priv *priv = file->private_data;
>>>  
>>>  	pr_debug("gntdev_vma_close %p\n", vma);
>>> -	if (use_ptemod) {
>>> -		/* It is possible that an mmu notifier could be running
>>> -		 * concurrently, so take priv->lock to ensure that the vma won't
>>> -		 * vanishing during the unmap_grant_pages call, since we will
>>> -		 * spin here until that completes. Such a concurrent call will
>>> -		 * not do any unmapping, since that has been done prior to
>>> -		 * closing the vma, but it may still iterate the unmap_ops list.
>>> -		 */
>>> -		mutex_lock(&priv->lock);
>>> +	if (use_ptemod && map->vma == vma) {
>>
>> Is it possible for map->vma not to be equal to vma?
> It could be NULL at least if use_ptemod is not set.
>
> Otherwise, I'm not sure, the confusing bit is that the map comes from
> here:
>
>         map = gntdev_find_map_index(priv, index, count);
>
> It looks like the intent is that the map->vma is always set to the
> only vma that has the map as private_data.

I am not sure how this can work otherwise. We stash map pointer in vm's
vm_private_data and vice versa (for use_ptemod) gntdev_mmap() so if they
have to match.

That's why I was asking you to see if you had something particular in
mind when you added this test.

> So, I suppose it can be relaxed to a null test and a WARN_ON that it
> hasn't changed?

You mean

if (use_ptemod) {
        WARN_ON(map->vma != vma);
        ...


Yes, that sounds good.


-boris

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled
  2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
@ 2019-11-05 21:23   ` John Hubbard
  2019-11-06 13:36     ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: John Hubbard @ 2019-11-05 21:23 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe

On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Now that we have KERNEL_HEADER_TEST all headers are generally compile
> tested, so relying on makefile tricks to avoid compiling code that depends
> on CONFIG_MMU_NOTIFIER is more annoying.
> 
> Instead follow the usual pattern and provide most of the header with only
> the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
> ensures code compiles no matter what the config setting is.
> 
> While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.

and correct a minor spelling error in a comment. Good. :)

> 
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  include/linux/mmu_notifier.h | 46 +++++++++++++-----------------------
>  mm/mmu_notifier.c            | 13 ++++++++++
>  2 files changed, 30 insertions(+), 29 deletions(-)
> 

Because this is correct as-is, you can add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>


...whether or not you take the following recommendation, which is:
you've only done part of the job of making struct mmu_notifier_mm 
private to mmu_notifier.c. There's more:

* struct mmu_notifier_mm is referred to in two places now: mm_types.h
  and (still) mmu_notifier.h. Therefore:

    a) Move the last two traces of it out of mmu_notifier.h, and

    b) Put a forward declaration in mm_types.h, which is where it
       belongs because that's where it's referred to.

So if you apply this incremental patch on top, I think it's where
you want to be:

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2222fa795284..df93a3cc0da9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -25,6 +25,7 @@
 
 struct address_space;
 struct mem_cgroup;
+struct mmu_notifier_mm;
 
 /*
  * Each physical page in the system has a struct page associated with
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 51b92ba013dd..84efd2c51f5c 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -8,7 +8,6 @@
 #include <linux/srcu.h>
 #include <linux/interval_tree.h>
 
-struct mmu_notifier_mm;
 struct mmu_notifier;
 struct mmu_notifier_range;
 struct mmu_range_notifier;
@@ -263,10 +262,7 @@ struct mmu_notifier_range {
        enum mmu_notifier_event event;
 };
 
-static inline int mm_has_notifiers(struct mm_struct *mm)
-{
-       return unlikely(mm->mmu_notifier_mm);
-}
+int mm_has_notifiers(struct mm_struct *mm);
 
 struct mmu_notifier *mmu_notifier_get_locked(const struct mmu_notifier_ops *ops,
                                             struct mm_struct *mm);
@@ -477,10 +473,7 @@ static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
                __mmu_notifier_invalidate_range(mm, start, end);
 }
 
-static inline void mmu_notifier_mm_init(struct mm_struct *mm)
-{
-       mm->mmu_notifier_mm = NULL;
-}
+void mmu_notifier_mm_init(struct mm_struct *mm);
 
 static inline void mmu_notifier_mm_destroy(struct mm_struct *mm)
 {
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2b7485919ecf..107f9406a92d 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -47,6 +47,16 @@ struct mmu_notifier_mm {
        struct hlist_head deferred_list;
 };
 
+int mm_has_notifiers(struct mm_struct *mm)
+{
+       return unlikely(mm->mmu_notifier_mm);
+}
+
+void mmu_notifier_mm_init(struct mm_struct *mm)
+{
+       mm->mmu_notifier_mm = NULL;
+}
+
 /*
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at


thanks,

John Hubbard
NVIDIA

> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 1bd8e6a09a3c27..12bd603d318ce7 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -7,8 +7,9 @@
>  #include <linux/mm_types.h>
>  #include <linux/srcu.h>
>  
> +struct mmu_notifier_mm;
>  struct mmu_notifier;
> -struct mmu_notifier_ops;
> +struct mmu_notifier_range;
>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -40,36 +41,8 @@ enum mmu_notifier_event {
>  	MMU_NOTIFY_SOFT_DIRTY,
>  };
>  
> -#ifdef CONFIG_MMU_NOTIFIER
> -
> -#ifdef CONFIG_LOCKDEP
> -extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> -#endif
> -
> -/*
> - * The mmu notifier_mm structure is allocated and installed in
> - * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> - * critical section and it's released only when mm_count reaches zero
> - * in mmdrop().
> - */
> -struct mmu_notifier_mm {
> -	/* all mmu notifiers registerd in this mm are queued in this list */
> -	struct hlist_head list;
> -	/* to serialize the list modifications and hlist_unhashed */
> -	spinlock_t lock;
> -};
> -
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>  
> -struct mmu_notifier_range {
> -	struct vm_area_struct *vma;
> -	struct mm_struct *mm;
> -	unsigned long start;
> -	unsigned long end;
> -	unsigned flags;
> -	enum mmu_notifier_event event;
> -};
> -
>  struct mmu_notifier_ops {
>  	/*
>  	 * Called either by mmu_notifier_unregister or when the mm is
> @@ -249,6 +222,21 @@ struct mmu_notifier {
>  	unsigned int users;
>  };
>  
> +#ifdef CONFIG_MMU_NOTIFIER
> +
> +#ifdef CONFIG_LOCKDEP
> +extern struct lockdep_map __mmu_notifier_invalidate_range_start_map;
> +#endif
> +
> +struct mmu_notifier_range {
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm;
> +	unsigned long start;
> +	unsigned long end;
> +	unsigned flags;
> +	enum mmu_notifier_event event;
> +};
> +
>  static inline int mm_has_notifiers(struct mm_struct *mm)
>  {
>  	return unlikely(mm->mmu_notifier_mm);
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 7fde88695f35d6..367670cfd02b7b 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -27,6 +27,19 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
>  };
>  #endif
>  
> +/*
> + * The mmu notifier_mm structure is allocated and installed in
> + * mm->mmu_notifier_mm inside the mm_take_all_locks() protected
> + * critical section and it's released only when mm_count reaches zero
> + * in mmdrop().
> + */
> +struct mmu_notifier_mm {
> +	/* all mmu notifiers registered in this mm are queued in this list */
> +	struct hlist_head list;
> +	/* to serialize the list modifications and hlist_unhashed */
> +	spinlock_t lock;
> +};
> +
>  /*
>   * This function can't run concurrently against mmu_notifier_register
>   * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled
  2019-11-05 21:23   ` John Hubbard
@ 2019-11-06 13:36     ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-06 13:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig

On Tue, Nov 05, 2019 at 01:23:46PM -0800, John Hubbard wrote:
> On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > Now that we have KERNEL_HEADER_TEST all headers are generally compile
> > tested, so relying on makefile tricks to avoid compiling code that depends
> > on CONFIG_MMU_NOTIFIER is more annoying.
> > 
> > Instead follow the usual pattern and provide most of the header with only
> > the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
> > ensures code compiles no matter what the config setting is.
> > 
> > While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.
> 
> and correct a minor spelling error in a comment. Good. :)
> 
> > 
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >  include/linux/mmu_notifier.h | 46 +++++++++++++-----------------------
> >  mm/mmu_notifier.c            | 13 ++++++++++
> >  2 files changed, 30 insertions(+), 29 deletions(-)
> > 
> 
> Because this is correct as-is, you can add:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> 

Thanks

> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 2b7485919ecf..107f9406a92d 100644
> +++ b/mm/mmu_notifier.c
> @@ -47,6 +47,16 @@ struct mmu_notifier_mm {
>         struct hlist_head deferred_list;
>  };
>  
> +int mm_has_notifiers(struct mm_struct *mm)
> +{
> +       return unlikely(mm->mmu_notifier_mm);
> +}

This inline is performance sensitive, it needs to stay inlined..

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
  2019-10-29 22:04   ` Kuehling, Felix
@ 2019-11-07  0:23   ` John Hubbard
  2019-11-07  2:08     ` Jerome Glisse
  2019-11-07 20:06     ` Jason Gunthorpe
  1 sibling, 2 replies; 71+ messages in thread
From: John Hubbard @ 2019-11-07  0:23 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe, Andrea Arcangeli,
	Michal Hocko

On 10/28/19 1:10 PM, Jason Gunthorpe wrote:
...
>  include/linux/mmu_notifier.h |  98 +++++++
>  mm/Kconfig                   |   1 +
>  mm/mmu_notifier.c            | 533 +++++++++++++++++++++++++++++++++--
>  3 files changed, 607 insertions(+), 25 deletions(-)
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 12bd603d318ce7..51b92ba013ddce 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -6,10 +6,12 @@
>  #include <linux/spinlock.h>
>  #include <linux/mm_types.h>
>  #include <linux/srcu.h>
> +#include <linux/interval_tree.h>
>  
>  struct mmu_notifier_mm;
>  struct mmu_notifier;
>  struct mmu_notifier_range;
> +struct mmu_range_notifier;

Hi Jason,

Nice design, I love the seq foundation! So far, I'm not able to spot anything
actually wrong with the implementation, sorry about that. 

Generally my reaction is: given that the design is complex, try to mitigate 
that with documentation and naming. So the comments are in these areas:

1. There is a rather severe naming overlap (not technically a naming conflict,
but still) with existing mmn work, which already has, for example:

    struct mmu_notifier_range

...and you're adding:

    struct mmu_range_notifier

...so I'll try to help sort that out.

2. I'm also seeing a couple of things that are really hard for the reader
verify are correct (abuse and battery of the low bit in .invalidate_seq, 
for example, haha), so I have some recommendations there.

3. Documentation improvements, which easy to apply, with perhaps one exception.
(Here, because this a complicated area, documentation does make a difference,
so it's worth a little extra fuss.)

4. Other nits that don't matter too much, but just help polish things up
as usual.

>  
>  /**
>   * enum mmu_notifier_event - reason for the mmu notifier callback
> @@ -32,6 +34,9 @@ struct mmu_notifier_range;
>   * access flags). User should soft dirty the page in the end callback to make
>   * sure that anyone relying on soft dirtyness catch pages that might be written
>   * through non CPU mappings.
> + *
> + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal that
> + * the mm refcount is zero and the range is no longer accessible.
>   */
>  enum mmu_notifier_event {
>  	MMU_NOTIFY_UNMAP = 0,
> @@ -39,6 +44,7 @@ enum mmu_notifier_event {
>  	MMU_NOTIFY_PROTECTION_VMA,
>  	MMU_NOTIFY_PROTECTION_PAGE,
>  	MMU_NOTIFY_SOFT_DIRTY,
> +	MMU_NOTIFY_RELEASE,
>  };


OK, let the naming debates begin! ha. Anyway, after careful study of the overall
patch, and some browsing of the larger patchset, it's clear that:

* The new "MMU range notifier" that you've created is, approximately, a new
object. It uses classic mmu notifiers inside, as an implementation detail, and
it does *similar* things (notifications) as mmn's. But it's certainly not the same
as mmn's, as shown later when you say the need to an entirely new ops struct, and 
data struct too.

Therefore, you need a separate events enum as well. This is important. MMN's
won't be issuing MMN_NOTIFY_RELEASE events, nor will MNR's be issuing the first
four prexisting MMU_NOTIFY_* items. So it would be a design mistake to glom them
together, unless you ultimately decided to merge these MMN and MNR objects (which
I don't really see any intention of, and that's fine).

So this should read:

enum mmu_range_notifier_event {
	MMU_NOTIFY_RELEASE,
};

...assuming that we stay with "mmu_range_notifier" as a core name for this 
whole thing.

Also, it is best moved down to be next to the new MNR structs, so that all the
MNR stuff is in one group.

Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
header file, but I know that's extra work. Maybe later as a follow-up patch,
if anyone has the time.

>  
>  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> @@ -222,6 +228,26 @@ struct mmu_notifier {
>  	unsigned int users;
>  };
>  

That should also be moved down, next to the new structs.



A little bit above these next items, just above "struct mmu_notifier" (not shown here, 
it's outside the diff area), there is some documentation about classic MMNs. It would 
be nice if it were clearer that that documentation is not relevant to MNRs. Actually, 
this is another reason that a separate header file would be nice.

> +/**
> + * struct mmu_range_notifier_ops
> + * @invalidate: Upon return the caller must stop using any SPTEs within this
> + *              range, this function can sleep. Return false if blocking was
> + *              required but range is non-blocking
> + */

How about this (I'm not sure I fully understand the return value, though):

/**
 * struct mmu_range_notifier_ops
 * @invalidate: Upon return the caller must stop using any SPTEs within this
 * 		range.
 *
 * 		This function is permitted to sleep.
 *
 *      	@Return: false if blocking was required, but @range is
 *			non-blocking.
 *
 */


> +struct mmu_range_notifier_ops {
> +	bool (*invalidate)(struct mmu_range_notifier *mrn,
> +			   const struct mmu_notifier_range *range,
> +			   unsigned long cur_seq);
> +};
> +
> +struct mmu_range_notifier {
> +	struct interval_tree_node interval_tree;
> +	const struct mmu_range_notifier_ops *ops;
> +	struct hlist_node deferred_item;
> +	unsigned long invalidate_seq;
> +	struct mm_struct *mm;
> +};
> +

Again, now we have the new struct mmu_range_notifier, and the old 
struct mmu_notifier_range, and it's not good.

Ideas:

a) Live with it.

b) (Discarded, too many callers): rename old one. Nope.

c) Rename new one. Ideas:

    struct mmu_interval_notifier
    struct mmu_range_intersection
    ...other ideas?


>  #ifdef CONFIG_MMU_NOTIFIER
>  
>  #ifdef CONFIG_LOCKDEP
> @@ -263,6 +289,78 @@ extern int __mmu_notifier_register(struct mmu_notifier *mn,
>  				   struct mm_struct *mm);
>  extern void mmu_notifier_unregister(struct mmu_notifier *mn,
>  				    struct mm_struct *mm);
> +
> +unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn);
> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +			      unsigned long start, unsigned long length,
> +			      struct mm_struct *mm);
> +int mmu_range_notifier_insert_locked(struct mmu_range_notifier *mrn,
> +				     unsigned long start, unsigned long length,
> +				     struct mm_struct *mm);
> +void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
> +
> +/**
> + * mmu_range_set_seq - Save the invalidation sequence

How about:

 * mmu_range_set_seq - Set the .invalidate_seq to a new value.


> + * @mrn - The mrn passed to invalidate
> + * @cur_seq - The cur_seq passed to invalidate
> + *
> + * This must be called unconditionally from the invalidate callback of a
> + * struct mmu_range_notifier_ops under the same lock that is used to call
> + * mmu_range_read_retry(). It updates the sequence number for later use by
> + * mmu_range_read_retry().
> + *
> + * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()

nit: "caller" is better than "user", when referring to...well, callers. "user" 
most often refers to user space, whereas a call stack and function calling is 
clearly what you're referring to here (and in other places, especially "user lock").

> + * then this call is not required.
> + */
> +static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
> +				     unsigned long cur_seq)
> +{
> +	WRITE_ONCE(mrn->invalidate_seq, cur_seq);
> +}
> +
> +/**
> + * mmu_range_read_retry - End a read side critical section against a VA range
> + * mrn: The range under lock
> + * seq: The return of the paired mmu_range_read_begin()
> + *
> + * This MUST be called under a user provided lock that is also held
> + * unconditionally by op->invalidate() when it calls mmu_range_set_seq().
> + *
> + * Each call should be paired with a single mmu_range_read_begin() and
> + * should be used to conclude the read side.
> + *
> + * Returns true if an invalidation collided with this critical section, and
> + * the caller should retry.
> + */
> +static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
> +					unsigned long seq)
> +{
> +	return mrn->invalidate_seq != seq;
> +}
> +
> +/**
> + * mmu_range_check_retry - Test if a collision has occurred
> + * mrn: The range under lock
> + * seq: The return of the matching mmu_range_read_begin()
> + *
> + * This can be used in the critical section between mmu_range_read_begin() and
> + * mmu_range_read_retry().  A return of true indicates an invalidation has
> + * collided with this lock and a future mmu_range_read_retry() will return
> + * true.
> + *
> + * False is not reliable and only suggests a collision has not happened. It

let's say "suggests that a collision *may* not have occurred."  

...
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 367670cfd02b7b..d02d3c8c223eb7 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -12,6 +12,7 @@
>  #include <linux/export.h>
>  #include <linux/mm.h>
>  #include <linux/err.h>
> +#include <linux/interval_tree.h>
>  #include <linux/srcu.h>
>  #include <linux/rcupdate.h>
>  #include <linux/sched.h>
> @@ -36,10 +37,243 @@ struct lockdep_map __mmu_notifier_invalidate_range_start_map = {
>  struct mmu_notifier_mm {
>  	/* all mmu notifiers registered in this mm are queued in this list */
>  	struct hlist_head list;
> +	bool has_interval;
>  	/* to serialize the list modifications and hlist_unhashed */
>  	spinlock_t lock;
> +	unsigned long invalidate_seq;
> +	unsigned long active_invalidate_ranges;
> +	struct rb_root_cached itree;
> +	wait_queue_head_t wq;
> +	struct hlist_head deferred_list;
>  };
>  
> +/*
> + * This is a collision-retry read-side/write-side 'lock', a lot like a
> + * seqcount, however this allows multiple write-sides to hold it at
> + * once. Conceptually the write side is protecting the values of the PTEs in
> + * this mm, such that PTES cannot be read into SPTEs while any writer exists.

Just to be kind, can we say "SPTEs (shadow PTEs)", just this once? :)

> + *
> + * Note that the core mm creates nested invalidate_range_start()/end() regions
> + * within the same thread, and runs invalidate_range_start()/end() in parallel
> + * on multiple CPUs. This is designed to not reduce concurrency or block
> + * progress on the mm side.
> + *
> + * As a secondary function, holding the full write side also serves to prevent
> + * writers for the itree, this is an optimization to avoid extra locking
> + * during invalidate_range_start/end notifiers.
> + *
> + * The write side has two states, fully excluded:
> + *  - mm->active_invalidate_ranges != 0
> + *  - mnn->invalidate_seq & 1 == True
> + *  - some range on the mm_struct is being invalidated
> + *  - the itree is not allowed to change
> + *
> + * And partially excluded:
> + *  - mm->active_invalidate_ranges != 0

I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
let's say so. I'm probably getting that wrong, too.

> + *  - some range on the mm_struct is being invalidated
> + *  - the itree is allowed to change
> + *
> + * The later state avoids some expensive work on inv_end in the common case of
> + * no mrn monitoring the VA.
> + */
> +static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
> +{
> +	lockdep_assert_held(&mmn_mm->lock);
> +	return mmn_mm->invalidate_seq & 1;
> +}
> +
> +static struct mmu_range_notifier *
> +mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
> +			 const struct mmu_notifier_range *range,
> +			 unsigned long *seq)
> +{
> +	struct interval_tree_node *node;
> +	struct mmu_range_notifier *res = NULL;
> +
> +	spin_lock(&mmn_mm->lock);
> +	mmn_mm->active_invalidate_ranges++;
> +	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
> +					range->end - 1);
> +	if (node) {
> +		mmn_mm->invalidate_seq |= 1;


OK, this either needs more documentation and assertions, or a different
approach. Because I see addition, subtraction, AND, OR and booleans
all being applied to this field, and it's darn near hopeless to figure
out whether or not it really is even or odd at the right times.

Different approach: why not just add a mmn_mm->is_invalidating 
member variable? It's not like you're short of space in that struct.


> +		res = container_of(node, struct mmu_range_notifier,
> +				   interval_tree);
> +	}
> +
> +	*seq = mmn_mm->invalidate_seq;
> +	spin_unlock(&mmn_mm->lock);
> +	return res;
> +}
> +
> +static struct mmu_range_notifier *
> +mn_itree_inv_next(struct mmu_range_notifier *mrn,
> +		  const struct mmu_notifier_range *range)
> +{
> +	struct interval_tree_node *node;
> +
> +	node = interval_tree_iter_next(&mrn->interval_tree, range->start,
> +				       range->end - 1);
> +	if (!node)
> +		return NULL;
> +	return container_of(node, struct mmu_range_notifier, interval_tree);
> +}
> +
> +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> +{
> +	struct mmu_range_notifier *mrn;
> +	struct hlist_node *next;
> +	bool need_wake = false;
> +
> +	spin_lock(&mmn_mm->lock);
> +	if (--mmn_mm->active_invalidate_ranges ||
> +	    !mn_itree_is_invalidating(mmn_mm)) {
> +		spin_unlock(&mmn_mm->lock);
> +		return;
> +	}
> +
> +	mmn_mm->invalidate_seq++;

Is this the right place for an assertion that this is now an even value?

> +	need_wake = true;
> +
> +	/*
> +	 * The inv_end incorporates a deferred mechanism like
> +	 * rtnl_lock(). Adds and removes are queued until the final inv_end

Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
But I suppose if one studies that file closely there is more. :)

...

> +unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
> +{
> +	struct mmu_notifier_mm *mmn_mm = mrn->mm->mmu_notifier_mm;
> +	unsigned long seq;
> +	bool is_invalidating;
> +
> +	/*
> +	 * If the mrn has a different seq value under the user_lock than we
> +	 * started with then it has collided.
> +	 *
> +	 * If the mrn currently has the same seq value as the mmn_mm seq, then
> +	 * it is currently between invalidate_start/end and is colliding.
> +	 *
> +	 * The locking looks broadly like this:
> +	 *   mn_tree_invalidate_start():          mmu_range_read_begin():
> +	 *                                         spin_lock
> +	 *                                          seq = READ_ONCE(mrn->invalidate_seq);
> +	 *                                          seq == mmn_mm->invalidate_seq
> +	 *                                         spin_unlock
> +	 *    spin_lock
> +	 *     seq = ++mmn_mm->invalidate_seq
> +	 *    spin_unlock
> +	 *     op->invalidate_range():
> +	 *       user_lock
> +	 *        mmu_range_set_seq()
> +	 *         mrn->invalidate_seq = seq
> +	 *       user_unlock
> +	 *
> +	 *                          [Required: mmu_range_read_retry() == true]
> +	 *
> +	 *   mn_itree_inv_end():
> +	 *    spin_lock
> +	 *     seq = ++mmn_mm->invalidate_seq
> +	 *    spin_unlock
> +	 *
> +	 *                                        user_lock
> +	 *                                         mmu_range_read_retry():
> +	 *                                          mrn->invalidate_seq != seq
> +	 *                                        user_unlock
> +	 *
> +	 * Barriers are not needed here as any races here are closed by an
> +	 * eventual mmu_range_read_retry(), which provides a barrier via the
> +	 * user_lock.
> +	 */
> +	spin_lock(&mmn_mm->lock);
> +	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
> +	seq = READ_ONCE(mrn->invalidate_seq);
> +	is_invalidating = seq == mmn_mm->invalidate_seq;
> +	spin_unlock(&mmn_mm->lock);
> +
> +	/*
> +	 * mrn->invalidate_seq is always set to an odd value. This ensures

This claim just looks wrong the first N times one reads the code, given that
there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe you mean

"is always set to an odd value when invalidating"??

> +	 * that if seq does wrap we will always clear the below sleep in some
> +	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> +	 * state.
> +	 */

Let's move that comment higher up. The code that follows it has nothing to
do with it, so it's confusing here.

...
> @@ -529,6 +852,166 @@ void mmu_notifier_put(struct mmu_notifier *mn)
>  }
>  EXPORT_SYMBOL_GPL(mmu_notifier_put);
>  
> +static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +				       unsigned long start,
> +				       unsigned long length,
> +				       struct mmu_notifier_mm *mmn_mm,
> +				       struct mm_struct *mm)
> +{
> +	mrn->mm = mm;
> +	RB_CLEAR_NODE(&mrn->interval_tree.rb);
> +	mrn->interval_tree.start = start;
> +	/*
> +	 * Note that the representation of the intervals in the interval tree
> +	 * considers the ending point as contained in the interval.

Thanks for that comment!

> +	 */
> +	if (length == 0 ||
> +	    check_add_overflow(start, length - 1, &mrn->interval_tree.last))
> +		return -EOVERFLOW;
> +
> +	/* pairs with mmdrop in mmu_range_notifier_remove() */
> +	mmgrab(mm);
> +
> +	/*
> +	 * If some invalidate_range_start/end region is going on in parallel
> +	 * we don't know what VA ranges are affected, so we must assume this
> +	 * new range is included.
> +	 *
> +	 * If the itree is invalidating then we are not allowed to change
> +	 * it. Retrying until invalidation is done is tricky due to the
> +	 * possibility for live lock, instead defer the add to the unlock so
> +	 * this algorithm is deterministic.
> +	 *
> +	 * In all cases the value for the mrn->mr_invalidate_seq should be
> +	 * odd, see mmu_range_read_begin()
> +	 */
> +	spin_lock(&mmn_mm->lock);
> +	if (mmn_mm->active_invalidate_ranges) {
> +		if (mn_itree_is_invalidating(mmn_mm))
> +			hlist_add_head(&mrn->deferred_item,
> +				       &mmn_mm->deferred_list);
> +		else {
> +			mmn_mm->invalidate_seq |= 1;
> +			interval_tree_insert(&mrn->interval_tree,
> +					     &mmn_mm->itree);
> +		}
> +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> +	} else {
> +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;

Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
for seq numbers here?  I'm acutely unhappy trying to figure this out.
I suspect it's another unfortunate side effect of trying to use the
lower bit of the seq number (even/odd) for something else.

> +		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
> +	}
> +	spin_unlock(&mmn_mm->lock);
> +	return 0;
> +}
> +
> +/**
> + * mmu_range_notifier_insert - Insert a range notifier
> + * @mrn: Range notifier to register
> + * @start: Starting virtual address to monitor
> + * @length: Length of the range to monitor
> + * @mm : mm_struct to attach to
> + *
> + * This function subscribes the range notifier for notifications from the mm.
> + * Upon return the ops related to mmu_range_notifier will be called whenever
> + * an event that intersects with the given range occurs.
> + *
> + * Upon return the range_notifier may not be present in the interval tree yet.
> + * The caller must use the normal range notifier locking flow via
> + * mmu_range_read_begin() to establish SPTEs for this range.
> + */
> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> +			      unsigned long start, unsigned long length,
> +			      struct mm_struct *mm)
> +{
> +	struct mmu_notifier_mm *mmn_mm;
> +	int ret;

Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
I'll check on that. It's correct here, though.

> +
> +	might_lock(&mm->mmap_sem);
> +
> +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);

What does the above pair with? Should have a comment that specifies that.

 
thanks,

John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07  0:23   ` John Hubbard
@ 2019-11-07  2:08     ` Jerome Glisse
  2019-11-07 20:11       ` Jason Gunthorpe
  2019-11-07 20:06     ` Jason Gunthorpe
  1 sibling, 1 reply; 71+ messages in thread
From: Jerome Glisse @ 2019-11-07  2:08 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, linux-mm, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Jason Gunthorpe, Andrea Arcangeli,
	Michal Hocko

On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
> On 10/28/19 1:10 PM, Jason Gunthorpe wrote:

[...]

> >  /**
> >   * enum mmu_notifier_event - reason for the mmu notifier callback
> > @@ -32,6 +34,9 @@ struct mmu_notifier_range;
> >   * access flags). User should soft dirty the page in the end callback to make
> >   * sure that anyone relying on soft dirtyness catch pages that might be written
> >   * through non CPU mappings.
> > + *
> > + * @MMU_NOTIFY_RELEASE: used during mmu_range_notifier invalidate to signal that
> > + * the mm refcount is zero and the range is no longer accessible.
> >   */
> >  enum mmu_notifier_event {
> >  	MMU_NOTIFY_UNMAP = 0,
> > @@ -39,6 +44,7 @@ enum mmu_notifier_event {
> >  	MMU_NOTIFY_PROTECTION_VMA,
> >  	MMU_NOTIFY_PROTECTION_PAGE,
> >  	MMU_NOTIFY_SOFT_DIRTY,
> > +	MMU_NOTIFY_RELEASE,
> >  };
> 
> 
> OK, let the naming debates begin! ha. Anyway, after careful study of the overall
> patch, and some browsing of the larger patchset, it's clear that:
> 
> * The new "MMU range notifier" that you've created is, approximately, a new
> object. It uses classic mmu notifiers inside, as an implementation detail, and
> it does *similar* things (notifications) as mmn's. But it's certainly not the same
> as mmn's, as shown later when you say the need to an entirely new ops struct, and 
> data struct too.
> 
> Therefore, you need a separate events enum as well. This is important. MMN's
> won't be issuing MMN_NOTIFY_RELEASE events, nor will MNR's be issuing the first
> four prexisting MMU_NOTIFY_* items. So it would be a design mistake to glom them
> together, unless you ultimately decided to merge these MMN and MNR objects (which
> I don't really see any intention of, and that's fine).
> 
> So this should read:
> 
> enum mmu_range_notifier_event {
> 	MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.
> 
> Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> header file, but I know that's extra work. Maybe later as a follow-up patch,
> if anyone has the time.

The range notifier should get the event too, it would be a waste, i think it is
an oversight here. The release event is fine so NAK to you separate event. Event
is really an helper for notifier i had a set of patch for nouveau to leverage
this i need to resucite them. So no need to split thing, i would just forward
the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
that in v1 sorry.


[...]

> > +struct mmu_range_notifier_ops {
> > +	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +			   const struct mmu_notifier_range *range,
> > +			   unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +	struct interval_tree_node interval_tree;
> > +	const struct mmu_range_notifier_ops *ops;
> > +	struct hlist_node deferred_item;
> > +	unsigned long invalidate_seq;
> > +	struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
>     struct mmu_interval_notifier
>     struct mmu_range_intersection
>     ...other ideas?

I vote for interval_notifier we do want notifier in name but i am also
fine with current name.

[...]

> > + *
> > + * Note that the core mm creates nested invalidate_range_start()/end() regions
> > + * within the same thread, and runs invalidate_range_start()/end() in parallel
> > + * on multiple CPUs. This is designed to not reduce concurrency or block
> > + * progress on the mm side.
> > + *
> > + * As a secondary function, holding the full write side also serves to prevent
> > + * writers for the itree, this is an optimization to avoid extra locking
> > + * during invalidate_range_start/end notifiers.
> > + *
> > + * The write side has two states, fully excluded:
> > + *  - mm->active_invalidate_ranges != 0
> > + *  - mnn->invalidate_seq & 1 == True
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is not allowed to change
> > + *
> > + * And partially excluded:
> > + *  - mm->active_invalidate_ranges != 0
> 
> I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
> let's say so. I'm probably getting that wrong, too.

Yes (mnn->invalidate_seq & 1) == 0

> 
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is allowed to change
> > + *
> > + * The later state avoids some expensive work on inv_end in the common case of
> > + * no mrn monitoring the VA.
> > + */
> > +static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	lockdep_assert_held(&mmn_mm->lock);
> > +	return mmn_mm->invalidate_seq & 1;
> > +}
> > +
> > +static struct mmu_range_notifier *
> > +mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
> > +			 const struct mmu_notifier_range *range,
> > +			 unsigned long *seq)
> > +{
> > +	struct interval_tree_node *node;
> > +	struct mmu_range_notifier *res = NULL;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	mmn_mm->active_invalidate_ranges++;
> > +	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
> > +					range->end - 1);
> > +	if (node) {
> > +		mmn_mm->invalidate_seq |= 1;
> 
> 
> OK, this either needs more documentation and assertions, or a different
> approach. Because I see addition, subtraction, AND, OR and booleans
> all being applied to this field, and it's darn near hopeless to figure
> out whether or not it really is even or odd at the right times.
> 
> Different approach: why not just add a mmn_mm->is_invalidating 
> member variable? It's not like you're short of space in that struct.

The invalidate_seq scheme looks fine to me, maybe it can use more comments.


> 
> 
> > +		res = container_of(node, struct mmu_range_notifier,
> > +				   interval_tree);
> > +	}
> > +
> > +	*seq = mmn_mm->invalidate_seq;
> > +	spin_unlock(&mmn_mm->lock);
> > +	return res;
> > +}
> > +
> > +static struct mmu_range_notifier *
> > +mn_itree_inv_next(struct mmu_range_notifier *mrn,
> > +		  const struct mmu_notifier_range *range)
> > +{
> > +	struct interval_tree_node *node;
> > +
> > +	node = interval_tree_iter_next(&mrn->interval_tree, range->start,
> > +				       range->end - 1);
> > +	if (!node)
> > +		return NULL;
> > +	return container_of(node, struct mmu_range_notifier, interval_tree);
> > +}
> > +
> > +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	struct mmu_range_notifier *mrn;
> > +	struct hlist_node *next;
> > +	bool need_wake = false;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	if (--mmn_mm->active_invalidate_ranges ||
> > +	    !mn_itree_is_invalidating(mmn_mm)) {
> > +		spin_unlock(&mmn_mm->lock);
> > +		return;
> > +	}
> > +
> > +	mmn_mm->invalidate_seq++;
> 
> Is this the right place for an assertion that this is now an even value?

Yes at that point it should be even ie mmn_mm->active_invalidate_ranges == 0
and we are holding the lock thus nothing can set the lower bit of invalidate_seq
and ++ should lead to even number.

> 
> > +	need_wake = true;
> > +
> > +	/*
> > +	 * The inv_end incorporates a deferred mechanism like
> > +	 * rtnl_lock(). Adds and removes are queued until the final inv_end
> 
> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
> But I suppose if one studies that file closely there is more. :)

I think i commented in v1 about rtnl_lock() being something network people only
might be familiar, i think i saw it documented somewhere, maybe a lwn article.
But if you are familiar with network it is a think well understood ... for any
reasonable network scholar ;)

> ...
> 
> > +unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
> > +{
> > +	struct mmu_notifier_mm *mmn_mm = mrn->mm->mmu_notifier_mm;
> > +	unsigned long seq;
> > +	bool is_invalidating;
> > +
> > +	/*
> > +	 * If the mrn has a different seq value under the user_lock than we
> > +	 * started with then it has collided.
> > +	 *
> > +	 * If the mrn currently has the same seq value as the mmn_mm seq, then
> > +	 * it is currently between invalidate_start/end and is colliding.
> > +	 *
> > +	 * The locking looks broadly like this:
> > +	 *   mn_tree_invalidate_start():          mmu_range_read_begin():
> > +	 *                                         spin_lock
> > +	 *                                          seq = READ_ONCE(mrn->invalidate_seq);
> > +	 *                                          seq == mmn_mm->invalidate_seq
> > +	 *                                         spin_unlock
> > +	 *    spin_lock
> > +	 *     seq = ++mmn_mm->invalidate_seq
> > +	 *    spin_unlock
> > +	 *     op->invalidate_range():
> > +	 *       user_lock
> > +	 *        mmu_range_set_seq()
> > +	 *         mrn->invalidate_seq = seq
> > +	 *       user_unlock
> > +	 *
> > +	 *                          [Required: mmu_range_read_retry() == true]
> > +	 *
> > +	 *   mn_itree_inv_end():
> > +	 *    spin_lock
> > +	 *     seq = ++mmn_mm->invalidate_seq
> > +	 *    spin_unlock
> > +	 *
> > +	 *                                        user_lock
> > +	 *                                         mmu_range_read_retry():
> > +	 *                                          mrn->invalidate_seq != seq
> > +	 *                                        user_unlock
> > +	 *
> > +	 * Barriers are not needed here as any races here are closed by an
> > +	 * eventual mmu_range_read_retry(), which provides a barrier via the
> > +	 * user_lock.
> > +	 */
> > +	spin_lock(&mmn_mm->lock);
> > +	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
> > +	seq = READ_ONCE(mrn->invalidate_seq);
> > +	is_invalidating = seq == mmn_mm->invalidate_seq;
> > +	spin_unlock(&mmn_mm->lock);
> > +
> > +	/*
> > +	 * mrn->invalidate_seq is always set to an odd value. This ensures
> 
> This claim just looks wrong the first N times one reads the code, given that
> there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe you mean
> 
> "is always set to an odd value when invalidating"??

No it is always odd, you must call mmu_range_set_seq() only from the
op->invalidate_range() callback at which point the seq is odd. As well
when mrn is added and its seq first set it is set to an odd value
always. Maybe the comment, should read:

 * mrn->invalidate_seq is always, yes always, set to an odd value. This ensures

To stress that it is not an error.

> 
> > +	 * that if seq does wrap we will always clear the below sleep in some
> > +	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> > +	 * state.
> > +	 */
> 
> Let's move that comment higher up. The code that follows it has nothing to
> do with it, so it's confusing here.

No the comment is in the right place, the fact that it is odd and that
idle state is even explains why the wait() will never last forever.
Already had a discussion on this in v1.

[...]

> > +	/*
> > +	 * If some invalidate_range_start/end region is going on in parallel
> > +	 * we don't know what VA ranges are affected, so we must assume this
> > +	 * new range is included.
> > +	 *
> > +	 * If the itree is invalidating then we are not allowed to change
> > +	 * it. Retrying until invalidation is done is tricky due to the
> > +	 * possibility for live lock, instead defer the add to the unlock so
> > +	 * this algorithm is deterministic.
> > +	 *
> > +	 * In all cases the value for the mrn->mr_invalidate_seq should be
> > +	 * odd, see mmu_range_read_begin()
> > +	 */
> > +	spin_lock(&mmn_mm->lock);
> > +	if (mmn_mm->active_invalidate_ranges) {
> > +		if (mn_itree_is_invalidating(mmn_mm))
> > +			hlist_add_head(&mrn->deferred_item,
> > +				       &mmn_mm->deferred_list);
> > +		else {
> > +			mmn_mm->invalidate_seq |= 1;
> > +			interval_tree_insert(&mrn->interval_tree,
> > +					     &mmn_mm->itree);
> > +		}
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > +	} else {
> > +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> 
> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> for seq numbers here?  I'm acutely unhappy trying to figure this out.
> I suspect it's another unfortunate side effect of trying to use the
> lower bit of the seq number (even/odd) for something else.

If there is no mmn_mm->active_invalidate_ranges then it means that
mmn_mm->invalidate_seq is even and thus mmn_mm->invalidate_seq - 1
is an odd number which means that mrn->invalidate_seq is initialized
to odd value and if you follow the rule for calling mmu_range_set_seq()
then it will _always_ be an odd number and this close the loop with
the above comments :)

> 
> > +		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
> > +	}
> > +	spin_unlock(&mmn_mm->lock);
> > +	return 0;
> > +}
> > +
> > +/**
> > + * mmu_range_notifier_insert - Insert a range notifier
> > + * @mrn: Range notifier to register
> > + * @start: Starting virtual address to monitor
> > + * @length: Length of the range to monitor
> > + * @mm : mm_struct to attach to
> > + *
> > + * This function subscribes the range notifier for notifications from the mm.
> > + * Upon return the ops related to mmu_range_notifier will be called whenever
> > + * an event that intersects with the given range occurs.
> > + *
> > + * Upon return the range_notifier may not be present in the interval tree yet.
> > + * The caller must use the normal range notifier locking flow via
> > + * mmu_range_read_begin() to establish SPTEs for this range.
> > + */
> > +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> > +			      unsigned long start, unsigned long length,
> > +			      struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier_mm *mmn_mm;
> > +	int ret;
> 
> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
> I'll check on that. It's correct here, though.
> 
> > +
> > +	might_lock(&mm->mmap_sem);
> > +
> > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> 
> What does the above pair with? Should have a comment that specifies that.

It was discussed in v1 but maybe a comment of what was said back then would
be helpful. Something like:

/*
 * We need to insure that all writes to mm->mmu_notifier_mm are visible before
 * any checks we do on mmn_mm below as otherwise CPU might re-order write done
 * by another CPU core to mm->mmu_notifier_mm structure fields after the read
 * belows.
 */

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER
  2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
  2019-11-01 18:26   ` Jason Gunthorpe
@ 2019-11-07  9:39   ` Jürgen Groß
  1 sibling, 0 replies; 71+ messages in thread
From: Jürgen Groß @ 2019-11-07  9:39 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Felix.Kuehling
  Cc: linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig, Jason Gunthorpe

On 28.10.19 21:10, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> DMA_SHARED_BUFFER can not be enabled by the user (it represents a library
> set in the kernel). The kconfig convention is to use select for such
> symbols so they are turned on implicitly when the user enables a kconfig
> that needs them.
> 
> Otherwise the XEN_GNTDEV_DMABUF kconfig is overly difficult to enable.
> 
> Fixes: 932d6562179e ("xen/gntdev: Add initial support for dma-buf UAPI")
> Cc: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Cc: xen-devel@lists.xenproject.org
> Cc: Juergen Gross <jgross@suse.com>
> Cc: Stefano Stabellini <sstabellini@kernel.org>
> Reviewed-by: Juergen Gross <jgross@suse.com>
> Reviewed-by: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Applied to xen/tip.git for-linus-5.5a


Juergen

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07  0:23   ` John Hubbard
  2019-11-07  2:08     ` Jerome Glisse
@ 2019-11-07 20:06     ` Jason Gunthorpe
  2019-11-07 20:53       ` John Hubbard
  2019-11-08  6:33       ` Christoph Hellwig
  1 sibling, 2 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-07 20:06 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Wed, Nov 06, 2019 at 04:23:21PM -0800, John Hubbard wrote:
 
> Nice design, I love the seq foundation! So far, I'm not able to spot anything
> actually wrong with the implementation, sorry about that. 

Alas :( I feel there must be a bug in here still, but onwards!

One of the main sad points was it didn't make sense to use the
existing seqlock/seqcount primitives as they have both the wrong write
concurrancy model and extra barriers that are not needed when it is
always manipulated under a spinlock
 
> 1. There is a rather severe naming overlap (not technically a naming conflict,
> but still) with existing mmn work, which already has, for example:
> 
>     struct mmu_notifier_range
> 
> ...and you're adding:
> 
>     struct mmu_range_notifier
> 
> ...so I'll try to help sort that out.

Yes, I've been sad about this too.

> So this should read:
> 
> enum mmu_range_notifier_event {
> 	MMU_NOTIFY_RELEASE,
> };
> 
> ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> whole thing.
> 
> Also, it is best moved down to be next to the new MNR structs, so that all the
> MNR stuff is in one group.

I agree with Jerome, this enum is part of the 'struct
mmu_notifier_range' (ie the description of the invalidation) and it
doesn't really matter that only these new notifiers can be called with
this type, it is still part of the mmu_notifier_range.

The comment already says it only applies to the mmu_range_notifier
scheme..

> >  #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
> > @@ -222,6 +228,26 @@ struct mmu_notifier {
> >  	unsigned int users;
> >  };
> >  
> 
> That should also be moved down, next to the new structs.

Which this?

> > +/**
> > + * struct mmu_range_notifier_ops
> > + * @invalidate: Upon return the caller must stop using any SPTEs within this
> > + *              range, this function can sleep. Return false if blocking was
> > + *              required but range is non-blocking
> > + */
> 
> How about this (I'm not sure I fully understand the return value, though):
> 
> /**
>  * struct mmu_range_notifier_ops
>  * @invalidate: Upon return the caller must stop using any SPTEs within this
>  * 		range.
>  *
>  * 		This function is permitted to sleep.
>  *
>  *      	@Return: false if blocking was required, but @range is
>  *			non-blocking.
>  *
>  */

Is this kdoc format for function pointers?
 
> 
> > +struct mmu_range_notifier_ops {
> > +	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > +			   const struct mmu_notifier_range *range,
> > +			   unsigned long cur_seq);
> > +};
> > +
> > +struct mmu_range_notifier {
> > +	struct interval_tree_node interval_tree;
> > +	const struct mmu_range_notifier_ops *ops;
> > +	struct hlist_node deferred_item;
> > +	unsigned long invalidate_seq;
> > +	struct mm_struct *mm;
> > +};
> > +
> 
> Again, now we have the new struct mmu_range_notifier, and the old 
> struct mmu_notifier_range, and it's not good.
> 
> Ideas:
> 
> a) Live with it.
> 
> b) (Discarded, too many callers): rename old one. Nope.
> 
> c) Rename new one. Ideas:
> 
>     struct mmu_interval_notifier
>     struct mmu_range_intersection
>     ...other ideas?

This odd duality has already cause some confusion, but names here are
hard.  mmu_interval_notifier is the best alternative I've heard.

Changing this name is a lot of work - are we happy
'mmu_interval_notifier' is the right choice?

> > +/**
> > + * mmu_range_set_seq - Save the invalidation sequence
> 
> How about:
> 
>  * mmu_range_set_seq - Set the .invalidate_seq to a new value.

It is not a 'new value', it is a value that is provided to the
invalidate callback

> 
> > + * @mrn - The mrn passed to invalidate
> > + * @cur_seq - The cur_seq passed to invalidate
> > + *
> > + * This must be called unconditionally from the invalidate callback of a
> > + * struct mmu_range_notifier_ops under the same lock that is used to call
> > + * mmu_range_read_retry(). It updates the sequence number for later use by
> > + * mmu_range_read_retry().
> > + *
> > + * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
> 
> nit: "caller" is better than "user", when referring to...well, callers. "user" 
> most often refers to user space, whereas a call stack and function calling is 
> clearly what you're referring to here (and in other places, especially "user lock").

Done

> > +/**
> > + * mmu_range_check_retry - Test if a collision has occurred
> > + * mrn: The range under lock
> > + * seq: The return of the matching mmu_range_read_begin()
> > + *
> > + * This can be used in the critical section between mmu_range_read_begin() and
> > + * mmu_range_read_retry().  A return of true indicates an invalidation has
> > + * collided with this lock and a future mmu_range_read_retry() will return
> > + * true.
> > + *
> > + * False is not reliable and only suggests a collision has not happened. It
> 
> let's say "suggests that a collision *may* not have occurred."  

Sure

> > +/*
> > + * This is a collision-retry read-side/write-side 'lock', a lot like a
> > + * seqcount, however this allows multiple write-sides to hold it at
> > + * once. Conceptually the write side is protecting the values of the PTEs in
> > + * this mm, such that PTES cannot be read into SPTEs while any writer exists.
> 
> Just to be kind, can we say "SPTEs (shadow PTEs)", just this once? :)

Haha, sure, why not

> > + * The write side has two states, fully excluded:
> > + *  - mm->active_invalidate_ranges != 0
> > + *  - mnn->invalidate_seq & 1 == True
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is not allowed to change
> > + *
> > + * And partially excluded:
> > + *  - mm->active_invalidate_ranges != 0
> 
> I assume this implies mnn->invalidate_seq & 1 == False in this case? If so,
> let's say so. I'm probably getting that wrong, too.

Yes that is right, done

> 
> > + *  - some range on the mm_struct is being invalidated
> > + *  - the itree is allowed to change
> > + *
> > + * The later state avoids some expensive work on inv_end in the common case of
> > + * no mrn monitoring the VA.
> > + */
> > +static bool mn_itree_is_invalidating(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	lockdep_assert_held(&mmn_mm->lock);
> > +	return mmn_mm->invalidate_seq & 1;
> > +}
> > +
> > +static struct mmu_range_notifier *
> > +mn_itree_inv_start_range(struct mmu_notifier_mm *mmn_mm,
> > +			 const struct mmu_notifier_range *range,
> > +			 unsigned long *seq)
> > +{
> > +	struct interval_tree_node *node;
> > +	struct mmu_range_notifier *res = NULL;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	mmn_mm->active_invalidate_ranges++;
> > +	node = interval_tree_iter_first(&mmn_mm->itree, range->start,
> > +					range->end - 1);
> > +	if (node) {
> > +		mmn_mm->invalidate_seq |= 1;
> 
> 
> OK, this either needs more documentation and assertions, or a different
> approach. Because I see addition, subtraction, AND, OR and booleans
> all being applied to this field, and it's darn near hopeless to figure
> out whether or not it really is even or odd at the right times.

This is a standard design for a seqlock scheme and follows the
existing design of the linux seq lock.

The lower bit indicates the lock'd state and the upper bits indicate
the generation of the lock

The operations on the lock itself are then:
   seq |= 1  # Take the lock
   seq++     # Release an acquired lock
   seq & 1   # True if locked

Which is how this is written

> Different approach: why not just add a mmn_mm->is_invalidating 
> member variable? It's not like you're short of space in that struct.

Splitting it makes alot of stuff more complex and unnatural.

The ops above could be put in inline wrappers, but they only occur
only in functions already called mn_itree_inv_start_range() and
mn_itree_inv_end() and mn_itree_is_invalidating().

There is the one 'take the lock' outlier in
__mmu_range_notifier_insert() though

> > +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
> > +{
> > +	struct mmu_range_notifier *mrn;
> > +	struct hlist_node *next;
> > +	bool need_wake = false;
> > +
> > +	spin_lock(&mmn_mm->lock);
> > +	if (--mmn_mm->active_invalidate_ranges ||
> > +	    !mn_itree_is_invalidating(mmn_mm)) {
> > +		spin_unlock(&mmn_mm->lock);
> > +		return;
> > +	}
> > +
> > +	mmn_mm->invalidate_seq++;
> 
> Is this the right place for an assertion that this is now an even value?

Yes, but I'm reluctant to add such a runtime check on this fastish path..
How about a comment?

> > +	need_wake = true;
> > +
> > +	/*
> > +	 * The inv_end incorporates a deferred mechanism like
> > +	 * rtnl_lock(). Adds and removes are queued until the final inv_end
> 
> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
> But I suppose if one studies that file closely there is more. :)

Lets change that to rtnl_unlock() then

> > +	spin_lock(&mmn_mm->lock);
> > +	/* Pairs with the WRITE_ONCE in mmu_range_set_seq() */
> > +	seq = READ_ONCE(mrn->invalidate_seq);
> > +	is_invalidating = seq == mmn_mm->invalidate_seq;
> > +	spin_unlock(&mmn_mm->lock);
> > +
> > +	/*
> > +	 * mrn->invalidate_seq is always set to an odd value. This ensures
> 
> This claim just looks wrong the first N times one reads the code, given that
> there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe
> you mean

mmu_range_set_seq() is NOT to be used to set to an arbitary value, it
must only be used to set to the value provided in the invalidate()
callback and that value is always odd. Lets make this super clear:

	/*
	 * mrn->invalidate_seq must always be set to an odd value via
	 * mmu_range_set_seq() using the provided cur_seq from
	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
	 * will always clear the below sleep in some reasonable time as
	 * mmn_mm->invalidate_seq is even in the idle state.
	 */

The invarient is that the 'struct mmu_range_notifier' always has an
odd 'seq'

> > +	 * that if seq does wrap we will always clear the below sleep in some
> > +	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> > +	 * state.
> > +	 */
> 
> Let's move that comment higher up. The code that follows it has nothing to
> do with it, so it's confusing here.

The comment is explaining why the wait_event is safe, even if we wrap
the sequence number, which is a significant and very subtle corner
case. This is really why we have the even/odd thing at all.

> > +	spin_lock(&mmn_mm->lock);
> > +	if (mmn_mm->active_invalidate_ranges) {
> > +		if (mn_itree_is_invalidating(mmn_mm))
> > +			hlist_add_head(&mrn->deferred_item,
> > +				       &mmn_mm->deferred_list);
> > +		else {
> > +			mmn_mm->invalidate_seq |= 1;
> > +			interval_tree_insert(&mrn->interval_tree,
> > +					     &mmn_mm->itree);
> > +		}
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > +	} else {
> > +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> 
> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> for seq numbers here?  I'm acutely unhappy trying to figure this out.
> I suspect it's another unfortunate side effect of trying to use the
> lower bit of the seq number (even/odd) for something else.

No, this is actually done for the seq number itself. We need to
generate a seq number that is != the current invalidate_seq as this
new mrn is not invalidating.

The best seq to use is one that the invalidate_seq will not reach for
a long time, ie 'invalidate_seq + MAX' which is expressed as -1

The even/odd thing just takes care of itself naturally here as
invalidate_seq is guarenteed even and -1 creates both an odd mrn value
and a good seq number.

The algorithm would actually work correctly if this was
'mrn->invalidate_seq = 1', but occasionally things would block when
they don't need to block.

Lets add a comment:

		/*
		 * The starting seq for a mrn not under invalidation should be
		 * odd, not equal to the current invalidate_seq and
		 * invalidate_seq should not 'wrap' to the new seq any time
		 * soon.
		 */

> > +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> > +			      unsigned long start, unsigned long length,
> > +			      struct mm_struct *mm)
> > +{
> > +	struct mmu_notifier_mm *mmn_mm;
> > +	int ret;
> 
> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
> I'll check on that. It's correct here, though.

Looks OK in my tree?

> > +	might_lock(&mm->mmap_sem);
> > +
> > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> 
> What does the above pair with? Should have a comment that specifies that.

smp_load_acquire() always pairs with smp_store_release() to the same
memory, there is only one store, is a comment really needed?

Below are the comment updates I made, thanks!

Jason

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 51b92ba013ddce..065c95002e9602 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -302,15 +302,15 @@ void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
 /**
  * mmu_range_set_seq - Save the invalidation sequence
  * @mrn - The mrn passed to invalidate
- * @cur_seq - The cur_seq passed to invalidate
+ * @cur_seq - The cur_seq passed to the invalidate() callback
  *
  * This must be called unconditionally from the invalidate callback of a
  * struct mmu_range_notifier_ops under the same lock that is used to call
  * mmu_range_read_retry(). It updates the sequence number for later use by
- * mmu_range_read_retry().
+ * mmu_range_read_retry(). The provided cur_seq will always be odd.
  *
- * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
- * then this call is not required.
+ * If the caller does not call mmu_range_read_begin() or
+ * mmu_range_read_retry() then this call is not required.
  */
 static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
 				     unsigned long cur_seq)
@@ -348,8 +348,9 @@ static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
  * collided with this lock and a future mmu_range_read_retry() will return
  * true.
  *
- * False is not reliable and only suggests a collision has not happened. It
- * can be called many times and does not have to hold the user provided lock.
+ * False is not reliable and only suggests a collision may not have
+ * occured. It can be called many times and does not have to hold the user
+ * provided lock.
  *
  * This call can be used as part of loops and other expensive operations to
  * expedite a retry.
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 2b7485919ecfeb..afe1e2d94183f8 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -51,7 +51,8 @@ struct mmu_notifier_mm {
  * This is a collision-retry read-side/write-side 'lock', a lot like a
  * seqcount, however this allows multiple write-sides to hold it at
  * once. Conceptually the write side is protecting the values of the PTEs in
- * this mm, such that PTES cannot be read into SPTEs while any writer exists.
+ * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any
+ * writer exists.
  *
  * Note that the core mm creates nested invalidate_range_start()/end() regions
  * within the same thread, and runs invalidate_range_start()/end() in parallel
@@ -64,12 +65,13 @@ struct mmu_notifier_mm {
  *
  * The write side has two states, fully excluded:
  *  - mm->active_invalidate_ranges != 0
- *  - mnn->invalidate_seq & 1 == True
+ *  - mnn->invalidate_seq & 1 == True (odd)
  *  - some range on the mm_struct is being invalidated
  *  - the itree is not allowed to change
  *
  * And partially excluded:
  *  - mm->active_invalidate_ranges != 0
+ *  - mnn->invalidate_seq & 1 == False (even)
  *  - some range on the mm_struct is being invalidated
  *  - the itree is allowed to change
  *
@@ -131,12 +133,13 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
 		return;
 	}
 
+	/* Make invalidate_seq even */
 	mmn_mm->invalidate_seq++;
 	need_wake = true;
 
 	/*
 	 * The inv_end incorporates a deferred mechanism like
-	 * rtnl_lock(). Adds and removes are queued until the final inv_end
+	 * rtnl_unlock(). Adds and removes are queued until the final inv_end
 	 * happens then they are progressed. This arrangement for tree updates
 	 * is used to avoid using a blocking lock during
 	 * invalidate_range_start.
@@ -230,10 +233,11 @@ unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
 	spin_unlock(&mmn_mm->lock);
 
 	/*
-	 * mrn->invalidate_seq is always set to an odd value. This ensures
-	 * that if seq does wrap we will always clear the below sleep in some
-	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
-	 * state.
+	 * mrn->invalidate_seq must always be set to an odd value via
+	 * mmu_range_set_seq() using the provided cur_seq from
+	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
+	 * will always clear the below sleep in some reasonable time as
+	 * mmn_mm->invalidate_seq is even in the idle state.
 	 */
 	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
 	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
@@ -892,6 +896,12 @@ static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
 		mrn->invalidate_seq = mmn_mm->invalidate_seq;
 	} else {
 		WARN_ON(mn_itree_is_invalidating(mmn_mm));
+		/*
+		 * The starting seq for a mrn not under invalidation should be
+		 * odd, not equal to the current invalidate_seq and
+		 * invalidate_seq should not 'wrap' to the new seq any time
+		 * soon.
+		 */
 		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
 		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
 	}

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07  2:08     ` Jerome Glisse
@ 2019-11-07 20:11       ` Jason Gunthorpe
  2019-11-07 21:04         ` Jerome Glisse
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-07 20:11 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, linux-mm, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:

> > 
> > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > if anyone has the time.
> 
> The range notifier should get the event too, it would be a waste, i think it is
> an oversight here. The release event is fine so NAK to you separate event. Event
> is really an helper for notifier i had a set of patch for nouveau to leverage
> this i need to resucite them. So no need to split thing, i would just forward
> the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> that in v1 sorry.

I think what you mean is already done?

struct mmu_range_notifier_ops {
	bool (*invalidate)(struct mmu_range_notifier *mrn,
			   const struct mmu_notifier_range *range,
			   unsigned long cur_seq);

> No it is always odd, you must call mmu_range_set_seq() only from the
> op->invalidate_range() callback at which point the seq is odd. As well
> when mrn is added and its seq first set it is set to an odd value
> always. Maybe the comment, should read:
> 
>  * mrn->invalidate_seq is always, yes always, set to an odd value. This ensures
> 
> To stress that it is not an error.

I went with this:

	/*
	 * mrn->invalidate_seq must always be set to an odd value via
	 * mmu_range_set_seq() using the provided cur_seq from
	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
	 * will always clear the below sleep in some reasonable time as
	 * mmn_mm->invalidate_seq is even in the idle state.
	 */

> > > +	spin_lock(&mmn_mm->lock);
> > > +	if (mmn_mm->active_invalidate_ranges) {
> > > +		if (mn_itree_is_invalidating(mmn_mm))
> > > +			hlist_add_head(&mrn->deferred_item,
> > > +				       &mmn_mm->deferred_list);
> > > +		else {
> > > +			mmn_mm->invalidate_seq |= 1;
> > > +			interval_tree_insert(&mrn->interval_tree,
> > > +					     &mmn_mm->itree);
> > > +		}
> > > +		mrn->invalidate_seq = mmn_mm->invalidate_seq;
> > > +	} else {
> > > +		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> > > +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
> > 
> > Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
> > for seq numbers here?  I'm acutely unhappy trying to figure this out.
> > I suspect it's another unfortunate side effect of trying to use the
> > lower bit of the seq number (even/odd) for something else.
> 
> If there is no mmn_mm->active_invalidate_ranges then it means that
> mmn_mm->invalidate_seq is even and thus mmn_mm->invalidate_seq - 1
> is an odd number which means that mrn->invalidate_seq is initialized
> to odd value and if you follow the rule for calling mmu_range_set_seq()
> then it will _always_ be an odd number and this close the loop with
> the above comments :)

The key thing is that it is an odd value that will take a long time
before mmn_mm->invalidate seq reaches it

> > > +	might_lock(&mm->mmap_sem);
> > > +
> > > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> > 
> > What does the above pair with? Should have a comment that specifies that.
> 
> It was discussed in v1 but maybe a comment of what was said back then would
> be helpful. Something like:
> 
> /*
>  * We need to insure that all writes to mm->mmu_notifier_mm are visible before
>  * any checks we do on mmn_mm below as otherwise CPU might re-order write done
>  * by another CPU core to mm->mmu_notifier_mm structure fields after the read
>  * belows.
>  */

This comment made it, just at the store side:

	/*
	 * Serialize the update against mmu_notifier_unregister. A
	 * side note: mmu_notifier_release can't run concurrently with
	 * us because we hold the mm_users pin (either implicitly as
	 * current->mm or explicitly with get_task_mm() or similar).
	 * We can't race against any other mmu notifier method either
	 * thanks to mm_take_all_locks().
	 *
	 * release semantics on the initialization of the mmu_notifier_mm's
         * contents are provided for unlocked readers.  acquire can only be
         * used while holding the mmgrab or mmget, and is safe because once
         * created the mmu_notififer_mm is not freed until the mm is
         * destroyed.  As above, users holding the mmap_sem or one of the
         * mm_take_all_locks() do not need to use acquire semantics.
	 */
	if (mmu_notifier_mm)
		smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);

Which I think is really overly belaboring the typical smp
store/release pattern, but people do seem unfamiliar with them...

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-05 15:16       ` Boris Ostrovsky
@ 2019-11-07 20:36         ` Jason Gunthorpe
  2019-11-07 22:54           ` Boris Ostrovsky
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-07 20:36 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On Tue, Nov 05, 2019 at 10:16:46AM -0500, Boris Ostrovsky wrote:

> > So, I suppose it can be relaxed to a null test and a WARN_ON that it
> > hasn't changed?
> 
> You mean
> 
> if (use_ptemod) {
>         WARN_ON(map->vma != vma);
>         ...
> 
> 
> Yes, that sounds good.

I amended my copy of the patch with the above, has this rework shown
signs of working?

@@ -436,7 +436,8 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
        struct gntdev_priv *priv = file->private_data;
 
        pr_debug("gntdev_vma_close %p\n", vma);
-       if (use_ptemod && map->vma == vma) {
+       if (use_ptemod) {
+               WARN_ON(map->vma != vma);
                mmu_range_notifier_remove(&map->notifier);
                map->vma = NULL;
        }

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07 20:06     ` Jason Gunthorpe
@ 2019-11-07 20:53       ` John Hubbard
  2019-11-08 15:26         ` Jason Gunthorpe
  2019-11-08  6:33       ` Christoph Hellwig
  1 sibling, 1 reply; 71+ messages in thread
From: John Hubbard @ 2019-11-07 20:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On 11/7/19 12:06 PM, Jason Gunthorpe wrote:
...
>>
>> Also, it is best moved down to be next to the new MNR structs, so that all the
>> MNR stuff is in one group.
> 
> I agree with Jerome, this enum is part of the 'struct
> mmu_notifier_range' (ie the description of the invalidation) and it
> doesn't really matter that only these new notifiers can be called with
> this type, it is still part of the mmu_notifier_range.
> 

OK.

> The comment already says it only applies to the mmu_range_notifier
> scheme..
> 
>>>   #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
>>> @@ -222,6 +228,26 @@ struct mmu_notifier {
>>>   	unsigned int users;
>>>   };
>>>   
>>
>> That should also be moved down, next to the new structs.
> 
> Which this?

I was referring to MMU_NOTIFIER_RANGE_BLOCKABLE, above. Trying
to put all the new range notifier stuff in one place. But maybe not,
if these are really not as separate as I thought.

> 
>>> +/**
>>> + * struct mmu_range_notifier_ops
>>> + * @invalidate: Upon return the caller must stop using any SPTEs within this
>>> + *              range, this function can sleep. Return false if blocking was
>>> + *              required but range is non-blocking
>>> + */
>>
>> How about this (I'm not sure I fully understand the return value, though):
>>
>> /**
>>   * struct mmu_range_notifier_ops
>>   * @invalidate: Upon return the caller must stop using any SPTEs within this
>>   * 		range.
>>   *
>>   * 		This function is permitted to sleep.
>>   *
>>   *      	@Return: false if blocking was required, but @range is
>>   *			non-blocking.
>>   *
>>   */
> 
> Is this kdoc format for function pointers?

heh, I'm sort of winging it, I'm not sure how function pointers are supposed
to be documented in kdoc. Actually the only key take-away here is to write

"This function can sleep"

as a separate sentence..

...
>> c) Rename new one. Ideas:
>>
>>      struct mmu_interval_notifier
>>      struct mmu_range_intersection
>>      ...other ideas?
> 
> This odd duality has already cause some confusion, but names here are
> hard.  mmu_interval_notifier is the best alternative I've heard.
> 
> Changing this name is a lot of work - are we happy
> 'mmu_interval_notifier' is the right choice?


Yes, it's my favorite too. I'd vote for going with that.

...
>>
>>
>> OK, this either needs more documentation and assertions, or a different
>> approach. Because I see addition, subtraction, AND, OR and booleans
>> all being applied to this field, and it's darn near hopeless to figure
>> out whether or not it really is even or odd at the right times.
> 
> This is a standard design for a seqlock scheme and follows the
> existing design of the linux seq lock.
> 
> The lower bit indicates the lock'd state and the upper bits indicate
> the generation of the lock
> 
> The operations on the lock itself are then:
>     seq |= 1  # Take the lock
>     seq++     # Release an acquired lock
>     seq & 1   # True if locked
> 
> Which is how this is written

Very nice, would you be open to putting that into (any) one of the comment
headers? That's an unusually clear and concise description:

/*
  * This is a standard design for a seqlock scheme and follows the
  * existing design of the linux seq lock.
  *
  * The lower bit indicates the lock'd state and the upper bits indicate
  * the generation of the lock
  *
  * The operations on the lock itself are then:
  *    seq |= 1  # Take the lock
  *    seq++     # Release an acquired lock
  *    seq & 1   # True if locked
  */


> 
>> Different approach: why not just add a mmn_mm->is_invalidating
>> member variable? It's not like you're short of space in that struct.
> 
> Splitting it makes alot of stuff more complex and unnatural.
> 

OK, agreed.

> The ops above could be put in inline wrappers, but they only occur
> only in functions already called mn_itree_inv_start_range() and
> mn_itree_inv_end() and mn_itree_is_invalidating().
> 
> There is the one 'take the lock' outlier in
> __mmu_range_notifier_insert() though
> 
>>> +static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>>> +{
>>> +	struct mmu_range_notifier *mrn;
>>> +	struct hlist_node *next;
>>> +	bool need_wake = false;
>>> +
>>> +	spin_lock(&mmn_mm->lock);
>>> +	if (--mmn_mm->active_invalidate_ranges ||
>>> +	    !mn_itree_is_invalidating(mmn_mm)) {
>>> +		spin_unlock(&mmn_mm->lock);
>>> +		return;
>>> +	}
>>> +
>>> +	mmn_mm->invalidate_seq++;
>>
>> Is this the right place for an assertion that this is now an even value?
> 
> Yes, but I'm reluctant to add such a runtime check on this fastish path..
> How about a comment?

Sure.

> 
>>> +	need_wake = true;
>>> +
>>> +	/*
>>> +	 * The inv_end incorporates a deferred mechanism like
>>> +	 * rtnl_lock(). Adds and removes are queued until the final inv_end
>>
>> Let me point out that rtnl_lock() itself is a one-liner that calls mutex_lock().
>> But I suppose if one studies that file closely there is more. :)
> 
> Lets change that to rtnl_unlock() then


Thanks :)


...
>>> +	 * mrn->invalidate_seq is always set to an odd value. This ensures
>>
>> This claim just looks wrong the first N times one reads the code, given that
>> there is mmu_range_set_seq() to set it to an arbitrary value!  Maybe
>> you mean
> 
> mmu_range_set_seq() is NOT to be used to set to an arbitary value, it
> must only be used to set to the value provided in the invalidate()
> callback and that value is always odd. Lets make this super clear:
> 
> 	/*
> 	 * mrn->invalidate_seq must always be set to an odd value via
> 	 * mmu_range_set_seq() using the provided cur_seq from
> 	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> 	 * will always clear the below sleep in some reasonable time as
> 	 * mmn_mm->invalidate_seq is even in the idle state.
> 	 */
> 

OK, that helps a lot.

...
>>> +		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
>>
>> Ohhh, checkmate. I lose. Why is *subtracting* the right thing to do
>> for seq numbers here?  I'm acutely unhappy trying to figure this out.
>> I suspect it's another unfortunate side effect of trying to use the
>> lower bit of the seq number (even/odd) for something else.
> 
> No, this is actually done for the seq number itself. We need to
> generate a seq number that is != the current invalidate_seq as this
> new mrn is not invalidating.
> 
> The best seq to use is one that the invalidate_seq will not reach for
> a long time, ie 'invalidate_seq + MAX' which is expressed as -1
> 
> The even/odd thing just takes care of itself naturally here as
> invalidate_seq is guarenteed even and -1 creates both an odd mrn value
> and a good seq number.
> 
> The algorithm would actually work correctly if this was
> 'mrn->invalidate_seq = 1', but occasionally things would block when
> they don't need to block.
> 
> Lets add a comment:
> 
> 		/*
> 		 * The starting seq for a mrn not under invalidation should be
> 		 * odd, not equal to the current invalidate_seq and
> 		 * invalidate_seq should not 'wrap' to the new seq any time
> 		 * soon.
> 		 */

Very helpful. How about this additional tweak:

/*
  * The starting seq for a mrn not under invalidation should be
  * odd, not equal to the current invalidate_seq and
  * invalidate_seq should not 'wrap' to the new seq any time
  * soon. Subtracting 1 from the current (even) value achieves that.
  */


> 
>>> +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
>>> +			      unsigned long start, unsigned long length,
>>> +			      struct mm_struct *mm)
>>> +{
>>> +	struct mmu_notifier_mm *mmn_mm;
>>> +	int ret;
>>
>> Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
>> I'll check on that. It's correct here, though.
> 
> Looks OK in my tree?

Nope, that's how I found it. The top of your mmu_notifier branch has this:

int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
{
         struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
         int ret = 0;

         if (mmn_mm->has_interval) {
                 ret = mn_itree_invalidate(mmn_mm, range);
                 if (ret)
                         return ret;
         }
         if (!hlist_empty(&mmn_mm->list))
                 return mn_hlist_invalidate_range_start(mmn_mm, range);
         return 0;
}


> 
>>> +	might_lock(&mm->mmap_sem);
>>> +
>>> +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
>>
>> What does the above pair with? Should have a comment that specifies that.
> 
> smp_load_acquire() always pairs with smp_store_release() to the same
> memory, there is only one store, is a comment really needed?
> 
> Below are the comment updates I made, thanks!
> 
> Jason
> 
> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
> index 51b92ba013ddce..065c95002e9602 100644
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -302,15 +302,15 @@ void mmu_range_notifier_remove(struct mmu_range_notifier *mrn);
>   /**
>    * mmu_range_set_seq - Save the invalidation sequence
>    * @mrn - The mrn passed to invalidate
> - * @cur_seq - The cur_seq passed to invalidate
> + * @cur_seq - The cur_seq passed to the invalidate() callback
>    *
>    * This must be called unconditionally from the invalidate callback of a
>    * struct mmu_range_notifier_ops under the same lock that is used to call
>    * mmu_range_read_retry(). It updates the sequence number for later use by
> - * mmu_range_read_retry().
> + * mmu_range_read_retry(). The provided cur_seq will always be odd.
>    *
> - * If the user does not call mmu_range_read_begin() or mmu_range_read_retry()
> - * then this call is not required.
> + * If the caller does not call mmu_range_read_begin() or
> + * mmu_range_read_retry() then this call is not required.
>    */
>   static inline void mmu_range_set_seq(struct mmu_range_notifier *mrn,
>   				     unsigned long cur_seq)
> @@ -348,8 +348,9 @@ static inline bool mmu_range_read_retry(struct mmu_range_notifier *mrn,
>    * collided with this lock and a future mmu_range_read_retry() will return
>    * true.
>    *
> - * False is not reliable and only suggests a collision has not happened. It
> - * can be called many times and does not have to hold the user provided lock.
> + * False is not reliable and only suggests a collision may not have
> + * occured. It can be called many times and does not have to hold the user
> + * provided lock.
>    *
>    * This call can be used as part of loops and other expensive operations to
>    * expedite a retry.
> diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
> index 2b7485919ecfeb..afe1e2d94183f8 100644
> --- a/mm/mmu_notifier.c
> +++ b/mm/mmu_notifier.c
> @@ -51,7 +51,8 @@ struct mmu_notifier_mm {
>    * This is a collision-retry read-side/write-side 'lock', a lot like a
>    * seqcount, however this allows multiple write-sides to hold it at
>    * once. Conceptually the write side is protecting the values of the PTEs in
> - * this mm, such that PTES cannot be read into SPTEs while any writer exists.
> + * this mm, such that PTES cannot be read into SPTEs (shadow PTEs) while any
> + * writer exists.
>    *
>    * Note that the core mm creates nested invalidate_range_start()/end() regions
>    * within the same thread, and runs invalidate_range_start()/end() in parallel
> @@ -64,12 +65,13 @@ struct mmu_notifier_mm {
>    *
>    * The write side has two states, fully excluded:
>    *  - mm->active_invalidate_ranges != 0
> - *  - mnn->invalidate_seq & 1 == True
> + *  - mnn->invalidate_seq & 1 == True (odd)
>    *  - some range on the mm_struct is being invalidated
>    *  - the itree is not allowed to change
>    *
>    * And partially excluded:
>    *  - mm->active_invalidate_ranges != 0
> + *  - mnn->invalidate_seq & 1 == False (even)
>    *  - some range on the mm_struct is being invalidated
>    *  - the itree is allowed to change
>    *
> @@ -131,12 +133,13 @@ static void mn_itree_inv_end(struct mmu_notifier_mm *mmn_mm)
>   		return;
>   	}
>   
> +	/* Make invalidate_seq even */
>   	mmn_mm->invalidate_seq++;
>   	need_wake = true;
>   
>   	/*
>   	 * The inv_end incorporates a deferred mechanism like
> -	 * rtnl_lock(). Adds and removes are queued until the final inv_end
> +	 * rtnl_unlock(). Adds and removes are queued until the final inv_end
>   	 * happens then they are progressed. This arrangement for tree updates
>   	 * is used to avoid using a blocking lock during
>   	 * invalidate_range_start.
> @@ -230,10 +233,11 @@ unsigned long mmu_range_read_begin(struct mmu_range_notifier *mrn)
>   	spin_unlock(&mmn_mm->lock);
>   
>   	/*
> -	 * mrn->invalidate_seq is always set to an odd value. This ensures
> -	 * that if seq does wrap we will always clear the below sleep in some
> -	 * reasonable time as mmn_mm->invalidate_seq is even in the idle
> -	 * state.
> +	 * mrn->invalidate_seq must always be set to an odd value via
> +	 * mmu_range_set_seq() using the provided cur_seq from
> +	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> +	 * will always clear the below sleep in some reasonable time as
> +	 * mmn_mm->invalidate_seq is even in the idle state.
>   	 */
>   	lock_map_acquire(&__mmu_notifier_invalidate_range_start_map);
>   	lock_map_release(&__mmu_notifier_invalidate_range_start_map);
> @@ -892,6 +896,12 @@ static int __mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
>   		mrn->invalidate_seq = mmn_mm->invalidate_seq;
>   	} else {
>   		WARN_ON(mn_itree_is_invalidating(mmn_mm));
> +		/*
> +		 * The starting seq for a mrn not under invalidation should be
> +		 * odd, not equal to the current invalidate_seq and
> +		 * invalidate_seq should not 'wrap' to the new seq any time
> +		 * soon.
> +		 */
>   		mrn->invalidate_seq = mmn_mm->invalidate_seq - 1;
>   		interval_tree_insert(&mrn->interval_tree, &mmn_mm->itree);
>   	}
> 

Looks good. We're just polishing up minor points now, so you can add:

Reviewed-by: John Hubbard <jhubbard@nvidia.com>



thanks,
-- 
John Hubbard
NVIDIA

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07 20:11       ` Jason Gunthorpe
@ 2019-11-07 21:04         ` Jerome Glisse
  2019-11-08  0:32           ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Jerome Glisse @ 2019-11-07 21:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, linux-mm, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> 
> > > 
> > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > if anyone has the time.
> > 
> > The range notifier should get the event too, it would be a waste, i think it is
> > an oversight here. The release event is fine so NAK to you separate event. Event
> > is really an helper for notifier i had a set of patch for nouveau to leverage
> > this i need to resucite them. So no need to split thing, i would just forward
> > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > that in v1 sorry.
> 
> I think what you mean is already done?
> 
> struct mmu_range_notifier_ops {
> 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> 			   const struct mmu_notifier_range *range,
> 			   unsigned long cur_seq);

Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
It is almost a palyndrome structure ;)

> 
> > No it is always odd, you must call mmu_range_set_seq() only from the
> > op->invalidate_range() callback at which point the seq is odd. As well
> > when mrn is added and its seq first set it is set to an odd value
> > always. Maybe the comment, should read:
> > 
> >  * mrn->invalidate_seq is always, yes always, set to an odd value. This ensures
> > 
> > To stress that it is not an error.
> 
> I went with this:
> 
> 	/*
> 	 * mrn->invalidate_seq must always be set to an odd value via
> 	 * mmu_range_set_seq() using the provided cur_seq from
> 	 * mn_itree_inv_start_range(). This ensures that if seq does wrap we
> 	 * will always clear the below sleep in some reasonable time as
> 	 * mmn_mm->invalidate_seq is even in the idle state.
> 	 */

Yes fine with me.

[...]

> > > > +	might_lock(&mm->mmap_sem);
> > > > +
> > > > +	mmn_mm = smp_load_acquire(&mm->mmu_notifier_mm);
> > > 
> > > What does the above pair with? Should have a comment that specifies that.
> > 
> > It was discussed in v1 but maybe a comment of what was said back then would
> > be helpful. Something like:
> > 
> > /*
> >  * We need to insure that all writes to mm->mmu_notifier_mm are visible before
> >  * any checks we do on mmn_mm below as otherwise CPU might re-order write done
> >  * by another CPU core to mm->mmu_notifier_mm structure fields after the read
> >  * belows.
> >  */
> 
> This comment made it, just at the store side:
> 
> 	/*
> 	 * Serialize the update against mmu_notifier_unregister. A
> 	 * side note: mmu_notifier_release can't run concurrently with
> 	 * us because we hold the mm_users pin (either implicitly as
> 	 * current->mm or explicitly with get_task_mm() or similar).
> 	 * We can't race against any other mmu notifier method either
> 	 * thanks to mm_take_all_locks().
> 	 *
> 	 * release semantics on the initialization of the mmu_notifier_mm's
>          * contents are provided for unlocked readers.  acquire can only be
>          * used while holding the mmgrab or mmget, and is safe because once
>          * created the mmu_notififer_mm is not freed until the mm is
>          * destroyed.  As above, users holding the mmap_sem or one of the
>          * mm_take_all_locks() do not need to use acquire semantics.
> 	 */
> 	if (mmu_notifier_mm)
> 		smp_store_release(&mm->mmu_notifier_mm, mmu_notifier_mm);
> 
> Which I think is really overly belaboring the typical smp
> store/release pattern, but people do seem unfamiliar with them...

Perfect with me. I think also sometimes you forgot what memory model is
and thus store/release pattern do, i know i do and i need to refresh my
mind.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-07 20:36         ` Jason Gunthorpe
@ 2019-11-07 22:54           ` Boris Ostrovsky
  2019-11-08 14:53             ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Boris Ostrovsky @ 2019-11-07 22:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On 11/7/19 3:36 PM, Jason Gunthorpe wrote:
> On Tue, Nov 05, 2019 at 10:16:46AM -0500, Boris Ostrovsky wrote:
>
>>> So, I suppose it can be relaxed to a null test and a WARN_ON that it
>>> hasn't changed?
>> You mean
>>
>> if (use_ptemod) {
>>         WARN_ON(map->vma != vma);
>>         ...
>>
>>
>> Yes, that sounds good.
> I amended my copy of the patch with the above, has this rework shown
> signs of working?

Yes, it works fine.

But please don't forget notifier ops initialization.

With those two changes,

Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

>
> @@ -436,7 +436,8 @@ static void gntdev_vma_close(struct vm_area_struct *vma)
>         struct gntdev_priv *priv = file->private_data;
>  
>         pr_debug("gntdev_vma_close %p\n", vma);
> -       if (use_ptemod && map->vma == vma) {
> +       if (use_ptemod) {
> +               WARN_ON(map->vma != vma);
>                 mmu_range_notifier_remove(&map->notifier);
>                 map->vma = NULL;
>         }
>
> Jason


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07 21:04         ` Jerome Glisse
@ 2019-11-08  0:32           ` Jason Gunthorpe
  2019-11-08  2:00             ` Jerome Glisse
  0 siblings, 1 reply; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-08  0:32 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, linux-mm, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > 
> > > > 
> > > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > > if anyone has the time.
> > > 
> > > The range notifier should get the event too, it would be a waste, i think it is
> > > an oversight here. The release event is fine so NAK to you separate event. Event
> > > is really an helper for notifier i had a set of patch for nouveau to leverage
> > > this i need to resucite them. So no need to split thing, i would just forward
> > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > > that in v1 sorry.
> > 
> > I think what you mean is already done?
> > 
> > struct mmu_range_notifier_ops {
> > 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > 			   const struct mmu_notifier_range *range,
> > 			   unsigned long cur_seq);
> 
> Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
> It is almost a palyndrome structure ;)

Lets change the name then, this is clearly not working. I'll reflow
everything tomorrow

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-08  0:32           ` Jason Gunthorpe
@ 2019-11-08  2:00             ` Jerome Glisse
  2019-11-08 20:19               ` Jason Gunthorpe
  0 siblings, 1 reply; 71+ messages in thread
From: Jerome Glisse @ 2019-11-08  2:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, linux-mm, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Fri, Nov 08, 2019 at 12:32:25AM +0000, Jason Gunthorpe wrote:
> On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> > On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > > 
> > > > > 
> > > > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > > > if anyone has the time.
> > > > 
> > > > The range notifier should get the event too, it would be a waste, i think it is
> > > > an oversight here. The release event is fine so NAK to you separate event. Event
> > > > is really an helper for notifier i had a set of patch for nouveau to leverage
> > > > this i need to resucite them. So no need to split thing, i would just forward
> > > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > > > that in v1 sorry.
> > > 
> > > I think what you mean is already done?
> > > 
> > > struct mmu_range_notifier_ops {
> > > 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > > 			   const struct mmu_notifier_range *range,
> > > 			   unsigned long cur_seq);
> > 
> > Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
> > It is almost a palyndrome structure ;)
> 
> Lets change the name then, this is clearly not working. I'll reflow
> everything tomorrow

Semantic patch to do that run from your linux kernel directory with your patch
applied (you can run it one patch after the other and the git commit -a --fixup HEAD)

spatch --sp-file name-of-the-file-below --dir . --all-includes --in-place

%< ------------------------------------------------------------------
@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier

@@
@@
struct
-mmu_range_notifier
+mmu_interval_notifier
{...};

// Change mrn name to mmu_in
@@
struct mmu_interval_notifier *mrn;
@@
-mrn
+mmu_in

@@
identifier fn;
@@
fn(..., 
-struct mmu_interval_notifier *mrn,
+struct mmu_interval_notifier *mmu_in,
...) {...}
------------------------------------------------------------------ >%

You need coccinelle (which provides spatch). It is untested but it should work
also i could not come up with a nice name to update mrn as min is way too
confusing. If you have better name feel free to use it.

Oh and coccinelle is pretty clever about code formating so it should do a good
jobs at keeping things nicely formated and align.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07 20:06     ` Jason Gunthorpe
  2019-11-07 20:53       ` John Hubbard
@ 2019-11-08  6:33       ` Christoph Hellwig
  2019-11-08 13:43         ` Jerome Glisse
  1 sibling, 1 reply; 71+ messages in thread
From: Christoph Hellwig @ 2019-11-08  6:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, linux-mm, Jerome Glisse, Ralph Campbell,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Thu, Nov 07, 2019 at 08:06:08PM +0000, Jason Gunthorpe wrote:
> > 
> > enum mmu_range_notifier_event {
> > 	MMU_NOTIFY_RELEASE,
> > };
> > 
> > ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> > whole thing.
> > 
> > Also, it is best moved down to be next to the new MNR structs, so that all the
> > MNR stuff is in one group.
> 
> I agree with Jerome, this enum is part of the 'struct
> mmu_notifier_range' (ie the description of the invalidation) and it
> doesn't really matter that only these new notifiers can be called with
> this type, it is still part of the mmu_notifier_range.
> 
> The comment already says it only applies to the mmu_range_notifier
> scheme..

In fact the enum is entirely unused.  We might as well just kill it off
entirely.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-08  6:33       ` Christoph Hellwig
@ 2019-11-08 13:43         ` Jerome Glisse
  0 siblings, 0 replies; 71+ messages in thread
From: Jerome Glisse @ 2019-11-08 13:43 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jason Gunthorpe, John Hubbard, linux-mm, Ralph Campbell,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Andrea Arcangeli, Michal Hocko

On Thu, Nov 07, 2019 at 10:33:02PM -0800, Christoph Hellwig wrote:
> On Thu, Nov 07, 2019 at 08:06:08PM +0000, Jason Gunthorpe wrote:
> > > 
> > > enum mmu_range_notifier_event {
> > > 	MMU_NOTIFY_RELEASE,
> > > };
> > > 
> > > ...assuming that we stay with "mmu_range_notifier" as a core name for this 
> > > whole thing.
> > > 
> > > Also, it is best moved down to be next to the new MNR structs, so that all the
> > > MNR stuff is in one group.
> > 
> > I agree with Jerome, this enum is part of the 'struct
> > mmu_notifier_range' (ie the description of the invalidation) and it
> > doesn't really matter that only these new notifiers can be called with
> > this type, it is still part of the mmu_notifier_range.
> > 
> > The comment already says it only applies to the mmu_range_notifier
> > scheme..
> 
> In fact the enum is entirely unused.  We might as well just kill it off
> entirely.

I had patches to use it, i need to re-post them. I posted them long ago
and i droped the ball. I will re-spin after this.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert
  2019-11-07 22:54           ` Boris Ostrovsky
@ 2019-11-08 14:53             ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-08 14:53 UTC (permalink / raw)
  To: Boris Ostrovsky
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, dri-devel, amd-gfx, Alex Deucher,
	Ben Skeggs, Christian König, David Zhou, Dennis Dalessandro,
	Juergen Gross, Mike Marciniszyn, Oleksandr Andrushchenko,
	Petr Cvek, Stefano Stabellini, nouveau, xen-devel,
	Christoph Hellwig

On Thu, Nov 07, 2019 at 05:54:52PM -0500, Boris Ostrovsky wrote:
> On 11/7/19 3:36 PM, Jason Gunthorpe wrote:
> > On Tue, Nov 05, 2019 at 10:16:46AM -0500, Boris Ostrovsky wrote:
> >
> >>> So, I suppose it can be relaxed to a null test and a WARN_ON that it
> >>> hasn't changed?
> >> You mean
> >>
> >> if (use_ptemod) {
> >>         WARN_ON(map->vma != vma);
> >>         ...
> >>
> >>
> >> Yes, that sounds good.
> > I amended my copy of the patch with the above, has this rework shown
> > signs of working?
> 
> Yes, it works fine.
> 
> But please don't forget notifier ops initialization.
> 
> With those two changes,
> 
> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Thanks, I got both things. I'll forward this toward linux-next and
repost a v3 

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-07 20:53       ` John Hubbard
@ 2019-11-08 15:26         ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-08 15:26 UTC (permalink / raw)
  To: John Hubbard
  Cc: linux-mm, Jerome Glisse, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Thu, Nov 07, 2019 at 12:53:56PM -0800, John Hubbard wrote:
> > > > +/**
> > > > + * struct mmu_range_notifier_ops
> > > > + * @invalidate: Upon return the caller must stop using any SPTEs within this
> > > > + *              range, this function can sleep. Return false if blocking was
> > > > + *              required but range is non-blocking
> > > > + */
> > > 
> > > How about this (I'm not sure I fully understand the return value, though):
> > > 
> > > /**
> > >   * struct mmu_range_notifier_ops
> > >   * @invalidate: Upon return the caller must stop using any SPTEs within this
> > >   * 		range.
> > >   *
> > >   * 		This function is permitted to sleep.
> > >   *
> > >   *      	@Return: false if blocking was required, but @range is
> > >   *			non-blocking.
> > >   *
> > >   */
> > 
> > Is this kdoc format for function pointers?
> 
> heh, I'm sort of winging it, I'm not sure how function pointers are supposed
> to be documented in kdoc. Actually the only key take-away here is to write
> 
> "This function can sleep"
> 
> as a separate sentence..

Sure

> > This odd duality has already cause some confusion, but names here are
> > hard.  mmu_interval_notifier is the best alternative I've heard.
> > 
> > Changing this name is a lot of work - are we happy
> > 'mmu_interval_notifier' is the right choice? 
> 
> Yes, it's my favorite too. I'd vote for going with that.

Okay, lets give it a go

> Very nice, would you be open to putting that into (any) one of the comment
> headers? That's an unusually clear and concise description:

Yep, done

> > > > +int mmu_range_notifier_insert(struct mmu_range_notifier *mrn,
> > > > +			      unsigned long start, unsigned long length,
> > > > +			      struct mm_struct *mm)
> > > > +{
> > > > +	struct mmu_notifier_mm *mmn_mm;
> > > > +	int ret;
> > > 
> > > Hmmm, I think a later patch improperly changes the above to "int ret = 0;".
> > > I'll check on that. It's correct here, though.
> > 
> > Looks OK in my tree?
> 
> Nope, that's how I found it. The top of your mmu_notifier branch has this:
> 
> int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
> {
>         struct mmu_notifier_mm *mmn_mm = range->mm->mmu_notifier_mm;
>         int ret = 0;
> 
>         if (mmn_mm->has_interval) {
>                 ret = mn_itree_invalidate(mmn_mm, range);
>                 if (ret)
>                         return ret;
>         }
>         if (!hlist_empty(&mmn_mm->list))
>                 return mn_hlist_invalidate_range_start(mmn_mm, range);
>         return 0;
> }

Ah, that is a different function :) Fixed

> Looks good. We're just polishing up minor points now, so you can add:
> 
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>

Great, thanks, I'll post a v3 with the rename

Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier
  2019-11-08  2:00             ` Jerome Glisse
@ 2019-11-08 20:19               ` Jason Gunthorpe
  0 siblings, 0 replies; 71+ messages in thread
From: Jason Gunthorpe @ 2019-11-08 20:19 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: John Hubbard, linux-mm, Ralph Campbell, Felix.Kuehling,
	linux-rdma, dri-devel, amd-gfx, Alex Deucher, Ben Skeggs,
	Boris Ostrovsky, Christian König, David Zhou,
	Dennis Dalessandro, Juergen Gross, Mike Marciniszyn,
	Oleksandr Andrushchenko, Petr Cvek, Stefano Stabellini, nouveau,
	xen-devel, Christoph Hellwig, Andrea Arcangeli, Michal Hocko

On Thu, Nov 07, 2019 at 09:00:34PM -0500, Jerome Glisse wrote:
> On Fri, Nov 08, 2019 at 12:32:25AM +0000, Jason Gunthorpe wrote:
> > On Thu, Nov 07, 2019 at 04:04:08PM -0500, Jerome Glisse wrote:
> > > On Thu, Nov 07, 2019 at 08:11:06PM +0000, Jason Gunthorpe wrote:
> > > > On Wed, Nov 06, 2019 at 09:08:07PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > 
> > > > > > Extra credit: IMHO, this clearly deserves to all be in a new mmu_range_notifier.h
> > > > > > header file, but I know that's extra work. Maybe later as a follow-up patch,
> > > > > > if anyone has the time.
> > > > > 
> > > > > The range notifier should get the event too, it would be a waste, i think it is
> > > > > an oversight here. The release event is fine so NAK to you separate event. Event
> > > > > is really an helper for notifier i had a set of patch for nouveau to leverage
> > > > > this i need to resucite them. So no need to split thing, i would just forward
> > > > > the event ie add event to mmu_range_notifier_ops.invalidate() i failed to catch
> > > > > that in v1 sorry.
> > > > 
> > > > I think what you mean is already done?
> > > > 
> > > > struct mmu_range_notifier_ops {
> > > > 	bool (*invalidate)(struct mmu_range_notifier *mrn,
> > > > 			   const struct mmu_notifier_range *range,
> > > > 			   unsigned long cur_seq);
> > > 
> > > Yes it is sorry, i got confuse with mmu_range_notifier and mmu_notifier_range :)
> > > It is almost a palyndrome structure ;)
> > 
> > Lets change the name then, this is clearly not working. I'll reflow
> > everything tomorrow
> 
> Semantic patch to do that run from your linux kernel directory with your patch
> applied (you can run it one patch after the other and the git commit -a --fixup HEAD)
> 
> spatch --sp-file name-of-the-file-below --dir . --all-includes --in-place
> 
> %< ------------------------------------------------------------------
> @@
> @@
> struct
> -mmu_range_notifier
> +mmu_interval_notifier
> 
> @@
> @@
> struct
> -mmu_range_notifier
> +mmu_interval_notifier
> {...};
> 
> // Change mrn name to mmu_in
> @@
> struct mmu_interval_notifier *mrn;
> @@
> -mrn
> +mmu_in
> 
> @@
> identifier fn;
> @@
> fn(..., 
> -struct mmu_interval_notifier *mrn,
> +struct mmu_interval_notifier *mmu_in,
> ...) {...}
> 
> You need coccinelle (which provides spatch). It is untested but it should work
> also i could not come up with a nice name to update mrn as min is way too
> confusing. If you have better name feel free to use it.

I used 'mni' as we already use 'mn' to refer to the notifier, and
'mmu_in' looks like some input parameter or something

It mostly worked, lots of comments to fix manually though:

https://github.com/jgunthorpe/linux/commits/mmu_notifier

Thanks,
Jason

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, back to index

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 20:10 [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 01/15] mm/mmu_notifier: define the header pre-processor parts even if disabled Jason Gunthorpe
2019-11-05 21:23   ` John Hubbard
2019-11-06 13:36     ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 02/15] mm/mmu_notifier: add an interval tree notifier Jason Gunthorpe
2019-10-29 22:04   ` Kuehling, Felix
2019-10-29 22:56     ` Jason Gunthorpe
2019-11-07  0:23   ` John Hubbard
2019-11-07  2:08     ` Jerome Glisse
2019-11-07 20:11       ` Jason Gunthorpe
2019-11-07 21:04         ` Jerome Glisse
2019-11-08  0:32           ` Jason Gunthorpe
2019-11-08  2:00             ` Jerome Glisse
2019-11-08 20:19               ` Jason Gunthorpe
2019-11-07 20:06     ` Jason Gunthorpe
2019-11-07 20:53       ` John Hubbard
2019-11-08 15:26         ` Jason Gunthorpe
2019-11-08  6:33       ` Christoph Hellwig
2019-11-08 13:43         ` Jerome Glisse
2019-10-28 20:10 ` [PATCH v2 03/15] mm/hmm: allow hmm_range to be used with a mmu_range_notifier or hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 04/15] mm/hmm: define the pre-processor related parts of hmm.h even if disabled Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 05/15] RDMA/odp: Use mmu_range_notifier_insert() Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 06/15] RDMA/hfi1: Use mmu_range_notifier_inset for user_exp_rcv Jason Gunthorpe
2019-10-29 12:19   ` Dennis Dalessandro
2019-10-29 12:51     ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 07/15] drm/radeon: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-29  7:48   ` Koenig, Christian
2019-10-28 20:10 ` [PATCH v2 08/15] xen/gntdev: Use select for DMA_SHARED_BUFFER Jason Gunthorpe
2019-11-01 18:26   ` Jason Gunthorpe
2019-11-05 14:44     ` Jürgen Groß
2019-11-07  9:39   ` Jürgen Groß
2019-10-28 20:10 ` [PATCH v2 09/15] xen/gntdev: use mmu_range_notifier_insert Jason Gunthorpe
2019-10-30 16:55   ` Boris Ostrovsky
2019-11-01 17:48     ` Jason Gunthorpe
2019-11-01 18:51       ` Boris Ostrovsky
2019-11-01 19:17         ` Jason Gunthorpe
2019-11-04 22:03   ` Boris Ostrovsky
2019-11-05  2:31     ` Jason Gunthorpe
2019-11-05 15:16       ` Boris Ostrovsky
2019-11-07 20:36         ` Jason Gunthorpe
2019-11-07 22:54           ` Boris Ostrovsky
2019-11-08 14:53             ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 10/15] nouveau: use mmu_notifier directly for invalidate_range_start Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 11/15] nouveau: use mmu_range_notifier instead of hmm_mirror Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 12/15] drm/amdgpu: Call find_vma under mmap_sem Jason Gunthorpe
2019-10-29  7:49   ` Koenig, Christian
2019-10-29 16:28   ` Kuehling, Felix
2019-10-29 13:07     ` Christian König
2019-10-29 17:19     ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 13/15] drm/amdgpu: Use mmu_range_insert instead of hmm_mirror Jason Gunthorpe
2019-10-29  7:51   ` Koenig, Christian
2019-10-29 13:59     ` Jason Gunthorpe
2019-10-29 22:14   ` Kuehling, Felix
2019-10-29 23:09     ` Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 14/15] drm/amdgpu: Use mmu_range_notifier " Jason Gunthorpe
2019-10-29 19:22   ` Yang, Philip
2019-10-29 19:25     ` Jason Gunthorpe
2019-11-01 14:44       ` Yang, Philip
2019-11-01 15:12         ` Jason Gunthorpe
2019-11-01 15:59           ` Yang, Philip
2019-11-01 17:42             ` Jason Gunthorpe
2019-11-01 19:19               ` Jason Gunthorpe
2019-11-01 19:45               ` Yang, Philip
2019-11-01 19:50                 ` Yang, Philip
2019-11-01 19:51                 ` Jason Gunthorpe
2019-11-01 18:21         ` Jason Gunthorpe
2019-11-01 18:34         ` [PATCH v2a " Jason Gunthorpe
2019-10-28 20:10 ` [PATCH v2 15/15] mm/hmm: remove hmm_mirror and related Jason Gunthorpe
2019-11-01 19:54 ` [PATCH v2 00/15] Consolidate the mmu notifier interval_tree and locking Jason Gunthorpe
2019-11-01 20:54 ` Ralph Campbell
2019-11-04 20:40   ` Jason Gunthorpe

Linux-RDMA Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-rdma/0 linux-rdma/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-rdma linux-rdma/ https://lore.kernel.org/linux-rdma \
		linux-rdma@vger.kernel.org
	public-inbox-index linux-rdma

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-rdma


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git