[PATCH v2 hmm 00/11] Various revisions from a locking/code review

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 hmm 00/11] Various revisions from a locking/code review
@ 2019-06-06 18:44 Jason Gunthorpe
  2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
                   ` (12 more replies)
  0 siblings, 13 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

For hmm.git:

This patch series arised out of discussions with Jerome when looking at the
ODP changes, particularly informed by use after free races we have already
found and fixed in the ODP code (thanks to syzkaller) working with mmu
notifiers, and the discussion with Ralph on how to resolve the lifetime model.

Overall this brings in a simplified locking scheme and easy to explain
lifetime model:

 If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
 is allocated memory.

 If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
 then the mmget must be obtained via mmget_not_zero().

Locking of mm->hmm is shifted to use the mmap_sem consistently for all
read/write and unlocked accesses are removed.

The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
standard mmget() locking to prevent the mm from being released. Many of the
debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
which is much clearer as to the lifetime intent.

The trailing patches are just some random cleanups I noticed when reviewing
this code.

This v2 incorporates alot of the good off list changes & feedback Jerome had,
and all the on-list comments too. However, now that we have the shared git I
have kept the one line change to nouveau_svm.c rather than the compat
funtions.

I believe we can resolve this merge in the DRM tree now and keep the core
mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.

It is on top of hmm.git, and I have a git tree of this series to ease testing
here:

https://github.com/jgunthorpe/linux/tree/hmm

There are still some open locking issues, as I think this remains unaddressed:

https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/

I'm looking for some more acks, reviews and tests so this can move ahead to
hmm.git.

Detailed notes on the v2 changes are in each patch. The big changes:
 - mmget is held so long as the range is registered
 - the last patch 'Remove confusing comment and logic from hmm_release' is new

Thanks everyone,
Jason

Jason Gunthorpe (11):
  mm/hmm: fix use after free with struct hmm in the mmu notifiers
  mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  mm/hmm: Hold a mmgrab from hmm to mm
  mm/hmm: Simplify hmm_get_or_create and make it reliable
  mm/hmm: Remove duplicate condition test before wait_event_timeout
  mm/hmm: Hold on to the mmget for the lifetime of the range
  mm/hmm: Use lockdep instead of comments
  mm/hmm: Remove racy protection against double-unregistration
  mm/hmm: Poison hmm_range during unregister
  mm/hmm: Do not use list*_rcu() for hmm->ranges
  mm/hmm: Remove confusing comment and logic from hmm_release

 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/hmm.h                   |  49 +------
 kernel/fork.c                         |   1 -
 mm/hmm.c                              | 204 ++++++++++----------------
 4 files changed, 87 insertions(+), 169 deletions(-)

-- 
2.21.0

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  2:29   ` John Hubbard
                     ` (2 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
                   ` (11 subsequent siblings)
  12 siblings, 3 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
system will continue to reference hmm->mn until the srcu grace period
expires.

Resulting in use after free races like this:

         CPU0                                     CPU1
                                               __mmu_notifier_invalidate_range_start()
                                                 srcu_read_lock
                                                 hlist_for_each ()
                                                   // mn == hmm->mn
hmm_mirror_unregister()
  hmm_put()
    hmm_free()
      mmu_notifier_unregister_no_release()
         hlist_del_init_rcu(hmm-mn->list)
			                           mn->ops->invalidate_range_start(mn, range);
					             mm_get_hmm()
      mm->hmm = NULL;
      kfree(hmm)
                                                     mutex_lock(&hmm->lock);

Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
existing. Get the now-safe hmm struct through container_of and directly
check kref_get_unless_zero to lock it against free.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
v2:
- Spell 'free' properly (Jerome/Ralph)
---
 include/linux/hmm.h |  1 +
 mm/hmm.c            | 25 +++++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 092f0234bfe917..688c5ca7068795 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -102,6 +102,7 @@ struct hmm {
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
 	wait_queue_head_t	wq;
+	struct rcu_head		rcu;
 	long			notifiers;
 	bool			dead;
 };
diff --git a/mm/hmm.c b/mm/hmm.c
index 8e7403f081f44a..547002f56a163d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	return NULL;
 }
 
+static void hmm_free_rcu(struct rcu_head *rcu)
+{
+	kfree(container_of(rcu, struct hmm, rcu));
+}
+
 static void hmm_free(struct kref *kref)
 {
 	struct hmm *hmm = container_of(kref, struct hmm, kref);
@@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
 		mm->hmm = NULL;
 	spin_unlock(&mm->page_table_lock);
 
-	kfree(hmm);
+	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
 }
 
 static inline void hmm_put(struct hmm *hmm)
@@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
 
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_range *range;
 
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return;
+
 	/* Report this HMM as dying. */
 	hmm->dead = true;
 
@@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
 	struct hmm_range *range;
 	int ret = 0;
 
-	VM_BUG_ON(!hmm);
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return 0;
 
 	update.start = nrange->start;
 	update.end = nrange->end;
@@ -245,9 +256,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 
-	VM_BUG_ON(!hmm);
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return;
 
 	mutex_lock(&hmm->lock);
 	hmm->notifiers--;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
  2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  2:36   ` John Hubbard
                     ` (3 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
                   ` (10 subsequent siblings)
  12 siblings, 4 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Ralph observes that hmm_range_register() can only be called by a driver
while a mirror is registered. Make this clear in the API by passing in the
mirror structure as a parameter.

This also simplifies understanding the lifetime model for struct hmm, as
the hmm pointer must be valid as part of a registered mirror so all we
need in hmm_register_range() is a simple kref_get.

Suggested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
v2
- Include the oneline patch to nouveau_svm.c
---
 drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
 include/linux/hmm.h                   |  7 ++++---
 mm/hmm.c                              | 15 ++++++---------
 3 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index 93ed43c413f0bb..8c92374afcf227 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -649,7 +649,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
 		range.values = nouveau_svm_pfn_values;
 		range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
 again:
-		ret = hmm_vma_fault(&range, true);
+		ret = hmm_vma_fault(&svmm->mirror, &range, true);
 		if (ret == 0) {
 			mutex_lock(&svmm->mutex);
 			if (!hmm_vma_range_done(&range)) {
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 688c5ca7068795..2d519797cb134a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -505,7 +505,7 @@ static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
  * Please see Documentation/vm/hmm.rst for how to use the range API.
  */
 int hmm_range_register(struct hmm_range *range,
-		       struct mm_struct *mm,
+		       struct hmm_mirror *mirror,
 		       unsigned long start,
 		       unsigned long end,
 		       unsigned page_shift);
@@ -541,7 +541,8 @@ static inline bool hmm_vma_range_done(struct hmm_range *range)
 }
 
 /* This is a temporary helper to avoid merge conflict between trees. */
-static inline int hmm_vma_fault(struct hmm_range *range, bool block)
+static inline int hmm_vma_fault(struct hmm_mirror *mirror,
+				struct hmm_range *range, bool block)
 {
 	long ret;
 
@@ -554,7 +555,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	range->default_flags = 0;
 	range->pfn_flags_mask = -1UL;
 
-	ret = hmm_range_register(range, range->vma->vm_mm,
+	ret = hmm_range_register(range, mirror,
 				 range->start, range->end,
 				 PAGE_SHIFT);
 	if (ret)
diff --git a/mm/hmm.c b/mm/hmm.c
index 547002f56a163d..8796447299023c 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -925,13 +925,13 @@ static void hmm_pfns_clear(struct hmm_range *range,
  * Track updates to the CPU page table see include/linux/hmm.h
  */
 int hmm_range_register(struct hmm_range *range,
-		       struct mm_struct *mm,
+		       struct hmm_mirror *mirror,
 		       unsigned long start,
 		       unsigned long end,
 		       unsigned page_shift)
 {
 	unsigned long mask = ((1UL << page_shift) - 1UL);
-	struct hmm *hmm;
+	struct hmm *hmm = mirror->hmm;
 
 	range->valid = false;
 	range->hmm = NULL;
@@ -945,15 +945,12 @@ int hmm_range_register(struct hmm_range *range,
 	range->start = start;
 	range->end = end;
 
-	hmm = hmm_get_or_create(mm);
-	if (!hmm)
-		return -EFAULT;
-
 	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL || hmm->dead) {
-		hmm_put(hmm);
+	if (hmm->mm == NULL || hmm->dead)
 		return -EFAULT;
-	}
+
+	range->hmm = hmm;
+	kref_get(&hmm->kref);
 
 	/* Initialize range to track CPU page table updates. */
 	mutex_lock(&hmm->lock);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
  2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
  2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  2:44   ` John Hubbard
                     ` (2 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
                   ` (9 subsequent siblings)
  12 siblings, 3 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

So long a a struct hmm pointer exists, so should the struct mm it is
linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
once the hmm refcount goes to zero.

Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
mm->hmm delete the hmm_hmm_destroy().

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
v2:
 - Fix error unwind paths in hmm_get_or_create (Jerome/Jason)
---
 include/linux/hmm.h |  3 ---
 kernel/fork.c       |  1 -
 mm/hmm.c            | 22 ++++------------------
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2d519797cb134a..4ee3acabe5ed22 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -586,14 +586,11 @@ static inline int hmm_vma_fault(struct hmm_mirror *mirror,
 }
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-void hmm_mm_destroy(struct mm_struct *mm);
-
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
 #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b2b87d450b80b5..588c768ae72451 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -673,7 +673,6 @@ void __mmdrop(struct mm_struct *mm)
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
-	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
diff --git a/mm/hmm.c b/mm/hmm.c
index 8796447299023c..cc7c26fda3300e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -29,6 +29,7 @@
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
 #include <linux/memremap.h>
+#include <linux/sched/mm.h>
 #include <linux/jump_label.h>
 #include <linux/dma-mapping.h>
 #include <linux/mmu_notifier.h>
@@ -82,6 +83,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	hmm->notifiers = 0;
 	hmm->dead = false;
 	hmm->mm = mm;
+	mmgrab(hmm->mm);
 
 	spin_lock(&mm->page_table_lock);
 	if (!mm->hmm)
@@ -109,6 +111,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 		mm->hmm = NULL;
 	spin_unlock(&mm->page_table_lock);
 error:
+	mmdrop(hmm->mm);
 	kfree(hmm);
 	return NULL;
 }
@@ -130,6 +133,7 @@ static void hmm_free(struct kref *kref)
 		mm->hmm = NULL;
 	spin_unlock(&mm->page_table_lock);
 
+	mmdrop(hmm->mm);
 	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
 }
 
@@ -138,24 +142,6 @@ static inline void hmm_put(struct hmm *hmm)
 	kref_put(&hmm->kref, hmm_free);
 }
 
-void hmm_mm_destroy(struct mm_struct *mm)
-{
-	struct hmm *hmm;
-
-	spin_lock(&mm->page_table_lock);
-	hmm = mm_get_hmm(mm);
-	mm->hmm = NULL;
-	if (hmm) {
-		hmm->mm = NULL;
-		hmm->dead = true;
-		spin_unlock(&mm->page_table_lock);
-		hmm_put(hmm);
-		return;
-	}
-
-	spin_unlock(&mm->page_table_lock);
-}
-
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  2:54   ` John Hubbard
                     ` (2 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
                   ` (8 subsequent siblings)
  12 siblings, 3 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

As coded this function can false-fail in various racy situations. Make it
reliable by running only under the write side of the mmap_sem and avoiding
the false-failing compare/exchange pattern.

Also make the locking very easy to understand by only ever reading or
writing mm->hmm while holding the write side of the mmap_sem.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
v2:
- Fix error unwind of mmgrab (Jerome)
- Use hmm local instead of 2nd container_of (Jerome)
---
 mm/hmm.c | 80 ++++++++++++++++++++------------------------------------
 1 file changed, 29 insertions(+), 51 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index cc7c26fda3300e..dc30edad9a8a02 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -40,16 +40,6 @@
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
-{
-	struct hmm *hmm = READ_ONCE(mm->hmm);
-
-	if (hmm && kref_get_unless_zero(&hmm->kref))
-		return hmm;
-
-	return NULL;
-}
-
 /**
  * hmm_get_or_create - register HMM against an mm (HMM internal)
  *
@@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
  */
 static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
-	bool cleanup = false;
+	struct hmm *hmm;
 
-	if (hmm)
-		return hmm;
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
+
+	if (mm->hmm) {
+		if (kref_get_unless_zero(&mm->hmm->kref))
+			return mm->hmm;
+		/*
+		 * The hmm is being freed by some other CPU and is pending a
+		 * RCU grace period, but this CPU can NULL now it since we
+		 * have the mmap_sem.
+		 */
+		mm->hmm = NULL;
+	}
 
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
@@ -83,57 +82,36 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	hmm->notifiers = 0;
 	hmm->dead = false;
 	hmm->mm = mm;
-	mmgrab(hmm->mm);
-
-	spin_lock(&mm->page_table_lock);
-	if (!mm->hmm)
-		mm->hmm = hmm;
-	else
-		cleanup = true;
-	spin_unlock(&mm->page_table_lock);
 
-	if (cleanup)
-		goto error;
-
-	/*
-	 * We should only get here if hold the mmap_sem in write mode ie on
-	 * registration of first mirror through hmm_mirror_register()
-	 */
 	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
-	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
-		goto error_mm;
+	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+		kfree(hmm);
+		return NULL;
+	}
 
+	mmgrab(hmm->mm);
+	mm->hmm = hmm;
 	return hmm;
-
-error_mm:
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-error:
-	mmdrop(hmm->mm);
-	kfree(hmm);
-	return NULL;
 }
 
 static void hmm_free_rcu(struct rcu_head *rcu)
 {
-	kfree(container_of(rcu, struct hmm, rcu));
+	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
+
+	down_write(&hmm->mm->mmap_sem);
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+	mmdrop(hmm->mm);
+
+	kfree(hmm);
 }
 
 static void hmm_free(struct kref *kref)
 {
 	struct hmm *hmm = container_of(kref, struct hmm, kref);
-	struct mm_struct *mm = hmm->mm;
-
-	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
 
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-
-	mmdrop(hmm->mm);
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
 	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:06   ` John Hubbard
  2019-06-07 19:01   ` [PATCH v2 " Ralph Campbell
  2019-06-06 18:44 ` [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range Jason Gunthorpe
                   ` (7 subsequent siblings)
  12 siblings, 2 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

The wait_event_timeout macro already tests the condition as its first
action, so there is no reason to open code another version of this, all
that does is skip the might_sleep() debugging in common cases, which is
not helpful.

Further, based on prior patches, we can no simplify the required condition
test:
 - If range is valid memory then so is range->hmm
 - If hmm_release() has run then range->valid is set to false
   at the same time as dead, so no reason to check both.
 - A valid hmm has a valid hmm->mm.

Also, add the READ_ONCE for range->valid as there is no lock held here.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
 include/linux/hmm.h | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 4ee3acabe5ed22..2ab35b40992b24 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
 static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
 					      unsigned long timeout)
 {
-	/* Check if mm is dead ? */
-	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
-		range->valid = false;
-		return false;
-	}
-	if (range->valid)
-		return true;
-	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
+	wait_event_timeout(range->hmm->wq, range->valid,
 			   msecs_to_jiffies(timeout));
-	/* Return current valid status just in case we get lucky */
-	return range->valid;
+	return READ_ONCE(range->valid);
 }
 
 /*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:15   ` John Hubbard
  2019-06-07 20:29   ` Ralph Campbell
  2019-06-06 18:44 ` [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
                   ` (6 subsequent siblings)
  12 siblings, 2 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Range functions like hmm_range_snapshot() and hmm_range_fault() call
find_vma, which requires hodling the mmget() and the mmap_sem for the mm.

Make this simpler for the callers by holding the mmget() inside the range
for the lifetime of the range. Other functions that accept a range should
only be called if the range is registered.

This has the side effect of directly preventing hmm_release() from
happening while a range is registered. That means range->dead cannot be
false during the lifetime of the range, so remove dead and
hmm_mirror_mm_is_alive() entirely.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
v2:
 - Use Jerome's idea of just holding the mmget() for the range lifetime,
   rework the patch to use that as as simplification to remove dead in
   one step
---
 include/linux/hmm.h | 26 --------------------------
 mm/hmm.c            | 28 ++++++++++------------------
 2 files changed, 10 insertions(+), 44 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2ab35b40992b24..0e20566802967a 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -91,7 +91,6 @@
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  * @wq: wait queue for user waiting on a range invalidation
  * @notifiers: count of active mmu notifiers
- * @dead: is the mm dead ?
  */
 struct hmm {
 	struct mm_struct	*mm;
@@ -104,7 +103,6 @@ struct hmm {
 	wait_queue_head_t	wq;
 	struct rcu_head		rcu;
 	long			notifiers;
-	bool			dead;
 };
 
 /*
@@ -469,30 +467,6 @@ struct hmm_mirror {
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
-/*
- * hmm_mirror_mm_is_alive() - test if mm is still alive
- * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
- * Return: false if the mm is dead, true otherwise
- *
- * This is an optimization, it will not always accurately return false if the
- * mm is dead; i.e., there can be false negatives (process is being killed but
- * HMM is not yet informed of that). It is only intended to be used to optimize
- * out cases where the driver is about to do something time consuming and it
- * would be better to skip it if the mm is dead.
- */
-static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
-{
-	struct mm_struct *mm;
-
-	if (!mirror || !mirror->hmm)
-		return false;
-	mm = READ_ONCE(mirror->hmm->mm);
-	if (mirror->hmm->dead || !mm)
-		return false;
-
-	return true;
-}
-
 /*
  * Please see Documentation/vm/hmm.rst for how to use the range API.
  */
diff --git a/mm/hmm.c b/mm/hmm.c
index dc30edad9a8a02..f67ba32983d9f1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -80,7 +80,6 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	mutex_init(&hmm->lock);
 	kref_init(&hmm->kref);
 	hmm->notifiers = 0;
-	hmm->dead = false;
 	hmm->mm = mm;
 
 	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
@@ -124,20 +123,17 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
-	struct hmm_range *range;
 
 	/* hmm is in progress to free */
 	if (!kref_get_unless_zero(&hmm->kref))
 		return;
 
-	/* Report this HMM as dying. */
-	hmm->dead = true;
-
-	/* Wake-up everyone waiting on any range. */
 	mutex_lock(&hmm->lock);
-	list_for_each_entry(range, &hmm->ranges, list)
-		range->valid = false;
-	wake_up_all(&hmm->wq);
+	/*
+	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
+	 * prevented as long as a range exists.
+	 */
+	WARN_ON(!list_empty(&hmm->ranges));
 	mutex_unlock(&hmm->lock);
 
 	down_write(&hmm->mirrors_sem);
@@ -909,8 +905,8 @@ int hmm_range_register(struct hmm_range *range,
 	range->start = start;
 	range->end = end;
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL || hmm->dead)
+	/* Prevent hmm_release() from running while the range is valid */
+	if (!mmget_not_zero(hmm->mm))
 		return -EFAULT;
 
 	range->hmm = hmm;
@@ -955,6 +951,7 @@ void hmm_range_unregister(struct hmm_range *range)
 
 	/* Drop reference taken by hmm_range_register() */
 	range->valid = false;
+	mmput(hmm->mm);
 	hmm_put(hmm);
 	range->hmm = NULL;
 }
@@ -982,10 +979,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 	struct vm_area_struct *vma;
 	struct mm_walk mm_walk;
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL || hmm->dead)
-		return -EFAULT;
-
+	lockdep_assert_held(&hmm->mm->mmap_sem);
 	do {
 		/* If range is no longer valid force retry. */
 		if (!range->valid)
@@ -1080,9 +1074,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 	struct mm_walk mm_walk;
 	int ret;
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL || hmm->dead)
-		return -EFAULT;
+	lockdep_assert_held(&hmm->mm->mmap_sem);
 
 	do {
 		/* If range is no longer valid force retry. */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:19   ` John Hubbard
                     ` (2 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
                   ` (5 subsequent siblings)
  12 siblings, 3 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

So we can check locking at runtime.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
v2
- Fix missing & in lockdeps (Jason)
---
 mm/hmm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index f67ba32983d9f1..c702cd72651b53 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -254,11 +254,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
  *
  * To start mirroring a process address space, the device driver must register
  * an HMM mirror struct.
- *
- * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
  */
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 {
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
+
 	/* Sanity check */
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:29   ` John Hubbard
  2019-06-07 20:33   ` Ralph Campbell
  2019-06-06 18:44 ` [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
                   ` (4 subsequent siblings)
  12 siblings, 2 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

No other register/unregister kernel API attempts to provide this kind of
protection as it is inherently racy, so just drop it.

Callers should provide their own protection, it appears nouveau already
does, but just in case drop a debugging POISON.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index c702cd72651b53..6802de7080d172 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -284,18 +284,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	struct hmm *hmm = READ_ONCE(mirror->hmm);
-
-	if (hmm == NULL)
-		return;
+	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
-	/* To protect us against double unregister ... */
-	mirror->hmm = NULL;
 	up_write(&hmm->mirrors_sem);
-
 	hmm_put(hmm);
+	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:37   ` John Hubbard
                     ` (2 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
                   ` (3 subsequent siblings)
  12 siblings, 3 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
and poison bytes to detect this condition.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
v2
- Keep range start/end valid after unregistration (Jerome)
---
 mm/hmm.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6802de7080d172..c2fecb3ecb11e1 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -937,7 +937,7 @@ void hmm_range_unregister(struct hmm_range *range)
 	struct hmm *hmm = range->hmm;
 
 	/* Sanity check this really should not happen. */
-	if (hmm == NULL || range->end <= range->start)
+	if (WARN_ON(range->end <= range->start))
 		return;
 
 	mutex_lock(&hmm->lock);
@@ -948,7 +948,10 @@ void hmm_range_unregister(struct hmm_range *range)
 	range->valid = false;
 	mmput(hmm->mm);
 	hmm_put(hmm);
-	range->hmm = NULL;
+
+	/* The range is now invalid, leave it poisoned. */
+	range->valid = false;
+	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
 }
 EXPORT_SYMBOL(hmm_range_unregister);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:40   ` John Hubbard
                     ` (3 more replies)
  2019-06-06 18:44 ` [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release Jason Gunthorpe
                   ` (2 subsequent siblings)
  12 siblings, 4 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

This list is always read and written while holding hmm->lock so there is
no need for the confusing _rcu annotations.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
---
 mm/hmm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index c2fecb3ecb11e1..709d138dd49027 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -911,7 +911,7 @@ int hmm_range_register(struct hmm_range *range,
 	mutex_lock(&hmm->lock);
 
 	range->hmm = hmm;
-	list_add_rcu(&range->list, &hmm->ranges);
+	list_add(&range->list, &hmm->ranges);
 
 	/*
 	 * If there are any concurrent notifiers we have to wait for them for
@@ -941,7 +941,7 @@ void hmm_range_unregister(struct hmm_range *range)
 		return;
 
 	mutex_lock(&hmm->lock);
-	list_del_rcu(&range->list);
+	list_del(&range->list);
 	mutex_unlock(&hmm->lock);
 
 	/* Drop reference taken by hmm_range_register() */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
@ 2019-06-06 18:44 ` Jason Gunthorpe
  2019-06-07  3:47   ` John Hubbard
  2019-06-07 21:37   ` Ralph Campbell
  2019-06-07 16:05 ` [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start Jason Gunthorpe
  2019-06-11 19:48 ` [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
  12 siblings, 2 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-06 18:44 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

hmm_release() is called exactly once per hmm. ops->release() cannot
accidentally trigger any action that would recurse back onto
hmm->mirrors_sem.

This fixes a use after-free race of the form:

       CPU0                                   CPU1
                                           hmm_release()
                                             up_write(&hmm->mirrors_sem);
 hmm_mirror_unregister(mirror)
  down_write(&hmm->mirrors_sem);
  up_write(&hmm->mirrors_sem);
  kfree(mirror)
                                             mirror->ops->release(mirror)

The only user we have today for ops->release is an empty function, so this
is unambiguously safe.

As a consequence of plugging this race drivers are not allowed to
register/unregister mirrors from within a release op.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 28 +++++++++-------------------
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 709d138dd49027..3a45dd3d778248 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	WARN_ON(!list_empty(&hmm->ranges));
 	mutex_unlock(&hmm->lock);
 
-	down_write(&hmm->mirrors_sem);
-	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
-					  list);
-	while (mirror) {
-		list_del_init(&mirror->list);
-		if (mirror->ops->release) {
-			/*
-			 * Drop mirrors_sem so the release callback can wait
-			 * on any pending work that might itself trigger a
-			 * mmu_notifier callback and thus would deadlock with
-			 * us.
-			 */
-			up_write(&hmm->mirrors_sem);
+	down_read(&hmm->mirrors_sem);
+	list_for_each_entry(mirror, &hmm->mirrors, list) {
+		/*
+		 * Note: The driver is not allowed to trigger
+		 * hmm_mirror_unregister() from this thread.
+		 */
+		if (mirror->ops->release)
 			mirror->ops->release(mirror);
-			down_write(&hmm->mirrors_sem);
-		}
-		mirror = list_first_entry_or_null(&hmm->mirrors,
-						  struct hmm_mirror, list);
 	}
-	up_write(&hmm->mirrors_sem);
+	up_read(&hmm->mirrors_sem);
 
 	hmm_put(hmm);
 }
@@ -287,7 +277,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
 	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
-	list_del_init(&mirror->list);
+	list_del(&mirror->list);
 	up_write(&hmm->mirrors_sem);
 	hmm_put(hmm);
 	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
@ 2019-06-07  2:29   ` John Hubbard
  2019-06-07 12:34     ` Jason Gunthorpe
  2019-06-07 18:12   ` Ralph Campbell
  2019-06-08  8:49   ` Christoph Hellwig
  2 siblings, 1 reply; 79+ messages in thread
From: John Hubbard @ 2019-06-07  2:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
...
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8e7403f081f44a..547002f56a163d 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
...
> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  
> -	kfree(hmm);
> +	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);


It occurred to me to wonder if it is best to use the MMU notifier's
instance of srcu, instead of creating a separate instance for HMM.
But this really does seem appropriate, since we are after all using
this to synchronize with MMU notifier callbacks. So, fine.


>  }
>  
>  static inline void hmm_put(struct hmm *hmm)
> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
>  	struct hmm_range *range;
>  
> +	/* hmm is in progress to free */

Well, sometimes, yes. :)

Maybe this wording is clearer (if we need any comment at all):

	/* Bail out if hmm is in the process of being freed */

> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
> +
>  	/* Report this HMM as dying. */
>  	hmm->dead = true;
>  
> @@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
>  	struct hmm_update update;
>  	struct hmm_range *range;
>  	int ret = 0;
>  
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */

Same here.

> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return 0;
>  
>  	update.start = nrange->start;
>  	update.end = nrange->end;
> @@ -245,9 +256,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */

And here.

> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
>  
>  	mutex_lock(&hmm->lock);
>  	hmm->notifiers--;
> 

Elegant fix. Regardless of the above chatter I added, you can add:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
@ 2019-06-07  2:36   ` John Hubbard
  2019-06-07 18:24   ` Ralph Campbell
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-07  2:36 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Ralph observes that hmm_range_register() can only be called by a driver
> while a mirror is registered. Make this clear in the API by passing in the
> mirror structure as a parameter.
> 
> This also simplifies understanding the lifetime model for struct hmm, as
> the hmm pointer must be valid as part of a registered mirror so all we
> need in hmm_register_range() is a simple kref_get.
> 
> Suggested-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
> v2
> - Include the oneline patch to nouveau_svm.c
> ---
>  drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
>  include/linux/hmm.h                   |  7 ++++---
>  mm/hmm.c                              | 15 ++++++---------
>  3 files changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 93ed43c413f0bb..8c92374afcf227 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -649,7 +649,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
>  		range.values = nouveau_svm_pfn_values;
>  		range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
>  again:
> -		ret = hmm_vma_fault(&range, true);
> +		ret = hmm_vma_fault(&svmm->mirror, &range, true);
>  		if (ret == 0) {
>  			mutex_lock(&svmm->mutex);
>  			if (!hmm_vma_range_done(&range)) {
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 688c5ca7068795..2d519797cb134a 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -505,7 +505,7 @@ static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
>   * Please see Documentation/vm/hmm.rst for how to use the range API.
>   */
>  int hmm_range_register(struct hmm_range *range,
> -		       struct mm_struct *mm,
> +		       struct hmm_mirror *mirror,
>  		       unsigned long start,
>  		       unsigned long end,
>  		       unsigned page_shift);
> @@ -541,7 +541,8 @@ static inline bool hmm_vma_range_done(struct hmm_range *range)
>  }
>  
>  /* This is a temporary helper to avoid merge conflict between trees. */
> -static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> +static inline int hmm_vma_fault(struct hmm_mirror *mirror,
> +				struct hmm_range *range, bool block)
>  {
>  	long ret;
>  
> @@ -554,7 +555,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  	range->default_flags = 0;
>  	range->pfn_flags_mask = -1UL;
>  
> -	ret = hmm_range_register(range, range->vma->vm_mm,
> +	ret = hmm_range_register(range, mirror,
>  				 range->start, range->end,
>  				 PAGE_SHIFT);
>  	if (ret)
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 547002f56a163d..8796447299023c 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -925,13 +925,13 @@ static void hmm_pfns_clear(struct hmm_range *range,
>   * Track updates to the CPU page table see include/linux/hmm.h
>   */
>  int hmm_range_register(struct hmm_range *range,
> -		       struct mm_struct *mm,
> +		       struct hmm_mirror *mirror,
>  		       unsigned long start,
>  		       unsigned long end,
>  		       unsigned page_shift)
>  {
>  	unsigned long mask = ((1UL << page_shift) - 1UL);
> -	struct hmm *hmm;
> +	struct hmm *hmm = mirror->hmm;
>  
>  	range->valid = false;
>  	range->hmm = NULL;
> @@ -945,15 +945,12 @@ int hmm_range_register(struct hmm_range *range,
>  	range->start = start;
>  	range->end = end;
>  
> -	hmm = hmm_get_or_create(mm);
> -	if (!hmm)
> -		return -EFAULT;
> -
>  	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead) {
> -		hmm_put(hmm);
> +	if (hmm->mm == NULL || hmm->dead)
>  		return -EFAULT;
> -	}
> +
> +	range->hmm = hmm;
> +	kref_get(&hmm->kref);
>  
>  	/* Initialize range to track CPU page table updates. */
>  	mutex_lock(&hmm->lock);
> 

I'm not a qualified Nouveau reviewer, but this looks obviously correct,
so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-06-06 18:44 ` [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
@ 2019-06-07  2:44   ` John Hubbard
  2019-06-07 12:36     ` Jason Gunthorpe
  2019-06-07 18:41   ` Ralph Campbell
  2019-06-07 22:38   ` Ira Weiny
  2 siblings, 1 reply; 79+ messages in thread
From: John Hubbard @ 2019-06-07  2:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> So long a a struct hmm pointer exists, so should the struct mm it is
> linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
> once the hmm refcount goes to zero.
> 
> Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
> mm->hmm delete the hmm_hmm_destroy().
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
> v2:
>  - Fix error unwind paths in hmm_get_or_create (Jerome/Jason)
> ---
>  include/linux/hmm.h |  3 ---
>  kernel/fork.c       |  1 -
>  mm/hmm.c            | 22 ++++------------------
>  3 files changed, 4 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 2d519797cb134a..4ee3acabe5ed22 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -586,14 +586,11 @@ static inline int hmm_vma_fault(struct hmm_mirror *mirror,
>  }
>  
>  /* Below are for HMM internal use only! Not to be used by device driver! */
> -void hmm_mm_destroy(struct mm_struct *mm);
> -
>  static inline void hmm_mm_init(struct mm_struct *mm)
>  {
>  	mm->hmm = NULL;
>  }
>  #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> -static inline void hmm_mm_destroy(struct mm_struct *mm) {}
>  static inline void hmm_mm_init(struct mm_struct *mm) {}
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b2b87d450b80b5..588c768ae72451 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -673,7 +673,6 @@ void __mmdrop(struct mm_struct *mm)
>  	WARN_ON_ONCE(mm == current->active_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> -	hmm_mm_destroy(mm);


This is particularly welcome, not to have an "HMM is special" case
in such a core part of process/mm code. 


>  	mmu_notifier_mm_destroy(mm);
>  	check_mm(mm);
>  	put_user_ns(mm->user_ns);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8796447299023c..cc7c26fda3300e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -29,6 +29,7 @@
>  #include <linux/swapops.h>
>  #include <linux/hugetlb.h>
>  #include <linux/memremap.h>
> +#include <linux/sched/mm.h>
>  #include <linux/jump_label.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/mmu_notifier.h>
> @@ -82,6 +83,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	hmm->notifiers = 0;
>  	hmm->dead = false;
>  	hmm->mm = mm;
> +	mmgrab(hmm->mm);
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (!mm->hmm)
> @@ -109,6 +111,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  error:
> +	mmdrop(hmm->mm);
>  	kfree(hmm);
>  	return NULL;
>  }
> @@ -130,6 +133,7 @@ static void hmm_free(struct kref *kref)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  
> +	mmdrop(hmm->mm);
>  	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>  }
>  
> @@ -138,24 +142,6 @@ static inline void hmm_put(struct hmm *hmm)
>  	kref_put(&hmm->kref, hmm_free);
>  }
>  
> -void hmm_mm_destroy(struct mm_struct *mm)
> -{
> -	struct hmm *hmm;
> -
> -	spin_lock(&mm->page_table_lock);
> -	hmm = mm_get_hmm(mm);
> -	mm->hmm = NULL;
> -	if (hmm) {
> -		hmm->mm = NULL;
> -		hmm->dead = true;
> -		spin_unlock(&mm->page_table_lock);
> -		hmm_put(hmm);
> -		return;
> -	}
> -
> -	spin_unlock(&mm->page_table_lock);
> -}
> -
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
>  	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
> 

Failed to find any problems with this. :)

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-06-06 18:44 ` [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
@ 2019-06-07  2:54   ` John Hubbard
  2019-06-07 18:52   ` Ralph Campbell
  2019-06-07 22:44   ` Ira Weiny
  2 siblings, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-07  2:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> As coded this function can false-fail in various racy situations. Make it
> reliable by running only under the write side of the mmap_sem and avoiding
> the false-failing compare/exchange pattern.
> 
> Also make the locking very easy to understand by only ever reading or
> writing mm->hmm while holding the write side of the mmap_sem.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
> v2:
> - Fix error unwind of mmgrab (Jerome)
> - Use hmm local instead of 2nd container_of (Jerome)
> ---
>  mm/hmm.c | 80 ++++++++++++++++++++------------------------------------
>  1 file changed, 29 insertions(+), 51 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index cc7c26fda3300e..dc30edad9a8a02 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -40,16 +40,6 @@
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> -static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> -{
> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> -
> -	if (hmm && kref_get_unless_zero(&hmm->kref))
> -		return hmm;
> -
> -	return NULL;
> -}
> -
>  /**
>   * hmm_get_or_create - register HMM against an mm (HMM internal)
>   *
> @@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>   */
>  static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> -	bool cleanup = false;
> +	struct hmm *hmm;
>  
> -	if (hmm)
> -		return hmm;
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
> +	if (mm->hmm) {
> +		if (kref_get_unless_zero(&mm->hmm->kref))
> +			return mm->hmm;
> +		/*
> +		 * The hmm is being freed by some other CPU and is pending a
> +		 * RCU grace period, but this CPU can NULL now it since we
> +		 * have the mmap_sem.
> +		 */
> +		mm->hmm = NULL;
> +	}
>  
>  	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>  	if (!hmm)
> @@ -83,57 +82,36 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	hmm->notifiers = 0;
>  	hmm->dead = false;
>  	hmm->mm = mm;
> -	mmgrab(hmm->mm);
> -
> -	spin_lock(&mm->page_table_lock);
> -	if (!mm->hmm)
> -		mm->hmm = hmm;
> -	else
> -		cleanup = true;
> -	spin_unlock(&mm->page_table_lock);
>  
> -	if (cleanup)
> -		goto error;
> -
> -	/*
> -	 * We should only get here if hold the mmap_sem in write mode ie on
> -	 * registration of first mirror through hmm_mirror_register()
> -	 */
>  	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> -	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
> -		goto error_mm;
> +	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
> +		kfree(hmm);
> +		return NULL;
> +	}
>  
> +	mmgrab(hmm->mm);
> +	mm->hmm = hmm;
>  	return hmm;
> -
> -error_mm:
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -error:
> -	mmdrop(hmm->mm);
> -	kfree(hmm);
> -	return NULL;
>  }
>  
>  static void hmm_free_rcu(struct rcu_head *rcu)
>  {
> -	kfree(container_of(rcu, struct hmm, rcu));
> +	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
> +
> +	down_write(&hmm->mm->mmap_sem);
> +	if (hmm->mm->hmm == hmm)
> +		hmm->mm->hmm = NULL;
> +	up_write(&hmm->mm->mmap_sem);
> +	mmdrop(hmm->mm);
> +
> +	kfree(hmm);
>  }
>  
>  static void hmm_free(struct kref *kref)
>  {
>  	struct hmm *hmm = container_of(kref, struct hmm, kref);
> -	struct mm_struct *mm = hmm->mm;
> -
> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
>  
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -
> -	mmdrop(hmm->mm);
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
>  	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>  }
>  
> 

Yes.

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-06 18:44 ` [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
@ 2019-06-07  3:06   ` John Hubbard
  2019-06-07 12:47     ` Jason Gunthorpe
  2019-06-07 13:31     ` [PATCH v3 " Jason Gunthorpe
  2019-06-07 19:01   ` [PATCH v2 " Ralph Campbell
  1 sibling, 2 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:06 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> The wait_event_timeout macro already tests the condition as its first
> action, so there is no reason to open code another version of this, all
> that does is skip the might_sleep() debugging in common cases, which is
> not helpful.
> 
> Further, based on prior patches, we can no simplify the required condition

                                          "now simplify"

> test:
>  - If range is valid memory then so is range->hmm
>  - If hmm_release() has run then range->valid is set to false
>    at the same time as dead, so no reason to check both.
>  - A valid hmm has a valid hmm->mm.
> 
> Also, add the READ_ONCE for range->valid as there is no lock held here.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  include/linux/hmm.h | 12 ++----------
>  1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4ee3acabe5ed22..2ab35b40992b24 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>  static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>  					      unsigned long timeout)
>  {
> -	/* Check if mm is dead ? */
> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> -		range->valid = false;
> -		return false;
> -	}
> -	if (range->valid)
> -		return true;
> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> +	wait_event_timeout(range->hmm->wq, range->valid,
>  			   msecs_to_jiffies(timeout));
> -	/* Return current valid status just in case we get lucky */
> -	return range->valid;
> +	return READ_ONCE(range->valid);

Just to ensure that I actually understand the model: I'm assuming that the 
READ_ONCE is there solely to ensure that range->valid is read *after* the
wait_event_timeout() returns. Is that correct?


>  }
>  
>  /*
> 

In any case, it looks good, so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range
  2019-06-06 18:44 ` [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range Jason Gunthorpe
@ 2019-06-07  3:15   ` John Hubbard
  2019-06-07 20:29   ` Ralph Campbell
  1 sibling, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Range functions like hmm_range_snapshot() and hmm_range_fault() call
> find_vma, which requires hodling the mmget() and the mmap_sem for the mm.
> 
> Make this simpler for the callers by holding the mmget() inside the range
> for the lifetime of the range. Other functions that accept a range should
> only be called if the range is registered.
> 
> This has the side effect of directly preventing hmm_release() from
> happening while a range is registered. That means range->dead cannot be
> false during the lifetime of the range, so remove dead and
> hmm_mirror_mm_is_alive() entirely.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
> v2:
>  - Use Jerome's idea of just holding the mmget() for the range lifetime,
>    rework the patch to use that as as simplification to remove dead in
>    one step
> ---
>  include/linux/hmm.h | 26 --------------------------
>  mm/hmm.c            | 28 ++++++++++------------------
>  2 files changed, 10 insertions(+), 44 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 2ab35b40992b24..0e20566802967a 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -91,7 +91,6 @@
>   * @mirrors_sem: read/write semaphore protecting the mirrors list
>   * @wq: wait queue for user waiting on a range invalidation
>   * @notifiers: count of active mmu notifiers
> - * @dead: is the mm dead ?
>   */
>  struct hmm {
>  	struct mm_struct	*mm;
> @@ -104,7 +103,6 @@ struct hmm {
>  	wait_queue_head_t	wq;
>  	struct rcu_head		rcu;
>  	long			notifiers;
> -	bool			dead;
>  };
>  
>  /*
> @@ -469,30 +467,6 @@ struct hmm_mirror {
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>  void hmm_mirror_unregister(struct hmm_mirror *mirror);
>  
> -/*
> - * hmm_mirror_mm_is_alive() - test if mm is still alive
> - * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> - * Return: false if the mm is dead, true otherwise
> - *
> - * This is an optimization, it will not always accurately return false if the
> - * mm is dead; i.e., there can be false negatives (process is being killed but
> - * HMM is not yet informed of that). It is only intended to be used to optimize
> - * out cases where the driver is about to do something time consuming and it
> - * would be better to skip it if the mm is dead.
> - */
> -static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
> -{
> -	struct mm_struct *mm;
> -
> -	if (!mirror || !mirror->hmm)
> -		return false;
> -	mm = READ_ONCE(mirror->hmm->mm);
> -	if (mirror->hmm->dead || !mm)
> -		return false;
> -
> -	return true;
> -}
> -
>  /*
>   * Please see Documentation/vm/hmm.rst for how to use the range API.
>   */
> diff --git a/mm/hmm.c b/mm/hmm.c
> index dc30edad9a8a02..f67ba32983d9f1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -80,7 +80,6 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	mutex_init(&hmm->lock);
>  	kref_init(&hmm->kref);
>  	hmm->notifiers = 0;
> -	hmm->dead = false;
>  	hmm->mm = mm;
>  
>  	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> @@ -124,20 +123,17 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
>  	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
> -	struct hmm_range *range;
>  
>  	/* hmm is in progress to free */
>  	if (!kref_get_unless_zero(&hmm->kref))
>  		return;
>  
> -	/* Report this HMM as dying. */
> -	hmm->dead = true;
> -
> -	/* Wake-up everyone waiting on any range. */
>  	mutex_lock(&hmm->lock);
> -	list_for_each_entry(range, &hmm->ranges, list)
> -		range->valid = false;
> -	wake_up_all(&hmm->wq);
> +	/*
> +	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
> +	 * prevented as long as a range exists.
> +	 */
> +	WARN_ON(!list_empty(&hmm->ranges));
>  	mutex_unlock(&hmm->lock);
>  
>  	down_write(&hmm->mirrors_sem);
> @@ -909,8 +905,8 @@ int hmm_range_register(struct hmm_range *range,
>  	range->start = start;
>  	range->end = end;
>  
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead)
> +	/* Prevent hmm_release() from running while the range is valid */
> +	if (!mmget_not_zero(hmm->mm))
>  		return -EFAULT;
>  
>  	range->hmm = hmm;
> @@ -955,6 +951,7 @@ void hmm_range_unregister(struct hmm_range *range)
>  
>  	/* Drop reference taken by hmm_range_register() */
>  	range->valid = false;
> +	mmput(hmm->mm);
>  	hmm_put(hmm);
>  	range->hmm = NULL;
>  }
> @@ -982,10 +979,7 @@ long hmm_range_snapshot(struct hmm_range *range)
>  	struct vm_area_struct *vma;
>  	struct mm_walk mm_walk;
>  
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead)
> -		return -EFAULT;
> -
> +	lockdep_assert_held(&hmm->mm->mmap_sem);
>  	do {
>  		/* If range is no longer valid force retry. */
>  		if (!range->valid)
> @@ -1080,9 +1074,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  	struct mm_walk mm_walk;
>  	int ret;
>  
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead)
> -		return -EFAULT;
> +	lockdep_assert_held(&hmm->mm->mmap_sem);
>  
>  	do {
>  		/* If range is no longer valid force retry. */
> 

Nice cleanup.

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments
  2019-06-06 18:44 ` [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
@ 2019-06-07  3:19   ` John Hubbard
  2019-06-07 20:31   ` Ralph Campbell
  2019-06-07 22:16   ` Souptick Joarder
  2 siblings, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:19 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> So we can check locking at runtime.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
> v2
> - Fix missing & in lockdeps (Jason)
> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f67ba32983d9f1..c702cd72651b53 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -254,11 +254,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
>   *
>   * To start mirroring a process address space, the device driver must register
>   * an HMM mirror struct.
> - *
> - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
>   */
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>  {
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
>  	/* Sanity check */
>  	if (!mm || !mirror || !mirror->ops)
>  		return -EINVAL;
> 

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration
  2019-06-06 18:44 ` [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
@ 2019-06-07  3:29   ` John Hubbard
  2019-06-07 13:57     ` Jason Gunthorpe
  2019-06-07 20:33   ` Ralph Campbell
  1 sibling, 1 reply; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> No other register/unregister kernel API attempts to provide this kind of
> protection as it is inherently racy, so just drop it.
> 
> Callers should provide their own protection, it appears nouveau already
> does, but just in case drop a debugging POISON.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  mm/hmm.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c702cd72651b53..6802de7080d172 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -284,18 +284,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
>   */
>  void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  {
> -	struct hmm *hmm = READ_ONCE(mirror->hmm);
> -
> -	if (hmm == NULL)
> -		return;
> +	struct hmm *hmm = mirror->hmm;
>  
>  	down_write(&hmm->mirrors_sem);
>  	list_del_init(&mirror->list);
> -	/* To protect us against double unregister ... */
> -	mirror->hmm = NULL;
>  	up_write(&hmm->mirrors_sem);
> -
>  	hmm_put(hmm);
> +	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));

I hadn't thought of POISON_* for these types of cases, it's a 
good technique to remember.

I noticed that this is now done outside of the lock, but that
follows directly from your commit description, so that all looks 
correct.

>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
>  
> 


    Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister
  2019-06-06 18:44 ` [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
@ 2019-06-07  3:37   ` John Hubbard
  2019-06-07 14:03     ` Jason Gunthorpe
  2019-06-07 20:46   ` Ralph Campbell
  2019-06-07 23:01   ` Ira Weiny
  2 siblings, 1 reply; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> and poison bytes to detect this condition.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
> v2
> - Keep range start/end valid after unregistration (Jerome)
> ---
>  mm/hmm.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6802de7080d172..c2fecb3ecb11e1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -937,7 +937,7 @@ void hmm_range_unregister(struct hmm_range *range)
>  	struct hmm *hmm = range->hmm;
>  
>  	/* Sanity check this really should not happen. */

That comment can also be deleted, as it has the same meaning as
the WARN_ON() that you just added.

> -	if (hmm == NULL || range->end <= range->start)
> +	if (WARN_ON(range->end <= range->start))
>  		return;
>  
>  	mutex_lock(&hmm->lock);
> @@ -948,7 +948,10 @@ void hmm_range_unregister(struct hmm_range *range)
>  	range->valid = false;
>  	mmput(hmm->mm);
>  	hmm_put(hmm);
> -	range->hmm = NULL;
> +
> +	/* The range is now invalid, leave it poisoned. */

To be precise, we are poisoning the range's back pointer to it's
owning hmm instance.  Maybe this is clearer:

	/*
	 * The range is now invalid, so poison it's hmm pointer. 
	 * Leave other range-> fields in place, for the caller's use.
	 */

...or something like that?

> +	range->valid = false;
> +	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
>  }
>  EXPORT_SYMBOL(hmm_range_unregister);
>  
> 

The above are very minor documentation points, so:

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
@ 2019-06-07  3:40   ` John Hubbard
  2019-06-07 20:49   ` Ralph Campbell
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:40 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This list is always read and written while holding hmm->lock so there is
> no need for the confusing _rcu annotations.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c2fecb3ecb11e1..709d138dd49027 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -911,7 +911,7 @@ int hmm_range_register(struct hmm_range *range,
>  	mutex_lock(&hmm->lock);
>  
>  	range->hmm = hmm;
> -	list_add_rcu(&range->list, &hmm->ranges);
> +	list_add(&range->list, &hmm->ranges);
>  
>  	/*
>  	 * If there are any concurrent notifiers we have to wait for them for
> @@ -941,7 +941,7 @@ void hmm_range_unregister(struct hmm_range *range)
>  		return;
>  
>  	mutex_lock(&hmm->lock);
> -	list_del_rcu(&range->list);
> +	list_del(&range->list);
>  	mutex_unlock(&hmm->lock);
>  
>  	/* Drop reference taken by hmm_range_register() */
> 

Good point. I'd overlooked this many times.

    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-06 18:44 ` [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release Jason Gunthorpe
@ 2019-06-07  3:47   ` John Hubbard
  2019-06-07 12:58     ` Jason Gunthorpe
  2019-06-07 21:37   ` Ralph Campbell
  1 sibling, 1 reply; 79+ messages in thread
From: John Hubbard @ 2019-06-07  3:47 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, Ralph Campbell, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> hmm_release() is called exactly once per hmm. ops->release() cannot
> accidentally trigger any action that would recurse back onto
> hmm->mirrors_sem.
> 
> This fixes a use after-free race of the form:
> 
>        CPU0                                   CPU1
>                                            hmm_release()
>                                              up_write(&hmm->mirrors_sem);
>  hmm_mirror_unregister(mirror)
>   down_write(&hmm->mirrors_sem);
>   up_write(&hmm->mirrors_sem);
>   kfree(mirror)
>                                              mirror->ops->release(mirror)
> 
> The only user we have today for ops->release is an empty function, so this
> is unambiguously safe.
> 
> As a consequence of plugging this race drivers are not allowed to
> register/unregister mirrors from within a release op.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  mm/hmm.c | 28 +++++++++-------------------
>  1 file changed, 9 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 709d138dd49027..3a45dd3d778248 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  	WARN_ON(!list_empty(&hmm->ranges));
>  	mutex_unlock(&hmm->lock);
>  
> -	down_write(&hmm->mirrors_sem);
> -	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> -					  list);
> -	while (mirror) {
> -		list_del_init(&mirror->list);
> -		if (mirror->ops->release) {
> -			/*
> -			 * Drop mirrors_sem so the release callback can wait
> -			 * on any pending work that might itself trigger a
> -			 * mmu_notifier callback and thus would deadlock with
> -			 * us.
> -			 */
> -			up_write(&hmm->mirrors_sem);
> +	down_read(&hmm->mirrors_sem);

This is cleaner and simpler, but I suspect it is leading to the deadlock
that Ralph Campbell is seeing in his driver testing. (And in general, holding
a lock during a driver callback usually leads to deadlocks.)

Ralph, is this the one? It's the only place in this patchset where I can
see a lock around a callback to driver code, that wasn't there before. So
I'm pretty sure it is the one...


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-07  2:29   ` John Hubbard
@ 2019-06-07 12:34     ` Jason Gunthorpe
  2019-06-07 13:42       ` Jason Gunthorpe
                         ` (2 more replies)
  0 siblings, 3 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 12:34 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 07:29:08PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> ...
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 8e7403f081f44a..547002f56a163d 100644
> > +++ b/mm/hmm.c
> ...
> > @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
> >  		mm->hmm = NULL;
> >  	spin_unlock(&mm->page_table_lock);
> >  
> > -	kfree(hmm);
> > +	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
> 
> 
> It occurred to me to wonder if it is best to use the MMU notifier's
> instance of srcu, instead of creating a separate instance for HMM.

It *has* to be the MMU notifier SRCU because we are synchornizing
against the read side of that SRU inside the mmu notifier code, ie:

int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
        id = srcu_read_lock(&srcu);
        hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
                if (mn->ops->invalidate_range_start) {
                   ^^^^^

Here 'mn' is really hmm (hmm = container_of(mn, struct hmm,
mmu_notifier)), so we must protect the memory against free for the mmu
notifier core.

Thus we have no choice but to use its SRCU.

CH also pointed out a more elegant solution, which is to get the write
side of the mmap_sem during hmm_mirror_unregister - no notifier
callback can be running in this case. Then we delete the kref, srcu
and so forth.

This is much clearer/saner/better, but.. requries the callers of
hmm_mirror_unregister to be safe to get the mmap_sem write side.

I think this is true, so maybe this patch should be switched, what do
you think?

> > @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
> >  
> >  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> >  {
> > -	struct hmm *hmm = mm_get_hmm(mm);
> > +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
> >  	struct hmm_mirror *mirror;
> >  	struct hmm_range *range;
> >  
> > +	/* hmm is in progress to free */
> 
> Well, sometimes, yes. :)

It think it is in all cases actually.. The only way we see a 0 kref
and still reach this code path is if another thread has alreay setup
the hmm_free in the call_srcu..

> Maybe this wording is clearer (if we need any comment at all):

I always find this hard.. This is a very standard pattern when working
with RCU - however in my experience few people actually know the RCU
patterns, and missing the _unless_zero is a common bug I find when
looking at code.

This is mm/ so I can drop it, what do you think?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-06-07  2:44   ` John Hubbard
@ 2019-06-07 12:36     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 12:36 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 07:44:58PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > So long a a struct hmm pointer exists, so should the struct mm it is
> > linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
> > once the hmm refcount goes to zero.
> > 
> > Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
> > mm->hmm delete the hmm_hmm_destroy().
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > v2:
> >  - Fix error unwind paths in hmm_get_or_create (Jerome/Jason)
> >  include/linux/hmm.h |  3 ---
> >  kernel/fork.c       |  1 -
> >  mm/hmm.c            | 22 ++++------------------
> >  3 files changed, 4 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 2d519797cb134a..4ee3acabe5ed22 100644
> > +++ b/include/linux/hmm.h
> > @@ -586,14 +586,11 @@ static inline int hmm_vma_fault(struct hmm_mirror *mirror,
> >  }
> >  
> >  /* Below are for HMM internal use only! Not to be used by device driver! */
> > -void hmm_mm_destroy(struct mm_struct *mm);
> > -
> >  static inline void hmm_mm_init(struct mm_struct *mm)
> >  {
> >  	mm->hmm = NULL;
> >  }
> >  #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> > -static inline void hmm_mm_destroy(struct mm_struct *mm) {}
> >  static inline void hmm_mm_init(struct mm_struct *mm) {}
> >  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> >  
> > diff --git a/kernel/fork.c b/kernel/fork.c
> > index b2b87d450b80b5..588c768ae72451 100644
> > +++ b/kernel/fork.c
> > @@ -673,7 +673,6 @@ void __mmdrop(struct mm_struct *mm)
> >  	WARN_ON_ONCE(mm == current->active_mm);
> >  	mm_free_pgd(mm);
> >  	destroy_context(mm);
> > -	hmm_mm_destroy(mm);
> 
> 
> This is particularly welcome, not to have an "HMM is special" case
> in such a core part of process/mm code. 

I would very much like to propose something like 'per-net' for struct
mm, as rdma also need to add some data to each mm to make it's use of
mmu notifiers work (for basically this same reason as HMM)

Thanks,
Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07  3:06   ` John Hubbard
@ 2019-06-07 12:47     ` Jason Gunthorpe
  2019-06-07 13:31     ` [PATCH v3 " Jason Gunthorpe
  1 sibling, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 12:47 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 08:06:52PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > The wait_event_timeout macro already tests the condition as its first
> > action, so there is no reason to open code another version of this, all
> > that does is skip the might_sleep() debugging in common cases, which is
> > not helpful.
> > 
> > Further, based on prior patches, we can no simplify the required condition
> 
>                                           "now simplify"
> 
> > test:
> >  - If range is valid memory then so is range->hmm
> >  - If hmm_release() has run then range->valid is set to false
> >    at the same time as dead, so no reason to check both.
> >  - A valid hmm has a valid hmm->mm.
> > 
> > Also, add the READ_ONCE for range->valid as there is no lock held here.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >  include/linux/hmm.h | 12 ++----------
> >  1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 4ee3acabe5ed22..2ab35b40992b24 100644
> > +++ b/include/linux/hmm.h
> > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
> >  static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> >  					      unsigned long timeout)
> >  {
> > -	/* Check if mm is dead ? */
> > -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > -		range->valid = false;
> > -		return false;
> > -	}
> > -	if (range->valid)
> > -		return true;
> > -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +	wait_event_timeout(range->hmm->wq, range->valid,
> >  			   msecs_to_jiffies(timeout));
> > -	/* Return current valid status just in case we get lucky */
> > -	return range->valid;
> > +	return READ_ONCE(range->valid);
> 
> Just to ensure that I actually understand the model: I'm assuming that the 
> READ_ONCE is there solely to ensure that range->valid is read *after* the
> wait_event_timeout() returns. Is that correct?

No, wait_event_timout already has internal barriers that make sure
things don't leak across it.

The READ_ONCE is required any time a thread is reading a value that
another thread can be concurrently changing - ie in this case there is
no lock protecting range->valid so the write side could be running.

Without the READ_ONCE the compiler is allowed to read the value twice
and assume it gets the same result, which may not be true with a
parallel writer, and thus may compromise the control flow in some
unknown way. 

It is also good documentation for the locking scheme in use as it
marks shared data that is not being locked.

However, now that dead is gone we can just write the above more simply
as:

static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
					      unsigned long timeout)
{
	return wait_event_timeout(range->hmm->wq, range->valid,
				  msecs_to_jiffies(timeout)) != 0;
}

Which relies on the internal barriers of wait_event_timeout, I'll fix
it up..

Thanks,
Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-07  3:47   ` John Hubbard
@ 2019-06-07 12:58     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 12:58 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 08:47:28PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > hmm_release() is called exactly once per hmm. ops->release() cannot
> > accidentally trigger any action that would recurse back onto
> > hmm->mirrors_sem.
> > 
> > This fixes a use after-free race of the form:
> > 
> >        CPU0                                   CPU1
> >                                            hmm_release()
> >                                              up_write(&hmm->mirrors_sem);
> >  hmm_mirror_unregister(mirror)
> >   down_write(&hmm->mirrors_sem);
> >   up_write(&hmm->mirrors_sem);
> >   kfree(mirror)
> >                                              mirror->ops->release(mirror)
> > 
> > The only user we have today for ops->release is an empty function, so this
> > is unambiguously safe.
> > 
> > As a consequence of plugging this race drivers are not allowed to
> > register/unregister mirrors from within a release op.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >  mm/hmm.c | 28 +++++++++-------------------
> >  1 file changed, 9 insertions(+), 19 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 709d138dd49027..3a45dd3d778248 100644
> > +++ b/mm/hmm.c
> > @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> >  	WARN_ON(!list_empty(&hmm->ranges));
> >  	mutex_unlock(&hmm->lock);
> >  
> > -	down_write(&hmm->mirrors_sem);
> > -	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> > -					  list);
> > -	while (mirror) {
> > -		list_del_init(&mirror->list);
> > -		if (mirror->ops->release) {
> > -			/*
> > -			 * Drop mirrors_sem so the release callback can wait
> > -			 * on any pending work that might itself trigger a
> > -			 * mmu_notifier callback and thus would deadlock with
> > -			 * us.
> > -			 */
> > -			up_write(&hmm->mirrors_sem);
> > +	down_read(&hmm->mirrors_sem);
> 
> This is cleaner and simpler, but I suspect it is leading to the deadlock
> that Ralph Campbell is seeing in his driver testing. (And in general, holding
> a lock during a driver callback usually leads to deadlocks.)

I think Ralph has never seen this patch (it is new), so it must be one
of the earlier patches..

> Ralph, is this the one? It's the only place in this patchset where I can
> see a lock around a callback to driver code, that wasn't there before. So
> I'm pretty sure it is the one...

Can you share the lockdep report please?

Thanks,
Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v3 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07  3:06   ` John Hubbard
  2019-06-07 12:47     ` Jason Gunthorpe
@ 2019-06-07 13:31     ` Jason Gunthorpe
  2019-06-07 22:55       ` Ira Weiny
  2019-06-08  1:32       ` John Hubbard
  1 sibling, 2 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 13:31 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

The wait_event_timeout macro already tests the condition as its first
action, so there is no reason to open code another version of this, all
that does is skip the might_sleep() debugging in common cases, which is
not helpful.

Further, based on prior patches, we can now simplify the required condition
test:
 - If range is valid memory then so is range->hmm
 - If hmm_release() has run then range->valid is set to false
   at the same time as dead, so no reason to check both.
 - A valid hmm has a valid hmm->mm.

Allowing the return value of wait_event_timeout() (along with its internal
barriers) to compute the result of the function.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
v3
- Simplify the wait_event_timeout to not check valid
---
 include/linux/hmm.h | 13 ++-----------
 1 file changed, 2 insertions(+), 11 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 1d97b6d62c5bcf..26e7c477490c4e 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -209,17 +209,8 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
 static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
 					      unsigned long timeout)
 {
-	/* Check if mm is dead ? */
-	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
-		range->valid = false;
-		return false;
-	}
-	if (range->valid)
-		return true;
-	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
-			   msecs_to_jiffies(timeout));
-	/* Return current valid status just in case we get lucky */
-	return range->valid;
+	return wait_event_timeout(range->hmm->wq, range->valid,
+				  msecs_to_jiffies(timeout)) != 0;
 }
 
 /*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-07 12:34     ` Jason Gunthorpe
@ 2019-06-07 13:42       ` Jason Gunthorpe
  2019-06-08  1:13       ` John Hubbard
  2019-06-08  1:37       ` John Hubbard
  2 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 13:42 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 09:34:32AM -0300, Jason Gunthorpe wrote:

> CH also pointed out a more elegant solution, which is to get the write
> side of the mmap_sem during hmm_mirror_unregister - no notifier
> callback can be running in this case. Then we delete the kref, srcu
> and so forth.

Oops, it turns out this is only the case for invalidate_start/end, not
release, so this doesn't help with the SRCU unless we also change
exit_mmap to call release with the mmap sem held.

So I think we have to stick with this for now.

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration
  2019-06-07  3:29   ` John Hubbard
@ 2019-06-07 13:57     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 13:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 08:29:10PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > No other register/unregister kernel API attempts to provide this kind of
> > protection as it is inherently racy, so just drop it.
> > 
> > Callers should provide their own protection, it appears nouveau already
> > does, but just in case drop a debugging POISON.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >  mm/hmm.c | 9 ++-------
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index c702cd72651b53..6802de7080d172 100644
> > +++ b/mm/hmm.c
> > @@ -284,18 +284,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
> >   */
> >  void hmm_mirror_unregister(struct hmm_mirror *mirror)
> >  {
> > -	struct hmm *hmm = READ_ONCE(mirror->hmm);
> > -
> > -	if (hmm == NULL)
> > -		return;
> > +	struct hmm *hmm = mirror->hmm;
> >  
> >  	down_write(&hmm->mirrors_sem);
> >  	list_del_init(&mirror->list);
> > -	/* To protect us against double unregister ... */
> > -	mirror->hmm = NULL;
> >  	up_write(&hmm->mirrors_sem);
> > -
> >  	hmm_put(hmm);
> > +	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
> 
> I hadn't thought of POISON_* for these types of cases, it's a 
> good technique to remember.
> 
> I noticed that this is now done outside of the lock, but that
> follows directly from your commit description, so that all looks 
> correct.

Yes, the thing about POISON is that if you ever read it then you have
found a use after free bug - thus we should never need to write it
under a lock (just after a serializing lock)

Normally I wouldn't bother as kfree does poison as well, but since we
can't easily audit the patches yet to be submitted this seems safer
and will reliably cause those patches to explode with an oops in
testing.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister
  2019-06-07  3:37   ` John Hubbard
@ 2019-06-07 14:03     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 14:03 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 08:37:42PM -0700, John Hubbard wrote:
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> > and poison bytes to detect this condition.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> > v2
> > - Keep range start/end valid after unregistration (Jerome)
> >  mm/hmm.c | 7 +++++--
> >  1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6802de7080d172..c2fecb3ecb11e1 100644
> > +++ b/mm/hmm.c
> > @@ -937,7 +937,7 @@ void hmm_range_unregister(struct hmm_range *range)
> >  	struct hmm *hmm = range->hmm;
> >  
> >  	/* Sanity check this really should not happen. */
> 
> That comment can also be deleted, as it has the same meaning as
> the WARN_ON() that you just added.
> 
> > -	if (hmm == NULL || range->end <= range->start)
> > +	if (WARN_ON(range->end <= range->start))
> >  		return;
> >  
> >  	mutex_lock(&hmm->lock);
> > @@ -948,7 +948,10 @@ void hmm_range_unregister(struct hmm_range *range)
> >  	range->valid = false;
> >  	mmput(hmm->mm);
> >  	hmm_put(hmm);
> > -	range->hmm = NULL;
> > +
> > +	/* The range is now invalid, leave it poisoned. */
> 
> To be precise, we are poisoning the range's back pointer to it's
> owning hmm instance.  Maybe this is clearer:
> 
> 	/*
> 	 * The range is now invalid, so poison it's hmm pointer. 
> 	 * Leave other range-> fields in place, for the caller's use.
> 	 */
> 
> ...or something like that?
> 
> > +	range->valid = false;
> > +	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
> >  }
> >  EXPORT_SYMBOL(hmm_range_unregister);
> >  
> > 
> 
> The above are very minor documentation points, so:
> 
>     Reviewed-by: John Hubbard <jhubbard@nvidia.com>

done thanks

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2019-06-06 18:44 ` [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release Jason Gunthorpe
@ 2019-06-07 16:05 ` Jason Gunthorpe
  2019-06-07 23:52   ` Ralph Campbell
  2019-06-11 19:48 ` [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
  12 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 16:05 UTC (permalink / raw)
  To: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

If the trylock on the hmm->mirrors_sem fails the function will return
without decrementing the notifiers that were previously incremented. Since
the caller will not call invalidate_range_end() on EAGAIN this will result
in notifiers becoming permanently incremented and deadlock.

If the sync_cpu_device_pagetables() required blocking the function will
not return EAGAIN even though the device continues to touch the
pages. This is a violation of the mmu notifier contract.

Switch, and rename, the ranges_lock to a spin lock so we can reliably
obtain it without blocking during error unwind.

The error unwind is necessary since the notifiers count must be held
incremented across the call to sync_cpu_device_pagetables() as we cannot
allow the range to become marked valid by a parallel
invalidate_start/end() pair while doing sync_cpu_device_pagetables().

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h |  2 +-
 mm/hmm.c            | 77 +++++++++++++++++++++++++++------------------
 2 files changed, 48 insertions(+), 31 deletions(-)

I almost lost this patch - it is part of the series, hasn't been
posted before, and wasn't sent with the rest, sorry.

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index bf013e96525771..0fa8ea34ccef6d 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -86,7 +86,7 @@
 struct hmm {
 	struct mm_struct	*mm;
 	struct kref		kref;
-	struct mutex		lock;
+	spinlock_t		ranges_lock;
 	struct list_head	ranges;
 	struct list_head	mirrors;
 	struct mmu_notifier	mmu_notifier;
diff --git a/mm/hmm.c b/mm/hmm.c
index 4215edf737ef5b..10103a24e9b7b3 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -68,7 +68,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	init_rwsem(&hmm->mirrors_sem);
 	hmm->mmu_notifier.ops = NULL;
 	INIT_LIST_HEAD(&hmm->ranges);
-	mutex_init(&hmm->lock);
+	spin_lock_init(&hmm->ranges_lock);
 	kref_init(&hmm->kref);
 	hmm->notifiers = 0;
 	hmm->mm = mm;
@@ -114,18 +114,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
+	unsigned long flags;
 
 	/* Bail out if hmm is in the process of being freed */
 	if (!kref_get_unless_zero(&hmm->kref))
 		return;
 
-	mutex_lock(&hmm->lock);
+	spin_lock_irqsave(&hmm->ranges_lock, flags);
 	/*
 	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
 	 * prevented as long as a range exists.
 	 */
 	WARN_ON(!list_empty(&hmm->ranges));
-	mutex_unlock(&hmm->lock);
+	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
 
 	down_read(&hmm->mirrors_sem);
 	list_for_each_entry(mirror, &hmm->mirrors, list) {
@@ -141,6 +142,23 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_put(hmm);
 }
 
+static void notifiers_decrement(struct hmm *hmm)
+{
+	lockdep_assert_held(&hmm->ranges_lock);
+
+	hmm->notifiers--;
+	if (!hmm->notifiers) {
+		struct hmm_range *range;
+
+		list_for_each_entry(range, &hmm->ranges, list) {
+			if (range->valid)
+				continue;
+			range->valid = true;
+		}
+		wake_up_all(&hmm->wq);
+	}
+}
+
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
@@ -148,6 +166,7 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
 	struct hmm_range *range;
+	unsigned long flags;
 	int ret = 0;
 
 	if (!kref_get_unless_zero(&hmm->kref))
@@ -158,12 +177,7 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 	update.event = HMM_UPDATE_INVALIDATE;
 	update.blockable = mmu_notifier_range_blockable(nrange);
 
-	if (mmu_notifier_range_blockable(nrange))
-		mutex_lock(&hmm->lock);
-	else if (!mutex_trylock(&hmm->lock)) {
-		ret = -EAGAIN;
-		goto out;
-	}
+	spin_lock_irqsave(&hmm->ranges_lock, flags);
 	hmm->notifiers++;
 	list_for_each_entry(range, &hmm->ranges, list) {
 		if (update.end < range->start || update.start >= range->end)
@@ -171,7 +185,7 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 
 		range->valid = false;
 	}
-	mutex_unlock(&hmm->lock);
+	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
 
 	if (mmu_notifier_range_blockable(nrange))
 		down_read(&hmm->mirrors_sem);
@@ -179,16 +193,26 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 		ret = -EAGAIN;
 		goto out;
 	}
+
 	list_for_each_entry(mirror, &hmm->mirrors, list) {
-		int ret;
+		int rc;
 
-		ret = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
-		if (!update.blockable && ret == -EAGAIN)
+		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
+		if (rc) {
+			if (WARN_ON(update.blockable || rc != -EAGAIN))
+				continue;
+			ret = -EAGAIN;
 			break;
+		}
 	}
 	up_read(&hmm->mirrors_sem);
 
 out:
+	if (ret) {
+		spin_lock_irqsave(&hmm->ranges_lock, flags);
+		notifiers_decrement(hmm);
+		spin_unlock_irqrestore(&hmm->ranges_lock, flags);
+	}
 	hmm_put(hmm);
 	return ret;
 }
@@ -197,23 +221,14 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
+	unsigned long flags;
 
 	if (!kref_get_unless_zero(&hmm->kref))
 		return;
 
-	mutex_lock(&hmm->lock);
-	hmm->notifiers--;
-	if (!hmm->notifiers) {
-		struct hmm_range *range;
-
-		list_for_each_entry(range, &hmm->ranges, list) {
-			if (range->valid)
-				continue;
-			range->valid = true;
-		}
-		wake_up_all(&hmm->wq);
-	}
-	mutex_unlock(&hmm->lock);
+	spin_lock_irqsave(&hmm->ranges_lock, flags);
+	notifiers_decrement(hmm);
+	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
 
 	hmm_put(hmm);
 }
@@ -866,6 +881,7 @@ int hmm_range_register(struct hmm_range *range,
 {
 	unsigned long mask = ((1UL << page_shift) - 1UL);
 	struct hmm *hmm = mirror->hmm;
+	unsigned long flags;
 
 	range->valid = false;
 	range->hmm = NULL;
@@ -887,7 +903,7 @@ int hmm_range_register(struct hmm_range *range,
 	kref_get(&hmm->kref);
 
 	/* Initialize range to track CPU page table updates. */
-	mutex_lock(&hmm->lock);
+	spin_lock_irqsave(&hmm->ranges_lock, flags);
 
 	range->hmm = hmm;
 	list_add(&range->list, &hmm->ranges);
@@ -898,7 +914,7 @@ int hmm_range_register(struct hmm_range *range,
 	 */
 	if (!hmm->notifiers)
 		range->valid = true;
-	mutex_unlock(&hmm->lock);
+	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
 
 	return 0;
 }
@@ -914,13 +930,14 @@ EXPORT_SYMBOL(hmm_range_register);
 void hmm_range_unregister(struct hmm_range *range)
 {
 	struct hmm *hmm = range->hmm;
+	unsigned long flags;
 
 	if (WARN_ON(range->end <= range->start))
 		return;
 
-	mutex_lock(&hmm->lock);
+	spin_lock_irqsave(&hmm->ranges_lock, flags);
 	list_del(&range->list);
-	mutex_unlock(&hmm->lock);
+	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
 
 	/* Drop reference taken by hmm_range_register() */
 	range->valid = false;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
  2019-06-07  2:29   ` John Hubbard
@ 2019-06-07 18:12   ` Ralph Campbell
  2019-06-08  8:49   ` Christoph Hellwig
  2 siblings, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 18:12 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe



On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> system will continue to reference hmm->mn until the srcu grace period
> expires.
> 
> Resulting in use after free races like this:
> 
>           CPU0                                     CPU1
>                                                 __mmu_notifier_invalidate_range_start()
>                                                   srcu_read_lock
>                                                   hlist_for_each ()
>                                                     // mn == hmm->mn
> hmm_mirror_unregister()
>    hmm_put()
>      hmm_free()
>        mmu_notifier_unregister_no_release()
>           hlist_del_init_rcu(hmm-mn->list)
> 			                           mn->ops->invalidate_range_start(mn, range);
> 					             mm_get_hmm()
>        mm->hmm = NULL;
>        kfree(hmm)
>                                                       mutex_lock(&hmm->lock);
> 
> Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> existing. Get the now-safe hmm struct through container_of and directly
> check kref_get_unless_zero to lock it against free.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

You can add
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
> v2:
> - Spell 'free' properly (Jerome/Ralph)
> ---
>   include/linux/hmm.h |  1 +
>   mm/hmm.c            | 25 +++++++++++++++++++------
>   2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 092f0234bfe917..688c5ca7068795 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -102,6 +102,7 @@ struct hmm {
>   	struct mmu_notifier	mmu_notifier;
>   	struct rw_semaphore	mirrors_sem;
>   	wait_queue_head_t	wq;
> +	struct rcu_head		rcu;
>   	long			notifiers;
>   	bool			dead;
>   };
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8e7403f081f44a..547002f56a163d 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	return NULL;
>   }
>   
> +static void hmm_free_rcu(struct rcu_head *rcu)
> +{
> +	kfree(container_of(rcu, struct hmm, rcu));
> +}
> +
>   static void hmm_free(struct kref *kref)
>   {
>   	struct hmm *hmm = container_of(kref, struct hmm, kref);
> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>   		mm->hmm = NULL;
>   	spin_unlock(&mm->page_table_lock);
>   
> -	kfree(hmm);
> +	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>   }
>   
>   static inline void hmm_put(struct hmm *hmm)
> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>   
>   static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   {
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   	struct hmm_mirror *mirror;
>   	struct hmm_range *range;
>   
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
> +
>   	/* Report this HMM as dying. */
>   	hmm->dead = true;
>   
> @@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *nrange)
>   {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   	struct hmm_mirror *mirror;
>   	struct hmm_update update;
>   	struct hmm_range *range;
>   	int ret = 0;
>   
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return 0;
>   
>   	update.start = nrange->start;
>   	update.end = nrange->end;
> @@ -245,9 +256,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *nrange)
>   {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
>   
>   	mutex_lock(&hmm->lock);
>   	hmm->notifiers--;
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
  2019-06-07  2:36   ` John Hubbard
@ 2019-06-07 18:24   ` Ralph Campbell
  2019-06-07 22:39     ` Ralph Campbell
  2019-06-07 22:33   ` Ira Weiny
  2019-06-08  8:54   ` Christoph Hellwig
  3 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 18:24 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Ralph observes that hmm_range_register() can only be called by a driver
> while a mirror is registered. Make this clear in the API by passing in the
> mirror structure as a parameter.
> 
> This also simplifies understanding the lifetime model for struct hmm, as
> the hmm pointer must be valid as part of a registered mirror so all we
> need in hmm_register_range() is a simple kref_get.
> 
> Suggested-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

You might CC Ben for the nouveau part.
CC: Ben Skeggs <bskeggs@redhat.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>


> ---
> v2
> - Include the oneline patch to nouveau_svm.c
> ---
>   drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
>   include/linux/hmm.h                   |  7 ++++---
>   mm/hmm.c                              | 15 ++++++---------
>   3 files changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 93ed43c413f0bb..8c92374afcf227 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -649,7 +649,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
>   		range.values = nouveau_svm_pfn_values;
>   		range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
>   again:
> -		ret = hmm_vma_fault(&range, true);
> +		ret = hmm_vma_fault(&svmm->mirror, &range, true);
>   		if (ret == 0) {
>   			mutex_lock(&svmm->mutex);
>   			if (!hmm_vma_range_done(&range)) {
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 688c5ca7068795..2d519797cb134a 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -505,7 +505,7 @@ static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
>    * Please see Documentation/vm/hmm.rst for how to use the range API.
>    */
>   int hmm_range_register(struct hmm_range *range,
> -		       struct mm_struct *mm,
> +		       struct hmm_mirror *mirror,
>   		       unsigned long start,
>   		       unsigned long end,
>   		       unsigned page_shift);
> @@ -541,7 +541,8 @@ static inline bool hmm_vma_range_done(struct hmm_range *range)
>   }
>   
>   /* This is a temporary helper to avoid merge conflict between trees. */
> -static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> +static inline int hmm_vma_fault(struct hmm_mirror *mirror,
> +				struct hmm_range *range, bool block)
>   {
>   	long ret;
>   
> @@ -554,7 +555,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>   	range->default_flags = 0;
>   	range->pfn_flags_mask = -1UL;
>   
> -	ret = hmm_range_register(range, range->vma->vm_mm,
> +	ret = hmm_range_register(range, mirror,
>   				 range->start, range->end,
>   				 PAGE_SHIFT);
>   	if (ret)
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 547002f56a163d..8796447299023c 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -925,13 +925,13 @@ static void hmm_pfns_clear(struct hmm_range *range,
>    * Track updates to the CPU page table see include/linux/hmm.h
>    */
>   int hmm_range_register(struct hmm_range *range,
> -		       struct mm_struct *mm,
> +		       struct hmm_mirror *mirror,
>   		       unsigned long start,
>   		       unsigned long end,
>   		       unsigned page_shift)
>   {
>   	unsigned long mask = ((1UL << page_shift) - 1UL);
> -	struct hmm *hmm;
> +	struct hmm *hmm = mirror->hmm;
>   
>   	range->valid = false;
>   	range->hmm = NULL;
> @@ -945,15 +945,12 @@ int hmm_range_register(struct hmm_range *range,
>   	range->start = start;
>   	range->end = end;
>   
> -	hmm = hmm_get_or_create(mm);
> -	if (!hmm)
> -		return -EFAULT;
> -
>   	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead) {
> -		hmm_put(hmm);
> +	if (hmm->mm == NULL || hmm->dead)
>   		return -EFAULT;
> -	}
> +
> +	range->hmm = hmm;
> +	kref_get(&hmm->kref);
>   
>   	/* Initialize range to track CPU page table updates. */
>   	mutex_lock(&hmm->lock);
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-06-06 18:44 ` [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
  2019-06-07  2:44   ` John Hubbard
@ 2019-06-07 18:41   ` Ralph Campbell
  2019-06-07 18:51     ` Jason Gunthorpe
  2019-06-07 22:38   ` Ira Weiny
  2 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 18:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> So long a a struct hmm pointer exists, so should the struct mm it is

s/a a/as a/

> linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
> once the hmm refcount goes to zero.
> 
> Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
> mm->hmm delete the hmm_hmm_destroy().
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
> v2:
>   - Fix error unwind paths in hmm_get_or_create (Jerome/Jason)
> ---
>   include/linux/hmm.h |  3 ---
>   kernel/fork.c       |  1 -
>   mm/hmm.c            | 22 ++++------------------
>   3 files changed, 4 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 2d519797cb134a..4ee3acabe5ed22 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -586,14 +586,11 @@ static inline int hmm_vma_fault(struct hmm_mirror *mirror,
>   }
>   
>   /* Below are for HMM internal use only! Not to be used by device driver! */
> -void hmm_mm_destroy(struct mm_struct *mm);
> -
>   static inline void hmm_mm_init(struct mm_struct *mm)
>   {
>   	mm->hmm = NULL;
>   }
>   #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> -static inline void hmm_mm_destroy(struct mm_struct *mm) {}
>   static inline void hmm_mm_init(struct mm_struct *mm) {}
>   #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>   
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b2b87d450b80b5..588c768ae72451 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -673,7 +673,6 @@ void __mmdrop(struct mm_struct *mm)
>   	WARN_ON_ONCE(mm == current->active_mm);
>   	mm_free_pgd(mm);
>   	destroy_context(mm);
> -	hmm_mm_destroy(mm);
>   	mmu_notifier_mm_destroy(mm);
>   	check_mm(mm);
>   	put_user_ns(mm->user_ns);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8796447299023c..cc7c26fda3300e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -29,6 +29,7 @@
>   #include <linux/swapops.h>
>   #include <linux/hugetlb.h>
>   #include <linux/memremap.h>
> +#include <linux/sched/mm.h>
>   #include <linux/jump_label.h>
>   #include <linux/dma-mapping.h>
>   #include <linux/mmu_notifier.h>
> @@ -82,6 +83,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	hmm->notifiers = 0;
>   	hmm->dead = false;
>   	hmm->mm = mm;
> +	mmgrab(hmm->mm);
>   
>   	spin_lock(&mm->page_table_lock);
>   	if (!mm->hmm)
> @@ -109,6 +111,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   		mm->hmm = NULL;
>   	spin_unlock(&mm->page_table_lock);
>   error:
> +	mmdrop(hmm->mm);
>   	kfree(hmm);
>   	return NULL;
>   }
> @@ -130,6 +133,7 @@ static void hmm_free(struct kref *kref)
>   		mm->hmm = NULL;
>   	spin_unlock(&mm->page_table_lock);
>   
> +	mmdrop(hmm->mm);
>   	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>   }
>   
> @@ -138,24 +142,6 @@ static inline void hmm_put(struct hmm *hmm)
>   	kref_put(&hmm->kref, hmm_free);
>   }
>   
> -void hmm_mm_destroy(struct mm_struct *mm)
> -{
> -	struct hmm *hmm;
> -
> -	spin_lock(&mm->page_table_lock);
> -	hmm = mm_get_hmm(mm);
> -	mm->hmm = NULL;
> -	if (hmm) {
> -		hmm->mm = NULL;
> -		hmm->dead = true;
> -		spin_unlock(&mm->page_table_lock);
> -		hmm_put(hmm);
> -		return;
> -	}
> -
> -	spin_unlock(&mm->page_table_lock);
> -}
> -
>   static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   {
>   	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-06-07 18:41   ` Ralph Campbell
@ 2019-06-07 18:51     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 18:51 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 11:41:20AM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > So long a a struct hmm pointer exists, so should the struct mm it is
> 
> s/a a/as a/

Got it, thanks

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-06-06 18:44 ` [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
  2019-06-07  2:54   ` John Hubbard
@ 2019-06-07 18:52   ` Ralph Campbell
  2019-06-07 22:44   ` Ira Weiny
  2 siblings, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 18:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> As coded this function can false-fail in various racy situations. Make it
> reliable by running only under the write side of the mmap_sem and avoiding
> the false-failing compare/exchange pattern.
> 
> Also make the locking very easy to understand by only ever reading or
> writing mm->hmm while holding the write side of the mmap_sem.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
> v2:
> - Fix error unwind of mmgrab (Jerome)
> - Use hmm local instead of 2nd container_of (Jerome)
> ---
>   mm/hmm.c | 80 ++++++++++++++++++++------------------------------------
>   1 file changed, 29 insertions(+), 51 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index cc7c26fda3300e..dc30edad9a8a02 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -40,16 +40,6 @@
>   #if IS_ENABLED(CONFIG_HMM_MIRROR)
>   static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>   
> -static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> -{
> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> -
> -	if (hmm && kref_get_unless_zero(&hmm->kref))
> -		return hmm;
> -
> -	return NULL;
> -}
> -
>   /**
>    * hmm_get_or_create - register HMM against an mm (HMM internal)
>    *
> @@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>    */
>   static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   {
> -	struct hmm *hmm = mm_get_hmm(mm);
> -	bool cleanup = false;
> +	struct hmm *hmm;
>   
> -	if (hmm)
> -		return hmm;
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
> +	if (mm->hmm) {
> +		if (kref_get_unless_zero(&mm->hmm->kref))
> +			return mm->hmm;
> +		/*
> +		 * The hmm is being freed by some other CPU and is pending a
> +		 * RCU grace period, but this CPU can NULL now it since we
> +		 * have the mmap_sem.
> +		 */
> +		mm->hmm = NULL;
> +	}
>   
>   	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>   	if (!hmm)
> @@ -83,57 +82,36 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	hmm->notifiers = 0;
>   	hmm->dead = false;
>   	hmm->mm = mm;
> -	mmgrab(hmm->mm);
> -
> -	spin_lock(&mm->page_table_lock);
> -	if (!mm->hmm)
> -		mm->hmm = hmm;
> -	else
> -		cleanup = true;
> -	spin_unlock(&mm->page_table_lock);
>   
> -	if (cleanup)
> -		goto error;
> -
> -	/*
> -	 * We should only get here if hold the mmap_sem in write mode ie on
> -	 * registration of first mirror through hmm_mirror_register()
> -	 */
>   	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> -	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
> -		goto error_mm;
> +	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
> +		kfree(hmm);
> +		return NULL;
> +	}
>   
> +	mmgrab(hmm->mm);
> +	mm->hmm = hmm;
>   	return hmm;
> -
> -error_mm:
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -error:
> -	mmdrop(hmm->mm);
> -	kfree(hmm);
> -	return NULL;
>   }
>   
>   static void hmm_free_rcu(struct rcu_head *rcu)
>   {
> -	kfree(container_of(rcu, struct hmm, rcu));
> +	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
> +
> +	down_write(&hmm->mm->mmap_sem);
> +	if (hmm->mm->hmm == hmm)
> +		hmm->mm->hmm = NULL;
> +	up_write(&hmm->mm->mmap_sem);
> +	mmdrop(hmm->mm);
> +
> +	kfree(hmm);
>   }
>   
>   static void hmm_free(struct kref *kref)
>   {
>   	struct hmm *hmm = container_of(kref, struct hmm, kref);
> -	struct mm_struct *mm = hmm->mm;
> -
> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
>   
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -
> -	mmdrop(hmm->mm);
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
>   	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>   }
>   
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-06 18:44 ` [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
  2019-06-07  3:06   ` John Hubbard
@ 2019-06-07 19:01   ` Ralph Campbell
  2019-06-07 19:13     ` Jason Gunthorpe
  1 sibling, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 19:01 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> The wait_event_timeout macro already tests the condition as its first
> action, so there is no reason to open code another version of this, all
> that does is skip the might_sleep() debugging in common cases, which is
> not helpful.
> 
> Further, based on prior patches, we can no simplify the required condition
> test:
>   - If range is valid memory then so is range->hmm
>   - If hmm_release() has run then range->valid is set to false
>     at the same time as dead, so no reason to check both.
>   - A valid hmm has a valid hmm->mm.
> 
> Also, add the READ_ONCE for range->valid as there is no lock held here.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
>   include/linux/hmm.h | 12 ++----------
>   1 file changed, 2 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 4ee3acabe5ed22..2ab35b40992b24 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>   static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>   					      unsigned long timeout)
>   {
> -	/* Check if mm is dead ? */
> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> -		range->valid = false;
> -		return false;
> -	}
> -	if (range->valid)
> -		return true;
> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> +	wait_event_timeout(range->hmm->wq, range->valid,
>   			   msecs_to_jiffies(timeout));
> -	/* Return current valid status just in case we get lucky */
> -	return range->valid;
> +	return READ_ONCE(range->valid);
>   }
>   
>   /*
> 

Since we are simplifying things, perhaps we should consider merging
hmm_range_wait_until_valid() info hmm_range_register() and
removing hmm_range_wait_until_valid() since the pattern
is to always call the two together.

In any case, this looks OK to me so you can add
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 19:01   ` [PATCH v2 " Ralph Campbell
@ 2019-06-07 19:13     ` Jason Gunthorpe
  2019-06-07 20:21       ` Ralph Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 19:13 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 12:01:45PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > The wait_event_timeout macro already tests the condition as its first
> > action, so there is no reason to open code another version of this, all
> > that does is skip the might_sleep() debugging in common cases, which is
> > not helpful.
> > 
> > Further, based on prior patches, we can no simplify the required condition
> > test:
> >   - If range is valid memory then so is range->hmm
> >   - If hmm_release() has run then range->valid is set to false
> >     at the same time as dead, so no reason to check both.
> >   - A valid hmm has a valid hmm->mm.
> > 
> > Also, add the READ_ONCE for range->valid as there is no lock held here.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> >   include/linux/hmm.h | 12 ++----------
> >   1 file changed, 2 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 4ee3acabe5ed22..2ab35b40992b24 100644
> > +++ b/include/linux/hmm.h
> > @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
> >   static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
> >   					      unsigned long timeout)
> >   {
> > -	/* Check if mm is dead ? */
> > -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> > -		range->valid = false;
> > -		return false;
> > -	}
> > -	if (range->valid)
> > -		return true;
> > -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> > +	wait_event_timeout(range->hmm->wq, range->valid,
> >   			   msecs_to_jiffies(timeout));
> > -	/* Return current valid status just in case we get lucky */
> > -	return range->valid;
> > +	return READ_ONCE(range->valid);
> >   }
> >   /*
> > 
> 
> Since we are simplifying things, perhaps we should consider merging
> hmm_range_wait_until_valid() info hmm_range_register() and
> removing hmm_range_wait_until_valid() since the pattern
> is to always call the two together.

? the hmm.rst shows the hmm_range_wait_until_valid being called in the
(ret == -EAGAIN) path. It is confusing because it should really just
have the again label moved up above hmm_range_wait_until_valid() as
even if we get the driver lock it could still be a long wait for the
colliding invalidation to clear.

What I want to get to is a pattern like this:

pagefault():

   hmm_range_register(&range);
again:
   /* On the slow path, if we appear to be live locked then we get
      the write side of mmap_sem which will break the live lock,
      otherwise this gets the read lock */
   if (hmm_range_start_and_lock(&range))
         goto err;

   lockdep_assert_held(range->mm->mmap_sem);

   // Optional: Avoid useless expensive work
   if (hmm_range_needs_retry(&range))
      goto again;
   hmm_range_(touch vmas)

   take_lock(driver->update);
   if (hmm_range_end(&range) {
       release_lock(driver->update);
       goto again;
   }
   // Finish driver updates
   release_lock(driver->update);

   // Releases mmap_sem
   hmm_range_unregister_and_unlock(&range);

What do you think? 

Is it clear?

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 19:13     ` Jason Gunthorpe
@ 2019-06-07 20:21       ` Ralph Campbell
  2019-06-07 20:44         ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 20:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx


On 6/7/19 12:13 PM, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 12:01:45PM -0700, Ralph Campbell wrote:
>>
>> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> The wait_event_timeout macro already tests the condition as its first
>>> action, so there is no reason to open code another version of this, all
>>> that does is skip the might_sleep() debugging in common cases, which is
>>> not helpful.
>>>
>>> Further, based on prior patches, we can no simplify the required condition
>>> test:
>>>    - If range is valid memory then so is range->hmm
>>>    - If hmm_release() has run then range->valid is set to false
>>>      at the same time as dead, so no reason to check both.
>>>    - A valid hmm has a valid hmm->mm.
>>>
>>> Also, add the READ_ONCE for range->valid as there is no lock held here.
>>>
>>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>>> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
>>>    include/linux/hmm.h | 12 ++----------
>>>    1 file changed, 2 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>>> index 4ee3acabe5ed22..2ab35b40992b24 100644
>>> +++ b/include/linux/hmm.h
>>> @@ -218,17 +218,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>>>    static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>>>    					      unsigned long timeout)
>>>    {
>>> -	/* Check if mm is dead ? */
>>> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
>>> -		range->valid = false;
>>> -		return false;
>>> -	}
>>> -	if (range->valid)
>>> -		return true;
>>> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
>>> +	wait_event_timeout(range->hmm->wq, range->valid,
>>>    			   msecs_to_jiffies(timeout));
>>> -	/* Return current valid status just in case we get lucky */
>>> -	return range->valid;
>>> +	return READ_ONCE(range->valid);
>>>    }
>>>    /*
>>>
>>
>> Since we are simplifying things, perhaps we should consider merging
>> hmm_range_wait_until_valid() info hmm_range_register() and
>> removing hmm_range_wait_until_valid() since the pattern
>> is to always call the two together.
> 
> ? the hmm.rst shows the hmm_range_wait_until_valid being called in the
> (ret == -EAGAIN) path. It is confusing because it should really just
> have the again label moved up above hmm_range_wait_until_valid() as
> even if we get the driver lock it could still be a long wait for the
> colliding invalidation to clear.
> 
> What I want to get to is a pattern like this:
> 
> pagefault():
> 
>     hmm_range_register(&range);
> again:
>     /* On the slow path, if we appear to be live locked then we get
>        the write side of mmap_sem which will break the live lock,
>        otherwise this gets the read lock */
>     if (hmm_range_start_and_lock(&range))
>           goto err;
> 
>     lockdep_assert_held(range->mm->mmap_sem);
> 
>     // Optional: Avoid useless expensive work
>     if (hmm_range_needs_retry(&range))
>        goto again;
>     hmm_range_(touch vmas)
> 
>     take_lock(driver->update);
>     if (hmm_range_end(&range) {
>         release_lock(driver->update);
>         goto again;
>     }
>     // Finish driver updates
>     release_lock(driver->update);
> 
>     // Releases mmap_sem
>     hmm_range_unregister_and_unlock(&range);
> 
> What do you think?
> 
> Is it clear?
> 
> Jason
> 

Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
Usually, the fault code has to lock mmap_sem for read in order to
call find_vma() so it can set range.vma.
If HMM drops mmap_sem - which I don't think it should, just return an
error to tell the caller to drop mmap_sem and retry - the find_vma()
will need to be repeated as well.
I'm also not sure about acquiring the mmap_sem for write as way to
mitigate thrashing. It seems to me that if a device and a CPU are
both faulting on the same page, some sort of backoff delay is needed
to let one side or the other make some progress.

Thrashing mitigation and how migrate_vma() plays in this is a
deep topic for thought.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range
  2019-06-06 18:44 ` [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range Jason Gunthorpe
  2019-06-07  3:15   ` John Hubbard
@ 2019-06-07 20:29   ` Ralph Campbell
  1 sibling, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 20:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Range functions like hmm_range_snapshot() and hmm_range_fault() call
> find_vma, which requires hodling the mmget() and the mmap_sem for the mm.
> 
> Make this simpler for the callers by holding the mmget() inside the range
> for the lifetime of the range. Other functions that accept a range should
> only be called if the range is registered.
> 
> This has the side effect of directly preventing hmm_release() from
> happening while a range is registered. That means range->dead cannot be
> false during the lifetime of the range, so remove dead and
> hmm_mirror_mm_is_alive() entirely.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Looks good to me.
Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
> v2:
>   - Use Jerome's idea of just holding the mmget() for the range lifetime,
>     rework the patch to use that as as simplification to remove dead in
>     one step
> ---
>   include/linux/hmm.h | 26 --------------------------
>   mm/hmm.c            | 28 ++++++++++------------------
>   2 files changed, 10 insertions(+), 44 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 2ab35b40992b24..0e20566802967a 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -91,7 +91,6 @@
>    * @mirrors_sem: read/write semaphore protecting the mirrors list
>    * @wq: wait queue for user waiting on a range invalidation
>    * @notifiers: count of active mmu notifiers
> - * @dead: is the mm dead ?
>    */
>   struct hmm {
>   	struct mm_struct	*mm;
> @@ -104,7 +103,6 @@ struct hmm {
>   	wait_queue_head_t	wq;
>   	struct rcu_head		rcu;
>   	long			notifiers;
> -	bool			dead;
>   };
>   
>   /*
> @@ -469,30 +467,6 @@ struct hmm_mirror {
>   int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
>   void hmm_mirror_unregister(struct hmm_mirror *mirror);
>   
> -/*
> - * hmm_mirror_mm_is_alive() - test if mm is still alive
> - * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
> - * Return: false if the mm is dead, true otherwise
> - *
> - * This is an optimization, it will not always accurately return false if the
> - * mm is dead; i.e., there can be false negatives (process is being killed but
> - * HMM is not yet informed of that). It is only intended to be used to optimize
> - * out cases where the driver is about to do something time consuming and it
> - * would be better to skip it if the mm is dead.
> - */
> -static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
> -{
> -	struct mm_struct *mm;
> -
> -	if (!mirror || !mirror->hmm)
> -		return false;
> -	mm = READ_ONCE(mirror->hmm->mm);
> -	if (mirror->hmm->dead || !mm)
> -		return false;
> -
> -	return true;
> -}
> -
>   /*
>    * Please see Documentation/vm/hmm.rst for how to use the range API.
>    */
> diff --git a/mm/hmm.c b/mm/hmm.c
> index dc30edad9a8a02..f67ba32983d9f1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -80,7 +80,6 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	mutex_init(&hmm->lock);
>   	kref_init(&hmm->kref);
>   	hmm->notifiers = 0;
> -	hmm->dead = false;
>   	hmm->mm = mm;
>   
>   	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> @@ -124,20 +123,17 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   {
>   	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   	struct hmm_mirror *mirror;
> -	struct hmm_range *range;
>   
>   	/* hmm is in progress to free */
>   	if (!kref_get_unless_zero(&hmm->kref))
>   		return;
>   
> -	/* Report this HMM as dying. */
> -	hmm->dead = true;
> -
> -	/* Wake-up everyone waiting on any range. */
>   	mutex_lock(&hmm->lock);
> -	list_for_each_entry(range, &hmm->ranges, list)
> -		range->valid = false;
> -	wake_up_all(&hmm->wq);
> +	/*
> +	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
> +	 * prevented as long as a range exists.
> +	 */
> +	WARN_ON(!list_empty(&hmm->ranges));
>   	mutex_unlock(&hmm->lock);
>   
>   	down_write(&hmm->mirrors_sem);
> @@ -909,8 +905,8 @@ int hmm_range_register(struct hmm_range *range,
>   	range->start = start;
>   	range->end = end;
>   
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead)
> +	/* Prevent hmm_release() from running while the range is valid */
> +	if (!mmget_not_zero(hmm->mm))
>   		return -EFAULT;
>   
>   	range->hmm = hmm;
> @@ -955,6 +951,7 @@ void hmm_range_unregister(struct hmm_range *range)
>   
>   	/* Drop reference taken by hmm_range_register() */
>   	range->valid = false;
> +	mmput(hmm->mm);
>   	hmm_put(hmm);
>   	range->hmm = NULL;
>   }
> @@ -982,10 +979,7 @@ long hmm_range_snapshot(struct hmm_range *range)
>   	struct vm_area_struct *vma;
>   	struct mm_walk mm_walk;
>   
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead)
> -		return -EFAULT;
> -
> +	lockdep_assert_held(&hmm->mm->mmap_sem);
>   	do {
>   		/* If range is no longer valid force retry. */
>   		if (!range->valid)
> @@ -1080,9 +1074,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>   	struct mm_walk mm_walk;
>   	int ret;
>   
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead)
> -		return -EFAULT;
> +	lockdep_assert_held(&hmm->mm->mmap_sem);
>   
>   	do {
>   		/* If range is no longer valid force retry. */
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments
  2019-06-06 18:44 ` [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
  2019-06-07  3:19   ` John Hubbard
@ 2019-06-07 20:31   ` Ralph Campbell
  2019-06-07 22:16   ` Souptick Joarder
  2 siblings, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 20:31 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> So we can check locking at runtime.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
> v2
> - Fix missing & in lockdeps (Jason)
> ---
>   mm/hmm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f67ba32983d9f1..c702cd72651b53 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -254,11 +254,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
>    *
>    * To start mirroring a process address space, the device driver must register
>    * an HMM mirror struct.
> - *
> - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
>    */
>   int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>   {
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
>   	/* Sanity check */
>   	if (!mm || !mirror || !mirror->ops)
>   		return -EINVAL;
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration
  2019-06-06 18:44 ` [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
  2019-06-07  3:29   ` John Hubbard
@ 2019-06-07 20:33   ` Ralph Campbell
  1 sibling, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 20:33 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> No other register/unregister kernel API attempts to provide this kind of
> protection as it is inherently racy, so just drop it.
> 
> Callers should provide their own protection, it appears nouveau already
> does, but just in case drop a debugging POISON.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   mm/hmm.c | 9 ++-------
>   1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c702cd72651b53..6802de7080d172 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -284,18 +284,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
>    */
>   void hmm_mirror_unregister(struct hmm_mirror *mirror)
>   {
> -	struct hmm *hmm = READ_ONCE(mirror->hmm);
> -
> -	if (hmm == NULL)
> -		return;
> +	struct hmm *hmm = mirror->hmm;
>   
>   	down_write(&hmm->mirrors_sem);
>   	list_del_init(&mirror->list);
> -	/* To protect us against double unregister ... */
> -	mirror->hmm = NULL;
>   	up_write(&hmm->mirrors_sem);
> -
>   	hmm_put(hmm);
> +	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
>   }
>   EXPORT_SYMBOL(hmm_mirror_unregister);
>   
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 20:21       ` Ralph Campbell
@ 2019-06-07 20:44         ` Jason Gunthorpe
  2019-06-07 22:13           ` Ralph Campbell
  0 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 20:44 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote:

> > What I want to get to is a pattern like this:
> > 
> > pagefault():
> > 
> >     hmm_range_register(&range);
> > again:
> >     /* On the slow path, if we appear to be live locked then we get
> >        the write side of mmap_sem which will break the live lock,
> >        otherwise this gets the read lock */
> >     if (hmm_range_start_and_lock(&range))
> >           goto err;
> > 
> >     lockdep_assert_held(range->mm->mmap_sem);
> > 
> >     // Optional: Avoid useless expensive work
> >     if (hmm_range_needs_retry(&range))
> >        goto again;
> >     hmm_range_(touch vmas)
> > 
> >     take_lock(driver->update);
> >     if (hmm_range_end(&range) {
> >         release_lock(driver->update);
> >         goto again;
> >     }
> >     // Finish driver updates
> >     release_lock(driver->update);
> > 
> >     // Releases mmap_sem
> >     hmm_range_unregister_and_unlock(&range);
> > 
> > What do you think?
> > 
> > Is it clear?
> > 
> > Jason
> > 
> 
> Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
> Usually, the fault code has to lock mmap_sem for read in order to
> call find_vma() so it can set range.vma.

> If HMM drops mmap_sem - which I don't think it should, just return an
> error to tell the caller to drop mmap_sem and retry - the find_vma()
> will need to be repeated as well.

Overall I don't think it makes a lot of sense to sleep for retry in
hmm_range_start_and_lock() while holding mmap_sem. It would be better
to drop that lock, sleep, then re-acquire it as part of the hmm logic.

The find_vma should be done inside the critical section created by
hmm_range_start_and_lock(), not before it. If we are retrying then we
already slept and the additional CPU cost to repeat the find_vma is
immaterial, IMHO?

Do you see a reason why the find_vma() ever needs to be before the
'again' in my above example? range.vma does not need to be set for
range_register.

> I'm also not sure about acquiring the mmap_sem for write as way to
> mitigate thrashing. It seems to me that if a device and a CPU are
> both faulting on the same page,

One of the reasons to prefer this approach is that it means we don't
need to keep track of which ranges we are faulting, and if there is a
lot of *unrelated* fault activity (unlikely?) we can resolve it using
mmap sem instead of this elaborate ranges scheme and related
locking. 

This would reduce the overall work in the page fault and
invalidate_start/end paths for the common uncontended cases.

> some sort of backoff delay is needed to let one side or the other
> make some progress.

What the write side of the mmap_sem would do is force the CPU and
device to cleanly take turns. Once the device pages are registered
under the write side the CPU will have to wait in invalidate_start for
the driver to complete a shootdown, then the whole thing starts all
over again. 

It is certainly imaginable something could have a 'min life' timer for
a device mapping and hold mm invalidate_start, and device pagefault
for that min time to promote better sharing.

But, if we don't use the mmap_sem then we can livelock and the device
will see an unrecoverable error from the timeout which means we have
risk that under load the system will simply obscurely fail. This seems
unacceptable to me..

Particularly since for the ODP use case the issue is not trashing
migration as a GPU might have, but simple system stability under swap
load. We do not want the ODP pagefault to permanently fail due to
timeout if the VMA is still valid..

Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister
  2019-06-06 18:44 ` [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
  2019-06-07  3:37   ` John Hubbard
@ 2019-06-07 20:46   ` Ralph Campbell
  2019-06-07 20:49     ` Jason Gunthorpe
  2019-06-07 23:01   ` Ira Weiny
  2 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 20:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> and poison bytes to detect this condition.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
> v2
> - Keep range start/end valid after unregistration (Jerome)
> ---
>   mm/hmm.c | 7 +++++--
>   1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6802de7080d172..c2fecb3ecb11e1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -937,7 +937,7 @@ void hmm_range_unregister(struct hmm_range *range)
>   	struct hmm *hmm = range->hmm;
>   
>   	/* Sanity check this really should not happen. */
> -	if (hmm == NULL || range->end <= range->start)
> +	if (WARN_ON(range->end <= range->start))
>   		return;

WARN_ON() is definitely better than silent return but I wonder how
useful it is since the caller shouldn't be modifying the hmm_range
once it is registered. Other fields could be changed too...

>   	mutex_lock(&hmm->lock);
> @@ -948,7 +948,10 @@ void hmm_range_unregister(struct hmm_range *range)
>   	range->valid = false;
>   	mmput(hmm->mm);
>   	hmm_put(hmm);
> -	range->hmm = NULL;
> +
> +	/* The range is now invalid, leave it poisoned. */
> +	range->valid = false;
> +	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
>   }
>   EXPORT_SYMBOL(hmm_range_unregister);
>   
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
  2019-06-07  3:40   ` John Hubbard
@ 2019-06-07 20:49   ` Ralph Campbell
  2019-06-07 22:11   ` Souptick Joarder
  2019-06-07 23:02   ` Ira Weiny
  3 siblings, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 20:49 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This list is always read and written while holding hmm->lock so there is
> no need for the confusing _rcu annotations.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   mm/hmm.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c2fecb3ecb11e1..709d138dd49027 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -911,7 +911,7 @@ int hmm_range_register(struct hmm_range *range,
>   	mutex_lock(&hmm->lock);
>   
>   	range->hmm = hmm;
> -	list_add_rcu(&range->list, &hmm->ranges);
> +	list_add(&range->list, &hmm->ranges);
>   
>   	/*
>   	 * If there are any concurrent notifiers we have to wait for them for
> @@ -941,7 +941,7 @@ void hmm_range_unregister(struct hmm_range *range)
>   		return;
>   
>   	mutex_lock(&hmm->lock);
> -	list_del_rcu(&range->list);
> +	list_del(&range->list);
>   	mutex_unlock(&hmm->lock);
>   
>   	/* Drop reference taken by hmm_range_register() */
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister
  2019-06-07 20:46   ` Ralph Campbell
@ 2019-06-07 20:49     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 20:49 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 01:46:30PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> > and poison bytes to detect this condition.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> 
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> > v2
> > - Keep range start/end valid after unregistration (Jerome)
> >   mm/hmm.c | 7 +++++--
> >   1 file changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6802de7080d172..c2fecb3ecb11e1 100644
> > +++ b/mm/hmm.c
> > @@ -937,7 +937,7 @@ void hmm_range_unregister(struct hmm_range *range)
> >   	struct hmm *hmm = range->hmm;
> >   	/* Sanity check this really should not happen. */
> > -	if (hmm == NULL || range->end <= range->start)
> > +	if (WARN_ON(range->end <= range->start))
> >   		return;
> 
> WARN_ON() is definitely better than silent return but I wonder how
> useful it is since the caller shouldn't be modifying the hmm_range
> once it is registered. Other fields could be changed too...

I deleted the warn on (see the other thread), but I'm confused by your 
"shouldn't be modified" statement.

The only thing that needs to be set and remain unchanged for register
is the virtual start/end address. Everything else should be done once
it is clear to proceed based on the collision-retry locking scheme
this uses.

Basically the range register only setups a 'detector' for colliding
invalidations. The other stuff in the struct is just random temporary
storage for the API.

AFAICS at least..

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-06 18:44 ` [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release Jason Gunthorpe
  2019-06-07  3:47   ` John Hubbard
@ 2019-06-07 21:37   ` Ralph Campbell
  2019-06-08  2:12     ` Jason Gunthorpe
  2019-06-10 16:02     ` Jason Gunthorpe
  1 sibling, 2 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 21:37 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> hmm_release() is called exactly once per hmm. ops->release() cannot
> accidentally trigger any action that would recurse back onto
> hmm->mirrors_sem.
> 
> This fixes a use after-free race of the form:
> 
>         CPU0                                   CPU1
>                                             hmm_release()
>                                               up_write(&hmm->mirrors_sem);
>   hmm_mirror_unregister(mirror)
>    down_write(&hmm->mirrors_sem);
>    up_write(&hmm->mirrors_sem);
>    kfree(mirror)
>                                               mirror->ops->release(mirror)
> 
> The only user we have today for ops->release is an empty function, so this
> is unambiguously safe.
> 
> As a consequence of plugging this race drivers are not allowed to
> register/unregister mirrors from within a release op.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

I agree with the analysis above but I'm not sure that release() will
always be an empty function. It might be more efficient to write back
all data migrated to a device "in one pass" instead of relying
on unmap_vmas() calling hmm_start_range_invalidate() per VMA.

I think the bigger issue is potential deadlocks while calling
sync_cpu_device_pagetables() and tasks calling hmm_mirror_unregister():

Say you have three threads:
- Thread A is in try_to_unmap(), either without holding mmap_sem or with
mmap_sem held for read.
- Thread B has some unrelated driver calling hmm_mirror_unregister().
This doesn't require mmap_sem.
- Thread C is about to call migrate_vma().

Thread A                Thread B                 Thread C
try_to_unmap            hmm_mirror_unregister    migrate_vma
----------------------  -----------------------  ----------------------
hmm_invalidate_range_start
down_read(mirrors_sem)
                         down_write(mirrors_sem)
                         // Blocked on A
                                                   device_lock
device_lock
// Blocked on C
                                                   migrate_vma()
                                                   hmm_invalidate_range_s
                                                   down_read(mirrors_sem)
                                                   // Blocked on B
                                                   // Deadlock

Perhaps we should consider using SRCU for walking the mirror->list?

> ---
>   mm/hmm.c | 28 +++++++++-------------------
>   1 file changed, 9 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 709d138dd49027..3a45dd3d778248 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -136,26 +136,16 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	WARN_ON(!list_empty(&hmm->ranges));
>   	mutex_unlock(&hmm->lock);
>   
> -	down_write(&hmm->mirrors_sem);
> -	mirror = list_first_entry_or_null(&hmm->mirrors, struct hmm_mirror,
> -					  list);
> -	while (mirror) {
> -		list_del_init(&mirror->list);
> -		if (mirror->ops->release) {
> -			/*
> -			 * Drop mirrors_sem so the release callback can wait
> -			 * on any pending work that might itself trigger a
> -			 * mmu_notifier callback and thus would deadlock with
> -			 * us.
> -			 */
> -			up_write(&hmm->mirrors_sem);
> +	down_read(&hmm->mirrors_sem);
> +	list_for_each_entry(mirror, &hmm->mirrors, list) {
> +		/*
> +		 * Note: The driver is not allowed to trigger
> +		 * hmm_mirror_unregister() from this thread.
> +		 */
> +		if (mirror->ops->release)
>   			mirror->ops->release(mirror);
> -			down_write(&hmm->mirrors_sem);
> -		}
> -		mirror = list_first_entry_or_null(&hmm->mirrors,
> -						  struct hmm_mirror, list);
>   	}
> -	up_write(&hmm->mirrors_sem);
> +	up_read(&hmm->mirrors_sem);
>   
>   	hmm_put(hmm);
>   }
> @@ -287,7 +277,7 @@ void hmm_mirror_unregister(struct hmm_mirror *mirror)
>   	struct hmm *hmm = mirror->hmm;
>   
>   	down_write(&hmm->mirrors_sem);
> -	list_del_init(&mirror->list);
> +	list_del(&mirror->list);
>   	up_write(&hmm->mirrors_sem);
>   	hmm_put(hmm);
>   	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
  2019-06-07  3:40   ` John Hubbard
  2019-06-07 20:49   ` Ralph Campbell
@ 2019-06-07 22:11   ` Souptick Joarder
  2019-06-07 23:02   ` Ira Weiny
  3 siblings, 0 replies; 79+ messages in thread
From: Souptick Joarder @ 2019-06-07 22:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, Linux-MM, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Fri, Jun 7, 2019 at 12:15 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> This list is always read and written while holding hmm->lock so there is
> no need for the confusing _rcu annotations.
>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Acked-by: Souptick Joarder <jrdr.linux@gmail.com>

> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c2fecb3ecb11e1..709d138dd49027 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -911,7 +911,7 @@ int hmm_range_register(struct hmm_range *range,
>         mutex_lock(&hmm->lock);
>
>         range->hmm = hmm;
> -       list_add_rcu(&range->list, &hmm->ranges);
> +       list_add(&range->list, &hmm->ranges);
>
>         /*
>          * If there are any concurrent notifiers we have to wait for them for
> @@ -941,7 +941,7 @@ void hmm_range_unregister(struct hmm_range *range)
>                 return;
>
>         mutex_lock(&hmm->lock);
> -       list_del_rcu(&range->list);
> +       list_del(&range->list);
>         mutex_unlock(&hmm->lock);
>
>         /* Drop reference taken by hmm_range_register() */
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 20:44         ` Jason Gunthorpe
@ 2019-06-07 22:13           ` Ralph Campbell
  2019-06-08  1:47             ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 22:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx


On 6/7/19 1:44 PM, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 01:21:12PM -0700, Ralph Campbell wrote:
> 
>>> What I want to get to is a pattern like this:
>>>
>>> pagefault():
>>>
>>>      hmm_range_register(&range);
>>> again:
>>>      /* On the slow path, if we appear to be live locked then we get
>>>         the write side of mmap_sem which will break the live lock,
>>>         otherwise this gets the read lock */
>>>      if (hmm_range_start_and_lock(&range))
>>>            goto err;
>>>
>>>      lockdep_assert_held(range->mm->mmap_sem);
>>>
>>>      // Optional: Avoid useless expensive work
>>>      if (hmm_range_needs_retry(&range))
>>>         goto again;
>>>      hmm_range_(touch vmas)
>>>
>>>      take_lock(driver->update);
>>>      if (hmm_range_end(&range) {
>>>          release_lock(driver->update);
>>>          goto again;
>>>      }
>>>      // Finish driver updates
>>>      release_lock(driver->update);
>>>
>>>      // Releases mmap_sem
>>>      hmm_range_unregister_and_unlock(&range);
>>>
>>> What do you think?
>>>
>>> Is it clear?
>>>
>>> Jason
>>>
>>
>> Are you talking about acquiring mmap_sem in hmm_range_start_and_lock()?
>> Usually, the fault code has to lock mmap_sem for read in order to
>> call find_vma() so it can set range.vma.
> 
>> If HMM drops mmap_sem - which I don't think it should, just return an
>> error to tell the caller to drop mmap_sem and retry - the find_vma()
>> will need to be repeated as well.
> 
> Overall I don't think it makes a lot of sense to sleep for retry in
> hmm_range_start_and_lock() while holding mmap_sem. It would be better
> to drop that lock, sleep, then re-acquire it as part of the hmm logic.
> 
> The find_vma should be done inside the critical section created by
> hmm_range_start_and_lock(), not before it. If we are retrying then we
> already slept and the additional CPU cost to repeat the find_vma is
> immaterial, IMHO?
> 
> Do you see a reason why the find_vma() ever needs to be before the
> 'again' in my above example? range.vma does not need to be set for
> range_register.

Yes, for the GPU case, there can be many faults in an event queue
and the goal is to try to handle more than one page at a time.
The vma is needed to limit the amount of coalescing and checking
for pages that could be speculatively migrated or mapped.

>> I'm also not sure about acquiring the mmap_sem for write as way to
>> mitigate thrashing. It seems to me that if a device and a CPU are
>> both faulting on the same page,
> 
> One of the reasons to prefer this approach is that it means we don't
> need to keep track of which ranges we are faulting, and if there is a
> lot of *unrelated* fault activity (unlikely?) we can resolve it using
> mmap sem instead of this elaborate ranges scheme and related
> locking.
> 
> This would reduce the overall work in the page fault and
> invalidate_start/end paths for the common uncontended cases.
> 
>> some sort of backoff delay is needed to let one side or the other
>> make some progress.
> 
> What the write side of the mmap_sem would do is force the CPU and
> device to cleanly take turns. Once the device pages are registered
> under the write side the CPU will have to wait in invalidate_start for
> the driver to complete a shootdown, then the whole thing starts all
> over again.
> 
> It is certainly imaginable something could have a 'min life' timer for
> a device mapping and hold mm invalidate_start, and device pagefault
> for that min time to promote better sharing.
> 
> But, if we don't use the mmap_sem then we can livelock and the device
> will see an unrecoverable error from the timeout which means we have
> risk that under load the system will simply obscurely fail. This seems
> unacceptable to me..
> 
> Particularly since for the ODP use case the issue is not trashing
> migration as a GPU might have, but simple system stability under swap
> load. We do not want the ODP pagefault to permanently fail due to
> timeout if the VMA is still valid..
> 
> Jason
> 

OK, I understand.
If you come up with a set of changes, I can try testing them.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments
  2019-06-06 18:44 ` [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
  2019-06-07  3:19   ` John Hubbard
  2019-06-07 20:31   ` Ralph Campbell
@ 2019-06-07 22:16   ` Souptick Joarder
  2 siblings, 0 replies; 79+ messages in thread
From: Souptick Joarder @ 2019-06-07 22:16 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, Linux-MM, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Fri, Jun 7, 2019 at 12:15 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> So we can check locking at runtime.

Little more descriptive change log would be helpful.
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>

>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
> v2
> - Fix missing & in lockdeps (Jason)
> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index f67ba32983d9f1..c702cd72651b53 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -254,11 +254,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
>   *
>   * To start mirroring a process address space, the device driver must register
>   * an HMM mirror struct.
> - *
> - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
>   */
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>  {
> +       lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
>         /* Sanity check */
>         if (!mm || !mirror || !mirror->ops)
>                 return -EINVAL;
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
  2019-06-07  2:36   ` John Hubbard
  2019-06-07 18:24   ` Ralph Campbell
@ 2019-06-07 22:33   ` Ira Weiny
  2019-06-08  8:54   ` Christoph Hellwig
  3 siblings, 0 replies; 79+ messages in thread
From: Ira Weiny @ 2019-06-07 22:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Thu, Jun 06, 2019 at 03:44:29PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Ralph observes that hmm_range_register() can only be called by a driver
> while a mirror is registered. Make this clear in the API by passing in the
> mirror structure as a parameter.
> 
> This also simplifies understanding the lifetime model for struct hmm, as
> the hmm pointer must be valid as part of a registered mirror so all we
> need in hmm_register_range() is a simple kref_get.
> 
> Suggested-by: Ralph Campbell <rcampbell@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
> v2
> - Include the oneline patch to nouveau_svm.c
> ---
>  drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
>  include/linux/hmm.h                   |  7 ++++---
>  mm/hmm.c                              | 15 ++++++---------
>  3 files changed, 11 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
> index 93ed43c413f0bb..8c92374afcf227 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
> @@ -649,7 +649,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
>  		range.values = nouveau_svm_pfn_values;
>  		range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
>  again:
> -		ret = hmm_vma_fault(&range, true);
> +		ret = hmm_vma_fault(&svmm->mirror, &range, true);
>  		if (ret == 0) {
>  			mutex_lock(&svmm->mutex);
>  			if (!hmm_vma_range_done(&range)) {
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 688c5ca7068795..2d519797cb134a 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -505,7 +505,7 @@ static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
>   * Please see Documentation/vm/hmm.rst for how to use the range API.
>   */
>  int hmm_range_register(struct hmm_range *range,
> -		       struct mm_struct *mm,
> +		       struct hmm_mirror *mirror,
>  		       unsigned long start,
>  		       unsigned long end,
>  		       unsigned page_shift);
> @@ -541,7 +541,8 @@ static inline bool hmm_vma_range_done(struct hmm_range *range)
>  }
>  
>  /* This is a temporary helper to avoid merge conflict between trees. */
> -static inline int hmm_vma_fault(struct hmm_range *range, bool block)
> +static inline int hmm_vma_fault(struct hmm_mirror *mirror,
> +				struct hmm_range *range, bool block)
>  {
>  	long ret;
>  
> @@ -554,7 +555,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>  	range->default_flags = 0;
>  	range->pfn_flags_mask = -1UL;
>  
> -	ret = hmm_range_register(range, range->vma->vm_mm,
> +	ret = hmm_range_register(range, mirror,
>  				 range->start, range->end,
>  				 PAGE_SHIFT);
>  	if (ret)
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 547002f56a163d..8796447299023c 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -925,13 +925,13 @@ static void hmm_pfns_clear(struct hmm_range *range,
>   * Track updates to the CPU page table see include/linux/hmm.h
>   */
>  int hmm_range_register(struct hmm_range *range,
> -		       struct mm_struct *mm,
> +		       struct hmm_mirror *mirror,
>  		       unsigned long start,
>  		       unsigned long end,
>  		       unsigned page_shift)
>  {
>  	unsigned long mask = ((1UL << page_shift) - 1UL);
> -	struct hmm *hmm;
> +	struct hmm *hmm = mirror->hmm;
>  
>  	range->valid = false;
>  	range->hmm = NULL;
> @@ -945,15 +945,12 @@ int hmm_range_register(struct hmm_range *range,
>  	range->start = start;
>  	range->end = end;
>  
> -	hmm = hmm_get_or_create(mm);
> -	if (!hmm)
> -		return -EFAULT;
> -
>  	/* Check if hmm_mm_destroy() was call. */
> -	if (hmm->mm == NULL || hmm->dead) {
> -		hmm_put(hmm);
> +	if (hmm->mm == NULL || hmm->dead)
>  		return -EFAULT;
> -	}
> +
> +	range->hmm = hmm;

I don't think you need this assignment here.  In the code below (right after
the mutext_lock()) it is set already.  And looks like it remains that way after
the end of the series.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> +	kref_get(&hmm->kref);
>  
>  	/* Initialize range to track CPU page table updates. */
>  	mutex_lock(&hmm->lock);
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-06-06 18:44 ` [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
  2019-06-07  2:44   ` John Hubbard
  2019-06-07 18:41   ` Ralph Campbell
@ 2019-06-07 22:38   ` Ira Weiny
  2 siblings, 0 replies; 79+ messages in thread
From: Ira Weiny @ 2019-06-07 22:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Thu, Jun 06, 2019 at 03:44:30PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> So long a a struct hmm pointer exists, so should the struct mm it is
> linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
> once the hmm refcount goes to zero.
> 
> Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
> mm->hmm delete the hmm_hmm_destroy().
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> ---
> v2:
>  - Fix error unwind paths in hmm_get_or_create (Jerome/Jason)
> ---
>  include/linux/hmm.h |  3 ---
>  kernel/fork.c       |  1 -
>  mm/hmm.c            | 22 ++++------------------
>  3 files changed, 4 insertions(+), 22 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 2d519797cb134a..4ee3acabe5ed22 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -586,14 +586,11 @@ static inline int hmm_vma_fault(struct hmm_mirror *mirror,
>  }
>  
>  /* Below are for HMM internal use only! Not to be used by device driver! */
> -void hmm_mm_destroy(struct mm_struct *mm);
> -
>  static inline void hmm_mm_init(struct mm_struct *mm)
>  {
>  	mm->hmm = NULL;
>  }
>  #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
> -static inline void hmm_mm_destroy(struct mm_struct *mm) {}
>  static inline void hmm_mm_init(struct mm_struct *mm) {}
>  #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
>  
> diff --git a/kernel/fork.c b/kernel/fork.c
> index b2b87d450b80b5..588c768ae72451 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -673,7 +673,6 @@ void __mmdrop(struct mm_struct *mm)
>  	WARN_ON_ONCE(mm == current->active_mm);
>  	mm_free_pgd(mm);
>  	destroy_context(mm);
> -	hmm_mm_destroy(mm);
>  	mmu_notifier_mm_destroy(mm);
>  	check_mm(mm);
>  	put_user_ns(mm->user_ns);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8796447299023c..cc7c26fda3300e 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -29,6 +29,7 @@
>  #include <linux/swapops.h>
>  #include <linux/hugetlb.h>
>  #include <linux/memremap.h>
> +#include <linux/sched/mm.h>
>  #include <linux/jump_label.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/mmu_notifier.h>
> @@ -82,6 +83,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	hmm->notifiers = 0;
>  	hmm->dead = false;
>  	hmm->mm = mm;
> +	mmgrab(hmm->mm);
>  
>  	spin_lock(&mm->page_table_lock);
>  	if (!mm->hmm)
> @@ -109,6 +111,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  error:
> +	mmdrop(hmm->mm);
>  	kfree(hmm);
>  	return NULL;
>  }
> @@ -130,6 +133,7 @@ static void hmm_free(struct kref *kref)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  
> +	mmdrop(hmm->mm);
>  	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>  }
>  
> @@ -138,24 +142,6 @@ static inline void hmm_put(struct hmm *hmm)
>  	kref_put(&hmm->kref, hmm_free);
>  }
>  
> -void hmm_mm_destroy(struct mm_struct *mm)
> -{
> -	struct hmm *hmm;
> -
> -	spin_lock(&mm->page_table_lock);
> -	hmm = mm_get_hmm(mm);
> -	mm->hmm = NULL;
> -	if (hmm) {
> -		hmm->mm = NULL;
> -		hmm->dead = true;
> -		spin_unlock(&mm->page_table_lock);
> -		hmm_put(hmm);
> -		return;
> -	}
> -
> -	spin_unlock(&mm->page_table_lock);
> -}
> -
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
>  	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-07 18:24   ` Ralph Campbell
@ 2019-06-07 22:39     ` Ralph Campbell
  2019-06-10 13:09       ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 22:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe


On 6/7/19 11:24 AM, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>> From: Jason Gunthorpe <jgg@mellanox.com>
>>
>> Ralph observes that hmm_range_register() can only be called by a driver
>> while a mirror is registered. Make this clear in the API by passing in 
>> the
>> mirror structure as a parameter.
>>
>> This also simplifies understanding the lifetime model for struct hmm, as
>> the hmm pointer must be valid as part of a registered mirror so all we
>> need in hmm_register_range() is a simple kref_get.
>>
>> Suggested-by: Ralph Campbell <rcampbell@nvidia.com>
>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> You might CC Ben for the nouveau part.
> CC: Ben Skeggs <bskeggs@redhat.com>
> 
> Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>
> 
> 
>> ---
>> v2
>> - Include the oneline patch to nouveau_svm.c
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_svm.c |  2 +-
>>   include/linux/hmm.h                   |  7 ++++---
>>   mm/hmm.c                              | 15 ++++++---------
>>   3 files changed, 11 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c 
>> b/drivers/gpu/drm/nouveau/nouveau_svm.c
>> index 93ed43c413f0bb..8c92374afcf227 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_svm.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
>> @@ -649,7 +649,7 @@ nouveau_svm_fault(struct nvif_notify *notify)
>>           range.values = nouveau_svm_pfn_values;
>>           range.pfn_shift = NVIF_VMM_PFNMAP_V0_ADDR_SHIFT;
>>   again:
>> -        ret = hmm_vma_fault(&range, true);
>> +        ret = hmm_vma_fault(&svmm->mirror, &range, true);
>>           if (ret == 0) {
>>               mutex_lock(&svmm->mutex);
>>               if (!hmm_vma_range_done(&range)) {
>> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
>> index 688c5ca7068795..2d519797cb134a 100644
>> --- a/include/linux/hmm.h
>> +++ b/include/linux/hmm.h
>> @@ -505,7 +505,7 @@ static inline bool hmm_mirror_mm_is_alive(struct 
>> hmm_mirror *mirror)
>>    * Please see Documentation/vm/hmm.rst for how to use the range API.
>>    */
>>   int hmm_range_register(struct hmm_range *range,
>> -               struct mm_struct *mm,
>> +               struct hmm_mirror *mirror,
>>                  unsigned long start,
>>                  unsigned long end,
>>                  unsigned page_shift);
>> @@ -541,7 +541,8 @@ static inline bool hmm_vma_range_done(struct 
>> hmm_range *range)
>>   }
>>   /* This is a temporary helper to avoid merge conflict between trees. */
>> -static inline int hmm_vma_fault(struct hmm_range *range, bool block)
>> +static inline int hmm_vma_fault(struct hmm_mirror *mirror,
>> +                struct hmm_range *range, bool block)
>>   {
>>       long ret;
>> @@ -554,7 +555,7 @@ static inline int hmm_vma_fault(struct hmm_range 
>> *range, bool block)
>>       range->default_flags = 0;
>>       range->pfn_flags_mask = -1UL;
>> -    ret = hmm_range_register(range, range->vma->vm_mm,
>> +    ret = hmm_range_register(range, mirror,
>>                    range->start, range->end,
>>                    PAGE_SHIFT);
>>       if (ret)
>> diff --git a/mm/hmm.c b/mm/hmm.c
>> index 547002f56a163d..8796447299023c 100644
>> --- a/mm/hmm.c
>> +++ b/mm/hmm.c
>> @@ -925,13 +925,13 @@ static void hmm_pfns_clear(struct hmm_range *range,
>>    * Track updates to the CPU page table see include/linux/hmm.h
>>    */
>>   int hmm_range_register(struct hmm_range *range,
>> -               struct mm_struct *mm,
>> +               struct hmm_mirror *mirror,
>>                  unsigned long start,
>>                  unsigned long end,
>>                  unsigned page_shift)
>>   {
>>       unsigned long mask = ((1UL << page_shift) - 1UL);
>> -    struct hmm *hmm;
>> +    struct hmm *hmm = mirror->hmm;
>>       range->valid = false;
>>       range->hmm = NULL;
>> @@ -945,15 +945,12 @@ int hmm_range_register(struct hmm_range *range,
>>       range->start = start;
>>       range->end = end;
>> -    hmm = hmm_get_or_create(mm);
>> -    if (!hmm)
>> -        return -EFAULT;
>> -
>>       /* Check if hmm_mm_destroy() was call. */
>> -    if (hmm->mm == NULL || hmm->dead) {
>> -        hmm_put(hmm);
>> +    if (hmm->mm == NULL || hmm->dead)
>>           return -EFAULT;
>> -    }
>> +
>> +    range->hmm = hmm;
>> +    kref_get(&hmm->kref);
>>       /* Initialize range to track CPU page table updates. */
>>       mutex_lock(&hmm->lock);
>>

I forgot to add that I think you can delete the duplicate
     "range->hmm = hmm;"
here between the mutex_lock/unlock.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-06-06 18:44 ` [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
  2019-06-07  2:54   ` John Hubbard
  2019-06-07 18:52   ` Ralph Campbell
@ 2019-06-07 22:44   ` Ira Weiny
  2 siblings, 0 replies; 79+ messages in thread
From: Ira Weiny @ 2019-06-07 22:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Thu, Jun 06, 2019 at 03:44:31PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> As coded this function can false-fail in various racy situations. Make it
> reliable by running only under the write side of the mmap_sem and avoiding
> the false-failing compare/exchange pattern.
> 
> Also make the locking very easy to understand by only ever reading or
> writing mm->hmm while holding the write side of the mmap_sem.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> ---
> v2:
> - Fix error unwind of mmgrab (Jerome)
> - Use hmm local instead of 2nd container_of (Jerome)
> ---
>  mm/hmm.c | 80 ++++++++++++++++++++------------------------------------
>  1 file changed, 29 insertions(+), 51 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index cc7c26fda3300e..dc30edad9a8a02 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -40,16 +40,6 @@
>  #if IS_ENABLED(CONFIG_HMM_MIRROR)
>  static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>  
> -static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> -{
> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> -
> -	if (hmm && kref_get_unless_zero(&hmm->kref))
> -		return hmm;
> -
> -	return NULL;
> -}
> -
>  /**
>   * hmm_get_or_create - register HMM against an mm (HMM internal)
>   *
> @@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>   */
>  static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> -	bool cleanup = false;
> +	struct hmm *hmm;
>  
> -	if (hmm)
> -		return hmm;
> +	lockdep_assert_held_exclusive(&mm->mmap_sem);
> +
> +	if (mm->hmm) {
> +		if (kref_get_unless_zero(&mm->hmm->kref))
> +			return mm->hmm;
> +		/*
> +		 * The hmm is being freed by some other CPU and is pending a
> +		 * RCU grace period, but this CPU can NULL now it since we
> +		 * have the mmap_sem.
> +		 */
> +		mm->hmm = NULL;
> +	}
>  
>  	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>  	if (!hmm)
> @@ -83,57 +82,36 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	hmm->notifiers = 0;
>  	hmm->dead = false;
>  	hmm->mm = mm;
> -	mmgrab(hmm->mm);
> -
> -	spin_lock(&mm->page_table_lock);
> -	if (!mm->hmm)
> -		mm->hmm = hmm;
> -	else
> -		cleanup = true;
> -	spin_unlock(&mm->page_table_lock);
>  
> -	if (cleanup)
> -		goto error;
> -
> -	/*
> -	 * We should only get here if hold the mmap_sem in write mode ie on
> -	 * registration of first mirror through hmm_mirror_register()
> -	 */
>  	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> -	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
> -		goto error_mm;
> +	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
> +		kfree(hmm);
> +		return NULL;
> +	}
>  
> +	mmgrab(hmm->mm);
> +	mm->hmm = hmm;
>  	return hmm;
> -
> -error_mm:
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -error:
> -	mmdrop(hmm->mm);
> -	kfree(hmm);
> -	return NULL;
>  }
>  
>  static void hmm_free_rcu(struct rcu_head *rcu)
>  {
> -	kfree(container_of(rcu, struct hmm, rcu));
> +	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
> +
> +	down_write(&hmm->mm->mmap_sem);
> +	if (hmm->mm->hmm == hmm)
> +		hmm->mm->hmm = NULL;
> +	up_write(&hmm->mm->mmap_sem);
> +	mmdrop(hmm->mm);
> +
> +	kfree(hmm);
>  }
>  
>  static void hmm_free(struct kref *kref)
>  {
>  	struct hmm *hmm = container_of(kref, struct hmm, kref);
> -	struct mm_struct *mm = hmm->mm;
> -
> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
>  
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -
> -	mmdrop(hmm->mm);
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
>  	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>  }
>  
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 13:31     ` [PATCH v3 " Jason Gunthorpe
@ 2019-06-07 22:55       ` Ira Weiny
  2019-06-08  1:32       ` John Hubbard
  1 sibling, 0 replies; 79+ messages in thread
From: Ira Weiny @ 2019-06-07 22:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: John Hubbard, Jerome Glisse, Ralph Campbell, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 10:31:07AM -0300, Jason Gunthorpe wrote:
> The wait_event_timeout macro already tests the condition as its first
> action, so there is no reason to open code another version of this, all
> that does is skip the might_sleep() debugging in common cases, which is
> not helpful.
> 
> Further, based on prior patches, we can now simplify the required condition
> test:
>  - If range is valid memory then so is range->hmm
>  - If hmm_release() has run then range->valid is set to false
>    at the same time as dead, so no reason to check both.
>  - A valid hmm has a valid hmm->mm.
> 
> Allowing the return value of wait_event_timeout() (along with its internal
> barriers) to compute the result of the function.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> ---
> v3
> - Simplify the wait_event_timeout to not check valid
> ---
>  include/linux/hmm.h | 13 ++-----------
>  1 file changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 1d97b6d62c5bcf..26e7c477490c4e 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -209,17 +209,8 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>  static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>  					      unsigned long timeout)
>  {
> -	/* Check if mm is dead ? */
> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> -		range->valid = false;
> -		return false;
> -	}
> -	if (range->valid)
> -		return true;
> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> -			   msecs_to_jiffies(timeout));
> -	/* Return current valid status just in case we get lucky */
> -	return range->valid;
> +	return wait_event_timeout(range->hmm->wq, range->valid,
> +				  msecs_to_jiffies(timeout)) != 0;
>  }
>  
>  /*
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister
  2019-06-06 18:44 ` [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
  2019-06-07  3:37   ` John Hubbard
  2019-06-07 20:46   ` Ralph Campbell
@ 2019-06-07 23:01   ` Ira Weiny
  2 siblings, 0 replies; 79+ messages in thread
From: Ira Weiny @ 2019-06-07 23:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Thu, Jun 06, 2019 at 03:44:36PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> and poison bytes to detect this condition.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
> ---
> v2
> - Keep range start/end valid after unregistration (Jerome)
> ---
>  mm/hmm.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6802de7080d172..c2fecb3ecb11e1 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -937,7 +937,7 @@ void hmm_range_unregister(struct hmm_range *range)
>  	struct hmm *hmm = range->hmm;
>  
>  	/* Sanity check this really should not happen. */
> -	if (hmm == NULL || range->end <= range->start)
> +	if (WARN_ON(range->end <= range->start))
>  		return;
>  
>  	mutex_lock(&hmm->lock);
> @@ -948,7 +948,10 @@ void hmm_range_unregister(struct hmm_range *range)
>  	range->valid = false;
>  	mmput(hmm->mm);
>  	hmm_put(hmm);
> -	range->hmm = NULL;
> +
> +	/* The range is now invalid, leave it poisoned. */
> +	range->valid = false;

No need to set valid false again as you just did this 5 lines above.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> +	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
>  }
>  EXPORT_SYMBOL(hmm_range_unregister);
>  
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2019-06-07 22:11   ` Souptick Joarder
@ 2019-06-07 23:02   ` Ira Weiny
  3 siblings, 0 replies; 79+ messages in thread
From: Ira Weiny @ 2019-06-07 23:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

On Thu, Jun 06, 2019 at 03:44:37PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This list is always read and written while holding hmm->lock so there is
> no need for the confusing _rcu annotations.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> Reviewed-by: Jérôme Glisse <jglisse@redhat.com>

Reviewed-by: Ira Weiny <iweiny@intel.com>

> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index c2fecb3ecb11e1..709d138dd49027 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -911,7 +911,7 @@ int hmm_range_register(struct hmm_range *range,
>  	mutex_lock(&hmm->lock);
>  
>  	range->hmm = hmm;
> -	list_add_rcu(&range->list, &hmm->ranges);
> +	list_add(&range->list, &hmm->ranges);
>  
>  	/*
>  	 * If there are any concurrent notifiers we have to wait for them for
> @@ -941,7 +941,7 @@ void hmm_range_unregister(struct hmm_range *range)
>  		return;
>  
>  	mutex_lock(&hmm->lock);
> -	list_del_rcu(&range->list);
> +	list_del(&range->list);
>  	mutex_unlock(&hmm->lock);
>  
>  	/* Drop reference taken by hmm_range_register() */
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start
  2019-06-07 16:05 ` [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start Jason Gunthorpe
@ 2019-06-07 23:52   ` Ralph Campbell
  2019-06-08  1:35     ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Ralph Campbell @ 2019-06-07 23:52 UTC (permalink / raw)
  To: Jason Gunthorpe, Jerome Glisse, John Hubbard, Felix.Kuehling
  Cc: linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx


On 6/7/19 9:05 AM, Jason Gunthorpe wrote:
> If the trylock on the hmm->mirrors_sem fails the function will return
> without decrementing the notifiers that were previously incremented. Since
> the caller will not call invalidate_range_end() on EAGAIN this will result
> in notifiers becoming permanently incremented and deadlock.
> 
> If the sync_cpu_device_pagetables() required blocking the function will
> not return EAGAIN even though the device continues to touch the
> pages. This is a violation of the mmu notifier contract.
> 
> Switch, and rename, the ranges_lock to a spin lock so we can reliably
> obtain it without blocking during error unwind.
> 
> The error unwind is necessary since the notifiers count must be held
> incremented across the call to sync_cpu_device_pagetables() as we cannot
> allow the range to become marked valid by a parallel
> invalidate_start/end() pair while doing sync_cpu_device_pagetables().
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Reviewed-by: Ralph Campbell <rcampbell@nvidia.com>

> ---
>   include/linux/hmm.h |  2 +-
>   mm/hmm.c            | 77 +++++++++++++++++++++++++++------------------
>   2 files changed, 48 insertions(+), 31 deletions(-)
> 
> I almost lost this patch - it is part of the series, hasn't been
> posted before, and wasn't sent with the rest, sorry.
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index bf013e96525771..0fa8ea34ccef6d 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -86,7 +86,7 @@
>   struct hmm {
>   	struct mm_struct	*mm;
>   	struct kref		kref;
> -	struct mutex		lock;
> +	spinlock_t		ranges_lock;
>   	struct list_head	ranges;
>   	struct list_head	mirrors;
>   	struct mmu_notifier	mmu_notifier;
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 4215edf737ef5b..10103a24e9b7b3 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -68,7 +68,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	init_rwsem(&hmm->mirrors_sem);
>   	hmm->mmu_notifier.ops = NULL;
>   	INIT_LIST_HEAD(&hmm->ranges);
> -	mutex_init(&hmm->lock);
> +	spin_lock_init(&hmm->ranges_lock);
>   	kref_init(&hmm->kref);
>   	hmm->notifiers = 0;
>   	hmm->mm = mm;
> @@ -114,18 +114,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   {
>   	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>   	struct hmm_mirror *mirror;
> +	unsigned long flags;
>   
>   	/* Bail out if hmm is in the process of being freed */
>   	if (!kref_get_unless_zero(&hmm->kref))
>   		return;
>   
> -	mutex_lock(&hmm->lock);
> +	spin_lock_irqsave(&hmm->ranges_lock, flags);
>   	/*
>   	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
>   	 * prevented as long as a range exists.
>   	 */
>   	WARN_ON(!list_empty(&hmm->ranges));
> -	mutex_unlock(&hmm->lock);
> +	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
>   
>   	down_read(&hmm->mirrors_sem);
>   	list_for_each_entry(mirror, &hmm->mirrors, list) {
> @@ -141,6 +142,23 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>   	hmm_put(hmm);
>   }
>   
> +static void notifiers_decrement(struct hmm *hmm)
> +{
> +	lockdep_assert_held(&hmm->ranges_lock);
> +
> +	hmm->notifiers--;
> +	if (!hmm->notifiers) {
> +		struct hmm_range *range;
> +
> +		list_for_each_entry(range, &hmm->ranges, list) {
> +			if (range->valid)
> +				continue;
> +			range->valid = true;
> +		}

This just effectively sets all ranges to valid.
I'm not sure that is best.
Shouldn't hmm_range_register() start with range.valid = true and
then hmm_invalidate_range_start() set affected ranges to false?
Then this becomes just wake_up_all() if --notifiers == 0 and
hmm_range_wait_until_valid() should wait for notifiers == 0.
Otherwise, range.valid doesn't really mean it's valid.

> +		wake_up_all(&hmm->wq);
> +	}
> +}
> +
>   static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *nrange)
>   {
> @@ -148,6 +166,7 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   	struct hmm_mirror *mirror;
>   	struct hmm_update update;
>   	struct hmm_range *range;
> +	unsigned long flags;
>   	int ret = 0;
>   
>   	if (!kref_get_unless_zero(&hmm->kref))
> @@ -158,12 +177,7 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   	update.event = HMM_UPDATE_INVALIDATE;
>   	update.blockable = mmu_notifier_range_blockable(nrange);
>   
> -	if (mmu_notifier_range_blockable(nrange))
> -		mutex_lock(&hmm->lock);
> -	else if (!mutex_trylock(&hmm->lock)) {
> -		ret = -EAGAIN;
> -		goto out;
> -	}
> +	spin_lock_irqsave(&hmm->ranges_lock, flags);
>   	hmm->notifiers++;
>   	list_for_each_entry(range, &hmm->ranges, list) {
>   		if (update.end < range->start || update.start >= range->end)
> @@ -171,7 +185,7 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   
>   		range->valid = false;
>   	}
> -	mutex_unlock(&hmm->lock);
> +	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
>   
>   	if (mmu_notifier_range_blockable(nrange))
>   		down_read(&hmm->mirrors_sem);
> @@ -179,16 +193,26 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>   		ret = -EAGAIN;
>   		goto out;
>   	}
> +
>   	list_for_each_entry(mirror, &hmm->mirrors, list) {
> -		int ret;
> +		int rc;
>   
> -		ret = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> -		if (!update.blockable && ret == -EAGAIN)
> +		rc = mirror->ops->sync_cpu_device_pagetables(mirror, &update);
> +		if (rc) {
> +			if (WARN_ON(update.blockable || rc != -EAGAIN))
> +				continue;
> +			ret = -EAGAIN;
>   			break;
> +		}
>   	}
>   	up_read(&hmm->mirrors_sem);
>   
>   out:
> +	if (ret) {
> +		spin_lock_irqsave(&hmm->ranges_lock, flags);
> +		notifiers_decrement(hmm);
> +		spin_unlock_irqrestore(&hmm->ranges_lock, flags);
> +	}
>   	hmm_put(hmm);
>   	return ret;
>   }
> @@ -197,23 +221,14 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>   			const struct mmu_notifier_range *nrange)
>   {
>   	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
> +	unsigned long flags;
>   
>   	if (!kref_get_unless_zero(&hmm->kref))
>   		return;
>   
> -	mutex_lock(&hmm->lock);
> -	hmm->notifiers--;
> -	if (!hmm->notifiers) {
> -		struct hmm_range *range;
> -
> -		list_for_each_entry(range, &hmm->ranges, list) {
> -			if (range->valid)
> -				continue;
> -			range->valid = true;
> -		}
> -		wake_up_all(&hmm->wq);
> -	}
> -	mutex_unlock(&hmm->lock);
> +	spin_lock_irqsave(&hmm->ranges_lock, flags);
> +	notifiers_decrement(hmm);
> +	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
>   
>   	hmm_put(hmm);
>   }
> @@ -866,6 +881,7 @@ int hmm_range_register(struct hmm_range *range,
>   {
>   	unsigned long mask = ((1UL << page_shift) - 1UL);
>   	struct hmm *hmm = mirror->hmm;
> +	unsigned long flags;
>   
>   	range->valid = false;
>   	range->hmm = NULL;
> @@ -887,7 +903,7 @@ int hmm_range_register(struct hmm_range *range,
>   	kref_get(&hmm->kref);
>   
>   	/* Initialize range to track CPU page table updates. */
> -	mutex_lock(&hmm->lock);
> +	spin_lock_irqsave(&hmm->ranges_lock, flags);
>   
>   	range->hmm = hmm;
>   	list_add(&range->list, &hmm->ranges);
> @@ -898,7 +914,7 @@ int hmm_range_register(struct hmm_range *range,
>   	 */
>   	if (!hmm->notifiers)
>   		range->valid = true;
> -	mutex_unlock(&hmm->lock);
> +	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
>   
>   	return 0;
>   }
> @@ -914,13 +930,14 @@ EXPORT_SYMBOL(hmm_range_register);
>   void hmm_range_unregister(struct hmm_range *range)
>   {
>   	struct hmm *hmm = range->hmm;
> +	unsigned long flags;
>   
>   	if (WARN_ON(range->end <= range->start))
>   		return;
>   
> -	mutex_lock(&hmm->lock);
> +	spin_lock_irqsave(&hmm->ranges_lock, flags);
>   	list_del(&range->list);
> -	mutex_unlock(&hmm->lock);
> +	spin_unlock_irqrestore(&hmm->ranges_lock, flags);
>   
>   	/* Drop reference taken by hmm_range_register() */
>   	range->valid = false;
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-07 12:34     ` Jason Gunthorpe
  2019-06-07 13:42       ` Jason Gunthorpe
@ 2019-06-08  1:13       ` John Hubbard
  2019-06-08  1:37       ` John Hubbard
  2 siblings, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-08  1:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On 6/7/19 5:34 AM, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 07:29:08PM -0700, John Hubbard wrote:
>> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>> ...
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index 8e7403f081f44a..547002f56a163d 100644
>>> +++ b/mm/hmm.c
>> ...
>>> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>>>  		mm->hmm = NULL;
>>>  	spin_unlock(&mm->page_table_lock);
>>>  
>>> -	kfree(hmm);
>>> +	mmu_notifier_call_srcu(&hmm->rcu, hmm_free_rcu);
>>
>>
>> It occurred to me to wonder if it is best to use the MMU notifier's
>> instance of srcu, instead of creating a separate instance for HMM.
> 
> It *has* to be the MMU notifier SRCU because we are synchornizing
> against the read side of that SRU inside the mmu notifier code, ie:
> 
> int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *range)
>         id = srcu_read_lock(&srcu);
>         hlist_for_each_entry_rcu(mn, &range->mm->mmu_notifier_mm->list, hlist) {
>                 if (mn->ops->invalidate_range_start) {
>                    ^^^^^
> 
> Here 'mn' is really hmm (hmm = container_of(mn, struct hmm,
> mmu_notifier)), so we must protect the memory against free for the mmu
> notifier core.
> 
> Thus we have no choice but to use its SRCU.

Ah right. It's embarassingly obvious when you say it out loud. :) 
Thanks for explaining.

> 
> CH also pointed out a more elegant solution, which is to get the write
> side of the mmap_sem during hmm_mirror_unregister - no notifier
> callback can be running in this case. Then we delete the kref, srcu
> and so forth.
> 
> This is much clearer/saner/better, but.. requries the callers of
> hmm_mirror_unregister to be safe to get the mmap_sem write side.
> 
> I think this is true, so maybe this patch should be switched, what do
> you think?

OK, your follow-up notes that we'll leave it as is, got it.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v3 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 13:31     ` [PATCH v3 " Jason Gunthorpe
  2019-06-07 22:55       ` Ira Weiny
@ 2019-06-08  1:32       ` John Hubbard
  1 sibling, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-08  1:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On 6/7/19 6:31 AM, Jason Gunthorpe wrote:
> The wait_event_timeout macro already tests the condition as its first
> action, so there is no reason to open code another version of this, all
> that does is skip the might_sleep() debugging in common cases, which is
> not helpful.
> 
> Further, based on prior patches, we can now simplify the required condition
> test:
>  - If range is valid memory then so is range->hmm
>  - If hmm_release() has run then range->valid is set to false
>    at the same time as dead, so no reason to check both.
>  - A valid hmm has a valid hmm->mm.
> 
> Allowing the return value of wait_event_timeout() (along with its internal
> barriers) to compute the result of the function.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---


    Reviewed-by: John Hubbard <jhubbard@nvidia.com>


thanks,
-- 
John Hubbard
NVIDIA



> v3
> - Simplify the wait_event_timeout to not check valid
> ---
>  include/linux/hmm.h | 13 ++-----------
>  1 file changed, 2 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 1d97b6d62c5bcf..26e7c477490c4e 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -209,17 +209,8 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
>  static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
>  					      unsigned long timeout)
>  {
> -	/* Check if mm is dead ? */
> -	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
> -		range->valid = false;
> -		return false;
> -	}
> -	if (range->valid)
> -		return true;
> -	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
> -			   msecs_to_jiffies(timeout));
> -	/* Return current valid status just in case we get lucky */
> -	return range->valid;
> +	return wait_event_timeout(range->hmm->wq, range->valid,
> +				  msecs_to_jiffies(timeout)) != 0;
>  }
>  
>  /*
> 


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start
  2019-06-07 23:52   ` Ralph Campbell
@ 2019-06-08  1:35     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-08  1:35 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 04:52:58PM -0700, Ralph Campbell wrote:
> > @@ -141,6 +142,23 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
> >   	hmm_put(hmm);
> >   }
> > +static void notifiers_decrement(struct hmm *hmm)
> > +{
> > +	lockdep_assert_held(&hmm->ranges_lock);
> > +
> > +	hmm->notifiers--;
> > +	if (!hmm->notifiers) {
> > +		struct hmm_range *range;
> > +
> > +		list_for_each_entry(range, &hmm->ranges, list) {
> > +			if (range->valid)
> > +				continue;
> > +			range->valid = true;
> > +		}
> 
> This just effectively sets all ranges to valid.
> I'm not sure that is best.

This is a trade off, it would be much more expensive to have a precise
'valid = true' - instead this algorithm is precise about 'valid =
false' and lazy about 'valid = true' which is much less costly to
calculate.

> Shouldn't hmm_range_register() start with range.valid = true and
> then hmm_invalidate_range_start() set affected ranges to false?

It kind of does, expect when it doesn't, right? :)

> Then this becomes just wake_up_all() if --notifiers == 0 and
> hmm_range_wait_until_valid() should wait for notifiers == 0.

Almost.. but it is more tricky than that.

This scheme is a collision-retry algorithm. The pagefault side runs to
completion if no parallel invalidate start/end happens.

If a parallel invalidation happens then the pagefault retries.

Seeing notifiers == 0 means there is absolutely no current parallel
invalidation.

Seeing range->valid == true (under the device lock)
means this range doesn't intersect with a parallel invalidate.

So.. hmm_range_wait_until_valid() checks the per-range valid because
it doesn't want to sleep if *this range* is not involved in a parallel
invalidation - but once it becomes involved, then yes, valid == true
implies notifiers == 0.

It is easier/safer to use unlocked variable reads if there is only one
variable, thus the weird construction.

It is unclear to me if this micro optimization is really
worthwhile. It is very expensive to manage all this tracking, and no
other mmu notifier implementation really does something like
this. Eliminating the per-range tracking and using the notifier count
as a global lock would be much simpler...

> Otherwise, range.valid doesn't really mean it's valid.

Right, it doesn't really mean 'valid'

It is tracking possible colliding invalidates such that valid == true
(under the device lock) means that there was no colliding invalidate.

I still think this implementation doesn't quite work, as I described
here:

https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/

But the idea is basically sound and matches what other mmu notifier
users do, just using a seqcount like scheme, not a boolean.

Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-07 12:34     ` Jason Gunthorpe
  2019-06-07 13:42       ` Jason Gunthorpe
  2019-06-08  1:13       ` John Hubbard
@ 2019-06-08  1:37       ` John Hubbard
  2 siblings, 0 replies; 79+ messages in thread
From: John Hubbard @ 2019-06-08  1:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On 6/7/19 5:34 AM, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 07:29:08PM -0700, John Hubbard wrote:
>> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>> ...
>>> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>>>  
>>>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>>>  {
>>> -	struct hmm *hmm = mm_get_hmm(mm);
>>> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>>>  	struct hmm_mirror *mirror;
>>>  	struct hmm_range *range;
>>>  
>>> +	/* hmm is in progress to free */
>>
>> Well, sometimes, yes. :)
> 
> It think it is in all cases actually.. The only way we see a 0 kref
> and still reach this code path is if another thread has alreay setup
> the hmm_free in the call_srcu..
> 
>> Maybe this wording is clearer (if we need any comment at all):
> 
> I always find this hard.. This is a very standard pattern when working
> with RCU - however in my experience few people actually know the RCU
> patterns, and missing the _unless_zero is a common bug I find when
> looking at code.
> 
> This is mm/ so I can drop it, what do you think?
> 

I forgot to respond to this section, so catching up now:

I think we're talking about slightly different things. I was just
noting that the comment above the "if" statement was only accurate
if the branch is taken, which is why I recommended this combination
of comment and code:

	/* Bail out if hmm is in the process of being freed */
	if (!kref_get_unless_zero(&hmm->kref))
		return;

As for the actual _unless_zero part, I think that's good to have.
And it's a good reminder if nothing else, even in mm/ code.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-06-07 22:13           ` Ralph Campbell
@ 2019-06-08  1:47             ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-08  1:47 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 03:13:00PM -0700, Ralph Campbell wrote:
> > Do you see a reason why the find_vma() ever needs to be before the
> > 'again' in my above example? range.vma does not need to be set for
> > range_register.
> 
> Yes, for the GPU case, there can be many faults in an event queue
> and the goal is to try to handle more than one page at a time.
> The vma is needed to limit the amount of coalescing and checking
> for pages that could be speculatively migrated or mapped.

I'd need to see an algorithm sketch to see what you are thinking..

But, I guess a driver would have figure out a list of what virtual
pages it wants to fault under the mmap sem (maybe use find_vam, etc),
then drop mmap_sem, and start processing those pages for mirroring
under the hmm side.

ie they are two seperate unrelated tasks.

I looked at the hmm.rst again, and that reference algorithm is already
showing that that upon retry the mmap_sem is released:

      take_lock(driver->update);
      if (!range.valid) {
          release_lock(driver->update);
          up_read(&mm->mmap_sem);
          goto again;

So a driver cannot assume that any VMA related work done under the
mmap before the hmm 'critical section' can carry into the hmm critical
section as the lock can be released upon retry and invalidate all that
data..

Forcing the hmm_range_start_and_lock() to acquire the mmap_sem is a
rough way to force the driver author to realize there are two locking
domains and lock protected information cannot cross between.

> OK, I understand.
> If you come up with a set of changes, I can try testing them.

Thanks, I make a sketch of the patch, I have to get back to it after
this series is done.

Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-07 21:37   ` Ralph Campbell
@ 2019-06-08  2:12     ` Jason Gunthorpe
  2019-06-10 16:02     ` Jason Gunthorpe
  1 sibling, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-08  2:12 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > hmm_release() is called exactly once per hmm. ops->release() cannot
> > accidentally trigger any action that would recurse back onto
> > hmm->mirrors_sem.
> > 
> > This fixes a use after-free race of the form:
> > 
> >         CPU0                                   CPU1
> >                                             hmm_release()
> >                                               up_write(&hmm->mirrors_sem);
> >   hmm_mirror_unregister(mirror)
> >    down_write(&hmm->mirrors_sem);
> >    up_write(&hmm->mirrors_sem);
> >    kfree(mirror)
> >                                               mirror->ops->release(mirror)
> > 
> > The only user we have today for ops->release is an empty function, so this
> > is unambiguously safe.
> > 
> > As a consequence of plugging this race drivers are not allowed to
> > register/unregister mirrors from within a release op.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> I agree with the analysis above but I'm not sure that release() will
> always be an empty function. It might be more efficient to write back
> all data migrated to a device "in one pass" instead of relying
> on unmap_vmas() calling hmm_start_range_invalidate() per VMA.

Sure, but it should not be allowed to recurse back to
hmm_mirror_unregister.

> I think the bigger issue is potential deadlocks while calling
> sync_cpu_device_pagetables() and tasks calling hmm_mirror_unregister():
>
> Say you have three threads:
> - Thread A is in try_to_unmap(), either without holding mmap_sem or with
> mmap_sem held for read.
> - Thread B has some unrelated driver calling hmm_mirror_unregister().
> This doesn't require mmap_sem.
> - Thread C is about to call migrate_vma().
>
> Thread A                Thread B                 Thread C
> try_to_unmap            hmm_mirror_unregister    migrate_vma
> hmm_invalidate_range_start
> down_read(mirrors_sem)
>                         down_write(mirrors_sem)
>                         // Blocked on A
>                                                   device_lock
> device_lock
> // Blocked on C
>                                                   migrate_vma()
>                                                   hmm_invalidate_range_s
>                                                   down_read(mirrors_sem)
>                                                   // Blocked on B
>                                                   // Deadlock

Oh... you know I didn't know this about rwsems in linux that they have
a fairness policy for writes to block future reads..

Still, at least as things are designed, the driver cannot hold a lock
it obtains under sync_cpu_device_pagetables() and nest other things in
that lock. It certainly can't recurse back into any mmu notifiers
while holding that lock. (as you point out)

The lock in sync_cpu_device_pagetables() needs to be very narrowly
focused on updating device state only.

So, my first reaction is that the driver in thread C is wrong, and
needs a different locking scheme. I think you'd have to make a really
good case that there is no alternative for a driver..

> Perhaps we should consider using SRCU for walking the mirror->list?

It means the driver has to deal with races like in this patch
description. At that point there is almost no reason to insert hmm
here, just use mmu notifiers directly.

Drivers won't get this right, it is too hard.

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
  2019-06-07  2:29   ` John Hubbard
  2019-06-07 18:12   ` Ralph Campbell
@ 2019-06-08  8:49   ` Christoph Hellwig
  2019-06-08 11:33     ` Jason Gunthorpe
  2 siblings, 1 reply; 79+ messages in thread
From: Christoph Hellwig @ 2019-06-08  8:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

I still think sruct hmm should die.  We already have a structure used
for additional information for drivers having crazly tight integration
into the VM, and it is called struct mmu_notifier_mm.  We really need
to reuse that intead of duplicating it badly.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
                     ` (2 preceding siblings ...)
  2019-06-07 22:33   ` Ira Weiny
@ 2019-06-08  8:54   ` Christoph Hellwig
  2019-06-11 19:44     ` Jason Gunthorpe
  3 siblings, 1 reply; 79+ messages in thread
From: Christoph Hellwig @ 2019-06-08  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx,
	Jason Gunthorpe

FYI, I very much disagree with the direction this is moving.

struct hmm_mirror literally is a trivial duplication of the
mmu_notifiers.  All these drivers should just use the mmu_notifiers
directly for the mirroring part instead of building a thing wrapper
that adds nothing but helping to manage the lifetime of struct hmm,
which shouldn't exist to start with.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers
  2019-06-08  8:49   ` Christoph Hellwig
@ 2019-06-08 11:33     ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-08 11:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Sat, Jun 08, 2019 at 01:49:48AM -0700, Christoph Hellwig wrote:
> I still think sruct hmm should die.  We already have a structure used
> for additional information for drivers having crazly tight integration
> into the VM, and it is called struct mmu_notifier_mm.  We really need
> to reuse that intead of duplicating it badly.

Probably. But at least in ODP we needed something very similar to
'struct hmm' to make our mmu notifier implementation work.

The mmu notifier api really lends itself to having a per-mm structure
in the driver to hold the 'struct mmu_notifier'..

I think I see other drivers are doing things like assuming that there
is only one mm in their world (despite being FD based, so this is not
really guarenteed)

So, my first attempt would be an api something like:

   priv = mmu_notififer_attach_mm(ops, current->mm, sizeof(my_priv))
   mmu_notifier_detach_mm(priv);

 ops->invalidate_start(struct mmu_notififer *mn):
   struct p *priv = mmu_notifier_priv(mn);

Such that
 - There is only one priv per mm
 - All the srcu stuff is handled inside mmu notifier
 - It is reference counted, so ops can be attached multiple times to
   the same mm

Then odp's per_mm, and struct hmm (if we keep it at all) is simply a
'priv' in the above.

I was thinking of looking at this stuff next, once this series is
done.

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-07 22:39     ` Ralph Campbell
@ 2019-06-10 13:09       ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-10 13:09 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 03:39:06PM -0700, Ralph Campbell wrote:
> > > +    range->hmm = hmm;
> > > +    kref_get(&hmm->kref);
> > >       /* Initialize range to track CPU page table updates. */
> > >       mutex_lock(&hmm->lock);
> > > 
> 
> I forgot to add that I think you can delete the duplicate
>     "range->hmm = hmm;"
> here between the mutex_lock/unlock.

Done, thanks

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-07 21:37   ` Ralph Campbell
  2019-06-08  2:12     ` Jason Gunthorpe
@ 2019-06-10 16:02     ` Jason Gunthorpe
  2019-06-10 22:03       ` Ralph Campbell
  1 sibling, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-10 16:02 UTC (permalink / raw)
  To: Ralph Campbell
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote:
> 
> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > hmm_release() is called exactly once per hmm. ops->release() cannot
> > accidentally trigger any action that would recurse back onto
> > hmm->mirrors_sem.
> > 
> > This fixes a use after-free race of the form:
> > 
> >         CPU0                                   CPU1
> >                                             hmm_release()
> >                                               up_write(&hmm->mirrors_sem);
> >   hmm_mirror_unregister(mirror)
> >    down_write(&hmm->mirrors_sem);
> >    up_write(&hmm->mirrors_sem);
> >    kfree(mirror)
> >                                               mirror->ops->release(mirror)
> > 
> > The only user we have today for ops->release is an empty function, so this
> > is unambiguously safe.
> > 
> > As a consequence of plugging this race drivers are not allowed to
> > register/unregister mirrors from within a release op.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> I agree with the analysis above but I'm not sure that release() will
> always be an empty function. It might be more efficient to write back
> all data migrated to a device "in one pass" instead of relying
> on unmap_vmas() calling hmm_start_range_invalidate() per VMA.

I think we have to focus on the *current* kernel - and we have two
users of release, nouveau_svm.c is empty and amdgpu_mn.c does
schedule_work() - so I believe we should go ahead with this simple
solution to the actual race today that both of those will suffer from.

If we find a need for a more complex version then it can be debated
and justified with proper context...

Ok?

Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release
  2019-06-10 16:02     ` Jason Gunthorpe
@ 2019-06-10 22:03       ` Ralph Campbell
  0 siblings, 0 replies; 79+ messages in thread
From: Ralph Campbell @ 2019-06-10 22:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jerome Glisse, John Hubbard, Felix.Kuehling, linux-rdma,
	linux-mm, Andrea Arcangeli, dri-devel, amd-gfx


On 6/10/19 9:02 AM, Jason Gunthorpe wrote:
> On Fri, Jun 07, 2019 at 02:37:07PM -0700, Ralph Campbell wrote:
>>
>> On 6/6/19 11:44 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> hmm_release() is called exactly once per hmm. ops->release() cannot
>>> accidentally trigger any action that would recurse back onto
>>> hmm->mirrors_sem.
>>>
>>> This fixes a use after-free race of the form:
>>>
>>>          CPU0                                   CPU1
>>>                                              hmm_release()
>>>                                                up_write(&hmm->mirrors_sem);
>>>    hmm_mirror_unregister(mirror)
>>>     down_write(&hmm->mirrors_sem);
>>>     up_write(&hmm->mirrors_sem);
>>>     kfree(mirror)
>>>                                                mirror->ops->release(mirror)
>>>
>>> The only user we have today for ops->release is an empty function, so this
>>> is unambiguously safe.
>>>
>>> As a consequence of plugging this race drivers are not allowed to
>>> register/unregister mirrors from within a release op.
>>>
>>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>>
>> I agree with the analysis above but I'm not sure that release() will
>> always be an empty function. It might be more efficient to write back
>> all data migrated to a device "in one pass" instead of relying
>> on unmap_vmas() calling hmm_start_range_invalidate() per VMA.
> 
> I think we have to focus on the *current* kernel - and we have two
> users of release, nouveau_svm.c is empty and amdgpu_mn.c does
> schedule_work() - so I believe we should go ahead with this simple
> solution to the actual race today that both of those will suffer from.
> 
> If we find a need for a more complex version then it can be debated
> and justified with proper context...
> 
> Ok?
> 
> Jason

OK.
I guess we have enough on the plate already :-)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-08  8:54   ` Christoph Hellwig
@ 2019-06-11 19:44     ` Jason Gunthorpe
  2019-06-12  7:12       ` Christoph Hellwig
  0 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-11 19:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Sat, Jun 08, 2019 at 01:54:25AM -0700, Christoph Hellwig wrote:
> FYI, I very much disagree with the direction this is moving.
> 
> struct hmm_mirror literally is a trivial duplication of the
> mmu_notifiers.  All these drivers should just use the mmu_notifiers
> directly for the mirroring part instead of building a thing wrapper
> that adds nothing but helping to manage the lifetime of struct hmm,
> which shouldn't exist to start with.

Christoph: What do you think about this sketch below?

It would replace the hmm_range/mirror/etc with a different way to
build the same locking scheme using some optional helpers linked to
the mmu notifier?

(just a sketch, still needs a lot more thinking)

Jason

From 5a91d17bc3b8fcaa685abddaaae5c5aea6f82dca Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Tue, 11 Jun 2019 16:33:33 -0300
Subject: [PATCH] RFC mm: Provide helpers to implement the common mmu_notifier
 locking

Many users of mmu_notifiers require a read/write lock that is write locked
during the invalidate_range_start/end period to protect against a parallel
thread reading the page tables while another thread is invalidating them.

kvm uses a collision-retry lock built with something like a sequence
count, and many mmu_notifiers users have copied this approach with various
levels of success.

Provide a common set of helpers that build a sleepable read side lock
using a collision retry scheme. The general usage pattern is:

driver pagefault():
  struct mmu_invlock_state st = MMU_INVLOCK_STATE_INIT;

again:
  mmu_invlock_write_start_and_lock(&driver->mn, &st)

  /* read vmas and page data under mmap_sem */
  /* maybe sleep */

  take_lock(&driver->lock);
  if (mn_invlock_end_write_and_unlock(&driver->mn, &st)) {
      unlock(&driver->lock);
      goto again;
  }
  /* make data visible to the device */
  /* does not sleep */
  unlock(&driver->lock);

The driver is responsible to provide the 'driver->lock', which is the same
lock it must hold during invalidate_range_start. By holding this lock the
sequence count is fully locked, and invalidations are prevented, so it is
safe to make the work visible to the device.

Since it is possible for this to live lock it uses the write side of the
mmap_sem to create a slow path if there are repeated collisions.

This is based off the design of the hmm_range and the RDMA ODP locking
scheme, with some additional refinements.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/mmu_notifier.h | 83 ++++++++++++++++++++++++++++++++++++
 mm/mmu_notifier.c            | 71 ++++++++++++++++++++++++++++++
 2 files changed, 154 insertions(+)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b6c004bd9f6ad9..0417f9452f2a09 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -6,6 +6,7 @@
 #include <linux/spinlock.h>
 #include <linux/mm_types.h>
 #include <linux/srcu.h>
+#include <linux/sched.h>
 
 struct mmu_notifier;
 struct mmu_notifier_ops;
@@ -227,8 +228,90 @@ struct mmu_notifier_ops {
 struct mmu_notifier {
 	struct hlist_node hlist;
 	const struct mmu_notifier_ops *ops;
+
+	/*
+	 * mmu_invlock is a set of helpers to allow the caller to provide a
+	 * read/write lock scheme where the write side of the lock is held
+	 * between invalidate_start -> end, and the read side can be obtained
+	 * on some other thread. This is a common usage pattern for mmu
+	 * notifier users that want to lock against changes to the mmu.
+	 */
+	struct mm_struct *mm;
+	unsigned int active_invalidates;
+	seqcount_t invalidate_seq;
+	wait_queue_head_t wq;
 };
 
+struct mmu_invlock_state
+{
+	unsigned long timeout;
+	unsigned int update_seq;
+	bool write_locked;
+};
+
+#define MMU_INVLOCK_STATE_INIT {.timeout = msecs_to_jiffies(1000)}
+
+// FIXME: needs a seqcount helper
+static inline bool is_locked_seqcount(const seqcount_t *s)
+{
+	return s->sequence & 1;
+}
+
+void mmu_invlock_write_start_and_lock(struct mmu_notifier *mn,
+				      struct mmu_invlock_state *st);
+bool mmu_invlock_write_end(struct mmu_notifier *mn);
+
+/**
+ * mmu_invlock_inv_start - Call during invalidate_range_start
+ * @mn - mmu_notifier
+ * @lock - True if the supplied range is interesting and should cause the
+ *         write side of the lock lock to be held.
+ *
+ * Updates the locking state as part of the invalidate_range_start callback.
+ * This must be called under a user supplied lock, and it must be called for
+ * every invalidate_range_start.
+ */
+static inline void mmu_invlock_inv_start(struct mmu_notifier *mn, bool lock)
+{
+	if (lock && !mn->active_invalidates)
+		write_seqcount_begin(&mn->invalidate_seq);
+	mn->active_invalidates++;
+}
+
+/**
+ * mmu_invlock_inv_start - Call during invalidate_range_start
+ * @mn - mmu_notifier
+ *
+ * Updates the locking state as part of the invalidate_range_start callback.
+ * This must be called under a user supplied lock, and it must be called for
+ * every invalidate_range_end.
+ */
+static inline void mmu_invlock_inv_end(struct mmu_notifier *mn)
+{
+	mn->active_invalidates++;
+	if (!mn->active_invalidates &&
+	    is_locked_seqcount(&mn->invalidate_seq)) {
+		write_seqcount_end(&mn->invalidate_seq);
+		wake_up_all(&mn->wq);
+	}
+}
+
+/**
+ * mmu_invlock_write_needs_retry - Check if the write lock has collided
+ * @mn - mmu_notifier
+ * @st - lock state set by mmu_invlock_write_start_and_lock()
+ *
+ * The nlock uses a collision retry scheme for the fast path. If a parallel
+ * invalidate has collided with the lock then it should be restarted again
+ * from mmu_invlock_write_start_and_lock()
+ */
+static inline bool mmu_invlock_write_needs_retry(struct mmu_notifier *mn,
+						 struct mmu_invlock_state *st)
+{
+	return !st->write_locked &&
+	       read_seqcount_retry(&mn->invalidate_seq, st->update_seq);
+}
+
 static inline int mm_has_notifiers(struct mm_struct *mm)
 {
 	return unlikely(mm->mmu_notifier_mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index ee36068077b6e5..3db8cdd7211285 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -247,6 +247,11 @@ static int do_mmu_notifier_register(struct mmu_notifier *mn,
 
 	BUG_ON(atomic_read(&mm->mm_users) <= 0);
 
+	mn->mm = mm;
+	mn->active_invalidates = 0;
+	seqcount_init(&mn->invalidate_seq);
+	init_waitqueue_head(&mn->wq);
+
 	ret = -ENOMEM;
 	mmu_notifier_mm = kmalloc(sizeof(struct mmu_notifier_mm), GFP_KERNEL);
 	if (unlikely(!mmu_notifier_mm))
@@ -405,3 +410,69 @@ mmu_notifier_range_update_to_read_only(const struct mmu_notifier_range *range)
 	return range->vma->vm_flags & VM_READ;
 }
 EXPORT_SYMBOL_GPL(mmu_notifier_range_update_to_read_only);
+
+/**
+ * mm_invlock_start_write_and_lock - Start a read critical section
+ * @mn - mmu_notifier
+ * @st - lock state set initialized by MMU_INVLOCK_STATE_INIT
+ *
+ * This should be called with the mmap sem unlocked. It will wait for any
+ * parallel invalidations to complete and return with the mmap_sem locked. The
+ * mmap_sem may be locked for read or write.
+ *
+ * The critical section must always be ended by
+ * mn_invlock_end_write_and_unlock().
+ */
+void mm_invlock_start_write_and_lock(struct mmu_notifier *mn, struct mmu_invlock_state *st)
+{
+	long ret;
+
+	if (st->timeout == 0)
+		goto write_out;
+
+	ret = wait_event_timeout(
+		mn->wq, !is_locked_seqcount(&mn->invalidate_seq), st->timeout);
+	if (ret == 0)
+		goto write_out;
+
+	if (ret == 1)
+		st->timeout = 0;
+	else
+		st->timeout = ret;
+	down_read(&mn->mm->mmap_sem);
+	return;
+
+write_out:
+	/*
+	 * If we ran out of time then fall back to using the mmap_sem write
+	 * side to block concurrent invalidations. The seqcount is an
+	 * optimization to try and avoid this expensive lock.
+	 */
+	down_write(&mn->mm->mmap_sem);
+	st->write_locked = true;
+}
+EXPORT_SYMBOL_GPL(mm_invlock_start_write_and_lock);
+
+/**
+ * mn_invlock_end_write_and_unlock - End a read critical section
+ * @mn - mmu_notifier
+ * @st - lock state set by mmu_invlock_write_start_and_lock()
+ *
+ * This completes the read side critical section. If it returns false the
+ * caller must call mm_invlock_start_write_and_lock again.  Upon success the
+ * mmap_sem is unlocked.
+ *
+ * The caller must hold the same lock that is held while calling
+ * mmu_invlock_inv_start()
+ */
+bool mn_invlock_end_write_and_unlock(struct mmu_notifier *mn,
+				     struct mmu_invlock_state *st)
+{
+	if (st->write_locked) {
+		up_write(&mn->mm->mmap_sem);
+		return true;
+	}
+	up_read(&mn->mm->mmap_sem);
+	return mmu_invlock_write_needs_retry(mn, st);
+}
+EXPORT_SYMBOL_GPL(mn_invlock_end_write_and_unlock);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 00/11] Various revisions from a locking/code review
  2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2019-06-07 16:05 ` [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start Jason Gunthorpe
@ 2019-06-11 19:48 ` Jason Gunthorpe
  2019-06-12 17:54   ` Kuehling, Felix
  12 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-11 19:48 UTC (permalink / raw)
  To: Felix.Kuehling, Deucher, Alexander
  Cc: linux-rdma, linux-mm, dri-devel, amd-gfx

On Thu, Jun 06, 2019 at 03:44:27PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> For hmm.git:
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> Overall this brings in a simplified locking scheme and easy to explain
> lifetime model:
> 
>  If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>  is allocated memory.
> 
>  If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>  then the mmget must be obtained via mmget_not_zero().
> 
> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> read/write and unlocked accesses are removed.
> 
> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> standard mmget() locking to prevent the mm from being released. Many of the
> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> which is much clearer as to the lifetime intent.
> 
> The trailing patches are just some random cleanups I noticed when reviewing
> this code.
> 
> This v2 incorporates alot of the good off list changes & feedback Jerome had,
> and all the on-list comments too. However, now that we have the shared git I
> have kept the one line change to nouveau_svm.c rather than the compat
> funtions.
> 
> I believe we can resolve this merge in the DRM tree now and keep the core
> mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.
> 
> It is on top of hmm.git, and I have a git tree of this series to ease testing
> here:
> 
> https://github.com/jgunthorpe/linux/tree/hmm
> 
> There are still some open locking issues, as I think this remains unaddressed:
> 
> https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/
> 
> I'm looking for some more acks, reviews and tests so this can move ahead to
> hmm.git.

AMD Folks, this is looking pretty good now, can you please give at
least a Tested-by for the new driver code using this that I see in
linux-next?

Thanks,
Jason


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-11 19:44     ` Jason Gunthorpe
@ 2019-06-12  7:12       ` Christoph Hellwig
  2019-06-12 11:41         ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Christoph Hellwig @ 2019-06-12  7:12 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, linux-mm, Andrea Arcangeli,
	dri-devel, amd-gfx

On Tue, Jun 11, 2019 at 04:44:31PM -0300, Jason Gunthorpe wrote:
> On Sat, Jun 08, 2019 at 01:54:25AM -0700, Christoph Hellwig wrote:
> > FYI, I very much disagree with the direction this is moving.
> > 
> > struct hmm_mirror literally is a trivial duplication of the
> > mmu_notifiers.  All these drivers should just use the mmu_notifiers
> > directly for the mirroring part instead of building a thing wrapper
> > that adds nothing but helping to manage the lifetime of struct hmm,
> > which shouldn't exist to start with.
> 
> Christoph: What do you think about this sketch below?
> 
> It would replace the hmm_range/mirror/etc with a different way to
> build the same locking scheme using some optional helpers linked to
> the mmu notifier?
> 
> (just a sketch, still needs a lot more thinking)

I like the idea.  A few nitpicks:  Can we avoid having to store
the mm in struct mmu_notifier?  I think we could just easily pass
it as a parameter to the helpers.  The write lock case of
mm_invlock_start_write_and_lock is probably worth factoring into
separate helper? I can see cases where drivers want to just use
it directly if they need to force getting the lock without the chance
of a long wait.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-12  7:12       ` Christoph Hellwig
@ 2019-06-12 11:41         ` Jason Gunthorpe
  2019-06-12 12:11           ` Christoph Hellwig
  0 siblings, 1 reply; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-12 11:41 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jerome Glisse, Ralph Campbell, John Hubbard, Felix.Kuehling,
	linux-rdma, linux-mm, Andrea Arcangeli, dri-devel, amd-gfx

On Wed, Jun 12, 2019 at 12:12:34AM -0700, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 04:44:31PM -0300, Jason Gunthorpe wrote:
> > On Sat, Jun 08, 2019 at 01:54:25AM -0700, Christoph Hellwig wrote:
> > > FYI, I very much disagree with the direction this is moving.
> > > 
> > > struct hmm_mirror literally is a trivial duplication of the
> > > mmu_notifiers.  All these drivers should just use the mmu_notifiers
> > > directly for the mirroring part instead of building a thing wrapper
> > > that adds nothing but helping to manage the lifetime of struct hmm,
> > > which shouldn't exist to start with.
> > 
> > Christoph: What do you think about this sketch below?
> > 
> > It would replace the hmm_range/mirror/etc with a different way to
> > build the same locking scheme using some optional helpers linked to
> > the mmu notifier?
> > 
> > (just a sketch, still needs a lot more thinking)
> 
> I like the idea.  A few nitpicks: Can we avoid having to store the
> mm in struct mmu_notifier? I think we could just easily pass it as a
> parameter to the helpers.

Yes, but I think any driver that needs to use this API will have to
hold the 'struct mm_struct' and the 'struct mmu_notifier' together (ie
ODP does this in ib_ucontext_per_mm), so if we put it in the notifier
then it is trivially available everwhere it is needed, and the
mmu_notifier code takes care of the lifetime for the driver.

> The write lock case of mm_invlock_start_write_and_lock is probably
> worth factoring into separate helper? I can see cases where drivers
> want to just use it directly if they need to force getting the lock
> without the chance of a long wait.

The entire purpose of the invlock is to avoid getting the write lock
on mmap_sem as a fast path - if the driver wishes to use mmap_sem
locking only then it should just do so directly and forget about the
invlock.

Note that this patch is just an API sketch, I haven't fully checked
that the range_start/end are actually always called under mmap_sem,
and I already found that release is not. So there will need to be some
preperatory adjustments before we can use down_write(mmap_sem) as a
locking strategy here.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register
  2019-06-12 11:41         ` Jason Gunthorpe
@ 2019-06-12 12:11           ` Christoph Hellwig
  0 siblings, 0 replies; 79+ messages in thread
From: Christoph Hellwig @ 2019-06-12 12:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christoph Hellwig, Jerome Glisse, Ralph Campbell, John Hubbard,
	Felix.Kuehling, linux-rdma, linux-mm, Andrea Arcangeli,
	dri-devel, amd-gfx

On Wed, Jun 12, 2019 at 08:41:25AM -0300, Jason Gunthorpe wrote:
> > I like the idea.  A few nitpicks: Can we avoid having to store the
> > mm in struct mmu_notifier? I think we could just easily pass it as a
> > parameter to the helpers.
> 
> Yes, but I think any driver that needs to use this API will have to
> hold the 'struct mm_struct' and the 'struct mmu_notifier' together (ie
> ODP does this in ib_ucontext_per_mm), so if we put it in the notifier
> then it is trivially available everwhere it is needed, and the
> mmu_notifier code takes care of the lifetime for the driver.

True.  Well, maybe keep it for now at least.

> The entire purpose of the invlock is to avoid getting the write lock
> on mmap_sem as a fast path - if the driver wishes to use mmap_sem
> locking only then it should just do so directly and forget about the
> invlock.

May worry here is that there migh be cases where the driver needs
to expedite the action, in which case jumping straight to the write
lock.  But again we can probably skip this for now and see if it really
ends up being needed.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 00/11] Various revisions from a locking/code review
  2019-06-11 19:48 ` [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
@ 2019-06-12 17:54   ` Kuehling, Felix
  2019-06-12 21:49     ` Yang, Philip
  0 siblings, 1 reply; 79+ messages in thread
From: Kuehling, Felix @ 2019-06-12 17:54 UTC (permalink / raw)
  To: Jason Gunthorpe, Deucher, Alexander, Yang, Philip
  Cc: linux-rdma, linux-mm, dri-devel, amd-gfx

[+Philip]

Hi Jason,

I'm out of the office this week.

Hi Philip, can you give this a go? Not sure how much you've been 
following this patch series review. Message or call me on Skype to 
discuss any questions.

Thanks,
   Felix

On 2019-06-11 12:48, Jason Gunthorpe wrote:
> On Thu, Jun 06, 2019 at 03:44:27PM -0300, Jason Gunthorpe wrote:
>> From: Jason Gunthorpe <jgg@mellanox.com>
>>
>> For hmm.git:
>>
>> This patch series arised out of discussions with Jerome when looking at the
>> ODP changes, particularly informed by use after free races we have already
>> found and fixed in the ODP code (thanks to syzkaller) working with mmu
>> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
>>
>> Overall this brings in a simplified locking scheme and easy to explain
>> lifetime model:
>>
>>   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>>   is allocated memory.
>>
>>   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>>   then the mmget must be obtained via mmget_not_zero().
>>
>> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
>> read/write and unlocked accesses are removed.
>>
>> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
>> standard mmget() locking to prevent the mm from being released. Many of the
>> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
>> which is much clearer as to the lifetime intent.
>>
>> The trailing patches are just some random cleanups I noticed when reviewing
>> this code.
>>
>> This v2 incorporates alot of the good off list changes & feedback Jerome had,
>> and all the on-list comments too. However, now that we have the shared git I
>> have kept the one line change to nouveau_svm.c rather than the compat
>> funtions.
>>
>> I believe we can resolve this merge in the DRM tree now and keep the core
>> mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.
>>
>> It is on top of hmm.git, and I have a git tree of this series to ease testing
>> here:
>>
>> https://github.com/jgunthorpe/linux/tree/hmm
>>
>> There are still some open locking issues, as I think this remains unaddressed:
>>
>> https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/
>>
>> I'm looking for some more acks, reviews and tests so this can move ahead to
>> hmm.git.
> AMD Folks, this is looking pretty good now, can you please give at
> least a Tested-by for the new driver code using this that I see in
> linux-next?
>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 00/11] Various revisions from a locking/code review
  2019-06-12 17:54   ` Kuehling, Felix
@ 2019-06-12 21:49     ` Yang, Philip
  2019-06-13 17:50       ` Jason Gunthorpe
  0 siblings, 1 reply; 79+ messages in thread
From: Yang, Philip @ 2019-06-12 21:49 UTC (permalink / raw)
  To: Kuehling, Felix, Jason Gunthorpe, Deucher, Alexander
  Cc: linux-rdma, linux-mm, dri-devel, amd-gfx

Rebase to https://github.com/jgunthorpe/linux.git hmm branch, need some 
changes because of interface hmm_range_register change. Then run a quick 
amdgpu_test. Test is finished, result is ok. But there is below kernel 
BUG message, seems hmm_free_rcu calls down_write.....

[ 1171.919921] BUG: sleeping function called from invalid context at 
/home/yangp/git/compute_staging/kernel/kernel/locking/rwsem.c:65
[ 1171.919933] in_atomic(): 1, irqs_disabled(): 0, pid: 53, name: 
kworker/1:1
[ 1171.919938] 2 locks held by kworker/1:1/53:
[ 1171.919940]  #0: 000000001c7c19d4 ((wq_completion)rcu_gp){+.+.}, at: 
process_one_work+0x20e/0x630
[ 1171.919951]  #1: 00000000923f2cfa 
((work_completion)(&sdp->work)){+.+.}, at: process_one_work+0x20e/0x630
[ 1171.919959] CPU: 1 PID: 53 Comm: kworker/1:1 Tainted: G        W 
    5.2.0-rc1-kfd-yangp #196
[ 1171.919961] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, 
BIOS 9001 03/07/2016
[ 1171.919965] Workqueue: rcu_gp srcu_invoke_callbacks
[ 1171.919968] Call Trace:
[ 1171.919974]  dump_stack+0x67/0x9b
[ 1171.919980]  ___might_sleep+0x149/0x230
[ 1171.919985]  down_write+0x1c/0x70
[ 1171.919989]  hmm_free_rcu+0x24/0x80
[ 1171.919993]  srcu_invoke_callbacks+0xc9/0x150
[ 1171.920000]  process_one_work+0x28e/0x630
[ 1171.920008]  worker_thread+0x39/0x3f0
[ 1171.920014]  ? process_one_work+0x630/0x630
[ 1171.920017]  kthread+0x11c/0x140
[ 1171.920021]  ? kthread_park+0x90/0x90
[ 1171.920026]  ret_from_fork+0x24/0x30

Philip

On 2019-06-12 1:54 p.m., Kuehling, Felix wrote:
> [+Philip]
> 
> Hi Jason,
> 
> I'm out of the office this week.
> 
> Hi Philip, can you give this a go? Not sure how much you've been
> following this patch series review. Message or call me on Skype to
> discuss any questions.
> 
> Thanks,
>     Felix
> 
> On 2019-06-11 12:48, Jason Gunthorpe wrote:
>> On Thu, Jun 06, 2019 at 03:44:27PM -0300, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> For hmm.git:
>>>
>>> This patch series arised out of discussions with Jerome when looking at the
>>> ODP changes, particularly informed by use after free races we have already
>>> found and fixed in the ODP code (thanks to syzkaller) working with mmu
>>> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
>>>
>>> Overall this brings in a simplified locking scheme and easy to explain
>>> lifetime model:
>>>
>>>    If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>>>    is allocated memory.
>>>
>>>    If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>>>    then the mmget must be obtained via mmget_not_zero().
>>>
>>> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
>>> read/write and unlocked accesses are removed.
>>>
>>> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
>>> standard mmget() locking to prevent the mm from being released. Many of the
>>> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
>>> which is much clearer as to the lifetime intent.
>>>
>>> The trailing patches are just some random cleanups I noticed when reviewing
>>> this code.
>>>
>>> This v2 incorporates alot of the good off list changes & feedback Jerome had,
>>> and all the on-list comments too. However, now that we have the shared git I
>>> have kept the one line change to nouveau_svm.c rather than the compat
>>> funtions.
>>>
>>> I believe we can resolve this merge in the DRM tree now and keep the core
>>> mm/hmm.c clean. DRM maintainers, please correct me if I'm wrong.
>>>
>>> It is on top of hmm.git, and I have a git tree of this series to ease testing
>>> here:
>>>
>>> https://github.com/jgunthorpe/linux/tree/hmm
>>>
>>> There are still some open locking issues, as I think this remains unaddressed:
>>>
>>> https://lore.kernel.org/linux-mm/20190527195829.GB18019@mellanox.com/
>>>
>>> I'm looking for some more acks, reviews and tests so this can move ahead to
>>> hmm.git.
>> AMD Folks, this is looking pretty good now, can you please give at
>> least a Tested-by for the new driver code using this that I see in
>> linux-next?
>>
>> Thanks,
>> Jason

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v2 hmm 00/11] Various revisions from a locking/code review
  2019-06-12 21:49     ` Yang, Philip
@ 2019-06-13 17:50       ` Jason Gunthorpe
  0 siblings, 0 replies; 79+ messages in thread
From: Jason Gunthorpe @ 2019-06-13 17:50 UTC (permalink / raw)
  To: Yang, Philip
  Cc: Kuehling, Felix, Deucher, Alexander, linux-rdma, linux-mm,
	dri-devel, amd-gfx

On Wed, Jun 12, 2019 at 09:49:12PM +0000, Yang, Philip wrote:
> Rebase to https://github.com/jgunthorpe/linux.git hmm branch, need some 
> changes because of interface hmm_range_register change. Then run a quick 
> amdgpu_test. Test is finished, result is ok.

Great! Thanks

I'll add your Tested-by to the series

>  But there is below kernel BUG message, seems hmm_free_rcu calls
> down_write.....
> 
> [ 1171.919921] BUG: sleeping function called from invalid context at 
> /home/yangp/git/compute_staging/kernel/kernel/locking/rwsem.c:65
> [ 1171.919933] in_atomic(): 1, irqs_disabled(): 0, pid: 53, name: 
> kworker/1:1
> [ 1171.919938] 2 locks held by kworker/1:1/53:
> [ 1171.919940]  #0: 000000001c7c19d4 ((wq_completion)rcu_gp){+.+.}, at: 
> process_one_work+0x20e/0x630
> [ 1171.919951]  #1: 00000000923f2cfa 
> ((work_completion)(&sdp->work)){+.+.}, at: process_one_work+0x20e/0x630
> [ 1171.919959] CPU: 1 PID: 53 Comm: kworker/1:1 Tainted: G        W 
>     5.2.0-rc1-kfd-yangp #196
> [ 1171.919961] Hardware name: ASUS All Series/Z97-PRO(Wi-Fi ac)/USB 3.1, 
> BIOS 9001 03/07/2016
> [ 1171.919965] Workqueue: rcu_gp srcu_invoke_callbacks
> [ 1171.919968] Call Trace:
> [ 1171.919974]  dump_stack+0x67/0x9b
> [ 1171.919980]  ___might_sleep+0x149/0x230
> [ 1171.919985]  down_write+0x1c/0x70
> [ 1171.919989]  hmm_free_rcu+0x24/0x80
> [ 1171.919993]  srcu_invoke_callbacks+0xc9/0x150
> [ 1171.920000]  process_one_work+0x28e/0x630
> [ 1171.920008]  worker_thread+0x39/0x3f0
> [ 1171.920014]  ? process_one_work+0x630/0x630
> [ 1171.920017]  kthread+0x11c/0x140
> [ 1171.920021]  ? kthread_park+0x90/0x90
> [ 1171.920026]  ret_from_fork+0x24/0x30

Thank you Phillip, it seems the prior tests were not done with
lockdep..

Sigh, I will keep this with the gross pagetable_lock then. I updated
the patches on the git with this modification. I think we have covered
all the bases so I will send another V of the series to the list and
if no more comments then it will move ahead to hmm.git. Thanks to all.

diff --git a/mm/hmm.c b/mm/hmm.c
index 136c812faa2790..4c64d4c32f4825 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -49,16 +49,15 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 
 	lockdep_assert_held_exclusive(&mm->mmap_sem);
 
+	/* Abuse the page_table_lock to also protect mm->hmm. */
+	spin_lock(&mm->page_table_lock);
 	if (mm->hmm) {
-		if (kref_get_unless_zero(&mm->hmm->kref))
+		if (kref_get_unless_zero(&mm->hmm->kref)) {
+			spin_unlock(&mm->page_table_lock);
 			return mm->hmm;
-		/*
-		 * The hmm is being freed by some other CPU and is pending a
-		 * RCU grace period, but this CPU can NULL now it since we
-		 * have the mmap_sem.
-		 */
-		mm->hmm = NULL;
+		}
 	}
+	spin_unlock(&mm->page_table_lock);
 
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
@@ -81,7 +80,14 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	}
 
 	mmgrab(hmm->mm);
+
+	/*
+	 * We hold the exclusive mmap_sem here so we know that mm->hmm is
+	 * still NULL or 0 kref, and is safe to update.
+	 */
+	spin_lock(&mm->page_table_lock);
 	mm->hmm = hmm;
+	spin_unlock(&mm->page_table_lock);
 	return hmm;
 }
 
@@ -89,10 +95,14 @@ static void hmm_free_rcu(struct rcu_head *rcu)
 {
 	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
 
-	down_write(&hmm->mm->mmap_sem);
+	/*
+	 * The mm->hmm pointer is kept valid while notifier ops can be running
+	 * so they don't have to deal with a NULL mm->hmm value
+	 */
+	spin_lock(&hmm->mm->page_table_lock);
 	if (hmm->mm->hmm == hmm)
 		hmm->mm->hmm = NULL;
-	up_write(&hmm->mm->mmap_sem);
+	spin_unlock(&hmm->mm->page_table_lock);
 	mmdrop(hmm->mm);
 
 	kfree(hmm);


^ permalink raw reply related	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2019-06-13 17:50 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-06 18:44 [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
2019-06-06 18:44 ` [PATCH v2 hmm 01/11] mm/hmm: fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
2019-06-07  2:29   ` John Hubbard
2019-06-07 12:34     ` Jason Gunthorpe
2019-06-07 13:42       ` Jason Gunthorpe
2019-06-08  1:13       ` John Hubbard
2019-06-08  1:37       ` John Hubbard
2019-06-07 18:12   ` Ralph Campbell
2019-06-08  8:49   ` Christoph Hellwig
2019-06-08 11:33     ` Jason Gunthorpe
2019-06-06 18:44 ` [PATCH v2 hmm 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_range_register Jason Gunthorpe
2019-06-07  2:36   ` John Hubbard
2019-06-07 18:24   ` Ralph Campbell
2019-06-07 22:39     ` Ralph Campbell
2019-06-10 13:09       ` Jason Gunthorpe
2019-06-07 22:33   ` Ira Weiny
2019-06-08  8:54   ` Christoph Hellwig
2019-06-11 19:44     ` Jason Gunthorpe
2019-06-12  7:12       ` Christoph Hellwig
2019-06-12 11:41         ` Jason Gunthorpe
2019-06-12 12:11           ` Christoph Hellwig
2019-06-06 18:44 ` [PATCH v2 hmm 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
2019-06-07  2:44   ` John Hubbard
2019-06-07 12:36     ` Jason Gunthorpe
2019-06-07 18:41   ` Ralph Campbell
2019-06-07 18:51     ` Jason Gunthorpe
2019-06-07 22:38   ` Ira Weiny
2019-06-06 18:44 ` [PATCH v2 hmm 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
2019-06-07  2:54   ` John Hubbard
2019-06-07 18:52   ` Ralph Campbell
2019-06-07 22:44   ` Ira Weiny
2019-06-06 18:44 ` [PATCH v2 hmm 05/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
2019-06-07  3:06   ` John Hubbard
2019-06-07 12:47     ` Jason Gunthorpe
2019-06-07 13:31     ` [PATCH v3 " Jason Gunthorpe
2019-06-07 22:55       ` Ira Weiny
2019-06-08  1:32       ` John Hubbard
2019-06-07 19:01   ` [PATCH v2 " Ralph Campbell
2019-06-07 19:13     ` Jason Gunthorpe
2019-06-07 20:21       ` Ralph Campbell
2019-06-07 20:44         ` Jason Gunthorpe
2019-06-07 22:13           ` Ralph Campbell
2019-06-08  1:47             ` Jason Gunthorpe
2019-06-06 18:44 ` [PATCH v2 hmm 06/11] mm/hmm: Hold on to the mmget for the lifetime of the range Jason Gunthorpe
2019-06-07  3:15   ` John Hubbard
2019-06-07 20:29   ` Ralph Campbell
2019-06-06 18:44 ` [PATCH v2 hmm 07/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
2019-06-07  3:19   ` John Hubbard
2019-06-07 20:31   ` Ralph Campbell
2019-06-07 22:16   ` Souptick Joarder
2019-06-06 18:44 ` [PATCH v2 hmm 08/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
2019-06-07  3:29   ` John Hubbard
2019-06-07 13:57     ` Jason Gunthorpe
2019-06-07 20:33   ` Ralph Campbell
2019-06-06 18:44 ` [PATCH v2 hmm 09/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
2019-06-07  3:37   ` John Hubbard
2019-06-07 14:03     ` Jason Gunthorpe
2019-06-07 20:46   ` Ralph Campbell
2019-06-07 20:49     ` Jason Gunthorpe
2019-06-07 23:01   ` Ira Weiny
2019-06-06 18:44 ` [PATCH v2 hmm 10/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
2019-06-07  3:40   ` John Hubbard
2019-06-07 20:49   ` Ralph Campbell
2019-06-07 22:11   ` Souptick Joarder
2019-06-07 23:02   ` Ira Weiny
2019-06-06 18:44 ` [PATCH v2 hmm 11/11] mm/hmm: Remove confusing comment and logic from hmm_release Jason Gunthorpe
2019-06-07  3:47   ` John Hubbard
2019-06-07 12:58     ` Jason Gunthorpe
2019-06-07 21:37   ` Ralph Campbell
2019-06-08  2:12     ` Jason Gunthorpe
2019-06-10 16:02     ` Jason Gunthorpe
2019-06-10 22:03       ` Ralph Campbell
2019-06-07 16:05 ` [PATCH v2 12/11] mm/hmm: Fix error flows in hmm_invalidate_range_start Jason Gunthorpe
2019-06-07 23:52   ` Ralph Campbell
2019-06-08  1:35     ` Jason Gunthorpe
2019-06-11 19:48 ` [PATCH v2 hmm 00/11] Various revisions from a locking/code review Jason Gunthorpe
2019-06-12 17:54   ` Kuehling, Felix
2019-06-12 21:49     ` Yang, Philip
2019-06-13 17:50       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).