linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
@ 2019-05-23 15:34 Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
                   ` (13 more replies)
  0 siblings, 14 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

This patch series arised out of discussions with Jerome when looking at the
ODP changes, particularly informed by use after free races we have already
found and fixed in the ODP code (thanks to syzkaller) working with mmu
notifiers, and the discussion with Ralph on how to resolve the lifetime model.

Overall this brings in a simplified locking scheme and easy to explain
lifetime model:

 If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
 is allocated memory.

 If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
 then the mmget must be obtained via mmget_not_zero().

Locking of mm->hmm is shifted to use the mmap_sem consistently for all
read/write and unlocked accesses are removed.

The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
standard mmget() locking to prevent the mm from being released. Many of the
debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
which is much clearer as to the lifetime intent.

The trailing patches are just some random cleanups I noticed when reviewing
this code.

I expect Jerome & Ralph will have some design notes so this is just RFC, and
it still needs a matching edit to nouveau. It is only compile tested.

Regards,
Jason

Jason Gunthorpe (11):
  mm/hmm: Fix use after free with struct hmm in the mmu notifiers
  mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range
  mm/hmm: Hold a mmgrab from hmm to mm
  mm/hmm: Simplify hmm_get_or_create and make it reliable
  mm/hmm: Improve locking around hmm->dead
  mm/hmm: Remove duplicate condition test before wait_event_timeout
  mm/hmm: Delete hmm_mirror_mm_is_alive()
  mm/hmm: Use lockdep instead of comments
  mm/hmm: Remove racy protection against double-unregistration
  mm/hmm: Poison hmm_range during unregister
  mm/hmm: Do not use list*_rcu() for hmm->ranges

 include/linux/hmm.h |  50 ++----------
 kernel/fork.c       |   1 -
 mm/hmm.c            | 184 +++++++++++++++++++-------------------------
 3 files changed, 88 insertions(+), 147 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-06-06 23:54   ` Ira Weiny
  2019-05-23 15:34 ` [RFC PATCH 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range Jason Gunthorpe
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
system will continue to reference hmm->mn until the srcu grace period
expires.

Resulting in use after free races like this:

         CPU0                                     CPU1
                                               __mmu_notifier_invalidate_range_start()
                                                 srcu_read_lock
                                                 hlist_for_each ()
                                                   // mn == hmm->mn
hmm_mirror_unregister()
  hmm_put()
    hmm_free()
      mmu_notifier_unregister_no_release()
         hlist_del_init_rcu(hmm-mn->list)
			                           mn->ops->invalidate_range_start(mn, range);
					             mm_get_hmm()
      mm->hmm = NULL;
      kfree(hmm)
                                                     mutex_lock(&hmm->lock);

Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
existing. Get the now-safe hmm struct through container_of and directly
check kref_get_unless_zero to lock it against free.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h |  1 +
 mm/hmm.c            | 25 +++++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 51ec27a8466816..8b91c90d3b88cb 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -102,6 +102,7 @@ struct hmm {
 	struct mmu_notifier	mmu_notifier;
 	struct rw_semaphore	mirrors_sem;
 	wait_queue_head_t	wq;
+	struct rcu_head		rcu;
 	long			notifiers;
 	bool			dead;
 };
diff --git a/mm/hmm.c b/mm/hmm.c
index 816c2356f2449f..824e7e160d8167 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	return NULL;
 }
 
+static void hmm_fee_rcu(struct rcu_head *rcu)
+{
+	kfree(container_of(rcu, struct hmm, rcu));
+}
+
 static void hmm_free(struct kref *kref)
 {
 	struct hmm *hmm = container_of(kref, struct hmm, kref);
@@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
 		mm->hmm = NULL;
 	spin_unlock(&mm->page_table_lock);
 
-	kfree(hmm);
+	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
 }
 
 static inline void hmm_put(struct hmm *hmm)
@@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
 
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_range *range;
 
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return;
+
 	/* Report this HMM as dying. */
 	hmm->dead = true;
 
@@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
 	struct hmm_range *range;
 	int ret = 0;
 
-	VM_BUG_ON(!hmm);
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return 0;
 
 	update.start = nrange->start;
 	update.end = nrange->end;
@@ -248,9 +259,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
-	struct hmm *hmm = mm_get_hmm(nrange->mm);
+	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 
-	VM_BUG_ON(!hmm);
+	/* hmm is in progress to free */
+	if (!kref_get_unless_zero(&hmm->kref))
+		return;
 
 	mutex_lock(&hmm->lock);
 	hmm->notifiers--;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-05-23 18:22   ` Christoph Hellwig
  2019-05-23 15:34 ` [RFC PATCH 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Ralph observes that hmm_register_range() can only be called by a driver
while a mirror is registered. Make this clear in the API by passing in the
mirror structure as a parameter.

This also simplifies understanding the lifetime model for struct hmm, as
the hmm pointer must be valid as part of a registered mirror so all we
need in hmm_register_range() is a simple kref_get.

Suggested-by: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h |  7 ++++---
 mm/hmm.c            | 14 +++++---------
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 8b91c90d3b88cb..87d29e085a69f7 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -503,7 +503,7 @@ static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
  * Please see Documentation/vm/hmm.rst for how to use the range API.
  */
 int hmm_range_register(struct hmm_range *range,
-		       struct mm_struct *mm,
+		       struct hmm_mirror *mirror,
 		       unsigned long start,
 		       unsigned long end,
 		       unsigned page_shift);
@@ -539,7 +539,8 @@ static inline bool hmm_vma_range_done(struct hmm_range *range)
 }
 
 /* This is a temporary helper to avoid merge conflict between trees. */
-static inline int hmm_vma_fault(struct hmm_range *range, bool block)
+static inline int hmm_vma_fault(struct hmm_mirror *mirror,
+				struct hmm_range *range, bool block)
 {
 	long ret;
 
@@ -552,7 +553,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	range->default_flags = 0;
 	range->pfn_flags_mask = -1UL;
 
-	ret = hmm_range_register(range, range->vma->vm_mm,
+	ret = hmm_range_register(range, mirror,
 				 range->start, range->end,
 				 PAGE_SHIFT);
 	if (ret)
diff --git a/mm/hmm.c b/mm/hmm.c
index 824e7e160d8167..fa1b04fcfc2549 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -927,7 +927,7 @@ static void hmm_pfns_clear(struct hmm_range *range,
  * Track updates to the CPU page table see include/linux/hmm.h
  */
 int hmm_range_register(struct hmm_range *range,
-		       struct mm_struct *mm,
+		       struct hmm_mirror *mirror,
 		       unsigned long start,
 		       unsigned long end,
 		       unsigned page_shift)
@@ -935,7 +935,6 @@ int hmm_range_register(struct hmm_range *range,
 	unsigned long mask = ((1UL << page_shift) - 1UL);
 
 	range->valid = false;
-	range->hmm = NULL;
 
 	if ((start & mask) || (end & mask))
 		return -EINVAL;
@@ -946,15 +945,12 @@ int hmm_range_register(struct hmm_range *range,
 	range->start = start;
 	range->end = end;
 
-	range->hmm = hmm_get_or_create(mm);
-	if (!range->hmm)
-		return -EFAULT;
-
 	/* Check if hmm_mm_destroy() was call. */
-	if (range->hmm->mm == NULL || range->hmm->dead) {
-		hmm_put(range->hmm);
+	if (mirror->hmm->mm == NULL || mirror->hmm->dead)
 		return -EFAULT;
-	}
+
+	range->hmm = mirror->hmm;
+	kref_get(&range->hmm->kref);
 
 	/* Initialize range to track CPU page table update */
 	mutex_lock(&range->hmm->lock);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 03/11] mm/hmm: Hold a mmgrab from hmm to mm
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

So long a a struct hmm pointer exists, so should the struct mm it is
linked too. Hold the mmgrab() as soon as a hmm is created, and mmdrop() it
once the hmm refcount goes to zero.

Since mmdrop() (ie a 0 kref on struct mm) is now impossible with a !NULL
mm->hmm delete the hmm_hmm_destroy().

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h |  3 ---
 kernel/fork.c       |  1 -
 mm/hmm.c            | 21 +++------------------
 3 files changed, 3 insertions(+), 22 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 87d29e085a69f7..2a7346384ead13 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -584,14 +584,11 @@ static inline int hmm_vma_fault(struct hmm_mirror *mirror,
 }
 
 /* Below are for HMM internal use only! Not to be used by device driver! */
-void hmm_mm_destroy(struct mm_struct *mm);
-
 static inline void hmm_mm_init(struct mm_struct *mm)
 {
 	mm->hmm = NULL;
 }
 #else /* IS_ENABLED(CONFIG_HMM_MIRROR) */
-static inline void hmm_mm_destroy(struct mm_struct *mm) {}
 static inline void hmm_mm_init(struct mm_struct *mm) {}
 #endif /* IS_ENABLED(CONFIG_HMM_MIRROR) */
 
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a0f..51b114ec6c395c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -672,7 +672,6 @@ void __mmdrop(struct mm_struct *mm)
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
-	hmm_mm_destroy(mm);
 	mmu_notifier_mm_destroy(mm);
 	check_mm(mm);
 	put_user_ns(mm->user_ns);
diff --git a/mm/hmm.c b/mm/hmm.c
index fa1b04fcfc2549..e27058e92508b9 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -29,6 +29,7 @@
 #include <linux/swapops.h>
 #include <linux/hugetlb.h>
 #include <linux/memremap.h>
+#include <linux/sched/mm.h>
 #include <linux/jump_label.h>
 #include <linux/dma-mapping.h>
 #include <linux/mmu_notifier.h>
@@ -82,6 +83,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	hmm->notifiers = 0;
 	hmm->dead = false;
 	hmm->mm = mm;
+	mmgrab(hmm->mm);
 
 	spin_lock(&mm->page_table_lock);
 	if (!mm->hmm)
@@ -130,6 +132,7 @@ static void hmm_free(struct kref *kref)
 		mm->hmm = NULL;
 	spin_unlock(&mm->page_table_lock);
 
+	mmdrop(hmm->mm);
 	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
 }
 
@@ -138,24 +141,6 @@ static inline void hmm_put(struct hmm *hmm)
 	kref_put(&hmm->kref, hmm_free);
 }
 
-void hmm_mm_destroy(struct mm_struct *mm)
-{
-	struct hmm *hmm;
-
-	spin_lock(&mm->page_table_lock);
-	hmm = mm_get_hmm(mm);
-	mm->hmm = NULL;
-	if (hmm) {
-		hmm->mm = NULL;
-		hmm->dead = true;
-		spin_unlock(&mm->page_table_lock);
-		hmm_put(hmm);
-		return;
-	}
-
-	spin_unlock(&mm->page_table_lock);
-}
-
 static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-05-23 23:38   ` Ralph Campbell
  2019-05-23 15:34 ` [RFC PATCH 05/11] mm/hmm: Improve locking around hmm->dead Jason Gunthorpe
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

As coded this function can false-fail in various racy situations. Make it
reliable by running only under the write side of the mmap_sem and avoiding
the false-failing compare/exchange pattern.

Also make the locking very easy to understand by only ever reading or
writing mm->hmm while holding the write side of the mmap_sem.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 75 ++++++++++++++++++++------------------------------------
 1 file changed, 27 insertions(+), 48 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index e27058e92508b9..ec54be54d81135 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -40,16 +40,6 @@
 #if IS_ENABLED(CONFIG_HMM_MIRROR)
 static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
 
-static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
-{
-	struct hmm *hmm = READ_ONCE(mm->hmm);
-
-	if (hmm && kref_get_unless_zero(&hmm->kref))
-		return hmm;
-
-	return NULL;
-}
-
 /**
  * hmm_get_or_create - register HMM against an mm (HMM internal)
  *
@@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
  */
 static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 {
-	struct hmm *hmm = mm_get_hmm(mm);
-	bool cleanup = false;
+	struct hmm *hmm;
 
-	if (hmm)
-		return hmm;
+	lockdep_assert_held_exclusive(mm->mmap_sem);
+
+	if (mm->hmm) {
+		if (kref_get_unless_zero(&mm->hmm->kref))
+			return mm->hmm;
+		/*
+		 * The hmm is being freed by some other CPU and is pending a
+		 * RCU grace period, but this CPU can NULL now it since we
+		 * have the mmap_sem.
+		 */
+		mm->hmm = NULL;
+	}
 
 	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
 	if (!hmm)
@@ -85,54 +84,34 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	hmm->mm = mm;
 	mmgrab(hmm->mm);
 
-	spin_lock(&mm->page_table_lock);
-	if (!mm->hmm)
-		mm->hmm = hmm;
-	else
-		cleanup = true;
-	spin_unlock(&mm->page_table_lock);
-
-	if (cleanup)
-		goto error;
-
-	/*
-	 * We should only get here if hold the mmap_sem in write mode ie on
-	 * registration of first mirror through hmm_mirror_register()
-	 */
 	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
-	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
-		goto error_mm;
+	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
+		kfree(hmm);
+		return NULL;
+	}
 
+	mm->hmm = hmm;
 	return hmm;
-
-error_mm:
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-error:
-	kfree(hmm);
-	return NULL;
 }
 
 static void hmm_fee_rcu(struct rcu_head *rcu)
 {
+	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
+
+	down_write(&hmm->mm->mmap_sem);
+	if (hmm->mm->hmm == hmm)
+		hmm->mm->hmm = NULL;
+	up_write(&hmm->mm->mmap_sem);
+	mmdrop(hmm->mm);
+
 	kfree(container_of(rcu, struct hmm, rcu));
 }
 
 static void hmm_free(struct kref *kref)
 {
 	struct hmm *hmm = container_of(kref, struct hmm, kref);
-	struct mm_struct *mm = hmm->mm;
-
-	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
 
-	spin_lock(&mm->page_table_lock);
-	if (mm->hmm == hmm)
-		mm->hmm = NULL;
-	spin_unlock(&mm->page_table_lock);
-
-	mmdrop(hmm->mm);
+	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
 	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
 }
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 05/11] mm/hmm: Improve locking around hmm->dead
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-05-24 13:40   ` Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 06/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

This value is being read without any locking, so it is just an unreliable
hint, however in many cases we need to have certainty that code is not
racing with mmput()/hmm_release().

For the two functions doing find_vma(), document that the caller is
expected to hold mmap_sem and thus also have a mmget().

For hmm_range_register acquire a mmget internally as it must not race with
hmm_release() when it sets valid.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 27 +++++++++++++++++++--------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index ec54be54d81135..d97ec293336ea5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -909,8 +909,10 @@ int hmm_range_register(struct hmm_range *range,
 	range->start = start;
 	range->end = end;
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (mirror->hmm->mm == NULL || mirror->hmm->dead)
+	/*
+	 * We cannot set range->value to true if hmm_release has already run.
+	 */
+	if (!mmget_not_zero(mirror->hmm->mm))
 		return -EFAULT;
 
 	range->hmm = mirror->hmm;
@@ -928,6 +930,7 @@ int hmm_range_register(struct hmm_range *range,
 	if (!range->hmm->notifiers)
 		range->valid = true;
 	mutex_unlock(&range->hmm->lock);
+	mmput(mirror->hmm->mm);
 
 	return 0;
 }
@@ -979,9 +982,13 @@ long hmm_range_snapshot(struct hmm_range *range)
 	struct vm_area_struct *vma;
 	struct mm_walk mm_walk;
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL || hmm->dead)
-		return -EFAULT;
+	/*
+	 * Caller must hold the mmap_sem, and that requires the caller to have
+	 * a mmget.
+	 */
+	lockdep_assert_held(hmm->mm->mmap_sem);
+	if (WARN_ON(!atomic_read(&hmm->mm->mm_users)))
+		return -EINVAL;
 
 	do {
 		/* If range is no longer valid force retry. */
@@ -1077,9 +1084,13 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 	struct mm_walk mm_walk;
 	int ret;
 
-	/* Check if hmm_mm_destroy() was call. */
-	if (hmm->mm == NULL || hmm->dead)
-		return -EFAULT;
+	/*
+	 * Caller must hold the mmap_sem, and that requires the caller to have
+	 * a mmget.
+	 */
+	lockdep_assert_held(hmm->mm->mmap_sem);
+	if (WARN_ON(!atomic_read(&hmm->mm->mm_users)))
+		return -EINVAL;
 
 	do {
 		/* If range is no longer valid force retry. */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 06/11] mm/hmm: Remove duplicate condition test before wait_event_timeout
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 05/11] mm/hmm: Improve locking around hmm->dead Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 07/11] mm/hmm: Delete hmm_mirror_mm_is_alive() Jason Gunthorpe
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

The wait_event_timeout macro already tests the condition as its first
action, so there is no reason to open code another version of this, all
that does is skip the might_sleep() debugging in common cases, which is
not helpful.

Further, based on prior patches, we can no simplify the required condition
test:
 - If range is valid memory then so is range->hmm
 - If hmm_release() has run then range->valid is set to false
   at the same time as dead, so no reason to check both.
 - A valid hmm has a valid hmm->mm.

Also, add the READ_ONCE for range->valid as there is no lock held here.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 2a7346384ead13..7f3b751fcab1ce 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -216,17 +216,9 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
 static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
 					      unsigned long timeout)
 {
-	/* Check if mm is dead ? */
-	if (range->hmm == NULL || range->hmm->dead || range->hmm->mm == NULL) {
-		range->valid = false;
-		return false;
-	}
-	if (range->valid)
-		return true;
-	wait_event_timeout(range->hmm->wq, range->valid || range->hmm->dead,
+	wait_event_timeout(range->hmm->wq, range->valid,
 			   msecs_to_jiffies(timeout));
-	/* Return current valid status just in case we get lucky */
-	return range->valid;
+	return READ_ONCE(range->valid);
 }
 
 /*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 07/11] mm/hmm: Delete hmm_mirror_mm_is_alive()
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 06/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-05-23 15:34 ` [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Now that it is clarified that callers to hmm_range_dma_map() must hold
the mmap_sem and thus the mmget, there is no purpose for this function.

It was the last user of dead, so delete it as well.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 include/linux/hmm.h | 27 ---------------------------
 mm/hmm.c            |  4 ----
 2 files changed, 31 deletions(-)

diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 7f3b751fcab1ce..6671643703a7ab 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -91,7 +91,6 @@
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  * @wq: wait queue for user waiting on a range invalidation
  * @notifiers: count of active mmu notifiers
- * @dead: is the mm dead ?
  */
 struct hmm {
 	struct mm_struct	*mm;
@@ -104,7 +103,6 @@ struct hmm {
 	wait_queue_head_t	wq;
 	struct rcu_head		rcu;
 	long			notifiers;
-	bool			dead;
 };
 
 /*
@@ -466,31 +464,6 @@ struct hmm_mirror {
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm);
 void hmm_mirror_unregister(struct hmm_mirror *mirror);
 
-/*
- * hmm_mirror_mm_is_alive() - test if mm is still alive
- * @mirror: the HMM mm mirror for which we want to lock the mmap_sem
- * Returns: false if the mm is dead, true otherwise
- *
- * This is an optimization it will not accurately always return -EINVAL if the
- * mm is dead ie there can be false negative (process is being kill but HMM is
- * not yet inform of that). It is only intented to be use to optimize out case
- * where driver is about to do something time consuming and it would be better
- * to skip it if the mm is dead.
- */
-static inline bool hmm_mirror_mm_is_alive(struct hmm_mirror *mirror)
-{
-	struct mm_struct *mm;
-
-	if (!mirror || !mirror->hmm)
-		return false;
-	mm = READ_ONCE(mirror->hmm->mm);
-	if (mirror->hmm->dead || !mm)
-		return false;
-
-	return true;
-}
-
-
 /*
  * Please see Documentation/vm/hmm.rst for how to use the range API.
  */
diff --git a/mm/hmm.c b/mm/hmm.c
index d97ec293336ea5..2695925c0c5927 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -80,7 +80,6 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	mutex_init(&hmm->lock);
 	kref_init(&hmm->kref);
 	hmm->notifiers = 0;
-	hmm->dead = false;
 	hmm->mm = mm;
 	mmgrab(hmm->mm);
 
@@ -130,9 +129,6 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	if (!kref_get_unless_zero(&hmm->kref))
 		return;
 
-	/* Report this HMM as dying. */
-	hmm->dead = true;
-
 	/* Wake-up everyone waiting on any range. */
 	mutex_lock(&hmm->lock);
 	list_for_each_entry(range, &hmm->ranges, list) {
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 07/11] mm/hmm: Delete hmm_mirror_mm_is_alive() Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-06-07 19:33   ` Souptick Joarder
  2019-05-23 15:34 ` [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

So we can check locking at runtime.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 2695925c0c5927..46872306f922bb 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -256,11 +256,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
  *
  * To start mirroring a process address space, the device driver must register
  * an HMM mirror struct.
- *
- * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
  */
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 {
+	lockdep_assert_held_exclusive(mm->mmap_sem);
+
 	/* Sanity check */
 	if (!mm || !mirror || !mirror->ops)
 		return -EINVAL;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-06-07 19:38   ` Souptick Joarder
  2019-05-23 15:34 ` [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

No other register/unregister kernel API attempts to provide this kind of
protection as it is inherently racy, so just drop it.

Callers should provide their own protection, it appears nouveau already
does, but just in case drop a debugging POISON.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 46872306f922bb..6c3b7398672c29 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -286,18 +286,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
  */
 void hmm_mirror_unregister(struct hmm_mirror *mirror)
 {
-	struct hmm *hmm = READ_ONCE(mirror->hmm);
-
-	if (hmm == NULL)
-		return;
+	struct hmm *hmm = mirror->hmm;
 
 	down_write(&hmm->mirrors_sem);
 	list_del_init(&mirror->list);
-	/* To protect us against double unregister ... */
-	mirror->hmm = NULL;
 	up_write(&hmm->mirrors_sem);
-
 	hmm_put(hmm);
+	memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
 }
 EXPORT_SYMBOL(hmm_mirror_unregister);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-06-07 20:13   ` Souptick Joarder
  2019-05-23 15:34 ` [RFC PATCH 11/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
and poison bytes to detect this condition.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 6c3b7398672c29..02752d3ef2ed92 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -936,8 +936,7 @@ EXPORT_SYMBOL(hmm_range_register);
  */
 void hmm_range_unregister(struct hmm_range *range)
 {
-	/* Sanity check this really should not happen. */
-	if (range->hmm == NULL || range->end <= range->start)
+	if (WARN_ON(range->end <= range->start))
 		return;
 
 	mutex_lock(&range->hmm->lock);
@@ -945,9 +944,13 @@ void hmm_range_unregister(struct hmm_range *range)
 	mutex_unlock(&range->hmm->lock);
 
 	/* Drop reference taken by hmm_range_register() */
-	range->valid = false;
 	hmm_put(range->hmm);
-	range->hmm = NULL;
+
+	/* The range is now invalid, leave it poisoned. */
+	range->valid = false;
+	range->start = ULONG_MAX;
+	range->end = 0;
+	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
 }
 EXPORT_SYMBOL(hmm_range_unregister);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [RFC PATCH 11/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
@ 2019-05-23 15:34 ` Jason Gunthorpe
  2019-06-07 20:22   ` Souptick Joarder
  2019-05-23 19:04 ` [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review John Hubbard
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 15:34 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard
  Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

This list is always read and written while holding hmm->lock so there is
no need for the confusing _rcu annotations.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 mm/hmm.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index 02752d3ef2ed92..b4aafa90a109a5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -912,7 +912,7 @@ int hmm_range_register(struct hmm_range *range,
 	/* Initialize range to track CPU page table update */
 	mutex_lock(&range->hmm->lock);
 
-	list_add_rcu(&range->list, &range->hmm->ranges);
+	list_add(&range->list, &range->hmm->ranges);
 
 	/*
 	 * If there are any concurrent notifiers we have to wait for them for
@@ -940,7 +940,7 @@ void hmm_range_unregister(struct hmm_range *range)
 		return;
 
 	mutex_lock(&range->hmm->lock);
-	list_del_rcu(&range->list);
+	list_del(&range->list);
 	mutex_unlock(&range->hmm->lock);
 
 	/* Drop reference taken by hmm_range_register() */
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range
  2019-05-23 15:34 ` [RFC PATCH 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range Jason Gunthorpe
@ 2019-05-23 18:22   ` Christoph Hellwig
  0 siblings, 0 replies; 45+ messages in thread
From: Christoph Hellwig @ 2019-05-23 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

Nitpick:

s/hmm_register_range/hmm_range_register/ in the subject and patch
description.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2019-05-23 15:34 ` [RFC PATCH 11/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
@ 2019-05-23 19:04 ` John Hubbard
  2019-05-23 19:37   ` Jason Gunthorpe
  2019-05-23 20:59   ` Jerome Glisse
  2019-05-24 13:35 ` Jason Gunthorpe
  2019-05-24 14:36 ` Jason Gunthorpe
  13 siblings, 2 replies; 45+ messages in thread
From: John Hubbard @ 2019-05-23 19:04 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell
  Cc: Jason Gunthorpe

On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> Overall this brings in a simplified locking scheme and easy to explain
> lifetime model:
> 
>   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>   is allocated memory.
> 
>   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>   then the mmget must be obtained via mmget_not_zero().
> 
> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> read/write and unlocked accesses are removed.
> 
> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> standard mmget() locking to prevent the mm from being released. Many of the
> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> which is much clearer as to the lifetime intent.
> 
> The trailing patches are just some random cleanups I noticed when reviewing
> this code.
> 
> I expect Jerome & Ralph will have some design notes so this is just RFC, and
> it still needs a matching edit to nouveau. It is only compile tested.
> 

Thanks so much for doing this. Jerome has already absorbed these into his
hmm-5.3 branch, along with Ralph's other fixes, so we can start testing,
as well as reviewing, the whole set. We'll have feedback soon.


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-23 19:04 ` [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review John Hubbard
@ 2019-05-23 19:37   ` Jason Gunthorpe
  2019-05-23 20:59   ` Jerome Glisse
  1 sibling, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-23 19:37 UTC (permalink / raw)
  To: John Hubbard; +Cc: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell

On Thu, May 23, 2019 at 12:04:16PM -0700, John Hubbard wrote:
> On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This patch series arised out of discussions with Jerome when looking at the
> > ODP changes, particularly informed by use after free races we have already
> > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > 
> > Overall this brings in a simplified locking scheme and easy to explain
> > lifetime model:
> > 
> >   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
> >   is allocated memory.
> > 
> >   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
> >   then the mmget must be obtained via mmget_not_zero().
> > 
> > Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> > read/write and unlocked accesses are removed.
> > 
> > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> > standard mmget() locking to prevent the mm from being released. Many of the
> > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> > which is much clearer as to the lifetime intent.
> > 
> > The trailing patches are just some random cleanups I noticed when reviewing
> > this code.
> > 
> > I expect Jerome & Ralph will have some design notes so this is just RFC, and
> > it still needs a matching edit to nouveau. It is only compile tested.
> > 
> 
> Thanks so much for doing this. Jerome has already absorbed these into his
> hmm-5.3 branch, along with Ralph's other fixes, so we can start testing,
> as well as reviewing, the whole set. We'll have feedback soon.

Yes, I looked at Jerome's v2's and he found a few great fixups.

My only dislike is re-introducing a READ_ONCE(mm->hmm) when a major
point of this seris was to remove that use-after-free stuff. 

But Jerome says it is a temporary defect while he works out some cross
tree API stuff.

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-23 19:04 ` [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review John Hubbard
  2019-05-23 19:37   ` Jason Gunthorpe
@ 2019-05-23 20:59   ` Jerome Glisse
  1 sibling, 0 replies; 45+ messages in thread
From: Jerome Glisse @ 2019-05-23 20:59 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jason Gunthorpe, linux-rdma, linux-mm, Ralph Campbell, Jason Gunthorpe

On Thu, May 23, 2019 at 12:04:16PM -0700, John Hubbard wrote:
> On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This patch series arised out of discussions with Jerome when looking at the
> > ODP changes, particularly informed by use after free races we have already
> > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > 
> > Overall this brings in a simplified locking scheme and easy to explain
> > lifetime model:
> > 
> >   If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
> >   is allocated memory.
> > 
> >   If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
> >   then the mmget must be obtained via mmget_not_zero().
> > 
> > Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> > read/write and unlocked accesses are removed.
> > 
> > The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> > standard mmget() locking to prevent the mm from being released. Many of the
> > debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> > which is much clearer as to the lifetime intent.
> > 
> > The trailing patches are just some random cleanups I noticed when reviewing
> > this code.
> > 
> > I expect Jerome & Ralph will have some design notes so this is just RFC, and
> > it still needs a matching edit to nouveau. It is only compile tested.
> > 
> 
> Thanks so much for doing this. Jerome has already absorbed these into his
> hmm-5.3 branch, along with Ralph's other fixes, so we can start testing,
> as well as reviewing, the whole set. We'll have feedback soon.
> 

I force pushed an updated branch with couple fix

https://cgit.freedesktop.org/~glisse/linux/log/?h=hmm-5.3

Seems to work ok so far, still doing testing.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-05-23 15:34 ` [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
@ 2019-05-23 23:38   ` Ralph Campbell
  2019-05-24  1:23     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Ralph Campbell @ 2019-05-23 23:38 UTC (permalink / raw)
  To: Jason Gunthorpe, linux-rdma, linux-mm, Jerome Glisse, John Hubbard
  Cc: Jason Gunthorpe


On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> As coded this function can false-fail in various racy situations. Make it
> reliable by running only under the write side of the mmap_sem and avoiding
> the false-failing compare/exchange pattern.
> 
> Also make the locking very easy to understand by only ever reading or
> writing mm->hmm while holding the write side of the mmap_sem.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>   mm/hmm.c | 75 ++++++++++++++++++++------------------------------------
>   1 file changed, 27 insertions(+), 48 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index e27058e92508b9..ec54be54d81135 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -40,16 +40,6 @@
>   #if IS_ENABLED(CONFIG_HMM_MIRROR)
>   static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>   
> -static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> -{
> -	struct hmm *hmm = READ_ONCE(mm->hmm);
> -
> -	if (hmm && kref_get_unless_zero(&hmm->kref))
> -		return hmm;
> -
> -	return NULL;
> -}
> -
>   /**
>    * hmm_get_or_create - register HMM against an mm (HMM internal)
>    *
> @@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>    */
>   static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   {
> -	struct hmm *hmm = mm_get_hmm(mm);
> -	bool cleanup = false;
> +	struct hmm *hmm;
>   
> -	if (hmm)
> -		return hmm;
> +	lockdep_assert_held_exclusive(mm->mmap_sem);
> +
> +	if (mm->hmm) {
> +		if (kref_get_unless_zero(&mm->hmm->kref))
> +			return mm->hmm;
> +		/*
> +		 * The hmm is being freed by some other CPU and is pending a
> +		 * RCU grace period, but this CPU can NULL now it since we
> +		 * have the mmap_sem.
> +		 */
> +		mm->hmm = NULL;

Shouldn't there be a "return NULL;" here so it doesn't fall through and
allocate a struct hmm below?

> +	}
>   
>   	hmm = kmalloc(sizeof(*hmm), GFP_KERNEL);
>   	if (!hmm)
> @@ -85,54 +84,34 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>   	hmm->mm = mm;
>   	mmgrab(hmm->mm);
>   
> -	spin_lock(&mm->page_table_lock);
> -	if (!mm->hmm)
> -		mm->hmm = hmm;
> -	else
> -		cleanup = true;
> -	spin_unlock(&mm->page_table_lock);
> -
> -	if (cleanup)
> -		goto error;
> -
> -	/*
> -	 * We should only get here if hold the mmap_sem in write mode ie on
> -	 * registration of first mirror through hmm_mirror_register()
> -	 */
>   	hmm->mmu_notifier.ops = &hmm_mmu_notifier_ops;
> -	if (__mmu_notifier_register(&hmm->mmu_notifier, mm))
> -		goto error_mm;
> +	if (__mmu_notifier_register(&hmm->mmu_notifier, mm)) {
> +		kfree(hmm);
> +		return NULL;
> +	}
>   
> +	mm->hmm = hmm;
>   	return hmm;
> -
> -error_mm:
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -error:
> -	kfree(hmm);
> -	return NULL;
>   }
>   
>   static void hmm_fee_rcu(struct rcu_head *rcu)

I see Jerome already saw and named this hmm_free_rcu()
which I agree with.

>   {
> +	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
> +
> +	down_write(&hmm->mm->mmap_sem);
> +	if (hmm->mm->hmm == hmm)
> +		hmm->mm->hmm = NULL;
> +	up_write(&hmm->mm->mmap_sem);
> +	mmdrop(hmm->mm);
> +
>   	kfree(container_of(rcu, struct hmm, rcu));
>   }
>   
>   static void hmm_free(struct kref *kref)
>   {
>   	struct hmm *hmm = container_of(kref, struct hmm, kref);
> -	struct mm_struct *mm = hmm->mm;
> -
> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
>   
> -	spin_lock(&mm->page_table_lock);
> -	if (mm->hmm == hmm)
> -		mm->hmm = NULL;
> -	spin_unlock(&mm->page_table_lock);
> -
> -	mmdrop(hmm->mm);
> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
>   	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
>   }
>   
> 

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-05-23 23:38   ` Ralph Campbell
@ 2019-05-24  1:23     ` Jason Gunthorpe
  2019-05-24 17:06       ` Ralph Campbell
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24  1:23 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: linux-rdma, linux-mm, Jerome Glisse, John Hubbard

On Thu, May 23, 2019 at 04:38:28PM -0700, Ralph Campbell wrote:
> 
> On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > As coded this function can false-fail in various racy situations. Make it
> > reliable by running only under the write side of the mmap_sem and avoiding
> > the false-failing compare/exchange pattern.
> > 
> > Also make the locking very easy to understand by only ever reading or
> > writing mm->hmm while holding the write side of the mmap_sem.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >   mm/hmm.c | 75 ++++++++++++++++++++------------------------------------
> >   1 file changed, 27 insertions(+), 48 deletions(-)
> > 
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index e27058e92508b9..ec54be54d81135 100644
> > +++ b/mm/hmm.c
> > @@ -40,16 +40,6 @@
> >   #if IS_ENABLED(CONFIG_HMM_MIRROR)
> >   static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
> > -static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> > -{
> > -	struct hmm *hmm = READ_ONCE(mm->hmm);
> > -
> > -	if (hmm && kref_get_unless_zero(&hmm->kref))
> > -		return hmm;
> > -
> > -	return NULL;
> > -}
> > -
> >   /**
> >    * hmm_get_or_create - register HMM against an mm (HMM internal)
> >    *
> > @@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
> >    */
> >   static struct hmm *hmm_get_or_create(struct mm_struct *mm)
> >   {
> > -	struct hmm *hmm = mm_get_hmm(mm);
> > -	bool cleanup = false;
> > +	struct hmm *hmm;
> > -	if (hmm)
> > -		return hmm;
> > +	lockdep_assert_held_exclusive(mm->mmap_sem);
> > +
> > +	if (mm->hmm) {
> > +		if (kref_get_unless_zero(&mm->hmm->kref))
> > +			return mm->hmm;
> > +		/*
> > +		 * The hmm is being freed by some other CPU and is pending a
> > +		 * RCU grace period, but this CPU can NULL now it since we
> > +		 * have the mmap_sem.
> > +		 */
> > +		mm->hmm = NULL;
> 
> Shouldn't there be a "return NULL;" here so it doesn't fall through and
> allocate a struct hmm below?

No, this function should only return NULL on memory allocation
failure.

In this case another thread is busy freeing the hmm but wasn't able to
update mm->hmm to null due to a locking constraint. So we make it null
on behalf of the other thread and allocate a fresh new hmm that is
valid. The freeing thread will complete the free and do nothing with
mm->hmm.

> >   static void hmm_fee_rcu(struct rcu_head *rcu)
> 
> I see Jerome already saw and named this hmm_free_rcu()
> which I agree with.

I do love my typos :)

> >   {
> > +	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
> > +
> > +	down_write(&hmm->mm->mmap_sem);
> > +	if (hmm->mm->hmm == hmm)
> > +		hmm->mm->hmm = NULL;
> > +	up_write(&hmm->mm->mmap_sem);
> > +	mmdrop(hmm->mm);
> > +
> >   	kfree(container_of(rcu, struct hmm, rcu));
> >   }
> >   static void hmm_free(struct kref *kref)
> >   {
> >   	struct hmm *hmm = container_of(kref, struct hmm, kref);
> > -	struct mm_struct *mm = hmm->mm;
> > -
> > -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
> > -	spin_lock(&mm->page_table_lock);
> > -	if (mm->hmm == hmm)
> > -		mm->hmm = NULL;
> > -	spin_unlock(&mm->page_table_lock);
> > -
> > -	mmdrop(hmm->mm);
> > +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
> >   	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
> >   }
> > 
> 
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.

Ah, you should not send this trailer to the public mailing lists.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2019-05-23 19:04 ` [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review John Hubbard
@ 2019-05-24 13:35 ` Jason Gunthorpe
  2019-05-24 14:36 ` Jason Gunthorpe
  13 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 13:35 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard

On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> Overall this brings in a simplified locking scheme and easy to explain
> lifetime model:
> 
>  If a hmm_range is valid, then the hmm is valid, if a hmm is valid then the mm
>  is allocated memory.
> 
>  If the mm needs to still be alive (ie to lock the mmap_sem, find a vma, etc)
>  then the mmget must be obtained via mmget_not_zero().
> 
> Locking of mm->hmm is shifted to use the mmap_sem consistently for all
> read/write and unlocked accesses are removed.
> 
> The use unlocked reads on 'hmm->dead' are also eliminated in favour of using
> standard mmget() locking to prevent the mm from being released. Many of the
> debugging checks of !range->hmm and !hmm->mm are dropped in favour of poison -
> which is much clearer as to the lifetime intent.
> 
> The trailing patches are just some random cleanups I noticed when reviewing
> this code.
> 
> I expect Jerome & Ralph will have some design notes so this is just RFC, and
> it still needs a matching edit to nouveau. It is only compile tested.
> 
> Regards,
> Jason
> 
> Jason Gunthorpe (11):
>   mm/hmm: Fix use after free with struct hmm in the mmu notifiers
>   mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range
>   mm/hmm: Hold a mmgrab from hmm to mm
>   mm/hmm: Simplify hmm_get_or_create and make it reliable
>   mm/hmm: Improve locking around hmm->dead
>   mm/hmm: Remove duplicate condition test before wait_event_timeout
>   mm/hmm: Delete hmm_mirror_mm_is_alive()
>   mm/hmm: Use lockdep instead of comments
>   mm/hmm: Remove racy protection against double-unregistration
>   mm/hmm: Poison hmm_range during unregister
>   mm/hmm: Do not use list*_rcu() for hmm->ranges
> 
>  include/linux/hmm.h |  50 ++----------
>  kernel/fork.c       |   1 -
>  mm/hmm.c            | 184 +++++++++++++++++++-------------------------
>  3 files changed, 88 insertions(+), 147 deletions(-)

Jerome, I was doing some more checking of this and noticed lockdep
doesn't compile test if it is turned off, since you took and revised
the series can you please fold in these hunks to fix compile failures
with lockdep on. Thanks

commit f0653c4d4c1dadeaf58d49f1c949ab1d2fda05d3
diff --git a/mm/hmm.c b/mm/hmm.c
index 836adf613f81c8..2a08b78550b90d 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -56,7 +56,7 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 {
 	struct hmm *hmm;
 
-	lockdep_assert_held_exclusive(mm->mmap_sem);
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
 
 	if (mm->hmm) {
 		if (kref_get_unless_zero(&mm->hmm->kref))
@@ -262,7 +262,7 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
  */
 int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
 {
-	lockdep_assert_held_exclusive(mm->mmap_sem);
+	lockdep_assert_held_exclusive(&mm->mmap_sem);
 
 	/* Sanity check */
 	if (!mm || !mirror || !mirror->ops)
@@ -987,7 +987,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 	struct mm_walk mm_walk;
 
 	/* Caller must hold the mmap_sem, and range hold a reference on mm. */
-	lockdep_assert_held(hmm->mm->mmap_sem);
+	lockdep_assert_held(&hmm->mm->mmap_sem);
 	if (WARN_ON(!atomic_read(&hmm->mm->mm_users)))
 		return -EINVAL;
 
@@ -1086,7 +1086,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 	int ret;
 
 	/* Caller must hold the mmap_sem, and range hold a reference on mm. */
-	lockdep_assert_held(hmm->mm->mmap_sem);
+	lockdep_assert_held(&hmm->mm->mmap_sem);
 	if (WARN_ON(!atomic_read(&hmm->mm->mm_users)))
 		return -EINVAL;
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 05/11] mm/hmm: Improve locking around hmm->dead
  2019-05-23 15:34 ` [RFC PATCH 05/11] mm/hmm: Improve locking around hmm->dead Jason Gunthorpe
@ 2019-05-24 13:40   ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 13:40 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard

On Thu, May 23, 2019 at 12:34:30PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This value is being read without any locking, so it is just an unreliable
> hint, however in many cases we need to have certainty that code is not
> racing with mmput()/hmm_release().
> 
> For the two functions doing find_vma(), document that the caller is
> expected to hold mmap_sem and thus also have a mmget().
> 
> For hmm_range_register acquire a mmget internally as it must not race with
> hmm_release() when it sets valid.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>  mm/hmm.c | 27 +++++++++++++++++++--------
>  1 file changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index ec54be54d81135..d97ec293336ea5 100644
> +++ b/mm/hmm.c
> @@ -909,8 +909,10 @@ int hmm_range_register(struct hmm_range *range,
>  	range->start = start;
>  	range->end = end;
>  
> -	/* Check if hmm_mm_destroy() was call. */
> -	if (mirror->hmm->mm == NULL || mirror->hmm->dead)
> +	/*
> +	 * We cannot set range->value to true if hmm_release has already run.
> +	 */
> +	if (!mmget_not_zero(mirror->hmm->mm))
>  		return -EFAULT;
>  
>  	range->hmm = mirror->hmm;
> @@ -928,6 +930,7 @@ int hmm_range_register(struct hmm_range *range,
>  	if (!range->hmm->notifiers)
>  		range->valid = true;
>  	mutex_unlock(&range->hmm->lock);
> +	mmput(mirror->hmm->mm);

Hi Jerome, when you revised this patch to move the mmput to
hmm_range_unregister() it means hmm_release() cannot run while a range
exists, and thus we can have this futher simplification rolled into
this patch. Can you update your git? Thanks:

diff --git a/mm/hmm.c b/mm/hmm.c
index 2a08b78550b90d..ddd05f2ebe739a 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -128,17 +128,17 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
-	struct hmm_range *range;
 
 	/* hmm is in progress to free */
 	if (!kref_get_unless_zero(&hmm->kref))
 		return;
 
-	/* Wake-up everyone waiting on any range. */
 	mutex_lock(&hmm->lock);
-	list_for_each_entry(range, &hmm->ranges, list)
-		range->valid = false;
-	wake_up_all(&hmm->wq);
+	/*
+	 * Since hmm_range_register() holds the mmget() lock hmm_release() is
+	 * prevented as long as a range exists.
+	 */
+	WARN_ON(!list_empty(&hmm->ranges));
 	mutex_unlock(&hmm->lock);
 
 	down_write(&hmm->mirrors_sem);
@@ -908,9 +908,7 @@ int hmm_range_register(struct hmm_range *range,
 	range->hmm = mm->hmm;
 	kref_get(&range->hmm->kref);
 
-	/*
-	 * We cannot set range->value to true if hmm_release has already run.
-	 */
+	/* Prevent hmm_release() from running while the range is valid */
 	if (!mmget_not_zero(mm))
 		return -EFAULT;
 


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
                   ` (12 preceding siblings ...)
  2019-05-24 13:35 ` Jason Gunthorpe
@ 2019-05-24 14:36 ` Jason Gunthorpe
  2019-05-24 16:49   ` Jerome Glisse
  13 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 14:36 UTC (permalink / raw)
  To: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard

On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> This patch series arised out of discussions with Jerome when looking at the
> ODP changes, particularly informed by use after free races we have already
> found and fixed in the ODP code (thanks to syzkaller) working with mmu
> notifiers, and the discussion with Ralph on how to resolve the lifetime model.

So the last big difference with ODP's flow is how 'range->valid'
works.

In ODP this was done using the rwsem umem->umem_rwsem which is
obtained for read in invalidate_start and released in invalidate_end.

Then any other threads that wish to only work on a umem which is not
undergoing invalidation will obtain the write side of the lock, and
within that lock's critical section the virtual address range is known
to not be invalidating.

I cannot understand how hmm gets to the same approach. It has
range->valid, but it is not locked by anything that I can see, so when
we test it in places like hmm_range_fault it seems useless..

Jerome, how does this work?

I have a feeling we should copy the approach from ODP and use an
actual lock here.

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 14:36 ` Jason Gunthorpe
@ 2019-05-24 16:49   ` Jerome Glisse
  2019-05-24 16:59     ` Jason Gunthorpe
  2019-05-24 17:47     ` Ralph Campbell
  0 siblings, 2 replies; 45+ messages in thread
From: Jerome Glisse @ 2019-05-24 16:49 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > This patch series arised out of discussions with Jerome when looking at the
> > ODP changes, particularly informed by use after free races we have already
> > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> 
> So the last big difference with ODP's flow is how 'range->valid'
> works.
> 
> In ODP this was done using the rwsem umem->umem_rwsem which is
> obtained for read in invalidate_start and released in invalidate_end.
> 
> Then any other threads that wish to only work on a umem which is not
> undergoing invalidation will obtain the write side of the lock, and
> within that lock's critical section the virtual address range is known
> to not be invalidating.
> 
> I cannot understand how hmm gets to the same approach. It has
> range->valid, but it is not locked by anything that I can see, so when
> we test it in places like hmm_range_fault it seems useless..
> 
> Jerome, how does this work?
> 
> I have a feeling we should copy the approach from ODP and use an
> actual lock here.

range->valid is use as bail early if invalidation is happening in
hmm_range_fault() to avoid doing useless work. The synchronization
is explained in the documentation:


Locking within the sync_cpu_device_pagetables() callback is the most important
aspect the driver must respect in order to keep things properly synchronized.
The usage pattern is::

 int driver_populate_range(...)
 {
      struct hmm_range range;
      ...

      range.start = ...;
      range.end = ...;
      range.pfns = ...;
      range.flags = ...;
      range.values = ...;
      range.pfn_shift = ...;
      hmm_range_register(&range);

      /*
       * Just wait for range to be valid, safe to ignore return value as we
       * will use the return value of hmm_range_snapshot() below under the
       * mmap_sem to ascertain the validity of the range.
       */
      hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);

 again:
      down_read(&mm->mmap_sem);
      ret = hmm_range_snapshot(&range);
      if (ret) {
          up_read(&mm->mmap_sem);
          if (ret == -EAGAIN) {
            /*
             * No need to check hmm_range_wait_until_valid() return value
             * on retry we will get proper error with hmm_range_snapshot()
             */
            hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
            goto again;
          }
          hmm_range_unregister(&range);
          return ret;
      }
      take_lock(driver->update);
      if (!hmm_range_valid(&range)) {
          release_lock(driver->update);
          up_read(&mm->mmap_sem);
          goto again;
      }

      // Use pfns array content to update device page table

      hmm_range_unregister(&range);
      release_lock(driver->update);
      up_read(&mm->mmap_sem);
      return 0;
 }

The driver->update lock is the same lock that the driver takes inside its
sync_cpu_device_pagetables() callback. That lock must be held before calling
hmm_range_valid() to avoid any race with a concurrent CPU page table update.


Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 16:49   ` Jerome Glisse
@ 2019-05-24 16:59     ` Jason Gunthorpe
  2019-05-24 17:01       ` Jerome Glisse
  2019-05-24 17:47     ` Ralph Campbell
  1 sibling, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 16:59 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > 
> > > This patch series arised out of discussions with Jerome when looking at the
> > > ODP changes, particularly informed by use after free races we have already
> > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > 
> > So the last big difference with ODP's flow is how 'range->valid'
> > works.
> > 
> > In ODP this was done using the rwsem umem->umem_rwsem which is
> > obtained for read in invalidate_start and released in invalidate_end.
> > 
> > Then any other threads that wish to only work on a umem which is not
> > undergoing invalidation will obtain the write side of the lock, and
> > within that lock's critical section the virtual address range is known
> > to not be invalidating.
> > 
> > I cannot understand how hmm gets to the same approach. It has
> > range->valid, but it is not locked by anything that I can see, so when
> > we test it in places like hmm_range_fault it seems useless..
> > 
> > Jerome, how does this work?
> > 
> > I have a feeling we should copy the approach from ODP and use an
> > actual lock here.
> 
> range->valid is use as bail early if invalidation is happening in
> hmm_range_fault() to avoid doing useless work. The synchronization
> is explained in the documentation:

That just says the hmm APIs handle locking. I asked how the apis
implement that locking internally.

Are you trying to say that if I do this, hmm will still work completely
correctly?

diff --git a/mm/hmm.c b/mm/hmm.c
index 8396a65710e304..42977744855d26 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -981,8 +981,8 @@ long hmm_range_snapshot(struct hmm_range *range)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid)
-			return -EAGAIN;
+/*		if (!range->valid)
+			return -EAGAIN;*/
 
 		vma = find_vma(hmm->mm, start);
 		if (vma == NULL || (vma->vm_flags & device_vma))
@@ -1080,10 +1080,10 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid) {
+/*		if (!range->valid) {
 			up_read(&hmm->mm->mmap_sem);
 			return -EAGAIN;
-		}
+		}*/
 
 		vma = find_vma(hmm->mm, start);
 		if (vma == NULL || (vma->vm_flags & device_vma))
@@ -1134,7 +1134,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 			start = hmm_vma_walk.last;
 
 			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && range->valid);
+		} while (ret == -EBUSY /*&& range->valid*/);
 
 		if (ret) {
 			unsigned long i;


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 16:59     ` Jason Gunthorpe
@ 2019-05-24 17:01       ` Jerome Glisse
  2019-05-24 17:52         ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Jerome Glisse @ 2019-05-24 17:01 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > 
> > > > This patch series arised out of discussions with Jerome when looking at the
> > > > ODP changes, particularly informed by use after free races we have already
> > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > 
> > > So the last big difference with ODP's flow is how 'range->valid'
> > > works.
> > > 
> > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > obtained for read in invalidate_start and released in invalidate_end.
> > > 
> > > Then any other threads that wish to only work on a umem which is not
> > > undergoing invalidation will obtain the write side of the lock, and
> > > within that lock's critical section the virtual address range is known
> > > to not be invalidating.
> > > 
> > > I cannot understand how hmm gets to the same approach. It has
> > > range->valid, but it is not locked by anything that I can see, so when
> > > we test it in places like hmm_range_fault it seems useless..
> > > 
> > > Jerome, how does this work?
> > > 
> > > I have a feeling we should copy the approach from ODP and use an
> > > actual lock here.
> > 
> > range->valid is use as bail early if invalidation is happening in
> > hmm_range_fault() to avoid doing useless work. The synchronization
> > is explained in the documentation:
> 
> That just says the hmm APIs handle locking. I asked how the apis
> implement that locking internally.
> 
> Are you trying to say that if I do this, hmm will still work completely
> correctly?

Yes it will keep working correctly. You would just be doing potentialy
useless work.

> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 8396a65710e304..42977744855d26 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -981,8 +981,8 @@ long hmm_range_snapshot(struct hmm_range *range)
>  
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (!range->valid)
> -			return -EAGAIN;
> +/*		if (!range->valid)
> +			return -EAGAIN;*/
>  
>  		vma = find_vma(hmm->mm, start);
>  		if (vma == NULL || (vma->vm_flags & device_vma))
> @@ -1080,10 +1080,10 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (!range->valid) {
> +/*		if (!range->valid) {
>  			up_read(&hmm->mm->mmap_sem);
>  			return -EAGAIN;
> -		}
> +		}*/
>  
>  		vma = find_vma(hmm->mm, start);
>  		if (vma == NULL || (vma->vm_flags & device_vma))
> @@ -1134,7 +1134,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
>  			start = hmm_vma_walk.last;
>  
>  			/* Keep trying while the range is valid. */
> -		} while (ret == -EBUSY && range->valid);
> +		} while (ret == -EBUSY /*&& range->valid*/);
>  
>  		if (ret) {
>  			unsigned long i;


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable
  2019-05-24  1:23     ` Jason Gunthorpe
@ 2019-05-24 17:06       ` Ralph Campbell
  0 siblings, 0 replies; 45+ messages in thread
From: Ralph Campbell @ 2019-05-24 17:06 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Jerome Glisse, John Hubbard


On 5/23/19 6:23 PM, Jason Gunthorpe wrote:
> On Thu, May 23, 2019 at 04:38:28PM -0700, Ralph Campbell wrote:
>>
>> On 5/23/19 8:34 AM, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> As coded this function can false-fail in various racy situations. Make it
>>> reliable by running only under the write side of the mmap_sem and avoiding
>>> the false-failing compare/exchange pattern.
>>>
>>> Also make the locking very easy to understand by only ever reading or
>>> writing mm->hmm while holding the write side of the mmap_sem.
>>>
>>> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
>>>    mm/hmm.c | 75 ++++++++++++++++++++------------------------------------
>>>    1 file changed, 27 insertions(+), 48 deletions(-)
>>>
>>> diff --git a/mm/hmm.c b/mm/hmm.c
>>> index e27058e92508b9..ec54be54d81135 100644
>>> +++ b/mm/hmm.c
>>> @@ -40,16 +40,6 @@
>>>    #if IS_ENABLED(CONFIG_HMM_MIRROR)
>>>    static const struct mmu_notifier_ops hmm_mmu_notifier_ops;
>>> -static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>>> -{
>>> -	struct hmm *hmm = READ_ONCE(mm->hmm);
>>> -
>>> -	if (hmm && kref_get_unless_zero(&hmm->kref))
>>> -		return hmm;
>>> -
>>> -	return NULL;
>>> -}
>>> -
>>>    /**
>>>     * hmm_get_or_create - register HMM against an mm (HMM internal)
>>>     *
>>> @@ -64,11 +54,20 @@ static inline struct hmm *mm_get_hmm(struct mm_struct *mm)
>>>     */
>>>    static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>>>    {
>>> -	struct hmm *hmm = mm_get_hmm(mm);
>>> -	bool cleanup = false;
>>> +	struct hmm *hmm;
>>> -	if (hmm)
>>> -		return hmm;
>>> +	lockdep_assert_held_exclusive(mm->mmap_sem);
>>> +
>>> +	if (mm->hmm) {
>>> +		if (kref_get_unless_zero(&mm->hmm->kref))
>>> +			return mm->hmm;
>>> +		/*
>>> +		 * The hmm is being freed by some other CPU and is pending a
>>> +		 * RCU grace period, but this CPU can NULL now it since we
>>> +		 * have the mmap_sem.
>>> +		 */
>>> +		mm->hmm = NULL;
>>
>> Shouldn't there be a "return NULL;" here so it doesn't fall through and
>> allocate a struct hmm below?
> 
> No, this function should only return NULL on memory allocation
> failure.
> 
> In this case another thread is busy freeing the hmm but wasn't able to
> update mm->hmm to null due to a locking constraint. So we make it null
> on behalf of the other thread and allocate a fresh new hmm that is
> valid. The freeing thread will complete the free and do nothing with
> mm->hmm.
> 
>>>    static void hmm_fee_rcu(struct rcu_head *rcu)
>>
>> I see Jerome already saw and named this hmm_free_rcu()
>> which I agree with.
> 
> I do love my typos :)
> 
>>>    {
>>> +	struct hmm *hmm = container_of(rcu, struct hmm, rcu);
>>> +
>>> +	down_write(&hmm->mm->mmap_sem);
>>> +	if (hmm->mm->hmm == hmm)
>>> +		hmm->mm->hmm = NULL;
>>> +	up_write(&hmm->mm->mmap_sem);
>>> +	mmdrop(hmm->mm);
>>> +
>>>    	kfree(container_of(rcu, struct hmm, rcu));
>>>    }
>>>    static void hmm_free(struct kref *kref)
>>>    {
>>>    	struct hmm *hmm = container_of(kref, struct hmm, kref);
>>> -	struct mm_struct *mm = hmm->mm;
>>> -
>>> -	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, mm);
>>> -	spin_lock(&mm->page_table_lock);
>>> -	if (mm->hmm == hmm)
>>> -		mm->hmm = NULL;
>>> -	spin_unlock(&mm->page_table_lock);
>>> -
>>> -	mmdrop(hmm->mm);
>>> +	mmu_notifier_unregister_no_release(&hmm->mmu_notifier, hmm->mm);
>>>    	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
>>>    }
>>>
>>
>> This email message is for the sole use of the intended recipient(s) and may contain
>> confidential information.  Any unauthorized review, use, disclosure or distribution
>> is prohibited.  If you are not the intended recipient, please contact the sender by
>> reply email and destroy all copies of the original message.
> 
> Ah, you should not send this trailer to the public mailing lists.
> 
> Thanks,
> Jason
> 

Sorry, I have to apply the "magic" header to suppress this each time I
send email to a public list.


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 16:49   ` Jerome Glisse
  2019-05-24 16:59     ` Jason Gunthorpe
@ 2019-05-24 17:47     ` Ralph Campbell
  2019-05-24 17:51       ` Jerome Glisse
  1 sibling, 1 reply; 45+ messages in thread
From: Ralph Campbell @ 2019-05-24 17:47 UTC (permalink / raw)
  To: Jerome Glisse, Jason Gunthorpe; +Cc: linux-rdma, linux-mm, John Hubbard


On 5/24/19 9:49 AM, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
>> On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
>>> From: Jason Gunthorpe <jgg@mellanox.com>
>>>
>>> This patch series arised out of discussions with Jerome when looking at the
>>> ODP changes, particularly informed by use after free races we have already
>>> found and fixed in the ODP code (thanks to syzkaller) working with mmu
>>> notifiers, and the discussion with Ralph on how to resolve the lifetime model.
>>
>> So the last big difference with ODP's flow is how 'range->valid'
>> works.
>>
>> In ODP this was done using the rwsem umem->umem_rwsem which is
>> obtained for read in invalidate_start and released in invalidate_end.
>>
>> Then any other threads that wish to only work on a umem which is not
>> undergoing invalidation will obtain the write side of the lock, and
>> within that lock's critical section the virtual address range is known
>> to not be invalidating.
>>
>> I cannot understand how hmm gets to the same approach. It has
>> range->valid, but it is not locked by anything that I can see, so when
>> we test it in places like hmm_range_fault it seems useless..
>>
>> Jerome, how does this work?
>>
>> I have a feeling we should copy the approach from ODP and use an
>> actual lock here.
> 
> range->valid is use as bail early if invalidation is happening in
> hmm_range_fault() to avoid doing useless work. The synchronization
> is explained in the documentation:
> 
> 
> Locking within the sync_cpu_device_pagetables() callback is the most important
> aspect the driver must respect in order to keep things properly synchronized.
> The usage pattern is::
> 
>   int driver_populate_range(...)
>   {
>        struct hmm_range range;
>        ...
> 
>        range.start = ...;
>        range.end = ...;
>        range.pfns = ...;
>        range.flags = ...;
>        range.values = ...;
>        range.pfn_shift = ...;
>        hmm_range_register(&range);
> 
>        /*
>         * Just wait for range to be valid, safe to ignore return value as we
>         * will use the return value of hmm_range_snapshot() below under the
>         * mmap_sem to ascertain the validity of the range.
>         */
>        hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> 
>   again:
>        down_read(&mm->mmap_sem);
>        ret = hmm_range_snapshot(&range);
>        if (ret) {
>            up_read(&mm->mmap_sem);
>            if (ret == -EAGAIN) {
>              /*
>               * No need to check hmm_range_wait_until_valid() return value
>               * on retry we will get proper error with hmm_range_snapshot()
>               */
>              hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
>              goto again;
>            }
>            hmm_range_unregister(&range);
>            return ret;
>        }
>        take_lock(driver->update);
>        if (!hmm_range_valid(&range)) {
>            release_lock(driver->update);
>            up_read(&mm->mmap_sem);
>            goto again;
>        }
> 
>        // Use pfns array content to update device page table
> 
>        hmm_range_unregister(&range);
>        release_lock(driver->update);
>        up_read(&mm->mmap_sem);
>        return 0;
>   }
> 
> The driver->update lock is the same lock that the driver takes inside its
> sync_cpu_device_pagetables() callback. That lock must be held before calling
> hmm_range_valid() to avoid any race with a concurrent CPU page table update.
> 
> 
> Cheers,
> Jérôme


Given the above, the following patch looks necessary to me.
Also, looking at drivers/gpu/drm/nouveau/nouveau_svm.c, it
doesn't check the return value to avoid calling up_read(&mm->mmap_sem).
Besides, it's better to keep the mmap_sem lock/unlock in the caller.

diff --git a/mm/hmm.c b/mm/hmm.c
index 836adf613f81..8b6ef97a8d71 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -1092,10 +1092,8 @@ long hmm_range_fault(struct hmm_range *range, 
bool block)

  	do {
  		/* If range is no longer valid force retry. */
-		if (!range->valid) {
-			up_read(&hmm->mm->mmap_sem);
+		if (!range->valid)
  			return -EAGAIN;
-		}

  		vma = find_vma(hmm->mm, start);
  		if (vma == NULL || (vma->vm_flags & device_vma))

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information.  Any unauthorized review, use, disclosure or distribution
is prohibited.  If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 17:47     ` Ralph Campbell
@ 2019-05-24 17:51       ` Jerome Glisse
  0 siblings, 0 replies; 45+ messages in thread
From: Jerome Glisse @ 2019-05-24 17:51 UTC (permalink / raw)
  To: Ralph Campbell; +Cc: Jason Gunthorpe, linux-rdma, linux-mm, John Hubbard

On Fri, May 24, 2019 at 10:47:16AM -0700, Ralph Campbell wrote:
> 
> On 5/24/19 9:49 AM, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > 
> > > > This patch series arised out of discussions with Jerome when looking at the
> > > > ODP changes, particularly informed by use after free races we have already
> > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > 
> > > So the last big difference with ODP's flow is how 'range->valid'
> > > works.
> > > 
> > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > obtained for read in invalidate_start and released in invalidate_end.
> > > 
> > > Then any other threads that wish to only work on a umem which is not
> > > undergoing invalidation will obtain the write side of the lock, and
> > > within that lock's critical section the virtual address range is known
> > > to not be invalidating.
> > > 
> > > I cannot understand how hmm gets to the same approach. It has
> > > range->valid, but it is not locked by anything that I can see, so when
> > > we test it in places like hmm_range_fault it seems useless..
> > > 
> > > Jerome, how does this work?
> > > 
> > > I have a feeling we should copy the approach from ODP and use an
> > > actual lock here.
> > 
> > range->valid is use as bail early if invalidation is happening in
> > hmm_range_fault() to avoid doing useless work. The synchronization
> > is explained in the documentation:
> > 
> > 
> > Locking within the sync_cpu_device_pagetables() callback is the most important
> > aspect the driver must respect in order to keep things properly synchronized.
> > The usage pattern is::
> > 
> >   int driver_populate_range(...)
> >   {
> >        struct hmm_range range;
> >        ...
> > 
> >        range.start = ...;
> >        range.end = ...;
> >        range.pfns = ...;
> >        range.flags = ...;
> >        range.values = ...;
> >        range.pfn_shift = ...;
> >        hmm_range_register(&range);
> > 
> >        /*
> >         * Just wait for range to be valid, safe to ignore return value as we
> >         * will use the return value of hmm_range_snapshot() below under the
> >         * mmap_sem to ascertain the validity of the range.
> >         */
> >        hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> > 
> >   again:
> >        down_read(&mm->mmap_sem);
> >        ret = hmm_range_snapshot(&range);
> >        if (ret) {
> >            up_read(&mm->mmap_sem);
> >            if (ret == -EAGAIN) {
> >              /*
> >               * No need to check hmm_range_wait_until_valid() return value
> >               * on retry we will get proper error with hmm_range_snapshot()
> >               */
> >              hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
> >              goto again;
> >            }
> >            hmm_range_unregister(&range);
> >            return ret;
> >        }
> >        take_lock(driver->update);
> >        if (!hmm_range_valid(&range)) {
> >            release_lock(driver->update);
> >            up_read(&mm->mmap_sem);
> >            goto again;
> >        }
> > 
> >        // Use pfns array content to update device page table
> > 
> >        hmm_range_unregister(&range);
> >        release_lock(driver->update);
> >        up_read(&mm->mmap_sem);
> >        return 0;
> >   }
> > 
> > The driver->update lock is the same lock that the driver takes inside its
> > sync_cpu_device_pagetables() callback. That lock must be held before calling
> > hmm_range_valid() to avoid any race with a concurrent CPU page table update.
> > 
> > 
> > Cheers,
> > Jérôme
> 
> 
> Given the above, the following patch looks necessary to me.
> Also, looking at drivers/gpu/drm/nouveau/nouveau_svm.c, it
> doesn't check the return value to avoid calling up_read(&mm->mmap_sem).
> Besides, it's better to keep the mmap_sem lock/unlock in the caller.

No, nouveau use the old API so check hmm_vma_fault() within hmm.h, i have
patch to convert it to new API for 5.3


> 
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 836adf613f81..8b6ef97a8d71 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -1092,10 +1092,8 @@ long hmm_range_fault(struct hmm_range *range, bool
> block)
> 
>  	do {
>  		/* If range is no longer valid force retry. */
> -		if (!range->valid) {
> -			up_read(&hmm->mm->mmap_sem);
> +		if (!range->valid)
>  			return -EAGAIN;
> -		}
> 
>  		vma = find_vma(hmm->mm, start);
>  		if (vma == NULL || (vma->vm_flags & device_vma))
> 
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and may contain
> confidential information.  Any unauthorized review, use, disclosure or distribution
> is prohibited.  If you are not the intended recipient, please contact the sender by
> reply email and destroy all copies of the original message.
> -----------------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 17:01       ` Jerome Glisse
@ 2019-05-24 17:52         ` Jason Gunthorpe
  2019-05-24 18:03           ` Jerome Glisse
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 17:52 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > 
> > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > ODP changes, particularly informed by use after free races we have already
> > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > 
> > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > works.
> > > > 
> > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > 
> > > > Then any other threads that wish to only work on a umem which is not
> > > > undergoing invalidation will obtain the write side of the lock, and
> > > > within that lock's critical section the virtual address range is known
> > > > to not be invalidating.
> > > > 
> > > > I cannot understand how hmm gets to the same approach. It has
> > > > range->valid, but it is not locked by anything that I can see, so when
> > > > we test it in places like hmm_range_fault it seems useless..
> > > > 
> > > > Jerome, how does this work?
> > > > 
> > > > I have a feeling we should copy the approach from ODP and use an
> > > > actual lock here.
> > > 
> > > range->valid is use as bail early if invalidation is happening in
> > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > is explained in the documentation:
> > 
> > That just says the hmm APIs handle locking. I asked how the apis
> > implement that locking internally.
> > 
> > Are you trying to say that if I do this, hmm will still work completely
> > correctly?
> 
> Yes it will keep working correctly. You would just be doing potentialy
> useless work.

I don't see how it works correctly.

Apply the comment out patch I showed and this trivially happens:

      CPU0                                               CPU1
  hmm_invalidate_start()
    ops->sync_cpu_device_pagetables()
      device_lock()
       // Wipe out page tables in device, enable faulting
      device_unlock()

						     DEVICE PAGE FAULT
						       device_lock()
						       hmm_range_register()
                                                       hmm_range_dma_map()
						       device_unlock()
  hmm_invalidate_end()

The mmu notifier spec says:

 	 * Invalidation of multiple concurrent ranges may be
	 * optionally permitted by the driver. Either way the
	 * establishment of sptes is forbidden in the range passed to
	 * invalidate_range_begin/end for the whole duration of the
	 * invalidate_range_begin/end critical section.

And I understand "establishment of sptes is forbidden" means
"hmm_range_dmap_map() must fail with EAGAIN". 

This is why ODP uses an actual lock held across the critical region
which completely prohibits reading the CPU pages tables, or
establishing new mappings.

So, I still think we need a true lock, not a 'maybe valid' flag.

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 17:52         ` Jason Gunthorpe
@ 2019-05-24 18:03           ` Jerome Glisse
  2019-05-24 18:32             ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Jerome Glisse @ 2019-05-24 18:03 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > > 
> > > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > > ODP changes, particularly informed by use after free races we have already
> > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > > 
> > > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > > works.
> > > > > 
> > > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > > 
> > > > > Then any other threads that wish to only work on a umem which is not
> > > > > undergoing invalidation will obtain the write side of the lock, and
> > > > > within that lock's critical section the virtual address range is known
> > > > > to not be invalidating.
> > > > > 
> > > > > I cannot understand how hmm gets to the same approach. It has
> > > > > range->valid, but it is not locked by anything that I can see, so when
> > > > > we test it in places like hmm_range_fault it seems useless..
> > > > > 
> > > > > Jerome, how does this work?
> > > > > 
> > > > > I have a feeling we should copy the approach from ODP and use an
> > > > > actual lock here.
> > > > 
> > > > range->valid is use as bail early if invalidation is happening in
> > > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > > is explained in the documentation:
> > > 
> > > That just says the hmm APIs handle locking. I asked how the apis
> > > implement that locking internally.
> > > 
> > > Are you trying to say that if I do this, hmm will still work completely
> > > correctly?
> > 
> > Yes it will keep working correctly. You would just be doing potentialy
> > useless work.
> 
> I don't see how it works correctly.
> 
> Apply the comment out patch I showed and this trivially happens:
> 
>       CPU0                                               CPU1
>   hmm_invalidate_start()
>     ops->sync_cpu_device_pagetables()
>       device_lock()
>        // Wipe out page tables in device, enable faulting
>       device_unlock()
> 
>                                                        DEVICE PAGE FAULT
>                                                        device_lock()
>                                                        hmm_range_register()
>                                                        hmm_range_dma_map()
>                                                        device_unlock()
>   hmm_invalidate_end()

No in the above scenario hmm_range_register() will not mark the range
as valid thus the driver will bailout after taking its lock and checking
the range->valid value.

> 
> The mmu notifier spec says:
> 
>  	 * Invalidation of multiple concurrent ranges may be
> 	 * optionally permitted by the driver. Either way the
> 	 * establishment of sptes is forbidden in the range passed to
> 	 * invalidate_range_begin/end for the whole duration of the
> 	 * invalidate_range_begin/end critical section.
> 
> And I understand "establishment of sptes is forbidden" means
> "hmm_range_dmap_map() must fail with EAGAIN". 

No it means that secondary page table entry (SPTE) must not materialize
thus what hmm_range_dmap_map() is doing if fine and safe as long as the
driver do not use the result to populate the device page table if there
was an invalidation for the range.

> 
> This is why ODP uses an actual lock held across the critical region
> which completely prohibits reading the CPU pages tables, or
> establishing new mappings.
> 
> So, I still think we need a true lock, not a 'maybe valid' flag.

The rational in HMM is to never block mm so that mm can always make
progress as whatever mm is doing will take precedence and thus it
would be useless to block mm while we do something if what we are
doing is about to become invalid.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 18:03           ` Jerome Glisse
@ 2019-05-24 18:32             ` Jason Gunthorpe
  2019-05-24 18:46               ` Jerome Glisse
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 18:32 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 02:03:22PM -0400, Jerome Glisse wrote:
> On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote:
> > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > > > 
> > > > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > > > ODP changes, particularly informed by use after free races we have already
> > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > > > 
> > > > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > > > works.
> > > > > > 
> > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > > > 
> > > > > > Then any other threads that wish to only work on a umem which is not
> > > > > > undergoing invalidation will obtain the write side of the lock, and
> > > > > > within that lock's critical section the virtual address range is known
> > > > > > to not be invalidating.
> > > > > > 
> > > > > > I cannot understand how hmm gets to the same approach. It has
> > > > > > range->valid, but it is not locked by anything that I can see, so when
> > > > > > we test it in places like hmm_range_fault it seems useless..
> > > > > > 
> > > > > > Jerome, how does this work?
> > > > > > 
> > > > > > I have a feeling we should copy the approach from ODP and use an
> > > > > > actual lock here.
> > > > > 
> > > > > range->valid is use as bail early if invalidation is happening in
> > > > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > > > is explained in the documentation:
> > > > 
> > > > That just says the hmm APIs handle locking. I asked how the apis
> > > > implement that locking internally.
> > > > 
> > > > Are you trying to say that if I do this, hmm will still work completely
> > > > correctly?
> > > 
> > > Yes it will keep working correctly. You would just be doing potentialy
> > > useless work.
> > 
> > I don't see how it works correctly.
> > 
> > Apply the comment out patch I showed and this trivially happens:
> > 
> >       CPU0                                               CPU1
> >   hmm_invalidate_start()
> >     ops->sync_cpu_device_pagetables()
> >       device_lock()
> >        // Wipe out page tables in device, enable faulting
> >       device_unlock()
> > 
> >                                                        DEVICE PAGE FAULT
> >                                                        device_lock()
> >                                                        hmm_range_register()
> >                                                        hmm_range_dma_map()
> >                                                        device_unlock()
> >   hmm_invalidate_end()
> 
> No in the above scenario hmm_range_register() will not mark the range
> as valid thus the driver will bailout after taking its lock and checking
> the range->valid value.

I see your confusion, I only asked about removing valid from hmm.c,
not the unlocked use of valid in your hmm.rst example. My mistake,
sorry for being unclear.

Here is the big 3 CPU ladder diagram that shows how 'valid' does not
work:

       CPU0                                               CPU1                                          CPU2
                                                        DEVICE PAGE FAULT
                                                        range = hmm_range_register()

   // Overlaps with range
   hmm_invalidate_start()
     range->valid = false
     ops->sync_cpu_device_pagetables()
       take_lock(driver->update);
        // Wipe out page tables in device, enable faulting
       release_lock(driver->update);
												    // Does not overlap with range
												    hmm_invalidate_start()
												    hmm_invalidate_end()
													list_for_each
													    range->valid =  true


                                                        device_lock()
							// Note range->valid = true now
							hmm_range_snapshot(&range);
							take_lock(driver->update);
							if (!hmm_range_valid(&range))
							    goto again
							ESTABLISHE SPTES
                                                        device_unlock()
   hmm_invalidate_end()

And I can make this more complicated (ie overlapping parallel
invalidates, etc) and show any 'bool' valid cannot work.

> > The mmu notifier spec says:
> > 
> >  	 * Invalidation of multiple concurrent ranges may be
> > 	 * optionally permitted by the driver. Either way the
> > 	 * establishment of sptes is forbidden in the range passed to
> > 	 * invalidate_range_begin/end for the whole duration of the
> > 	 * invalidate_range_begin/end critical section.
> > 
> > And I understand "establishment of sptes is forbidden" means
> > "hmm_range_dmap_map() must fail with EAGAIN". 
> 
> No it means that secondary page table entry (SPTE) must not
> materialize thus what hmm_range_dmap_map() is doing if fine and safe
> as long as the driver do not use the result to populate the device
> page table if there was an invalidation for the range.

Okay, so we agree, if there is an invalidate_start/end critical region
then it is OK to *call* hmm_range_dmap_map(), however the driver must
not *use* the result, and you are expecting this bit:

      take_lock(driver->update);
      if (!hmm_range_valid(&range)) {
         goto again

In your hmm.rst to prevent the pfns from being used by the driver?

I think the above ladder shows that hmm_range_valid can return true
during a invalidate_start/end critical region, so this is a problem.

I still think the best solution is to move device_lock() into mirror
and have hmm manage it for the driver as ODP does. It is certainly the
simplest solution to understand.

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 18:32             ` Jason Gunthorpe
@ 2019-05-24 18:46               ` Jerome Glisse
  2019-05-24 22:09                 ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Jerome Glisse @ 2019-05-24 18:46 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 03:32:25PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 02:03:22PM -0400, Jerome Glisse wrote:
> > On Fri, May 24, 2019 at 02:52:03PM -0300, Jason Gunthorpe wrote:
> > > On Fri, May 24, 2019 at 01:01:49PM -0400, Jerome Glisse wrote:
> > > > On Fri, May 24, 2019 at 01:59:31PM -0300, Jason Gunthorpe wrote:
> > > > > On Fri, May 24, 2019 at 12:49:02PM -0400, Jerome Glisse wrote:
> > > > > > On Fri, May 24, 2019 at 11:36:49AM -0300, Jason Gunthorpe wrote:
> > > > > > > On Thu, May 23, 2019 at 12:34:25PM -0300, Jason Gunthorpe wrote:
> > > > > > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > > > > > > 
> > > > > > > > This patch series arised out of discussions with Jerome when looking at the
> > > > > > > > ODP changes, particularly informed by use after free races we have already
> > > > > > > > found and fixed in the ODP code (thanks to syzkaller) working with mmu
> > > > > > > > notifiers, and the discussion with Ralph on how to resolve the lifetime model.
> > > > > > > 
> > > > > > > So the last big difference with ODP's flow is how 'range->valid'
> > > > > > > works.
> > > > > > > 
> > > > > > > In ODP this was done using the rwsem umem->umem_rwsem which is
> > > > > > > obtained for read in invalidate_start and released in invalidate_end.
> > > > > > > 
> > > > > > > Then any other threads that wish to only work on a umem which is not
> > > > > > > undergoing invalidation will obtain the write side of the lock, and
> > > > > > > within that lock's critical section the virtual address range is known
> > > > > > > to not be invalidating.
> > > > > > > 
> > > > > > > I cannot understand how hmm gets to the same approach. It has
> > > > > > > range->valid, but it is not locked by anything that I can see, so when
> > > > > > > we test it in places like hmm_range_fault it seems useless..
> > > > > > > 
> > > > > > > Jerome, how does this work?
> > > > > > > 
> > > > > > > I have a feeling we should copy the approach from ODP and use an
> > > > > > > actual lock here.
> > > > > > 
> > > > > > range->valid is use as bail early if invalidation is happening in
> > > > > > hmm_range_fault() to avoid doing useless work. The synchronization
> > > > > > is explained in the documentation:
> > > > > 
> > > > > That just says the hmm APIs handle locking. I asked how the apis
> > > > > implement that locking internally.
> > > > > 
> > > > > Are you trying to say that if I do this, hmm will still work completely
> > > > > correctly?
> > > > 
> > > > Yes it will keep working correctly. You would just be doing potentialy
> > > > useless work.
> > > 
> > > I don't see how it works correctly.
> > > 
> > > Apply the comment out patch I showed and this trivially happens:
> > > 
> > >       CPU0                                               CPU1
> > >   hmm_invalidate_start()
> > >     ops->sync_cpu_device_pagetables()
> > >       device_lock()
> > >        // Wipe out page tables in device, enable faulting
> > >       device_unlock()
> > > 
> > >                                                        DEVICE PAGE FAULT
> > >                                                        device_lock()
> > >                                                        hmm_range_register()
> > >                                                        hmm_range_dma_map()
> > >                                                        device_unlock()
> > >   hmm_invalidate_end()
> > 
> > No in the above scenario hmm_range_register() will not mark the range
> > as valid thus the driver will bailout after taking its lock and checking
> > the range->valid value.
> 
> I see your confusion, I only asked about removing valid from hmm.c,
> not the unlocked use of valid in your hmm.rst example. My mistake,
> sorry for being unclear.

No i did understand properly and it is fine to remove all the valid
check within hmm_range_fault() or hmm_range_snapshot() nothing bad
will come out of that.

> 
> Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> work:
> 
>        CPU0                                               CPU1                                          CPU2
>                                                         DEVICE PAGE FAULT
>                                                         range = hmm_range_register()
>
>   // Overlaps with range
>   hmm_invalidate_start()
>     range->valid = false
>     ops->sync_cpu_device_pagetables()
>       take_lock(driver->update);
>        // Wipe out page tables in device, enable faulting
>       release_lock(driver->update);
>                                                                                                    // Does not overlap with range
>                                                                                                    hmm_invalidate_start()
>                                                                                                    hmm_invalidate_end()
>                                                                                                        list_for_each
>                                                                                                            range->valid =  true

                                                                                                             ^
No this can not happen because CPU0 still has invalidate_range in progress and
thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
range->valid as true.

>
>
>                                                        device_lock()
>                                                        // Note range->valid = true now
>                                                        hmm_range_snapshot(&range);
>                                                        take_lock(driver->update);
>                                                        if (!hmm_range_valid(&range))
>                                                            goto again
>                                                        ESTABLISHE SPTES
>                                                        device_unlock()
>   hmm_invalidate_end()
> 
> 
> And I can make this more complicated (ie overlapping parallel
> invalidates, etc) and show any 'bool' valid cannot work.

It does work. If you want i can remove the range->valid = true from the
hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid()
ie modifying the hmm_range_wait_until_valid() logic, this might look
cleaner.

> > > The mmu notifier spec says:
> > > 
> > >  	 * Invalidation of multiple concurrent ranges may be
> > > 	 * optionally permitted by the driver. Either way the
> > > 	 * establishment of sptes is forbidden in the range passed to
> > > 	 * invalidate_range_begin/end for the whole duration of the
> > > 	 * invalidate_range_begin/end critical section.
> > > 
> > > And I understand "establishment of sptes is forbidden" means
> > > "hmm_range_dmap_map() must fail with EAGAIN". 
> > 
> > No it means that secondary page table entry (SPTE) must not
> > materialize thus what hmm_range_dmap_map() is doing if fine and safe
> > as long as the driver do not use the result to populate the device
> > page table if there was an invalidation for the range.
> 
> Okay, so we agree, if there is an invalidate_start/end critical region
> then it is OK to *call* hmm_range_dmap_map(), however the driver must
> not *use* the result, and you are expecting this bit:
> 
>       take_lock(driver->update);
>       if (!hmm_range_valid(&range)) {
>          goto again
> 
> In your hmm.rst to prevent the pfns from being used by the driver?
> 
> I think the above ladder shows that hmm_range_valid can return true
> during a invalidate_start/end critical region, so this is a problem.
>
> I still think the best solution is to move device_lock() into mirror
> and have hmm manage it for the driver as ODP does. It is certainly the
> simplest solution to understand.

It is un-efficient and would block further than needed forward progress
by mm code.

Cheers,
Jérôme


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 18:46               ` Jerome Glisse
@ 2019-05-24 22:09                 ` Jason Gunthorpe
  2019-05-27 19:58                   ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-24 22:09 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote:
> > Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> > work:
> > 
> >        CPU0                                               CPU1                                          CPU2
> >                                                         DEVICE PAGE FAULT
> >                                                         range = hmm_range_register()
> >
> >   // Overlaps with range
> >   hmm_invalidate_start()
> >     range->valid = false
> >     ops->sync_cpu_device_pagetables()
> >       take_lock(driver->update);
> >        // Wipe out page tables in device, enable faulting
> >       release_lock(driver->update);
> >                                                                                                    // Does not overlap with range
> >                                                                                                    hmm_invalidate_start()
> >                                                                                                    hmm_invalidate_end()
> >                                                                                                        list_for_each
> >                                                                                                            range->valid =  true
> 
>                                                                                                              ^
> No this can not happen because CPU0 still has invalidate_range in progress and
> thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
> range->valid as true.

Oh, Okay, I now see how this all works, thank you

> > And I can make this more complicated (ie overlapping parallel
> > invalidates, etc) and show any 'bool' valid cannot work.
> 
> It does work. 

Well, I ment the bool alone cannot work, but this is really bool + a
counter.

> If you want i can remove the range->valid = true from the
> hmm_invalidate_range_end() and move it within hmm_range_wait_until_valid()
> ie modifying the hmm_range_wait_until_valid() logic, this might look
> cleaner.

Let me reflect on it for a bit. I have to say I don't like the clarity
here, and I don't like the valid=true loop in the invalidate_end, it
is pretty clunky.

I'm thinking a more obvious API for drivers, as something like:

again:
    hmm_range_start();
     [..]
    if (hmm_range_test_retry())
          goto again

    driver_lock()
      if (hmm_range_end())
           goto again
    driver_unlock();

Just because it makes it very clear to the driver author what to do
and how this is working, and makes it clear that there is no such
thing as 'valid' - what we *really* have is a locking collision
forcing retry. ie this is really closer to a seq-lock scheme, not a
valid/invalid scheme. Being able to explain the concept does matter
for maintainability...

And I'm thinking the above API design would comfortably support a more
efficient seq-lock like approach without the loop in invalidate_end..

But I haven't quite thought it all through yet. Next week!

> > I still think the best solution is to move device_lock() into mirror
> > and have hmm manage it for the driver as ODP does. It is certainly the
> > simplest solution to understand.
> 
> It is un-efficient and would block further than needed forward progress
> by mm code.

I'm not sure how you get to that, we already have the device_lock()
and it already blocks forward progress by mm code.

Really the big unfortunate thing here is that valid is manipulated
outside the device_lock, but really, logically, it is covered under
the device_lock

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review
  2019-05-24 22:09                 ` Jason Gunthorpe
@ 2019-05-27 19:58                   ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-05-27 19:58 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-rdma, linux-mm, Ralph Campbell, John Hubbard

On Fri, May 24, 2019 at 07:09:23PM -0300, Jason Gunthorpe wrote:
> On Fri, May 24, 2019 at 02:46:08PM -0400, Jerome Glisse wrote:
> > > Here is the big 3 CPU ladder diagram that shows how 'valid' does not
> > > work:
> > > 
> > >        CPU0                                               CPU1                                          CPU2
> > >                                                         DEVICE PAGE FAULT
> > >                                                         range = hmm_range_register()
> > >
> > >   // Overlaps with range
> > >   hmm_invalidate_start()
> > >     range->valid = false
> > >     ops->sync_cpu_device_pagetables()
> > >       take_lock(driver->update);
> > >        // Wipe out page tables in device, enable faulting
> > >       release_lock(driver->update);
> > >                                                                                                    // Does not overlap with range
> > >                                                                                                    hmm_invalidate_start()
> > >                                                                                                    hmm_invalidate_end()
> > >                                                                                                        list_for_each
> > >                                                                                                            range->valid =  true
> > 
> >                                                                                                              ^
> > No this can not happen because CPU0 still has invalidate_range in progress and
> > thus hmm->notifiers > 0 so the hmm_invalidate_range_end() will not set the
> > range->valid as true.
> 
> Oh, Okay, I now see how this all works, thank you
> 
> > > And I can make this more complicated (ie overlapping parallel
> > > invalidates, etc) and show any 'bool' valid cannot work.
> > 
> > It does work. 
> 
> Well, I ment the bool alone cannot work, but this is really bool + a
> counter.

I couldn't shake this unease that bool shouldn't work for this type of
locking, especially since odp also used a sequence lock as well as the
rwsem...

What about this situation:

       CPU0                                               CPU1
                                                        DEVICE PAGE FAULT
                                                        range = hmm_range_register()
							hmm_range_snapshot(&range);


   // Overlaps with range
   hmm_invalidate_start()
     range->valid = false
     ops->sync_cpu_device_pagetables()
       take_lock(driver->update);
        // Wipe out page tables in device, enable faulting
       release_lock(driver->update);
   hmm_invalidate_end()
      range->valid = true


							take_lock(driver->update);
							if (!hmm_range_valid(&range))
							    goto again
							ESTABLISH SPTES
                                                        release_lock(driver->update);


The ODP patch appears to follow this pattern as the dma_map and the
mlx5_ib_update_xlt are in different locking regions. We should dump
the result of rmm_range_snapshot in this case, certainly the driver
shouldn't be tasked with fixing this..

So, something like this is what I'm thinking about:

From 41b6a6120e30978e7335ada04fb9168db4e5fd29 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Mon, 27 May 2019 16:48:53 -0300
Subject: [PATCH] RFC mm/hmm: Replace the range->valid with a seqcount

Instead of trying to use a single valid to keep track of parallel
invalidations use a traditional seqcount retry lock.

Replace the range->valid with the concept of a 'update critical region'
bounded by hmm_range_start_update() / hmm_range_end_update() which can
fail and need retry if it is ever interrupted by a parallel
invalidation. Updaters must create the critical section and can only
finish their update while holding the device_lock.

Continue to take a very loose approach to track invalidation, now with a
single global seqcount for all ranges. This is done to minimize the
overhead in the mmu notifier, and expects there will only be a small
number of ranges active at once. It could be converted to a seqcount per
range if necessary.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 Documentation/vm/hmm.rst | 22 +++++--------
 include/linux/hmm.h      | 60 ++++++++++++++++++++++++---------
 mm/hmm.c                 | 71 ++++++++++++++++++++++------------------
 3 files changed, 93 insertions(+), 60 deletions(-)

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 7c1e929931a07f..7e827058964579 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -229,32 +229,27 @@ The usage pattern is::
        * will use the return value of hmm_range_snapshot() below under the
        * mmap_sem to ascertain the validity of the range.
        */
-      hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
-
  again:
+       if (!hmm_range_start_update(&range, TIMEOUT_IN_MSEC))
+             goto err
+
       down_read(&mm->mmap_sem);
       ret = hmm_range_snapshot(&range);
       if (ret) {
           up_read(&mm->mmap_sem);
-          if (ret == -EAGAIN) {
-            /*
-             * No need to check hmm_range_wait_until_valid() return value
-             * on retry we will get proper error with hmm_range_snapshot()
-             */
-            hmm_range_wait_until_valid(&range, TIMEOUT_IN_MSEC);
-            goto again;
-          }
+          if (ret == -EAGAIN)
+              goto again;
           hmm_mirror_unregister(&range);
           return ret;
       }
       take_lock(driver->update);
-      if (!hmm_range_valid(&range)) {
+      if (!hmm_range_end_update(&range)) {
           release_lock(driver->update);
           up_read(&mm->mmap_sem);
           goto again;
       }
 
-      // Use pfns array content to update device page table
+      // Use pfns array content to update device page table, must hold driver->update
 
       hmm_mirror_unregister(&range);
       release_lock(driver->update);
@@ -264,7 +259,8 @@ The usage pattern is::
 
 The driver->update lock is the same lock that the driver takes inside its
 sync_cpu_device_pagetables() callback. That lock must be held before calling
-hmm_range_valid() to avoid any race with a concurrent CPU page table update.
+hmm_range_end_update() to avoid any race with a concurrent CPU page table
+update.
 
 HMM implements all this on top of the mmu_notifier API because we wanted a
 simpler API and also to be able to perform optimizations latter on like doing
diff --git a/include/linux/hmm.h b/include/linux/hmm.h
index 26dfd9377b5094..9096113cfba8de 100644
--- a/include/linux/hmm.h
+++ b/include/linux/hmm.h
@@ -90,7 +90,9 @@
  * @mmu_notifier: mmu notifier to track updates to CPU page table
  * @mirrors_sem: read/write semaphore protecting the mirrors list
  * @wq: wait queue for user waiting on a range invalidation
- * @notifiers: count of active mmu notifiers
+ * @active_invalidates: count of active mmu notifier invalidations
+ * @range_invalidated: seqcount indicating that an active range was
+ *                     maybe invalidated
  */
 struct hmm {
 	struct mm_struct	*mm;
@@ -102,7 +104,8 @@ struct hmm {
 	struct rw_semaphore	mirrors_sem;
 	wait_queue_head_t	wq;
 	struct rcu_head		rcu;
-	long			notifiers;
+	unsigned int		active_invalidates;
+	seqcount_t		range_invalidated;
 };
 
 /*
@@ -169,7 +172,7 @@ enum hmm_pfn_value_e {
  * @pfn_flags_mask: allows to mask pfn flags so that only default_flags matter
  * @page_shift: device virtual address shift value (should be >= PAGE_SHIFT)
  * @pfn_shifts: pfn shift value (should be <= PAGE_SHIFT)
- * @valid: pfns array did not change since it has been fill by an HMM function
+ * @update_seq: sequence number for the seqcount lock read side
  */
 struct hmm_range {
 	struct hmm		*hmm;
@@ -184,7 +187,7 @@ struct hmm_range {
 	uint64_t		pfn_flags_mask;
 	uint8_t			page_shift;
 	uint8_t			pfn_shift;
-	bool			valid;
+	unsigned int		update_seq;
 };
 
 /*
@@ -208,27 +211,52 @@ static inline unsigned long hmm_range_page_size(const struct hmm_range *range)
 }
 
 /*
- * hmm_range_wait_until_valid() - wait for range to be valid
+ * hmm_range_start_update() - wait for range to be valid
  * @range: range affected by invalidation to wait on
  * @timeout: time out for wait in ms (ie abort wait after that period of time)
  * Return: true if the range is valid, false otherwise.
  */
-static inline bool hmm_range_wait_until_valid(struct hmm_range *range,
-					      unsigned long timeout)
+// FIXME: hmm should handle the timeout for the driver too.
+static inline unsigned int hmm_range_start_update(struct hmm_range *range,
+						  unsigned long timeout)
 {
-	wait_event_timeout(range->hmm->wq, range->valid,
+	wait_event_timeout(range->hmm->wq,
+			   READ_ONCE(range->hmm->active_invalidates) == 0,
 			   msecs_to_jiffies(timeout));
-	return READ_ONCE(range->valid);
+
+	// FIXME: wants a non-raw seq helper
+	seqcount_lockdep_reader_access(&range->hmm->range_invalidated);
+	range->update_seq = raw_seqcount_begin(&range->hmm->range_invalidated);
+	return !read_seqcount_retry(&range->hmm->range_invalidated,
+				    range->update_seq);
 }
 
 /*
- * hmm_range_valid() - test if a range is valid or not
+ * hmm_range_needs_retry() - test if a range that has begun update
+ *                 via hmm_range_start_update() needs to be retried.
  * @range: range
- * Return: true if the range is valid, false otherwise.
+ * Return: true if the update needs to be restarted from hmm_range_start_update(),
+ *         false otherwise.
+ */
+static inline bool hmm_range_needs_retry(struct hmm_range *range)
+{
+	return read_seqcount_retry(&range->hmm->range_invalidated,
+				   range->update_seq);
+}
+
+/*
+ * hmm_range_end_update() - finish an update of a range
+ * @range: range
+ *
+ * This must only be called while holding the device lock as described in
+ * hmm.rst, and must be called before making any of the update visible.
+ *
+ * Return: true if the update was not interrupted by an invalidation of the
+ * covered virtual memory range, false if the update needs to be retried.
  */
-static inline bool hmm_range_valid(struct hmm_range *range)
+static inline bool hmm_range_end_update(struct hmm_range *range)
 {
-	return range->valid;
+	return !hmm_range_needs_retry(range);
 }
 
 /*
@@ -511,7 +539,7 @@ static inline int hmm_mirror_range_register(struct hmm_range *range,
 /* This is a temporary helper to avoid merge conflict between trees. */
 static inline bool hmm_vma_range_done(struct hmm_range *range)
 {
-	bool ret = hmm_range_valid(range);
+	bool ret = !hmm_range_needs_retry(range);
 
 	hmm_range_unregister(range);
 	return ret;
@@ -537,7 +565,7 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	if (ret)
 		return (int)ret;
 
-	if (!hmm_range_wait_until_valid(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
+	if (!hmm_range_start_update(range, HMM_RANGE_DEFAULT_TIMEOUT)) {
 		hmm_range_unregister(range);
 		/*
 		 * The mmap_sem was taken by driver we release it here and
@@ -549,6 +577,8 @@ static inline int hmm_vma_fault(struct hmm_range *range, bool block)
 	}
 
 	ret = hmm_range_fault(range, block);
+	if (!hmm_range_end_update(range))
+		ret = -EAGAIN;
 	if (ret <= 0) {
 		hmm_range_unregister(range);
 		if (ret == -EBUSY || !ret) {
diff --git a/mm/hmm.c b/mm/hmm.c
index 8396a65710e304..09725774ff6112 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -79,7 +79,8 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
 	INIT_LIST_HEAD(&hmm->ranges);
 	mutex_init(&hmm->lock);
 	kref_init(&hmm->kref);
-	hmm->notifiers = 0;
+	hmm->active_invalidates = 0;
+	seqcount_init(&hmm->range_invalidated);
 	hmm->mm = mm;
 	mmgrab(mm);
 
@@ -155,13 +156,22 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
 	hmm_put(hmm);
 }
 
+static bool any_range_overlaps(struct hmm *hmm, unsigned long start, unsigned long end)
+{
+	struct hmm_range *range;
+
+	list_for_each_entry(range, &hmm->ranges, list)
+		// FIXME: check me
+		if (range->start <= end && range->end < start)
+			return true;
+	return false;
+}
 static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 			const struct mmu_notifier_range *nrange)
 {
 	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
 	struct hmm_mirror *mirror;
 	struct hmm_update update;
-	struct hmm_range *range;
 	int ret = 0;
 
 	/* hmm is in progress to free */
@@ -179,13 +189,22 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
 		ret = -EAGAIN;
 		goto out;
 	}
-	hmm->notifiers++;
-	list_for_each_entry(range, &hmm->ranges, list) {
-		if (update.end < range->start || update.start >= range->end)
-			continue;
-
-		range->valid = false;
-	}
+	/*
+	 * The seqcount is used as a fast but inaccurate indication that
+	 * another CPU working with a range needs to retry due to invalidation
+	 * of the same virtual address space covered by the range by this
+	 * CPU.
+	 *
+	 * It is based on the assumption that the ranges will be short lived,
+	 * so there is no need to be aggressively accurate in the mmu notifier
+	 * fast path. Any notifier intersection will cause all registered
+	 * ranges to retry.
+	 */
+	hmm->active_invalidates++;
+	// FIXME: needs a seqcount helper
+	if (!(hmm->range_invalidated.sequence & 1) &&
+	    any_range_overlaps(hmm, update.start, update.end))
+		write_seqcount_begin(&hmm->range_invalidated);
 	mutex_unlock(&hmm->lock);
 
 	if (mmu_notifier_range_blockable(nrange))
@@ -218,15 +237,11 @@ static void hmm_invalidate_range_end(struct mmu_notifier *mn,
 		return;
 
 	mutex_lock(&hmm->lock);
-	hmm->notifiers--;
-	if (!hmm->notifiers) {
-		struct hmm_range *range;
-
-		list_for_each_entry(range, &hmm->ranges, list) {
-			if (range->valid)
-				continue;
-			range->valid = true;
-		}
+	hmm->active_invalidates--;
+	if (hmm->active_invalidates == 0) {
+		// FIXME: needs a seqcount helper
+		if (hmm->range_invalidated.sequence & 1)
+			write_seqcount_end(&hmm->range_invalidated);
 		wake_up_all(&hmm->wq);
 	}
 	mutex_unlock(&hmm->lock);
@@ -882,7 +897,7 @@ int hmm_range_register(struct hmm_range *range,
 {
 	unsigned long mask = ((1UL << page_shift) - 1UL);
 
-	range->valid = false;
+	range->update_seq = 0;
 	range->hmm = NULL;
 
 	if ((start & mask) || (end & mask))
@@ -908,15 +923,7 @@ int hmm_range_register(struct hmm_range *range,
 
 	/* Initialize range to track CPU page table updates. */
 	mutex_lock(&range->hmm->lock);
-
 	list_add(&range->list, &range->hmm->ranges);
-
-	/*
-	 * If there are any concurrent notifiers we have to wait for them for
-	 * the range to be valid (see hmm_range_wait_until_valid()).
-	 */
-	if (!range->hmm->notifiers)
-		range->valid = true;
 	mutex_unlock(&range->hmm->lock);
 
 	return 0;
@@ -947,7 +954,6 @@ void hmm_range_unregister(struct hmm_range *range)
 	hmm_put(hmm);
 
 	/* The range is now invalid, leave it poisoned. */
-	range->valid = false;
 	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
 }
 EXPORT_SYMBOL(hmm_range_unregister);
@@ -981,7 +987,7 @@ long hmm_range_snapshot(struct hmm_range *range)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid)
+		if (hmm_range_needs_retry(range))
 			return -EAGAIN;
 
 		vma = find_vma(hmm->mm, start);
@@ -1080,7 +1086,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 
 	do {
 		/* If range is no longer valid force retry. */
-		if (!range->valid) {
+		if (hmm_range_needs_retry(range)) {
 			up_read(&hmm->mm->mmap_sem);
 			return -EAGAIN;
 		}
@@ -1134,7 +1140,7 @@ long hmm_range_fault(struct hmm_range *range, bool block)
 			start = hmm_vma_walk.last;
 
 			/* Keep trying while the range is valid. */
-		} while (ret == -EBUSY && range->valid);
+		} while (ret == -EBUSY && !hmm_range_needs_retry(range));
 
 		if (ret) {
 			unsigned long i;
@@ -1195,7 +1201,8 @@ long hmm_range_dma_map(struct hmm_range *range,
 			continue;
 
 		/* Check if range is being invalidated */
-		if (!range->valid) {
+		if (hmm_range_needs_retry(range)) {
+			// ?? EAGAIN??
 			ret = -EBUSY;
 			goto unmap;
 		}
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers
  2019-05-23 15:34 ` [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
@ 2019-06-06 23:54   ` Ira Weiny
  2019-06-07 14:17     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Ira Weiny @ 2019-06-06 23:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Thu, May 23, 2019 at 12:34:26PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> system will continue to reference hmm->mn until the srcu grace period
> expires.
> 
> Resulting in use after free races like this:
> 
>          CPU0                                     CPU1
>                                                __mmu_notifier_invalidate_range_start()
>                                                  srcu_read_lock
>                                                  hlist_for_each ()
>                                                    // mn == hmm->mn
> hmm_mirror_unregister()
>   hmm_put()
>     hmm_free()
>       mmu_notifier_unregister_no_release()
>          hlist_del_init_rcu(hmm-mn->list)
> 			                           mn->ops->invalidate_range_start(mn, range);
> 					             mm_get_hmm()
>       mm->hmm = NULL;
>       kfree(hmm)
>                                                      mutex_lock(&hmm->lock);
> 
> Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> existing. Get the now-safe hmm struct through container_of and directly
> check kref_get_unless_zero to lock it against free.
> 
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  include/linux/hmm.h |  1 +
>  mm/hmm.c            | 25 +++++++++++++++++++------
>  2 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> index 51ec27a8466816..8b91c90d3b88cb 100644
> --- a/include/linux/hmm.h
> +++ b/include/linux/hmm.h
> @@ -102,6 +102,7 @@ struct hmm {
>  	struct mmu_notifier	mmu_notifier;
>  	struct rw_semaphore	mirrors_sem;
>  	wait_queue_head_t	wq;
> +	struct rcu_head		rcu;
>  	long			notifiers;
>  	bool			dead;
>  };
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 816c2356f2449f..824e7e160d8167 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
>  	return NULL;
>  }
>  
> +static void hmm_fee_rcu(struct rcu_head *rcu)

NIT: "free"

Other than that looks good.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

> +{
> +	kfree(container_of(rcu, struct hmm, rcu));
> +}
> +
>  static void hmm_free(struct kref *kref)
>  {
>  	struct hmm *hmm = container_of(kref, struct hmm, kref);
> @@ -125,7 +130,7 @@ static void hmm_free(struct kref *kref)
>  		mm->hmm = NULL;
>  	spin_unlock(&mm->page_table_lock);
>  
> -	kfree(hmm);
> +	mmu_notifier_call_srcu(&hmm->rcu, hmm_fee_rcu);
>  }
>  
>  static inline void hmm_put(struct hmm *hmm)
> @@ -153,10 +158,14 @@ void hmm_mm_destroy(struct mm_struct *mm)
>  
>  static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  {
> -	struct hmm *hmm = mm_get_hmm(mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
>  	struct hmm_range *range;
>  
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
> +
>  	/* Report this HMM as dying. */
>  	hmm->dead = true;
>  
> @@ -194,13 +203,15 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
>  static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  	struct hmm_mirror *mirror;
>  	struct hmm_update update;
>  	struct hmm_range *range;
>  	int ret = 0;
>  
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return 0;
>  
>  	update.start = nrange->start;
>  	update.end = nrange->end;
> @@ -248,9 +259,11 @@ static int hmm_invalidate_range_start(struct mmu_notifier *mn,
>  static void hmm_invalidate_range_end(struct mmu_notifier *mn,
>  			const struct mmu_notifier_range *nrange)
>  {
> -	struct hmm *hmm = mm_get_hmm(nrange->mm);
> +	struct hmm *hmm = container_of(mn, struct hmm, mmu_notifier);
>  
> -	VM_BUG_ON(!hmm);
> +	/* hmm is in progress to free */
> +	if (!kref_get_unless_zero(&hmm->kref))
> +		return;
>  
>  	mutex_lock(&hmm->lock);
>  	hmm->notifiers--;
> -- 
> 2.21.0
> 


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers
  2019-06-06 23:54   ` Ira Weiny
@ 2019-06-07 14:17     ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 14:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-rdma, linux-mm, Jerome Glisse, Ralph Campbell, John Hubbard

On Thu, Jun 06, 2019 at 04:54:41PM -0700, Ira Weiny wrote:
> On Thu, May 23, 2019 at 12:34:26PM -0300, Jason Gunthorpe wrote:
> > From: Jason Gunthorpe <jgg@mellanox.com>
> > 
> > mmu_notifier_unregister_no_release() is not a fence and the mmu_notifier
> > system will continue to reference hmm->mn until the srcu grace period
> > expires.
> > 
> > Resulting in use after free races like this:
> > 
> >          CPU0                                     CPU1
> >                                                __mmu_notifier_invalidate_range_start()
> >                                                  srcu_read_lock
> >                                                  hlist_for_each ()
> >                                                    // mn == hmm->mn
> > hmm_mirror_unregister()
> >   hmm_put()
> >     hmm_free()
> >       mmu_notifier_unregister_no_release()
> >          hlist_del_init_rcu(hmm-mn->list)
> > 			                           mn->ops->invalidate_range_start(mn, range);
> > 					             mm_get_hmm()
> >       mm->hmm = NULL;
> >       kfree(hmm)
> >                                                      mutex_lock(&hmm->lock);
> > 
> > Use SRCU to kfree the hmm memory so that the notifiers can rely on hmm
> > existing. Get the now-safe hmm struct through container_of and directly
> > check kref_get_unless_zero to lock it against free.
> > 
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >  include/linux/hmm.h |  1 +
> >  mm/hmm.c            | 25 +++++++++++++++++++------
> >  2 files changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/include/linux/hmm.h b/include/linux/hmm.h
> > index 51ec27a8466816..8b91c90d3b88cb 100644
> > +++ b/include/linux/hmm.h
> > @@ -102,6 +102,7 @@ struct hmm {
> >  	struct mmu_notifier	mmu_notifier;
> >  	struct rw_semaphore	mirrors_sem;
> >  	wait_queue_head_t	wq;
> > +	struct rcu_head		rcu;
> >  	long			notifiers;
> >  	bool			dead;
> >  };
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 816c2356f2449f..824e7e160d8167 100644
> > +++ b/mm/hmm.c
> > @@ -113,6 +113,11 @@ static struct hmm *hmm_get_or_create(struct mm_struct *mm)
> >  	return NULL;
> >  }
> >  
> > +static void hmm_fee_rcu(struct rcu_head *rcu)
> 
> NIT: "free"
> 
> Other than that looks good.
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Fixed in v2, thanks

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments
  2019-05-23 15:34 ` [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
@ 2019-06-07 19:33   ` Souptick Joarder
  2019-06-07 19:39     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Souptick Joarder @ 2019-06-07 19:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> So we can check locking at runtime.
>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 2695925c0c5927..46872306f922bb 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -256,11 +256,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
>   *
>   * To start mirroring a process address space, the device driver must register
>   * an HMM mirror struct.
> - *
> - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
>   */
>  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
>  {
> +       lockdep_assert_held_exclusive(mm->mmap_sem);
> +

Gentle query, does the same required in hmm_mirror_unregister() ?

>         /* Sanity check */
>         if (!mm || !mirror || !mirror->ops)
>                 return -EINVAL;
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration
  2019-06-07 19:38   ` Souptick Joarder
@ 2019-06-07 19:37     ` Jason Gunthorpe
  2019-06-07 19:55       ` Souptick Joarder
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 19:37 UTC (permalink / raw)
  To: Souptick Joarder
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell, John Hubbard

On Sat, Jun 08, 2019 at 01:08:37AM +0530, Souptick Joarder wrote:
> On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > From: Jason Gunthorpe <jgg@mellanox.com>
> >
> > No other register/unregister kernel API attempts to provide this kind of
> > protection as it is inherently racy, so just drop it.
> >
> > Callers should provide their own protection, it appears nouveau already
> > does, but just in case drop a debugging POISON.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >  mm/hmm.c | 9 ++-------
> >  1 file changed, 2 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 46872306f922bb..6c3b7398672c29 100644
> > +++ b/mm/hmm.c
> > @@ -286,18 +286,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
> >   */
> >  void hmm_mirror_unregister(struct hmm_mirror *mirror)
> >  {
> > -       struct hmm *hmm = READ_ONCE(mirror->hmm);
> > -
> > -       if (hmm == NULL)
> > -               return;
> > +       struct hmm *hmm = mirror->hmm;
> 
> How about remove struct hmm *hmm and replace the code like below -
> 
> down_write(&mirror->hmm->mirrors_sem);
> list_del_init(&mirror->list);
> up_write(&mirror->hmm->mirrors_sem);
> hmm_put(hmm);
> memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
> 
> Similar to hmm_mirror_register().

I think we get there in patch 10, right?

When the series is all done the function looks like this:

void hmm_mirror_unregister(struct hmm_mirror *mirror)
{
        struct hmm *hmm = mirror->hmm;

        down_write(&hmm->mirrors_sem);
        list_del(&mirror->list);
        up_write(&hmm->mirrors_sem);
        hmm_put(hmm);
        memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
}

I think this mostly matches what you wrote above, or do you think we
should s/hmm/mirror->hmm/ anyhow? I think Ralph just added that :)

Regards,
Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration
  2019-05-23 15:34 ` [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
@ 2019-06-07 19:38   ` Souptick Joarder
  2019-06-07 19:37     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Souptick Joarder @ 2019-06-07 19:38 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> No other register/unregister kernel API attempts to provide this kind of
> protection as it is inherently racy, so just drop it.
>
> Callers should provide their own protection, it appears nouveau already
> does, but just in case drop a debugging POISON.
>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  mm/hmm.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 46872306f922bb..6c3b7398672c29 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -286,18 +286,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
>   */
>  void hmm_mirror_unregister(struct hmm_mirror *mirror)
>  {
> -       struct hmm *hmm = READ_ONCE(mirror->hmm);
> -
> -       if (hmm == NULL)
> -               return;
> +       struct hmm *hmm = mirror->hmm;

How about remove struct hmm *hmm and replace the code like below -

down_write(&mirror->hmm->mirrors_sem);
list_del_init(&mirror->list);
up_write(&mirror->hmm->mirrors_sem);
hmm_put(hmm);
memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));

Similar to hmm_mirror_register().


>         down_write(&hmm->mirrors_sem);
>         list_del_init(&mirror->list);
> -       /* To protect us against double unregister ... */
> -       mirror->hmm = NULL;
>         up_write(&hmm->mirrors_sem);
> -
>         hmm_put(hmm);
> +       memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
>  }
>  EXPORT_SYMBOL(hmm_mirror_unregister);
>
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments
  2019-06-07 19:33   ` Souptick Joarder
@ 2019-06-07 19:39     ` Jason Gunthorpe
  2019-06-07 21:02       ` Souptick Joarder
  0 siblings, 1 reply; 45+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 19:39 UTC (permalink / raw)
  To: Souptick Joarder
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell, John Hubbard

On Sat, Jun 08, 2019 at 01:03:48AM +0530, Souptick Joarder wrote:
> On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > From: Jason Gunthorpe <jgg@mellanox.com>
> >
> > So we can check locking at runtime.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> >  mm/hmm.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 2695925c0c5927..46872306f922bb 100644
> > +++ b/mm/hmm.c
> > @@ -256,11 +256,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> >   *
> >   * To start mirroring a process address space, the device driver must register
> >   * an HMM mirror struct.
> > - *
> > - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
> >   */
> >  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> >  {
> > +       lockdep_assert_held_exclusive(mm->mmap_sem);
> > +
> 
> Gentle query, does the same required in hmm_mirror_unregister() ?

No.. The unregistration path does its actual work in the srcu
callback, which is in a different context than this function. So any
locking held by the caller of unregister will not apply.

The hmm_range_free SRCU callback obtains the write side of mmap_sem to
protect the same data that the write side above in register is
touching, mostly &mm->hmm.

Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration
  2019-06-07 19:37     ` Jason Gunthorpe
@ 2019-06-07 19:55       ` Souptick Joarder
  0 siblings, 0 replies; 45+ messages in thread
From: Souptick Joarder @ 2019-06-07 19:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell, John Hubbard

On Sat, Jun 8, 2019 at 1:07 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sat, Jun 08, 2019 at 01:08:37AM +0530, Souptick Joarder wrote:
> > On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > >
> > > No other register/unregister kernel API attempts to provide this kind of
> > > protection as it is inherently racy, so just drop it.
> > >
> > > Callers should provide their own protection, it appears nouveau already
> > > does, but just in case drop a debugging POISON.
> > >
> > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > >  mm/hmm.c | 9 ++-------
> > >  1 file changed, 2 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > index 46872306f922bb..6c3b7398672c29 100644
> > > +++ b/mm/hmm.c
> > > @@ -286,18 +286,13 @@ EXPORT_SYMBOL(hmm_mirror_register);
> > >   */
> > >  void hmm_mirror_unregister(struct hmm_mirror *mirror)
> > >  {
> > > -       struct hmm *hmm = READ_ONCE(mirror->hmm);
> > > -
> > > -       if (hmm == NULL)
> > > -               return;
> > > +       struct hmm *hmm = mirror->hmm;
> >
> > How about remove struct hmm *hmm and replace the code like below -
> >
> > down_write(&mirror->hmm->mirrors_sem);
> > list_del_init(&mirror->list);
> > up_write(&mirror->hmm->mirrors_sem);
> > hmm_put(hmm);
> > memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
> >
> > Similar to hmm_mirror_register().
>
> I think we get there in patch 10, right?

No, Patch 10 of this series has modified hmm_range_unregister().
>
> When the series is all done the function looks like this:
>
> void hmm_mirror_unregister(struct hmm_mirror *mirror)
> {
>         struct hmm *hmm = mirror->hmm;
>
>         down_write(&hmm->mirrors_sem);
>         list_del(&mirror->list);
>         up_write(&hmm->mirrors_sem);
>         hmm_put(hmm);
>         memset(&mirror->hmm, POISON_INUSE, sizeof(mirror->hmm));
> }
>
> I think this mostly matches what you wrote above, or do you think we
> should s/hmm/mirror->hmm/ anyhow? I think Ralph just added that :)

I prefer, s/hmm/mirror->hmm and remove struct hmm *hmm :)


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister
  2019-05-23 15:34 ` [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
@ 2019-06-07 20:13   ` Souptick Joarder
  2019-06-07 20:18     ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Souptick Joarder @ 2019-06-07 20:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> and poison bytes to detect this condition.
>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Acked-by: Souptick Joarder <jrdr.linux@gmail.com>

> ---
>  mm/hmm.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 6c3b7398672c29..02752d3ef2ed92 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -936,8 +936,7 @@ EXPORT_SYMBOL(hmm_range_register);
>   */
>  void hmm_range_unregister(struct hmm_range *range)
>  {
> -       /* Sanity check this really should not happen. */
> -       if (range->hmm == NULL || range->end <= range->start)
> +       if (WARN_ON(range->end <= range->start))
>                 return;

Does it make any sense to sanity check for range == NULL as well ?
>
>         mutex_lock(&range->hmm->lock);
> @@ -945,9 +944,13 @@ void hmm_range_unregister(struct hmm_range *range)
>         mutex_unlock(&range->hmm->lock);
>
>         /* Drop reference taken by hmm_range_register() */
> -       range->valid = false;
>         hmm_put(range->hmm);
> -       range->hmm = NULL;
> +
> +       /* The range is now invalid, leave it poisoned. */
> +       range->valid = false;
> +       range->start = ULONG_MAX;
> +       range->end = 0;
> +       memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
>  }
>  EXPORT_SYMBOL(hmm_range_unregister);
>
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister
  2019-06-07 20:13   ` Souptick Joarder
@ 2019-06-07 20:18     ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-06-07 20:18 UTC (permalink / raw)
  To: Souptick Joarder
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell, John Hubbard

On Sat, Jun 08, 2019 at 01:43:12AM +0530, Souptick Joarder wrote:
> On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > From: Jason Gunthorpe <jgg@mellanox.com>
> >
> > Trying to misuse a range outside its lifetime is a kernel bug. Use WARN_ON
> > and poison bytes to detect this condition.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> 
> Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
> 
> >  mm/hmm.c | 11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index 6c3b7398672c29..02752d3ef2ed92 100644
> > +++ b/mm/hmm.c
> > @@ -936,8 +936,7 @@ EXPORT_SYMBOL(hmm_range_register);
> >   */
> >  void hmm_range_unregister(struct hmm_range *range)
> >  {
> > -       /* Sanity check this really should not happen. */
> > -       if (range->hmm == NULL || range->end <= range->start)
> > +       if (WARN_ON(range->end <= range->start))
> >                 return;
> 
> Does it make any sense to sanity check for range == NULL as well ?

The purpose of the sanity check is to make API misuse into a reliable
crash, so if range is NULL then it will already reliably crash due to
next lines. 

This approach is to help driver authors use the API properly.

However, looking closer, this will already crash reliably if we double
unregister as range->hmm->lock will instantly crash due the poison,
and the test no longer works right anyhow since v2 dropped the set of
the start/end values. I've deleted the check for v3:

Thanks,
Jason

From 461d880d1e898dc8e9ff6236b1730a5996df8738 Mon Sep 17 00:00:00 2001
From: Jason Gunthorpe <jgg@mellanox.com>
Date: Thu, 23 May 2019 11:40:24 -0300
Subject: [PATCH] mm/hmm: Poison hmm_range during unregister
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Trying to misuse a range outside its lifetime is a kernel bug. Use poison
bytes to help detect this condition. Double unregister will reliably crash.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Reviewed-by: Jérôme Glisse <jglisse@redhat.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
---
v2
- Keep range start/end valid after unregistration (Jerome)
v3
- Revise some comments (John)
- Remove start/end WARN_ON (Souptick)
---
 mm/hmm.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/hmm.c b/mm/hmm.c
index d4ac179c899c4e..288fcd1ffca5b5 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -925,10 +925,6 @@ void hmm_range_unregister(struct hmm_range *range)
 {
 	struct hmm *hmm = range->hmm;
 
-	/* Sanity check this really should not happen. */
-	if (hmm == NULL || range->end <= range->start)
-		return;
-
 	mutex_lock(&hmm->lock);
 	list_del_rcu(&range->list);
 	mutex_unlock(&hmm->lock);
@@ -937,7 +933,14 @@ void hmm_range_unregister(struct hmm_range *range)
 	range->valid = false;
 	mmput(hmm->mm);
 	hmm_put(hmm);
-	range->hmm = NULL;
+
+	/*
+	 * The range is now invalid and the ref on the hmm is dropped, so
+         * poison the pointer.  Leave other fields in place, for the caller's
+         * use.
+         */
+	range->valid = false;
+	memset(&range->hmm, POISON_INUSE, sizeof(range->hmm));
 }
 EXPORT_SYMBOL(hmm_range_unregister);
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 11/11] mm/hmm: Do not use list*_rcu() for hmm->ranges
  2019-05-23 15:34 ` [RFC PATCH 11/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
@ 2019-06-07 20:22   ` Souptick Joarder
  0 siblings, 0 replies; 45+ messages in thread
From: Souptick Joarder @ 2019-06-07 20:22 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell,
	John Hubbard, Jason Gunthorpe

On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> This list is always read and written while holding hmm->lock so there is
> no need for the confusing _rcu annotations.
>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>

Acked-by: Souptick Joarder <jrdr.linux@gmail.com>

> ---
>  mm/hmm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 02752d3ef2ed92..b4aafa90a109a5 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -912,7 +912,7 @@ int hmm_range_register(struct hmm_range *range,
>         /* Initialize range to track CPU page table update */
>         mutex_lock(&range->hmm->lock);
>
> -       list_add_rcu(&range->list, &range->hmm->ranges);
> +       list_add(&range->list, &range->hmm->ranges);
>
>         /*
>          * If there are any concurrent notifiers we have to wait for them for
> @@ -940,7 +940,7 @@ void hmm_range_unregister(struct hmm_range *range)
>                 return;
>
>         mutex_lock(&range->hmm->lock);
> -       list_del_rcu(&range->list);
> +       list_del(&range->list);
>         mutex_unlock(&range->hmm->lock);
>
>         /* Drop reference taken by hmm_range_register() */
> --
> 2.21.0
>


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments
  2019-06-07 19:39     ` Jason Gunthorpe
@ 2019-06-07 21:02       ` Souptick Joarder
  2019-06-08  1:15         ` Jason Gunthorpe
  0 siblings, 1 reply; 45+ messages in thread
From: Souptick Joarder @ 2019-06-07 21:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell, John Hubbard

On Sat, Jun 8, 2019 at 1:09 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Sat, Jun 08, 2019 at 01:03:48AM +0530, Souptick Joarder wrote:
> > On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > From: Jason Gunthorpe <jgg@mellanox.com>
> > >
> > > So we can check locking at runtime.
> > >
> > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > >  mm/hmm.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > index 2695925c0c5927..46872306f922bb 100644
> > > +++ b/mm/hmm.c
> > > @@ -256,11 +256,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> > >   *
> > >   * To start mirroring a process address space, the device driver must register
> > >   * an HMM mirror struct.
> > > - *
> > > - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
> > >   */
> > >  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> > >  {
> > > +       lockdep_assert_held_exclusive(mm->mmap_sem);
> > > +
> >
> > Gentle query, does the same required in hmm_mirror_unregister() ?
>
> No.. The unregistration path does its actual work in the srcu
> callback, which is in a different context than this function. So any
> locking held by the caller of unregister will not apply.
>
> The hmm_range_free SRCU callback obtains the write side of mmap_sem to
> protect the same data that the write side above in register is
> touching, mostly &mm->hmm.

Looking into https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/tree/?h=hmm,
unable trace hmm_range_free(). Am I looking into correct tree ?


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments
  2019-06-07 21:02       ` Souptick Joarder
@ 2019-06-08  1:15         ` Jason Gunthorpe
  0 siblings, 0 replies; 45+ messages in thread
From: Jason Gunthorpe @ 2019-06-08  1:15 UTC (permalink / raw)
  To: Souptick Joarder
  Cc: linux-rdma, Linux-MM, Jerome Glisse, Ralph Campbell, John Hubbard

On Sat, Jun 08, 2019 at 02:32:23AM +0530, Souptick Joarder wrote:
> On Sat, Jun 8, 2019 at 1:09 AM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Sat, Jun 08, 2019 at 01:03:48AM +0530, Souptick Joarder wrote:
> > > On Thu, May 23, 2019 at 9:05 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > >
> > > > From: Jason Gunthorpe <jgg@mellanox.com>
> > > >
> > > > So we can check locking at runtime.
> > > >
> > > > Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> > > >  mm/hmm.c | 4 ++--
> > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/mm/hmm.c b/mm/hmm.c
> > > > index 2695925c0c5927..46872306f922bb 100644
> > > > +++ b/mm/hmm.c
> > > > @@ -256,11 +256,11 @@ static const struct mmu_notifier_ops hmm_mmu_notifier_ops = {
> > > >   *
> > > >   * To start mirroring a process address space, the device driver must register
> > > >   * an HMM mirror struct.
> > > > - *
> > > > - * THE mm->mmap_sem MUST BE HELD IN WRITE MODE !
> > > >   */
> > > >  int hmm_mirror_register(struct hmm_mirror *mirror, struct mm_struct *mm)
> > > >  {
> > > > +       lockdep_assert_held_exclusive(mm->mmap_sem);
> > > > +
> > >
> > > Gentle query, does the same required in hmm_mirror_unregister() ?
> >
> > No.. The unregistration path does its actual work in the srcu
> > callback, which is in a different context than this function. So any
> > locking held by the caller of unregister will not apply.
> >
> > The hmm_range_free SRCU callback obtains the write side of mmap_sem to
> > protect the same data that the write side above in register is
> > touching, mostly &mm->hmm.
> 
> Looking into https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/tree/?h=hmm,
> unable trace hmm_range_free(). Am I looking into correct tree ?

The cover letter for the v2 posting has a note about the git tree for
this series:

https://github.com/jgunthorpe/linux/tree/hmm

The above rdma.git is only for already applied patches on their way to
Linus. This series is still in review.

Thanks,
Jason


^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2019-06-08  1:15 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-23 15:34 [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 01/11] mm/hmm: Fix use after free with struct hmm in the mmu notifiers Jason Gunthorpe
2019-06-06 23:54   ` Ira Weiny
2019-06-07 14:17     ` Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 02/11] mm/hmm: Use hmm_mirror not mm as an argument for hmm_register_range Jason Gunthorpe
2019-05-23 18:22   ` Christoph Hellwig
2019-05-23 15:34 ` [RFC PATCH 03/11] mm/hmm: Hold a mmgrab from hmm to mm Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 04/11] mm/hmm: Simplify hmm_get_or_create and make it reliable Jason Gunthorpe
2019-05-23 23:38   ` Ralph Campbell
2019-05-24  1:23     ` Jason Gunthorpe
2019-05-24 17:06       ` Ralph Campbell
2019-05-23 15:34 ` [RFC PATCH 05/11] mm/hmm: Improve locking around hmm->dead Jason Gunthorpe
2019-05-24 13:40   ` Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 06/11] mm/hmm: Remove duplicate condition test before wait_event_timeout Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 07/11] mm/hmm: Delete hmm_mirror_mm_is_alive() Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 08/11] mm/hmm: Use lockdep instead of comments Jason Gunthorpe
2019-06-07 19:33   ` Souptick Joarder
2019-06-07 19:39     ` Jason Gunthorpe
2019-06-07 21:02       ` Souptick Joarder
2019-06-08  1:15         ` Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 09/11] mm/hmm: Remove racy protection against double-unregistration Jason Gunthorpe
2019-06-07 19:38   ` Souptick Joarder
2019-06-07 19:37     ` Jason Gunthorpe
2019-06-07 19:55       ` Souptick Joarder
2019-05-23 15:34 ` [RFC PATCH 10/11] mm/hmm: Poison hmm_range during unregister Jason Gunthorpe
2019-06-07 20:13   ` Souptick Joarder
2019-06-07 20:18     ` Jason Gunthorpe
2019-05-23 15:34 ` [RFC PATCH 11/11] mm/hmm: Do not use list*_rcu() for hmm->ranges Jason Gunthorpe
2019-06-07 20:22   ` Souptick Joarder
2019-05-23 19:04 ` [RFC PATCH 00/11] mm/hmm: Various revisions from a locking/code review John Hubbard
2019-05-23 19:37   ` Jason Gunthorpe
2019-05-23 20:59   ` Jerome Glisse
2019-05-24 13:35 ` Jason Gunthorpe
2019-05-24 14:36 ` Jason Gunthorpe
2019-05-24 16:49   ` Jerome Glisse
2019-05-24 16:59     ` Jason Gunthorpe
2019-05-24 17:01       ` Jerome Glisse
2019-05-24 17:52         ` Jason Gunthorpe
2019-05-24 18:03           ` Jerome Glisse
2019-05-24 18:32             ` Jason Gunthorpe
2019-05-24 18:46               ` Jerome Glisse
2019-05-24 22:09                 ` Jason Gunthorpe
2019-05-27 19:58                   ` Jason Gunthorpe
2019-05-24 17:47     ` Ralph Campbell
2019-05-24 17:51       ` Jerome Glisse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).