linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -rc 0/6] Bug fixes for odp
@ 2019-10-01 15:38 Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 1/6] RDMA/mlx5: Do not allow rereg of a ODP MR Jason Gunthorpe
                   ` (6 more replies)
  0 siblings, 7 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe

From: Jason Gunthorpe <jgg@mellanox.com>

Various assorted bug fixes for the ODP feature closing races and other bad
locking things we be seeing in the field.

Jason Gunthorpe (6):
  RDMA/mlx5: Do not allow rereg of a ODP MR
  RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages()
  RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu
  RDMA/mlx5: Put live in the correct place for ODP MRs
  RDMA/mlx5: Add missing synchronize_srcu() for MW cases

 drivers/infiniband/core/umem_odp.c           |  6 +-
 drivers/infiniband/hw/mlx5/devx.c            | 58 +++++------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h         |  3 +-
 drivers/infiniband/hw/mlx5/mr.c              | 68 ++++++++------------
 drivers/infiniband/hw/mlx5/odp.c             | 58 +++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/mr.c |  8 +--
 6 files changed, 96 insertions(+), 105 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/6] RDMA/mlx5: Do not allow rereg of a ODP MR
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
@ 2019-10-01 15:38 ` Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR Jason Gunthorpe
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe, Artemy Kovalyov

From: Jason Gunthorpe <jgg@mellanox.com>

This code is completely broken, the umem of a ODP MR simply cannot be
discarded without a lot more locking, nor can an ODP mkey be blithely
destroyed via destroy_mkey().

Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 1eff031ef04842..e7f840f306e46a 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1441,6 +1441,9 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 	if (!mr->umem)
 		return -EINVAL;
 
+	if (is_odp_mr(mr))
+		return -EOPNOTSUPP;
+
 	if (flags & IB_MR_REREG_TRANS) {
 		addr = virt_addr;
 		len = length;
@@ -1486,8 +1489,6 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 		}
 
 		mr->allocated_from_cache = 0;
-		if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
-			mr->live = 1;
 	} else {
 		/*
 		 * Send a UMR WQE
@@ -1516,7 +1517,6 @@ int mlx5_ib_rereg_user_mr(struct ib_mr *ib_mr, int flags, u64 start,
 
 	set_mr_fields(dev, mr, npages, len, access_flags);
 
-	update_odp_mr(mr);
 	return 0;
 
 err:
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 1/6] RDMA/mlx5: Do not allow rereg of a ODP MR Jason Gunthorpe
@ 2019-10-01 15:38 ` Jason Gunthorpe
  2019-10-02  8:18   ` Leon Romanovsky
  2019-10-01 15:38 ` [PATCH 3/6] RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages() Jason Gunthorpe
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe, Artemy Kovalyov

From: Jason Gunthorpe <jgg@mellanox.com>

mlx5_ib_update_xlt() must be protected against parallel free of the MR it
is accessing, also it must be called single threaded while updating the
HW. Otherwise we can have races of the form:

    CPU0                               CPU1
  mlx5_ib_update_xlt()
   mlx5_odp_populate_klm()
     odp_lookup() == NULL
     pklm = ZAP
                                      implicit_mr_get_data()
 				        implicit_mr_alloc()
 					  <update interval tree>
					mlx5_ib_update_xlt
					  mlx5_odp_populate_klm()
					    odp_lookup() != NULL
					    pklm = VALID
					   mlx5_ib_post_send_wait()

    mlx5_ib_post_send_wait() // Replaces VALID with ZAP

This can be solved by putting both the SRCU and the umem_mutex lock around
every call to mlx5_ib_update_xlt(). This ensures that the content of the
interval tree relavent to mlx5_odp_populate_klm() (ie mr->parent == mr)
will not change while it is running, and thus the posted WRs to update the
KLM will always reflect the correct information.

The race above will resolve by either having CPU1 wait till CPU0 completes
the ZAP or CPU0 will run after the add and instead store VALID.

The pagefault path adding children already holds the umem_mutex and SRCU,
so the only missed lock is during MR destruction.

Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/hw/mlx5/odp.c | 34 ++++++++++++++++++++++++++++++--
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 2e9b4306179745..3401c06b7e54f5 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -178,6 +178,29 @@ void mlx5_odp_populate_klm(struct mlx5_klm *pklm, size_t offset,
 		return;
 	}
 
+	/*
+	 * The locking here is pretty subtle. Ideally the implicit children
+	 * list would be protected by the umem_mutex, however that is not
+	 * possible. Instead this uses a weaker update-then-lock pattern:
+	 *
+	 *  srcu_read_lock()
+	 *    <change children list>
+	 *    mutex_lock(umem_mutex)
+	 *     mlx5_ib_update_xlt()
+	 *    mutex_unlock(umem_mutex)
+	 *    destroy lkey
+	 *
+	 * ie any change the children list must be followed by the locked
+	 * update_xlt before destroying.
+	 *
+	 * The umem_mutex provides the acquire/release semantic needed to make
+	 * the children list visible to a racing thread. While SRCU is not
+	 * technically required, using it gives consistent use of the SRCU
+	 * locking around the children list.
+	 */
+	lockdep_assert_held(&to_ib_umem_odp(mr->umem)->umem_mutex);
+	lockdep_assert_held(&mr->dev->mr_srcu);
+
 	odp = odp_lookup(offset * MLX5_IMR_MTT_SIZE,
 			 nentries * MLX5_IMR_MTT_SIZE, mr);
 
@@ -202,15 +225,22 @@ static void mr_leaf_free_action(struct work_struct *work)
 	struct ib_umem_odp *odp = container_of(work, struct ib_umem_odp, work);
 	int idx = ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT;
 	struct mlx5_ib_mr *mr = odp->private, *imr = mr->parent;
+	struct ib_umem_odp *odp_imr = to_ib_umem_odp(imr->umem);
+	int srcu_key;
 
 	mr->parent = NULL;
 	synchronize_srcu(&mr->dev->mr_srcu);
 
-	ib_umem_odp_release(odp);
-	if (imr->live)
+	if (imr->live) {
+		srcu_key = srcu_read_lock(&mr->dev->mr_srcu);
+		mutex_lock(&odp_imr->umem_mutex);
 		mlx5_ib_update_xlt(imr, idx, 1, 0,
 				   MLX5_IB_UPD_XLT_INDIRECT |
 				   MLX5_IB_UPD_XLT_ATOMIC);
+		mutex_unlock(&odp_imr->umem_mutex);
+		srcu_read_unlock(&mr->dev->mr_srcu, srcu_key);
+	}
+	ib_umem_odp_release(odp);
 	mlx5_mr_cache_free(mr->dev, mr);
 
 	if (atomic_dec_and_test(&imr->num_leaf_free))
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/6] RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages()
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 1/6] RDMA/mlx5: Do not allow rereg of a ODP MR Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR Jason Gunthorpe
@ 2019-10-01 15:38 ` Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 4/6] RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu Jason Gunthorpe
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe, Artemy Kovalyov

From: Jason Gunthorpe <jgg@mellanox.com>

This fixes a race of the form:
    CPU0                               CPU1
mlx5_ib_invalidate_range()     mlx5_ib_invalidate_range()
				 // This one actually makes npages == 0
				 ib_umem_odp_unmap_dma_pages()
				 if (npages == 0 && !dying)
  // This one does nothing
  ib_umem_odp_unmap_dma_pages()
  if (npages == 0 && !dying)
     dying = 1;
                                    dying = 1;
				    schedule_work(&umem_odp->work);
     // Double schedule of the same work
     schedule_work(&umem_odp->work);  // BOOM

npages and dying must be read and written under the umem_mutex lock.

Since whenever ib_umem_odp_unmap_dma_pages() is called mlx5 must also call
mlx5_ib_update_xlt, and both need to be done in the same locking region,
hoist the lock out of unmap.

This avoids an expensive double critical section in
mlx5_ib_invalidate_range().

Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/core/umem_odp.c |  6 ++++--
 drivers/infiniband/hw/mlx5/odp.c   | 12 ++++++++----
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index f67a30fda1ed9a..163ff7ba92b7f1 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -451,8 +451,10 @@ void ib_umem_odp_release(struct ib_umem_odp *umem_odp)
 	 * that the hardware will not attempt to access the MR any more.
 	 */
 	if (!umem_odp->is_implicit_odp) {
+		mutex_lock(&umem_odp->umem_mutex);
 		ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
 					    ib_umem_end(umem_odp));
+		mutex_unlock(&umem_odp->umem_mutex);
 		kvfree(umem_odp->dma_list);
 		kvfree(umem_odp->page_list);
 	}
@@ -719,6 +721,8 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	u64 addr;
 	struct ib_device *dev = umem_odp->umem.ibdev;
 
+	lockdep_assert_held(&umem_odp->umem_mutex);
+
 	virt = max_t(u64, virt, ib_umem_start(umem_odp));
 	bound = min_t(u64, bound, ib_umem_end(umem_odp));
 	/* Note that during the run of this function, the
@@ -726,7 +730,6 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 	 * faults from completion. We might be racing with other
 	 * invalidations, so we must make sure we free each page only
 	 * once. */
-	mutex_lock(&umem_odp->umem_mutex);
 	for (addr = virt; addr < bound; addr += BIT(umem_odp->page_shift)) {
 		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
 		if (umem_odp->page_list[idx]) {
@@ -757,7 +760,6 @@ void ib_umem_odp_unmap_dma_pages(struct ib_umem_odp *umem_odp, u64 virt,
 			umem_odp->npages--;
 		}
 	}
-	mutex_unlock(&umem_odp->umem_mutex);
 }
 EXPORT_SYMBOL(ib_umem_odp_unmap_dma_pages);
 
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 3401c06b7e54f5..1930d78c3091cb 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -308,7 +308,6 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 				   idx - blk_start_idx + 1, 0,
 				   MLX5_IB_UPD_XLT_ZAP |
 				   MLX5_IB_UPD_XLT_ATOMIC);
-	mutex_unlock(&umem_odp->umem_mutex);
 	/*
 	 * We are now sure that the device will not access the
 	 * memory. We can safely unmap it, and mark it as dirty if
@@ -319,10 +318,11 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 
 	if (unlikely(!umem_odp->npages && mr->parent &&
 		     !umem_odp->dying)) {
-		WRITE_ONCE(umem_odp->dying, 1);
+		umem_odp->dying = 1;
 		atomic_inc(&mr->parent->num_leaf_free);
 		schedule_work(&umem_odp->work);
 	}
+	mutex_unlock(&umem_odp->umem_mutex);
 }
 
 void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
@@ -585,15 +585,19 @@ void mlx5_ib_free_implicit_mr(struct mlx5_ib_mr *imr)
 		if (mr->parent != imr)
 			continue;
 
+		mutex_lock(&umem_odp->umem_mutex);
 		ib_umem_odp_unmap_dma_pages(umem_odp, ib_umem_start(umem_odp),
 					    ib_umem_end(umem_odp));
 
-		if (umem_odp->dying)
+		if (umem_odp->dying) {
+			mutex_unlock(&umem_odp->umem_mutex);
 			continue;
+		}
 
-		WRITE_ONCE(umem_odp->dying, 1);
+		umem_odp->dying = 1;
 		atomic_inc(&imr->num_leaf_free);
 		schedule_work(&umem_odp->work);
+		mutex_unlock(&umem_odp->umem_mutex);
 	}
 	up_read(&per_mm->umem_rwsem);
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/6] RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2019-10-01 15:38 ` [PATCH 3/6] RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages() Jason Gunthorpe
@ 2019-10-01 15:38 ` Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 5/6] RDMA/mlx5: Put live in the correct place for ODP MRs Jason Gunthorpe
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe, Artemy Kovalyov

From: Jason Gunthorpe <jgg@mellanox.com>

During destroy setting live = 0 and then synchronize_srcu() prevents
num_pending_prefetch from incrementing, and also, ensures that all work
holding that count is queued on the WQ. Testing before causes races of the
form:

    CPU0                                         CPU1
  dereg_mr()
                                          mlx5_ib_advise_mr_prefetch()
            				   srcu_read_lock()
                                            num_pending_prefetch_inc()
					      if (!live)
   live = 0
   atomic_read() == 0
     // skip flush_workqueue()
                                              atomic_inc()
 					      queue_work();
            				   srcu_read_unlock()
   WARN_ON(atomic_read())  // Fails

Swap the order so that the synchronize_srcu() prevents this.

Fixes: a6bc3875f176 ("IB/mlx5: Protect against prefetch of invalid MR")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/hw/mlx5/mr.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index e7f840f306e46a..0ee8fa01177fc9 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1609,13 +1609,14 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 		 */
 		mr->live = 0;
 
+		/* Wait for all running page-fault handlers to finish. */
+		synchronize_srcu(&dev->mr_srcu);
+
 		/* dequeue pending prefetch requests for the mr */
 		if (atomic_read(&mr->num_pending_prefetch))
 			flush_workqueue(system_unbound_wq);
 		WARN_ON(atomic_read(&mr->num_pending_prefetch));
 
-		/* Wait for all running page-fault handlers to finish. */
-		synchronize_srcu(&dev->mr_srcu);
 		/* Destroy all page mappings */
 		if (!umem_odp->is_implicit_odp)
 			mlx5_ib_invalidate_range(umem_odp,
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 5/6] RDMA/mlx5: Put live in the correct place for ODP MRs
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2019-10-01 15:38 ` [PATCH 4/6] RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu Jason Gunthorpe
@ 2019-10-01 15:38 ` Jason Gunthorpe
  2019-10-01 15:38 ` [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases Jason Gunthorpe
  2019-10-04 18:55 ` [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
  6 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe, Artemy Kovalyov

From: Jason Gunthorpe <jgg@mellanox.com>

live is used to signal to the pagefault thread that the MR is initialized
and ready for use. It should be after the umem is assigned and all other
setup is completed. This prevents races (at least) of the form:

    CPU0                                     CPU1
mlx5_ib_alloc_implicit_mr()
 implicit_mr_alloc()
  live = 1
 imr->umem = umem
                                    num_pending_prefetch_inc()
                                      if (live)
				        atomic_inc(num_pending_prefetch)
 atomic_set(num_pending_prefetch,0) // Overwrites other thread's store

Further, live is being used with SRCU as the 'update' in an
acquire/release fashion, so it can not be read and written raw.

Move all live = 1's to after MR initialization is completed and use
smp_store_release/smp_load_acquire() for manipulating it.

Add a missing live = 0 when an implicit MR child is deleted, before
queuing work to do synchronize_srcu().

The barriers in update_odp_mr() were some broken attempt to create a
acquire/release, but were not even applied consistently and missed the
point, delete it as well.

Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  2 +-
 drivers/infiniband/hw/mlx5/mr.c      | 36 ++++------------------------
 drivers/infiniband/hw/mlx5/odp.c     | 14 ++++++-----
 3 files changed, 14 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 2ceaef3ea3fb92..15e42825cc976e 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -606,7 +606,7 @@ struct mlx5_ib_mr {
 	struct mlx5_ib_dev     *dev;
 	u32 out[MLX5_ST_SZ_DW(create_mkey_out)];
 	struct mlx5_core_sig_ctx    *sig;
-	int			live;
+	unsigned int		live;
 	void			*descs_alloc;
 	int			access_flags; /* Needed for rereg MR */
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 0ee8fa01177fc9..3a27bddfcf31f5 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -84,32 +84,6 @@ static bool use_umr_mtt_update(struct mlx5_ib_mr *mr, u64 start, u64 length)
 		length + (start & (MLX5_ADAPTER_PAGE_SIZE - 1));
 }
 
-static void update_odp_mr(struct mlx5_ib_mr *mr)
-{
-	if (is_odp_mr(mr)) {
-		/*
-		 * This barrier prevents the compiler from moving the
-		 * setting of umem->odp_data->private to point to our
-		 * MR, before reg_umr finished, to ensure that the MR
-		 * initialization have finished before starting to
-		 * handle invalidations.
-		 */
-		smp_wmb();
-		to_ib_umem_odp(mr->umem)->private = mr;
-		/*
-		 * Make sure we will see the new
-		 * umem->odp_data->private value in the invalidation
-		 * routines, before we can get page faults on the
-		 * MR. Page faults can happen once we put the MR in
-		 * the tree, below this line. Without the barrier,
-		 * there can be a fault handling and an invalidation
-		 * before umem->odp_data->private == mr is visible to
-		 * the invalidation handler.
-		 */
-		smp_wmb();
-	}
-}
-
 static void reg_mr_callback(int status, struct mlx5_async_work *context)
 {
 	struct mlx5_ib_mr *mr =
@@ -1346,8 +1320,6 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 	mr->umem = umem;
 	set_mr_fields(dev, mr, npages, length, access_flags);
 
-	update_odp_mr(mr);
-
 	if (use_umr) {
 		int update_xlt_flags = MLX5_IB_UPD_XLT_ENABLE;
 
@@ -1363,10 +1335,12 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 start, u64 length,
 		}
 	}
 
-	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
-		mr->live = 1;
+	if (is_odp_mr(mr)) {
+		to_ib_umem_odp(mr->umem)->private = mr;
 		atomic_set(&mr->num_pending_prefetch, 0);
 	}
+	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
+		smp_store_release(&mr->live, 1);
 
 	return &mr->ibmr;
 error:
@@ -1607,7 +1581,7 @@ static void dereg_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr)
 		/* Prevent new page faults and
 		 * prefetch requests from succeeding
 		 */
-		mr->live = 0;
+		WRITE_ONCE(mr->live, 0);
 
 		/* Wait for all running page-fault handlers to finish. */
 		synchronize_srcu(&dev->mr_srcu);
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 1930d78c3091cb..3f9478d1937668 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -231,7 +231,7 @@ static void mr_leaf_free_action(struct work_struct *work)
 	mr->parent = NULL;
 	synchronize_srcu(&mr->dev->mr_srcu);
 
-	if (imr->live) {
+	if (smp_load_acquire(&imr->live)) {
 		srcu_key = srcu_read_lock(&mr->dev->mr_srcu);
 		mutex_lock(&odp_imr->umem_mutex);
 		mlx5_ib_update_xlt(imr, idx, 1, 0,
@@ -318,6 +318,7 @@ void mlx5_ib_invalidate_range(struct ib_umem_odp *umem_odp, unsigned long start,
 
 	if (unlikely(!umem_odp->npages && mr->parent &&
 		     !umem_odp->dying)) {
+		WRITE_ONCE(mr->live, 0);
 		umem_odp->dying = 1;
 		atomic_inc(&mr->parent->num_leaf_free);
 		schedule_work(&umem_odp->work);
@@ -459,8 +460,6 @@ static struct mlx5_ib_mr *implicit_mr_alloc(struct ib_pd *pd,
 	mr->ibmr.lkey = mr->mmkey.key;
 	mr->ibmr.rkey = mr->mmkey.key;
 
-	mr->live = 1;
-
 	mlx5_ib_dbg(dev, "key %x dev %p mr %p\n",
 		    mr->mmkey.key, dev->mdev, mr);
 
@@ -514,6 +513,8 @@ static struct ib_umem_odp *implicit_mr_get_data(struct mlx5_ib_mr *mr,
 		mtt->parent = mr;
 		INIT_WORK(&odp->work, mr_leaf_free_action);
 
+		smp_store_release(&mtt->live, 1);
+
 		if (!nentries)
 			start_idx = addr >> MLX5_IMR_MTT_SHIFT;
 		nentries++;
@@ -566,6 +567,7 @@ struct mlx5_ib_mr *mlx5_ib_alloc_implicit_mr(struct mlx5_ib_pd *pd,
 	init_waitqueue_head(&imr->q_leaf_free);
 	atomic_set(&imr->num_leaf_free, 0);
 	atomic_set(&imr->num_pending_prefetch, 0);
+	smp_store_release(&imr->live, 1);
 
 	return imr;
 }
@@ -807,7 +809,7 @@ static int pagefault_single_data_segment(struct mlx5_ib_dev *dev,
 	switch (mmkey->type) {
 	case MLX5_MKEY_MR:
 		mr = container_of(mmkey, struct mlx5_ib_mr, mmkey);
-		if (!mr->live || !mr->ibmr.pd) {
+		if (!smp_load_acquire(&mr->live) || !mr->ibmr.pd) {
 			mlx5_ib_dbg(dev, "got dead MR\n");
 			ret = -EFAULT;
 			goto srcu_unlock;
@@ -1675,12 +1677,12 @@ static bool num_pending_prefetch_inc(struct ib_pd *pd,
 
 		mr = container_of(mmkey, struct mlx5_ib_mr, mmkey);
 
-		if (mr->ibmr.pd != pd) {
+		if (!smp_load_acquire(&mr->live)) {
 			ret = false;
 			break;
 		}
 
-		if (!mr->live) {
+		if (mr->ibmr.pd != pd) {
 			ret = false;
 			break;
 		}
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2019-10-01 15:38 ` [PATCH 5/6] RDMA/mlx5: Put live in the correct place for ODP MRs Jason Gunthorpe
@ 2019-10-01 15:38 ` Jason Gunthorpe
  2019-10-03  8:54   ` Leon Romanovsky
  2019-10-04 18:55 ` [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
  6 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-01 15:38 UTC (permalink / raw)
  To: linux-rdma; +Cc: Jason Gunthorpe, Artemy Kovalyov

From: Jason Gunthorpe <jgg@mellanox.com>

While MR uses live as the SRCU 'update', the MW case uses the xarray
directly, xa_erase() causes the MW to become inaccessible to the pagefault
thread.

Thus whenever a MW is removed from the xarray we must synchronize_srcu()
before freeing it.

This must be done before freeing the mkey as re-use of the mkey while the
pagefault thread is using the stale mkey is undesirable.

Add the missing synchronizes to MW and DEVX indirect mkey and delete the
bogus protection against double destroy in mlx5_core_destroy_mkey()

Fixes: 534fd7aac56a ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
---
 drivers/infiniband/hw/mlx5/devx.c            | 58 ++++++--------------
 drivers/infiniband/hw/mlx5/mlx5_ib.h         |  1 -
 drivers/infiniband/hw/mlx5/mr.c              | 21 +++++--
 drivers/net/ethernet/mellanox/mlx5/core/mr.c |  8 +--
 4 files changed, 33 insertions(+), 55 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
index 59022b7441448f..d609f4659afb7a 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -1298,29 +1298,6 @@ static int devx_handle_mkey_create(struct mlx5_ib_dev *dev,
 	return 0;
 }
 
-static void devx_free_indirect_mkey(struct rcu_head *rcu)
-{
-	kfree(container_of(rcu, struct devx_obj, devx_mr.rcu));
-}
-
-/* This function to delete from the radix tree needs to be called before
- * destroying the underlying mkey. Otherwise a race might occur in case that
- * other thread will get the same mkey before this one will be deleted,
- * in that case it will fail via inserting to the tree its own data.
- *
- * Note:
- * An error in the destroy is not expected unless there is some other indirect
- * mkey which points to this one. In a kernel cleanup flow it will be just
- * destroyed in the iterative destruction call. In a user flow, in case
- * the application didn't close in the expected order it's its own problem,
- * the mkey won't be part of the tree, in both cases the kernel is safe.
- */
-static void devx_cleanup_mkey(struct devx_obj *obj)
-{
-	xa_erase(&obj->ib_dev->mdev->priv.mkey_table,
-		 mlx5_base_mkey(obj->devx_mr.mmkey.key));
-}
-
 static void devx_cleanup_subscription(struct mlx5_ib_dev *dev,
 				      struct devx_event_subscription *sub)
 {
@@ -1362,8 +1339,16 @@ static int devx_obj_cleanup(struct ib_uobject *uobject,
 	int ret;
 
 	dev = mlx5_udata_to_mdev(&attrs->driver_udata);
-	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY)
-		devx_cleanup_mkey(obj);
+	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
+		/*
+		 * The pagefault_single_data_segment() does commands against
+		 * the mmkey, we must wait for that to stop before freeing the
+		 * mkey, as another allocation could get the same mkey #.
+		 */
+		xa_erase(&obj->ib_dev->mdev->priv.mkey_table,
+			 mlx5_base_mkey(obj->devx_mr.mmkey.key));
+		synchronize_srcu(&dev->mr_srcu);
+	}
 
 	if (obj->flags & DEVX_OBJ_FLAGS_DCT)
 		ret = mlx5_core_destroy_dct(obj->ib_dev->mdev, &obj->core_dct);
@@ -1382,12 +1367,6 @@ static int devx_obj_cleanup(struct ib_uobject *uobject,
 		devx_cleanup_subscription(dev, sub_entry);
 	mutex_unlock(&devx_event_table->event_xa_lock);
 
-	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
-		call_srcu(&dev->mr_srcu, &obj->devx_mr.rcu,
-			  devx_free_indirect_mkey);
-		return ret;
-	}
-
 	kfree(obj);
 	return ret;
 }
@@ -1491,26 +1470,21 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_CREATE)(
 				   &obj_id);
 	WARN_ON(obj->dinlen > MLX5_MAX_DESTROY_INBOX_SIZE_DW * sizeof(u32));
 
-	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
-		err = devx_handle_mkey_indirect(obj, dev, cmd_in, cmd_out);
-		if (err)
-			goto obj_destroy;
-	}
-
 	err = uverbs_copy_to(attrs, MLX5_IB_ATTR_DEVX_OBJ_CREATE_CMD_OUT, cmd_out, cmd_out_len);
 	if (err)
-		goto err_copy;
+		goto obj_destroy;
 
 	if (opcode == MLX5_CMD_OP_CREATE_GENERAL_OBJECT)
 		obj_type = MLX5_GET(general_obj_in_cmd_hdr, cmd_in, obj_type);
-
 	obj->obj_id = get_enc_obj_id(opcode | obj_type << 16, obj_id);
 
+	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
+		err = devx_handle_mkey_indirect(obj, dev, cmd_in, cmd_out);
+		if (err)
+			goto obj_destroy;
+	}
 	return 0;
 
-err_copy:
-	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY)
-		devx_cleanup_mkey(obj);
 obj_destroy:
 	if (obj->flags & DEVX_OBJ_FLAGS_DCT)
 		mlx5_core_destroy_dct(obj->ib_dev->mdev, &obj->core_dct);
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 15e42825cc976e..1a98ee2e01c4b9 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -639,7 +639,6 @@ struct mlx5_ib_mw {
 struct mlx5_ib_devx_mr {
 	struct mlx5_core_mkey	mmkey;
 	int			ndescs;
-	struct rcu_head		rcu;
 };
 
 struct mlx5_ib_umr_context {
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3a27bddfcf31f5..630599311586ec 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1962,14 +1962,25 @@ struct ib_mw *mlx5_ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type,
 
 int mlx5_ib_dealloc_mw(struct ib_mw *mw)
 {
+	struct mlx5_ib_dev *dev = to_mdev(mw->device);
 	struct mlx5_ib_mw *mmw = to_mmw(mw);
 	int err;
 
-	err =  mlx5_core_destroy_mkey((to_mdev(mw->device))->mdev,
-				      &mmw->mmkey);
-	if (!err)
-		kfree(mmw);
-	return err;
+	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
+		xa_erase(&dev->mdev->priv.mkey_table,
+			 mlx5_base_mkey(mmw->mmkey.key));
+		/*
+		 * pagefault_single_data_segment() may be accessing mmw under
+		 * SRCU if the user bound an ODP MR to this MW.
+		 */
+		synchronize_srcu(&dev->mr_srcu);
+	}
+
+	err = mlx5_core_destroy_mkey(dev->mdev, &mmw->mmkey);
+	if (err)
+		return err;
+	kfree(mmw);
+	return 0;
 }
 
 int mlx5_ib_check_mr_status(struct ib_mr *ibmr, u32 check_mask,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mr.c b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
index 9231b39d18b21c..c501bf2a025210 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mr.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mr.c
@@ -112,17 +112,11 @@ int mlx5_core_destroy_mkey(struct mlx5_core_dev *dev,
 	u32 out[MLX5_ST_SZ_DW(destroy_mkey_out)] = {0};
 	u32 in[MLX5_ST_SZ_DW(destroy_mkey_in)]   = {0};
 	struct xarray *mkeys = &dev->priv.mkey_table;
-	struct mlx5_core_mkey *deleted_mkey;
 	unsigned long flags;
 
 	xa_lock_irqsave(mkeys, flags);
-	deleted_mkey = __xa_erase(mkeys, mlx5_base_mkey(mkey->key));
+	__xa_erase(mkeys, mlx5_base_mkey(mkey->key));
 	xa_unlock_irqrestore(mkeys, flags);
-	if (!deleted_mkey) {
-		mlx5_core_dbg(dev, "failed xarray delete of mkey 0x%x\n",
-			      mlx5_base_mkey(mkey->key));
-		return -ENOENT;
-	}
 
 	MLX5_SET(destroy_mkey_in, in, opcode, MLX5_CMD_OP_DESTROY_MKEY);
 	MLX5_SET(destroy_mkey_in, in, mkey_index, mlx5_mkey_to_idx(mkey->key));
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  2019-10-01 15:38 ` [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR Jason Gunthorpe
@ 2019-10-02  8:18   ` Leon Romanovsky
  2019-10-02 14:39     ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2019-10-02  8:18 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, Jason Gunthorpe, Artemy Kovalyov

On Tue, Oct 01, 2019 at 12:38:17PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> mlx5_ib_update_xlt() must be protected against parallel free of the MR it
> is accessing, also it must be called single threaded while updating the
> HW. Otherwise we can have races of the form:
>
>     CPU0                               CPU1
>   mlx5_ib_update_xlt()
>    mlx5_odp_populate_klm()
>      odp_lookup() == NULL
>      pklm = ZAP
>                                       implicit_mr_get_data()
>  				        implicit_mr_alloc()
>  					  <update interval tree>
> 					mlx5_ib_update_xlt
> 					  mlx5_odp_populate_klm()
> 					    odp_lookup() != NULL
> 					    pklm = VALID
> 					   mlx5_ib_post_send_wait()
>
>     mlx5_ib_post_send_wait() // Replaces VALID with ZAP
>
> This can be solved by putting both the SRCU and the umem_mutex lock around
> every call to mlx5_ib_update_xlt(). This ensures that the content of the
> interval tree relavent to mlx5_odp_populate_klm() (ie mr->parent == mr)
> will not change while it is running, and thus the posted WRs to update the
> KLM will always reflect the correct information.
>
> The race above will resolve by either having CPU1 wait till CPU0 completes
> the ZAP or CPU0 will run after the add and instead store VALID.
>
> The pagefault path adding children already holds the umem_mutex and SRCU,
> so the only missed lock is during MR destruction.
>
> Fixes: 81713d3788d2 ("IB/mlx5: Add implicit MR support")
> Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  drivers/infiniband/hw/mlx5/odp.c | 34 ++++++++++++++++++++++++++++++--
>  1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index 2e9b4306179745..3401c06b7e54f5 100644
> --- a/drivers/infiniband/hw/mlx5/odp.c
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -178,6 +178,29 @@ void mlx5_odp_populate_klm(struct mlx5_klm *pklm, size_t offset,
>  		return;
>  	}
>
> +	/*
> +	 * The locking here is pretty subtle. Ideally the implicit children
> +	 * list would be protected by the umem_mutex, however that is not
> +	 * possible. Instead this uses a weaker update-then-lock pattern:
> +	 *
> +	 *  srcu_read_lock()
> +	 *    <change children list>
> +	 *    mutex_lock(umem_mutex)
> +	 *     mlx5_ib_update_xlt()
> +	 *    mutex_unlock(umem_mutex)
> +	 *    destroy lkey
> +	 *
> +	 * ie any change the children list must be followed by the locked
> +	 * update_xlt before destroying.
> +	 *
> +	 * The umem_mutex provides the acquire/release semantic needed to make
> +	 * the children list visible to a racing thread. While SRCU is not
> +	 * technically required, using it gives consistent use of the SRCU
> +	 * locking around the children list.
> +	 */
> +	lockdep_assert_held(&to_ib_umem_odp(mr->umem)->umem_mutex);
> +	lockdep_assert_held(&mr->dev->mr_srcu);
> +
>  	odp = odp_lookup(offset * MLX5_IMR_MTT_SIZE,
>  			 nentries * MLX5_IMR_MTT_SIZE, mr);
>
> @@ -202,15 +225,22 @@ static void mr_leaf_free_action(struct work_struct *work)
>  	struct ib_umem_odp *odp = container_of(work, struct ib_umem_odp, work);
>  	int idx = ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT;
>  	struct mlx5_ib_mr *mr = odp->private, *imr = mr->parent;
> +	struct ib_umem_odp *odp_imr = to_ib_umem_odp(imr->umem);
> +	int srcu_key;
>
>  	mr->parent = NULL;
>  	synchronize_srcu(&mr->dev->mr_srcu);

Are you sure that this line is still needed?

>
> -	ib_umem_odp_release(odp);
> -	if (imr->live)
> +	if (imr->live) {
> +		srcu_key = srcu_read_lock(&mr->dev->mr_srcu);
> +		mutex_lock(&odp_imr->umem_mutex);
>  		mlx5_ib_update_xlt(imr, idx, 1, 0,
>  				   MLX5_IB_UPD_XLT_INDIRECT |
>  				   MLX5_IB_UPD_XLT_ATOMIC);
> +		mutex_unlock(&odp_imr->umem_mutex);
> +		srcu_read_unlock(&mr->dev->mr_srcu, srcu_key);
> +	}
> +	ib_umem_odp_release(odp);
>  	mlx5_mr_cache_free(mr->dev, mr);
>
>  	if (atomic_dec_and_test(&imr->num_leaf_free))
> --
> 2.23.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  2019-10-02  8:18   ` Leon Romanovsky
@ 2019-10-02 14:39     ` Jason Gunthorpe
  2019-10-02 15:41       ` Leon Romanovsky
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-02 14:39 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma, Artemy Kovalyov

On Wed, Oct 02, 2019 at 11:18:26AM +0300, Leon Romanovsky wrote:
> > @@ -202,15 +225,22 @@ static void mr_leaf_free_action(struct work_struct *work)
> >  	struct ib_umem_odp *odp = container_of(work, struct ib_umem_odp, work);
> >  	int idx = ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT;
> >  	struct mlx5_ib_mr *mr = odp->private, *imr = mr->parent;
> > +	struct ib_umem_odp *odp_imr = to_ib_umem_odp(imr->umem);
> > +	int srcu_key;
> >
> >  	mr->parent = NULL;
> >  	synchronize_srcu(&mr->dev->mr_srcu);
> 
> Are you sure that this line is still needed?

Yes, in this case the mr->parent is the SRCU 'update' and it blocks
seeing this MR in the pagefault handler.

It is necessary before calling ib_umem_odp_release below that frees
the memory

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  2019-10-02 14:39     ` Jason Gunthorpe
@ 2019-10-02 15:41       ` Leon Romanovsky
  2019-10-03 12:48         ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2019-10-02 15:41 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, Artemy Kovalyov

On Wed, Oct 02, 2019 at 11:39:28AM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 02, 2019 at 11:18:26AM +0300, Leon Romanovsky wrote:
> > > @@ -202,15 +225,22 @@ static void mr_leaf_free_action(struct work_struct *work)
> > >  	struct ib_umem_odp *odp = container_of(work, struct ib_umem_odp, work);
> > >  	int idx = ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT;
> > >  	struct mlx5_ib_mr *mr = odp->private, *imr = mr->parent;
> > > +	struct ib_umem_odp *odp_imr = to_ib_umem_odp(imr->umem);
> > > +	int srcu_key;
> > >
> > >  	mr->parent = NULL;
> > >  	synchronize_srcu(&mr->dev->mr_srcu);
> >
> > Are you sure that this line is still needed?
>
> Yes, in this case the mr->parent is the SRCU 'update' and it blocks
> seeing this MR in the pagefault handler.
>
> It is necessary before calling ib_umem_odp_release below that frees
> the memory

sorry for not being clear, I thought that synchronize_srcu() should be
moved after your read_lock/unlock additions to reuse grace period.

Thanks

>
> Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases
  2019-10-01 15:38 ` [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases Jason Gunthorpe
@ 2019-10-03  8:54   ` Leon Romanovsky
  2019-10-03 12:33     ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Leon Romanovsky @ 2019-10-03  8:54 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: linux-rdma, Jason Gunthorpe, Artemy Kovalyov

On Tue, Oct 01, 2019 at 12:38:21PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
>
> While MR uses live as the SRCU 'update', the MW case uses the xarray
> directly, xa_erase() causes the MW to become inaccessible to the pagefault
> thread.
>
> Thus whenever a MW is removed from the xarray we must synchronize_srcu()
> before freeing it.
>
> This must be done before freeing the mkey as re-use of the mkey while the
> pagefault thread is using the stale mkey is undesirable.
>
> Add the missing synchronizes to MW and DEVX indirect mkey and delete the
> bogus protection against double destroy in mlx5_core_destroy_mkey()
>
> Fixes: 534fd7aac56a ("IB/mlx5: Manage indirection mkey upon DEVX flow for ODP")
> Fixes: 6aec21f6a832 ("IB/mlx5: Page faults handling infrastructure")
> Reviewed-by: Artemy Kovalyov <artemyko@mellanox.com>
> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
> ---
>  drivers/infiniband/hw/mlx5/devx.c            | 58 ++++++--------------
>  drivers/infiniband/hw/mlx5/mlx5_ib.h         |  1 -
>  drivers/infiniband/hw/mlx5/mr.c              | 21 +++++--
>  drivers/net/ethernet/mellanox/mlx5/core/mr.c |  8 +--
>  4 files changed, 33 insertions(+), 55 deletions(-)
>
> diff --git a/drivers/infiniband/hw/mlx5/devx.c b/drivers/infiniband/hw/mlx5/devx.c
> index 59022b7441448f..d609f4659afb7a 100644
> --- a/drivers/infiniband/hw/mlx5/devx.c
> +++ b/drivers/infiniband/hw/mlx5/devx.c
> @@ -1298,29 +1298,6 @@ static int devx_handle_mkey_create(struct mlx5_ib_dev *dev,
>  	return 0;
>  }
>
> -static void devx_free_indirect_mkey(struct rcu_head *rcu)
> -{
> -	kfree(container_of(rcu, struct devx_obj, devx_mr.rcu));
> -}
> -
> -/* This function to delete from the radix tree needs to be called before
> - * destroying the underlying mkey. Otherwise a race might occur in case that
> - * other thread will get the same mkey before this one will be deleted,
> - * in that case it will fail via inserting to the tree its own data.
> - *
> - * Note:
> - * An error in the destroy is not expected unless there is some other indirect
> - * mkey which points to this one. In a kernel cleanup flow it will be just
> - * destroyed in the iterative destruction call. In a user flow, in case
> - * the application didn't close in the expected order it's its own problem,
> - * the mkey won't be part of the tree, in both cases the kernel is safe.
> - */
> -static void devx_cleanup_mkey(struct devx_obj *obj)
> -{
> -	xa_erase(&obj->ib_dev->mdev->priv.mkey_table,
> -		 mlx5_base_mkey(obj->devx_mr.mmkey.key));
> -}
> -
>  static void devx_cleanup_subscription(struct mlx5_ib_dev *dev,
>  				      struct devx_event_subscription *sub)
>  {
> @@ -1362,8 +1339,16 @@ static int devx_obj_cleanup(struct ib_uobject *uobject,
>  	int ret;
>
>  	dev = mlx5_udata_to_mdev(&attrs->driver_udata);
> -	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY)
> -		devx_cleanup_mkey(obj);
> +	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
> +		/*
> +		 * The pagefault_single_data_segment() does commands against
> +		 * the mmkey, we must wait for that to stop before freeing the
> +		 * mkey, as another allocation could get the same mkey #.
> +		 */
> +		xa_erase(&obj->ib_dev->mdev->priv.mkey_table,
> +			 mlx5_base_mkey(obj->devx_mr.mmkey.key));
> +		synchronize_srcu(&dev->mr_srcu);
> +	}
>
>  	if (obj->flags & DEVX_OBJ_FLAGS_DCT)
>  		ret = mlx5_core_destroy_dct(obj->ib_dev->mdev, &obj->core_dct);
> @@ -1382,12 +1367,6 @@ static int devx_obj_cleanup(struct ib_uobject *uobject,
>  		devx_cleanup_subscription(dev, sub_entry);
>  	mutex_unlock(&devx_event_table->event_xa_lock);
>
> -	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
> -		call_srcu(&dev->mr_srcu, &obj->devx_mr.rcu,
> -			  devx_free_indirect_mkey);
> -		return ret;
> -	}
> -
>  	kfree(obj);
>  	return ret;
>  }
> @@ -1491,26 +1470,21 @@ static int UVERBS_HANDLER(MLX5_IB_METHOD_DEVX_OBJ_CREATE)(
>  				   &obj_id);
>  	WARN_ON(obj->dinlen > MLX5_MAX_DESTROY_INBOX_SIZE_DW * sizeof(u32));
>
> -	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
> -		err = devx_handle_mkey_indirect(obj, dev, cmd_in, cmd_out);
> -		if (err)
> -			goto obj_destroy;
> -	}
> -
>  	err = uverbs_copy_to(attrs, MLX5_IB_ATTR_DEVX_OBJ_CREATE_CMD_OUT, cmd_out, cmd_out_len);
>  	if (err)
> -		goto err_copy;
> +		goto obj_destroy;
>
>  	if (opcode == MLX5_CMD_OP_CREATE_GENERAL_OBJECT)
>  		obj_type = MLX5_GET(general_obj_in_cmd_hdr, cmd_in, obj_type);
> -
>  	obj->obj_id = get_enc_obj_id(opcode | obj_type << 16, obj_id);
>
> +	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY) {
> +		err = devx_handle_mkey_indirect(obj, dev, cmd_in, cmd_out);
> +		if (err)
> +			goto obj_destroy;
> +	}
>  	return 0;
>
> -err_copy:
> -	if (obj->flags & DEVX_OBJ_FLAGS_INDIRECT_MKEY)
> -		devx_cleanup_mkey(obj);
>  obj_destroy:
>  	if (obj->flags & DEVX_OBJ_FLAGS_DCT)
>  		mlx5_core_destroy_dct(obj->ib_dev->mdev, &obj->core_dct);
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index 15e42825cc976e..1a98ee2e01c4b9 100644
> --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -639,7 +639,6 @@ struct mlx5_ib_mw {
>  struct mlx5_ib_devx_mr {
>  	struct mlx5_core_mkey	mmkey;
>  	int			ndescs;
> -	struct rcu_head		rcu;
>  };
>
>  struct mlx5_ib_umr_context {
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 3a27bddfcf31f5..630599311586ec 100644
> --- a/drivers/infiniband/hw/mlx5/mr.c
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -1962,14 +1962,25 @@ struct ib_mw *mlx5_ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type,
>
>  int mlx5_ib_dealloc_mw(struct ib_mw *mw)
>  {
> +	struct mlx5_ib_dev *dev = to_mdev(mw->device);
>  	struct mlx5_ib_mw *mmw = to_mmw(mw);
>  	int err;
>
> -	err =  mlx5_core_destroy_mkey((to_mdev(mw->device))->mdev,
> -				      &mmw->mmkey);
> -	if (!err)
> -		kfree(mmw);
> -	return err;
> +	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
> +		xa_erase(&dev->mdev->priv.mkey_table,
> +			 mlx5_base_mkey(mmw->mmkey.key));
> +		/*
> +		 * pagefault_single_data_segment() may be accessing mmw under
> +		 * SRCU if the user bound an ODP MR to this MW.
> +		 */
> +		synchronize_srcu(&dev->mr_srcu);
> +	}
> +
> +	err = mlx5_core_destroy_mkey(dev->mdev, &mmw->mmkey);
> +	if (err)
> +		return err;
> +	kfree(mmw);

You are skipping kfree() in case of error returned by mlx5_core_destroy_mkey().
IMHO, it is right for -ENOENT, but is not right for mlx5_cmd_exec() failures.

Thanks

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases
  2019-10-03  8:54   ` Leon Romanovsky
@ 2019-10-03 12:33     ` Jason Gunthorpe
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-03 12:33 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma, Artemy Kovalyov

On Thu, Oct 03, 2019 at 11:54:49AM +0300, Leon Romanovsky wrote:

> > diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> > index 3a27bddfcf31f5..630599311586ec 100644
> > +++ b/drivers/infiniband/hw/mlx5/mr.c
> > @@ -1962,14 +1962,25 @@ struct ib_mw *mlx5_ib_alloc_mw(struct ib_pd *pd, enum ib_mw_type type,
> >
> >  int mlx5_ib_dealloc_mw(struct ib_mw *mw)
> >  {
> > +	struct mlx5_ib_dev *dev = to_mdev(mw->device);
> >  	struct mlx5_ib_mw *mmw = to_mmw(mw);
> >  	int err;
> >
> > -	err =  mlx5_core_destroy_mkey((to_mdev(mw->device))->mdev,
> > -				      &mmw->mmkey);
> > -	if (!err)
> > -		kfree(mmw);
> > -	return err;
> > +	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
> > +		xa_erase(&dev->mdev->priv.mkey_table,
> > +			 mlx5_base_mkey(mmw->mmkey.key));
> > +		/*
> > +		 * pagefault_single_data_segment() may be accessing mmw under
> > +		 * SRCU if the user bound an ODP MR to this MW.
> > +		 */
> > +		synchronize_srcu(&dev->mr_srcu);
> > +	}
> > +
> > +	err = mlx5_core_destroy_mkey(dev->mdev, &mmw->mmkey);
> > +	if (err)
> > +		return err;
> > +	kfree(mmw);
> 
> You are skipping kfree() in case of error returned by mlx5_core_destroy_mkey().
> IMHO, it is right for -ENOENT, but is not right for mlx5_cmd_exec() failures.

This is pre-existing behavior, it seems reasonable to always free.

Again, allow error on destroy is such an annoying anti-pattern..

But fixing this should be a followup patch

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
  2019-10-02 15:41       ` Leon Romanovsky
@ 2019-10-03 12:48         ` Jason Gunthorpe
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-03 12:48 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: linux-rdma, Artemy Kovalyov

On Wed, Oct 02, 2019 at 06:41:14PM +0300, Leon Romanovsky wrote:
> On Wed, Oct 02, 2019 at 11:39:28AM -0300, Jason Gunthorpe wrote:
> > On Wed, Oct 02, 2019 at 11:18:26AM +0300, Leon Romanovsky wrote:
> > > > @@ -202,15 +225,22 @@ static void mr_leaf_free_action(struct work_struct *work)
> > > >  	struct ib_umem_odp *odp = container_of(work, struct ib_umem_odp, work);
> > > >  	int idx = ib_umem_start(odp) >> MLX5_IMR_MTT_SHIFT;
> > > >  	struct mlx5_ib_mr *mr = odp->private, *imr = mr->parent;
> > > > +	struct ib_umem_odp *odp_imr = to_ib_umem_odp(imr->umem);
> > > > +	int srcu_key;
> > > >
> > > >  	mr->parent = NULL;
> > > >  	synchronize_srcu(&mr->dev->mr_srcu);
> > >
> > > Are you sure that this line is still needed?
> >
> > Yes, in this case the mr->parent is the SRCU 'update' and it blocks
> > seeing this MR in the pagefault handler.
> >
> > It is necessary before calling ib_umem_odp_release below that frees
> > the memory
> 
> sorry for not being clear, I thought that synchronize_srcu() should be
> moved after your read_lock/unlock additions to reuse grace period.

It has to be before to ensure that the page fault handler does not
undo the invaliate below with new pages

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -rc 0/6] Bug fixes for odp
  2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2019-10-01 15:38 ` [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases Jason Gunthorpe
@ 2019-10-04 18:55 ` Jason Gunthorpe
  6 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2019-10-04 18:55 UTC (permalink / raw)
  To: linux-rdma

On Tue, Oct 01, 2019 at 12:38:15PM -0300, Jason Gunthorpe wrote:
> From: Jason Gunthorpe <jgg@mellanox.com>
> 
> Various assorted bug fixes for the ODP feature closing races and other bad
> locking things we be seeing in the field.
> 
> Jason Gunthorpe (6):
>   RDMA/mlx5: Do not allow rereg of a ODP MR
>   RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR
>   RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages()
>   RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu
>   RDMA/mlx5: Put live in the correct place for ODP MRs
>   RDMA/mlx5: Add missing synchronize_srcu() for MW cases

Applied to for-rc

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-10-04 18:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-01 15:38 [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe
2019-10-01 15:38 ` [PATCH 1/6] RDMA/mlx5: Do not allow rereg of a ODP MR Jason Gunthorpe
2019-10-01 15:38 ` [PATCH 2/6] RDMA/mlx5: Fix a race with mlx5_ib_update_xlt on an implicit MR Jason Gunthorpe
2019-10-02  8:18   ` Leon Romanovsky
2019-10-02 14:39     ` Jason Gunthorpe
2019-10-02 15:41       ` Leon Romanovsky
2019-10-03 12:48         ` Jason Gunthorpe
2019-10-01 15:38 ` [PATCH 3/6] RDMA/odp: Lift umem_mutex out of ib_umem_odp_unmap_dma_pages() Jason Gunthorpe
2019-10-01 15:38 ` [PATCH 4/6] RDMA/mlx5: Order num_pending_prefetch properly with synchronize_srcu Jason Gunthorpe
2019-10-01 15:38 ` [PATCH 5/6] RDMA/mlx5: Put live in the correct place for ODP MRs Jason Gunthorpe
2019-10-01 15:38 ` [PATCH 6/6] RDMA/mlx5: Add missing synchronize_srcu() for MW cases Jason Gunthorpe
2019-10-03  8:54   ` Leon Romanovsky
2019-10-03 12:33     ` Jason Gunthorpe
2019-10-04 18:55 ` [PATCH -rc 0/6] Bug fixes for odp Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).