linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v10 00/11] Fix race conditions in rxe_pool
@ 2022-02-25 19:57 Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC Bob Pearson
                   ` (11 more replies)
  0 siblings, 12 replies; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

There are several race conditions discovered in the current rdma_rxe
driver.  They mostly relate to races between normal operations and
destroying objects.  This patch series
 - Makes several minor cleanups in rxe_pool.[ch]
 - Replaces the red-black trees currently used by xarrays for indices
 - Corrects several reference counting errors
 - Adds wait for completions to the paths in verbs APIs which destroy
   objects.
 - Changes read side locking to rcu.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v10
  Rebased to current wip/jgg-for-next.
  Split some patches into smaller ones.
v9
  Corrected issues reported by Jason Gunthorpe,
  Converted locking in rxe_mcast.c and rxe_pool.c to use RCU
  Split up the patches into smaller changes
v8
  Fixed an additional race in 3/8 which was not handled correctly.
v7
  Corrected issues reported by Jason Gunthorpe
Link: https://lore.kernel.org/linux-rdma/20211207190947.GH6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207191857.GI6385@nvidia.com/
Link: https://lore.kernel.org/linux-rdma/20211207192824.GJ6385@nvidia.com/
v6
  Fixed a kzalloc flags bug.
  Fixed comment bug reported by 'Kernel Test Robot'.
  Changed type of rxe_pool.c in __rxe_fini().
v5
  Removed patches already accepted into for-next and addressed comments
  from Jason Gunthorpe.
v4
  Restructured patch series to change to xarray earlier which
  greatly simplified the changes.
  Rebased to current for-next
v3
  Changed rxe_alloc to use GFP_KERNEL
  Addressed other comments by Jason Gunthorp
  Merged the previous 06/10 and 07/10 patches into one since they overlapped
  Added some minor cleanups as 10/10
v2
  Rebased to current for-next.
  Added 4 additional patches

Bob Pearson (11):
  RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC
  RDMA/rxe: Delete _locked() APIs for pool objects
  RDMA/rxe: Replace obj by elem in declaration
  RDMA/rxe: Replace red-black trees by xarrays
  RDMA/rxe: Stop lookup of partially built objects
  RDMA/rxe: Add wait_for_completion to pool objects
  RDMA/rxe: Fix ref error in rxe_av.c
  RDMA/rxe: Replace mr by rkey in responder resources
  RDMA/rxe: Convert read side locking to rcu
  RDMA/rxe: Move max_elem into rxe_type_info
  RDMA/rxe: Cleanup rxe_pool.c

 drivers/infiniband/sw/rxe/rxe.c       |  87 +----
 drivers/infiniband/sw/rxe/rxe_av.c    |  19 +-
 drivers/infiniband/sw/rxe/rxe_loc.h   |   5 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    |   4 +-
 drivers/infiniband/sw/rxe/rxe_mw.c    |  13 +-
 drivers/infiniband/sw/rxe/rxe_net.c   |  17 +-
 drivers/infiniband/sw/rxe/rxe_pool.c  | 453 ++++++++++++++------------
 drivers/infiniband/sw/rxe/rxe_pool.h  |  74 ++---
 drivers/infiniband/sw/rxe/rxe_qp.c    |  10 +-
 drivers/infiniband/sw/rxe/rxe_req.c   |  55 ++--
 drivers/infiniband/sw/rxe/rxe_resp.c  | 125 ++++---
 drivers/infiniband/sw/rxe/rxe_verbs.c |  55 ++--
 drivers/infiniband/sw/rxe/rxe_verbs.h |   1 -
 13 files changed, 462 insertions(+), 456 deletions(-)


base-commit: 3ac3107872b8dd4b5c4c1b598fcbc24983cd009b

Patch applies to current wip/jgg-for-next with or without the last
two (5-6/6) patches in the multicast series.
-- 
2.32.0


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-28 17:15   ` Jason Gunthorpe
  2022-02-25 19:57 ` [PATCH for-next v10 02/11] RDMA/rxe: Delete _locked() APIs for pool objects Bob Pearson
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

There is only one remaining object type that allocates its own memory,
that is MR. So the sense of RXE_POOL_NO_ALLOC is changed to
RXE_POOL_ALLOC. Add checks to rxe_alloc() and rxe_add_to_pool() to
make sure the correct call is used for the setting of this flag.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 27 ++++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_pool.h |  2 +-
 2 files changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 16056b918ace..63681a8398b8 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -21,19 +21,17 @@ static const struct rxe_type_info {
 		.name		= "rxe-uc",
 		.size		= sizeof(struct rxe_ucontext),
 		.elem_offset	= offsetof(struct rxe_ucontext, elem),
-		.flags          = RXE_POOL_NO_ALLOC,
 	},
 	[RXE_TYPE_PD] = {
 		.name		= "rxe-pd",
 		.size		= sizeof(struct rxe_pd),
 		.elem_offset	= offsetof(struct rxe_pd, elem),
-		.flags		= RXE_POOL_NO_ALLOC,
 	},
 	[RXE_TYPE_AH] = {
 		.name		= "rxe-ah",
 		.size		= sizeof(struct rxe_ah),
 		.elem_offset	= offsetof(struct rxe_ah, elem),
-		.flags		= RXE_POOL_INDEX | RXE_POOL_NO_ALLOC,
+		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_AH_INDEX,
 		.max_index	= RXE_MAX_AH_INDEX,
 	},
@@ -41,7 +39,7 @@ static const struct rxe_type_info {
 		.name		= "rxe-srq",
 		.size		= sizeof(struct rxe_srq),
 		.elem_offset	= offsetof(struct rxe_srq, elem),
-		.flags		= RXE_POOL_INDEX | RXE_POOL_NO_ALLOC,
+		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_SRQ_INDEX,
 		.max_index	= RXE_MAX_SRQ_INDEX,
 	},
@@ -50,7 +48,7 @@ static const struct rxe_type_info {
 		.size		= sizeof(struct rxe_qp),
 		.elem_offset	= offsetof(struct rxe_qp, elem),
 		.cleanup	= rxe_qp_cleanup,
-		.flags		= RXE_POOL_INDEX | RXE_POOL_NO_ALLOC,
+		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_QP_INDEX,
 		.max_index	= RXE_MAX_QP_INDEX,
 	},
@@ -58,7 +56,6 @@ static const struct rxe_type_info {
 		.name		= "rxe-cq",
 		.size		= sizeof(struct rxe_cq),
 		.elem_offset	= offsetof(struct rxe_cq, elem),
-		.flags          = RXE_POOL_NO_ALLOC,
 		.cleanup	= rxe_cq_cleanup,
 	},
 	[RXE_TYPE_MR] = {
@@ -66,7 +63,7 @@ static const struct rxe_type_info {
 		.size		= sizeof(struct rxe_mr),
 		.elem_offset	= offsetof(struct rxe_mr, elem),
 		.cleanup	= rxe_mr_cleanup,
-		.flags		= RXE_POOL_INDEX,
+		.flags		= RXE_POOL_INDEX | RXE_POOL_ALLOC,
 		.min_index	= RXE_MIN_MR_INDEX,
 		.max_index	= RXE_MAX_MR_INDEX,
 	},
@@ -75,7 +72,7 @@ static const struct rxe_type_info {
 		.size		= sizeof(struct rxe_mw),
 		.elem_offset	= offsetof(struct rxe_mw, elem),
 		.cleanup	= rxe_mw_cleanup,
-		.flags		= RXE_POOL_INDEX | RXE_POOL_NO_ALLOC,
+		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_MW_INDEX,
 		.max_index	= RXE_MAX_MW_INDEX,
 	},
@@ -264,6 +261,12 @@ void *rxe_alloc(struct rxe_pool *pool)
 	struct rxe_pool_elem *elem;
 	void *obj;
 
+	if (!(pool->flags & RXE_POOL_ALLOC)) {
+		pr_warn_once("%s: Pool %s must call rxe_add_to_pool\n",
+				__func__, pool->name);
+		return NULL;
+	}
+
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
 
@@ -286,6 +289,12 @@ void *rxe_alloc(struct rxe_pool *pool)
 
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
 {
+	if (pool->flags & RXE_POOL_ALLOC) {
+		pr_warn_once("%s: Pool %s must call rxe_alloc\n",
+				__func__, pool->name);
+		return -EINVAL;
+	}
+
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
 
@@ -310,7 +319,7 @@ void rxe_elem_release(struct kref *kref)
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
-	if (!(pool->flags & RXE_POOL_NO_ALLOC)) {
+	if (pool->flags & RXE_POOL_ALLOC) {
 		obj = elem->obj;
 		kfree(obj);
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 8fc95c6b7b9b..44b944c8c360 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -9,7 +9,7 @@
 
 enum rxe_pool_flags {
 	RXE_POOL_INDEX		= BIT(1),
-	RXE_POOL_NO_ALLOC	= BIT(4),
+	RXE_POOL_ALLOC		= BIT(2),
 };
 
 enum rxe_elem_type {
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 02/11] RDMA/rxe: Delete _locked() APIs for pool objects
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 03/11] RDMA/rxe: Replace obj by elem in declaration Bob Pearson
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Since caller managed locks for indexed objects are no longer used
these APIs are deleted.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 67 ++++------------------------
 drivers/infiniband/sw/rxe/rxe_pool.h | 24 ++--------
 2 files changed, 12 insertions(+), 79 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 63681a8398b8..0a1233b2d4c5 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -189,17 +189,6 @@ static int rxe_insert_index(struct rxe_pool *pool, struct rxe_pool_elem *new)
 	return 0;
 }
 
-int __rxe_add_index_locked(struct rxe_pool_elem *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-	int err;
-
-	elem->index = alloc_index(pool);
-	err = rxe_insert_index(pool, elem);
-
-	return err;
-}
-
 int __rxe_add_index(struct rxe_pool_elem *elem)
 {
 	struct rxe_pool *pool = elem->pool;
@@ -207,55 +196,24 @@ int __rxe_add_index(struct rxe_pool_elem *elem)
 	int err;
 
 	write_lock_irqsave(&pool->pool_lock, flags);
-	err = __rxe_add_index_locked(elem);
+	elem->index = alloc_index(pool);
+	err = rxe_insert_index(pool, elem);
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return err;
 }
 
-void __rxe_drop_index_locked(struct rxe_pool_elem *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-
-	clear_bit(elem->index - pool->index.min_index, pool->index.table);
-	rb_erase(&elem->index_node, &pool->index.tree);
-}
-
 void __rxe_drop_index(struct rxe_pool_elem *elem)
 {
 	struct rxe_pool *pool = elem->pool;
 	unsigned long flags;
 
 	write_lock_irqsave(&pool->pool_lock, flags);
-	__rxe_drop_index_locked(elem);
+	clear_bit(elem->index - pool->index.min_index, pool->index.table);
+	rb_erase(&elem->index_node, &pool->index.tree);
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 }
 
-void *rxe_alloc_locked(struct rxe_pool *pool)
-{
-	struct rxe_pool_elem *elem;
-	void *obj;
-
-	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
-		goto out_cnt;
-
-	obj = kzalloc(pool->elem_size, GFP_ATOMIC);
-	if (!obj)
-		goto out_cnt;
-
-	elem = (struct rxe_pool_elem *)((u8 *)obj + pool->elem_offset);
-
-	elem->pool = pool;
-	elem->obj = obj;
-	kref_init(&elem->ref_cnt);
-
-	return obj;
-
-out_cnt:
-	atomic_dec(&pool->num_elem);
-	return NULL;
-}
-
 void *rxe_alloc(struct rxe_pool *pool)
 {
 	struct rxe_pool_elem *elem;
@@ -327,12 +285,14 @@ void rxe_elem_release(struct kref *kref)
 	atomic_dec(&pool->num_elem);
 }
 
-void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
+void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 {
-	struct rb_node *node;
 	struct rxe_pool_elem *elem;
+	struct rb_node *node;
+	unsigned long flags;
 	void *obj;
 
+	read_lock_irqsave(&pool->pool_lock, flags);
 	node = pool->index.tree.rb_node;
 
 	while (node) {
@@ -352,17 +312,6 @@ void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 	} else {
 		obj = NULL;
 	}
-
-	return obj;
-}
-
-void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
-{
-	unsigned long flags;
-	void *obj;
-
-	read_lock_irqsave(&pool->pool_lock, flags);
-	obj = rxe_pool_get_index_locked(pool, index);
 	read_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return obj;
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 44b944c8c360..7fec5d96d695 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -68,9 +68,7 @@ int rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
 /* free resources from object pool */
 void rxe_pool_cleanup(struct rxe_pool *pool);
 
-/* allocate an object from pool holding and not holding the pool lock */
-void *rxe_alloc_locked(struct rxe_pool *pool);
-
+/* allocate an object from pool */
 void *rxe_alloc(struct rxe_pool *pool);
 
 /* connect already allocated object to pool */
@@ -79,32 +77,18 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem);
 #define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->elem)
 
 /* assign an index to an indexed object and insert object into
- *  pool's rb tree holding and not holding the pool_lock
+ * pool's rb tree
  */
-int __rxe_add_index_locked(struct rxe_pool_elem *elem);
-
-#define rxe_add_index_locked(obj) __rxe_add_index_locked(&(obj)->elem)
-
 int __rxe_add_index(struct rxe_pool_elem *elem);
 
 #define rxe_add_index(obj) __rxe_add_index(&(obj)->elem)
 
-/* drop an index and remove object from rb tree
- * holding and not holding the pool_lock
- */
-void __rxe_drop_index_locked(struct rxe_pool_elem *elem);
-
-#define rxe_drop_index_locked(obj) __rxe_drop_index_locked(&(obj)->elem)
-
+/* drop an index and remove object from rb tree */
 void __rxe_drop_index(struct rxe_pool_elem *elem);
 
 #define rxe_drop_index(obj) __rxe_drop_index(&(obj)->elem)
 
-/* lookup an indexed object from index holding and not holding the pool_lock.
- * takes a reference on object
- */
-void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index);
-
+/* lookup an indexed object from index. takes a reference on object */
 void *rxe_pool_get_index(struct rxe_pool *pool, u32 index);
 
 /* cleanup an object when all references are dropped */
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 03/11] RDMA/rxe: Replace obj by elem in declaration
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 02/11] RDMA/rxe: Delete _locked() APIs for pool objects Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays Bob Pearson
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Fix a harmless typo replacing obj by elem in the cleanup fields.
This has no effect but is confusing.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 2 +-
 drivers/infiniband/sw/rxe/rxe_pool.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 0a1233b2d4c5..2e2d01a73639 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -12,7 +12,7 @@ static const struct rxe_type_info {
 	const char *name;
 	size_t size;
 	size_t elem_offset;
-	void (*cleanup)(struct rxe_pool_elem *obj);
+	void (*cleanup)(struct rxe_pool_elem *elem);
 	enum rxe_pool_flags flags;
 	u32 min_index;
 	u32 max_index;
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 7fec5d96d695..a8582ad85b1e 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -39,7 +39,7 @@ struct rxe_pool {
 	struct rxe_dev		*rxe;
 	const char		*name;
 	rwlock_t		pool_lock; /* protects pool add/del/search */
-	void			(*cleanup)(struct rxe_pool_elem *obj);
+	void			(*cleanup)(struct rxe_pool_elem *elem);
 	enum rxe_pool_flags	flags;
 	enum rxe_elem_type	type;
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (2 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 03/11] RDMA/rxe: Replace obj by elem in declaration Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-28 16:57   ` Jason Gunthorpe
  2022-02-25 19:57 ` [PATCH for-next v10 05/11] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently the rxe driver uses red-black trees to add indices to the rxe
object pools. Linux xarrays provide a better way to implement the same
functionality for indices. This patch replaces red-black trees by xarrays
for pool objects. Since xarrays already have a spinlock use that in place
of the pool rwlock. Make sure that all changes in the xarray(index) and
kref(ref counnt) occur atomically.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c       |  87 ++-------
 drivers/infiniband/sw/rxe/rxe_mr.c    |   1 -
 drivers/infiniband/sw/rxe/rxe_mw.c    |   8 -
 drivers/infiniband/sw/rxe/rxe_pool.c  | 244 ++++++++++----------------
 drivers/infiniband/sw/rxe/rxe_pool.h  |  43 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.c |  12 --
 6 files changed, 118 insertions(+), 277 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index fce3994d8f7a..29e2b93f6d7e 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -114,83 +114,26 @@ static void rxe_init_ports(struct rxe_dev *rxe)
 }
 
 /* init pools of managed objects */
-static int rxe_init_pools(struct rxe_dev *rxe)
+static void rxe_init_pools(struct rxe_dev *rxe)
 {
-	int err;
-
-	err = rxe_pool_init(rxe, &rxe->uc_pool, RXE_TYPE_UC,
-			    rxe->max_ucontext);
-	if (err)
-		goto err1;
-
-	err = rxe_pool_init(rxe, &rxe->pd_pool, RXE_TYPE_PD,
-			    rxe->attr.max_pd);
-	if (err)
-		goto err2;
-
-	err = rxe_pool_init(rxe, &rxe->ah_pool, RXE_TYPE_AH,
-			    rxe->attr.max_ah);
-	if (err)
-		goto err3;
-
-	err = rxe_pool_init(rxe, &rxe->srq_pool, RXE_TYPE_SRQ,
-			    rxe->attr.max_srq);
-	if (err)
-		goto err4;
-
-	err = rxe_pool_init(rxe, &rxe->qp_pool, RXE_TYPE_QP,
-			    rxe->attr.max_qp);
-	if (err)
-		goto err5;
-
-	err = rxe_pool_init(rxe, &rxe->cq_pool, RXE_TYPE_CQ,
-			    rxe->attr.max_cq);
-	if (err)
-		goto err6;
-
-	err = rxe_pool_init(rxe, &rxe->mr_pool, RXE_TYPE_MR,
-			    rxe->attr.max_mr);
-	if (err)
-		goto err7;
-
-	err = rxe_pool_init(rxe, &rxe->mw_pool, RXE_TYPE_MW,
-			    rxe->attr.max_mw);
-	if (err)
-		goto err8;
-
-	return 0;
-
-err8:
-	rxe_pool_cleanup(&rxe->mr_pool);
-err7:
-	rxe_pool_cleanup(&rxe->cq_pool);
-err6:
-	rxe_pool_cleanup(&rxe->qp_pool);
-err5:
-	rxe_pool_cleanup(&rxe->srq_pool);
-err4:
-	rxe_pool_cleanup(&rxe->ah_pool);
-err3:
-	rxe_pool_cleanup(&rxe->pd_pool);
-err2:
-	rxe_pool_cleanup(&rxe->uc_pool);
-err1:
-	return err;
+	rxe_pool_init(rxe, &rxe->uc_pool, RXE_TYPE_UC, rxe->max_ucontext);
+	rxe_pool_init(rxe, &rxe->pd_pool, RXE_TYPE_PD, rxe->attr.max_pd);
+	rxe_pool_init(rxe, &rxe->ah_pool, RXE_TYPE_AH, rxe->attr.max_ah);
+	rxe_pool_init(rxe, &rxe->srq_pool, RXE_TYPE_SRQ, rxe->attr.max_srq);
+	rxe_pool_init(rxe, &rxe->qp_pool, RXE_TYPE_QP, rxe->attr.max_qp);
+	rxe_pool_init(rxe, &rxe->cq_pool, RXE_TYPE_CQ, rxe->attr.max_cq);
+	rxe_pool_init(rxe, &rxe->mr_pool, RXE_TYPE_MR, rxe->attr.max_mr);
+	rxe_pool_init(rxe, &rxe->mw_pool, RXE_TYPE_MW, rxe->attr.max_mw);
 }
 
 /* initialize rxe device state */
-static int rxe_init(struct rxe_dev *rxe)
+static void rxe_init(struct rxe_dev *rxe)
 {
-	int err;
-
 	/* init default device parameters */
 	rxe_init_device_param(rxe);
 
 	rxe_init_ports(rxe);
-
-	err = rxe_init_pools(rxe);
-	if (err)
-		return err;
+	rxe_init_pools(rxe);
 
 	/* init pending mmap list */
 	spin_lock_init(&rxe->mmap_offset_lock);
@@ -202,8 +145,6 @@ static int rxe_init(struct rxe_dev *rxe)
 	rxe->mcg_tree = RB_ROOT;
 
 	mutex_init(&rxe->usdev_lock);
-
-	return 0;
 }
 
 void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
@@ -225,11 +166,7 @@ void rxe_set_mtu(struct rxe_dev *rxe, unsigned int ndev_mtu)
  */
 int rxe_add(struct rxe_dev *rxe, unsigned int mtu, const char *ibdev_name)
 {
-	int err;
-
-	err = rxe_init(rxe);
-	if (err)
-		return err;
+	rxe_init(rxe);
 
 	rxe_set_mtu(rxe, mtu);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 453ef3c9d535..35628b8a00b4 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -691,7 +691,6 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 
 	mr->state = RXE_MR_STATE_INVALID;
 	rxe_drop_ref(mr_pd(mr));
-	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
 
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 32dd8c0b8b9e..7df36c40eec2 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -20,7 +20,6 @@ int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
 		return ret;
 	}
 
-	rxe_add_index(mw);
 	mw->rkey = ibmw->rkey = (mw->elem.index << 8) | rxe_get_next_key(-1);
 	mw->state = (mw->ibmw.type == IB_MW_TYPE_2) ?
 			RXE_MW_STATE_FREE : RXE_MW_STATE_VALID;
@@ -329,10 +328,3 @@ struct rxe_mw *rxe_lookup_mw(struct rxe_qp *qp, int access, u32 rkey)
 
 	return mw;
 }
-
-void rxe_mw_cleanup(struct rxe_pool_elem *elem)
-{
-	struct rxe_mw *mw = container_of(elem, typeof(*mw), elem);
-
-	rxe_drop_index(mw);
-}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 2e2d01a73639..c298467337b8 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -21,17 +21,20 @@ static const struct rxe_type_info {
 		.name		= "rxe-uc",
 		.size		= sizeof(struct rxe_ucontext),
 		.elem_offset	= offsetof(struct rxe_ucontext, elem),
+		.min_index	= 1,
+		.max_index	= UINT_MAX,
 	},
 	[RXE_TYPE_PD] = {
 		.name		= "rxe-pd",
 		.size		= sizeof(struct rxe_pd),
 		.elem_offset	= offsetof(struct rxe_pd, elem),
+		.min_index	= 1,
+		.max_index	= UINT_MAX,
 	},
 	[RXE_TYPE_AH] = {
 		.name		= "rxe-ah",
 		.size		= sizeof(struct rxe_ah),
 		.elem_offset	= offsetof(struct rxe_ah, elem),
-		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_AH_INDEX,
 		.max_index	= RXE_MAX_AH_INDEX,
 	},
@@ -39,7 +42,6 @@ static const struct rxe_type_info {
 		.name		= "rxe-srq",
 		.size		= sizeof(struct rxe_srq),
 		.elem_offset	= offsetof(struct rxe_srq, elem),
-		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_SRQ_INDEX,
 		.max_index	= RXE_MAX_SRQ_INDEX,
 	},
@@ -48,7 +50,6 @@ static const struct rxe_type_info {
 		.size		= sizeof(struct rxe_qp),
 		.elem_offset	= offsetof(struct rxe_qp, elem),
 		.cleanup	= rxe_qp_cleanup,
-		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_QP_INDEX,
 		.max_index	= RXE_MAX_QP_INDEX,
 	},
@@ -57,13 +58,15 @@ static const struct rxe_type_info {
 		.size		= sizeof(struct rxe_cq),
 		.elem_offset	= offsetof(struct rxe_cq, elem),
 		.cleanup	= rxe_cq_cleanup,
+		.min_index	= 1,
+		.max_index	= UINT_MAX,
 	},
 	[RXE_TYPE_MR] = {
 		.name		= "rxe-mr",
 		.size		= sizeof(struct rxe_mr),
 		.elem_offset	= offsetof(struct rxe_mr, elem),
 		.cleanup	= rxe_mr_cleanup,
-		.flags		= RXE_POOL_INDEX | RXE_POOL_ALLOC,
+		.flags		= RXE_POOL_ALLOC,
 		.min_index	= RXE_MIN_MR_INDEX,
 		.max_index	= RXE_MAX_MR_INDEX,
 	},
@@ -71,44 +74,15 @@ static const struct rxe_type_info {
 		.name		= "rxe-mw",
 		.size		= sizeof(struct rxe_mw),
 		.elem_offset	= offsetof(struct rxe_mw, elem),
-		.cleanup	= rxe_mw_cleanup,
-		.flags		= RXE_POOL_INDEX,
 		.min_index	= RXE_MIN_MW_INDEX,
 		.max_index	= RXE_MAX_MW_INDEX,
 	},
 };
 
-static int rxe_pool_init_index(struct rxe_pool *pool, u32 max, u32 min)
-{
-	int err = 0;
-
-	if ((max - min + 1) < pool->max_elem) {
-		pr_warn("not enough indices for max_elem\n");
-		err = -EINVAL;
-		goto out;
-	}
-
-	pool->index.max_index = max;
-	pool->index.min_index = min;
-
-	pool->index.table = bitmap_zalloc(max - min + 1, GFP_KERNEL);
-	if (!pool->index.table) {
-		err = -ENOMEM;
-		goto out;
-	}
-
-out:
-	return err;
-}
-
-int rxe_pool_init(
-	struct rxe_dev		*rxe,
-	struct rxe_pool		*pool,
-	enum rxe_elem_type	type,
-	unsigned int		max_elem)
+void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
+		   enum rxe_elem_type type, unsigned int max_elem)
 {
 	const struct rxe_type_info *info = &rxe_type_info[type];
-	int			err = 0;
 
 	memset(pool, 0, sizeof(*pool));
 
@@ -123,114 +97,54 @@ int rxe_pool_init(
 
 	atomic_set(&pool->num_elem, 0);
 
-	rwlock_init(&pool->pool_lock);
-
-	if (pool->flags & RXE_POOL_INDEX) {
-		pool->index.tree = RB_ROOT;
-		err = rxe_pool_init_index(pool, info->max_index,
-					  info->min_index);
-		if (err)
-			goto out;
-	}
-
-out:
-	return err;
+	xa_init_flags(&pool->xa, XA_FLAGS_ALLOC);
+	pool->limit.max = info->max_index;
+	pool->limit.min = info->min_index;
 }
 
 void rxe_pool_cleanup(struct rxe_pool *pool)
 {
-	if (atomic_read(&pool->num_elem) > 0)
-		pr_warn("%s pool destroyed with unfree'd elem\n",
-			pool->name);
-
-	if (pool->flags & RXE_POOL_INDEX)
-		bitmap_free(pool->index.table);
-}
-
-static u32 alloc_index(struct rxe_pool *pool)
-{
-	u32 index;
-	u32 range = pool->index.max_index - pool->index.min_index + 1;
-
-	index = find_next_zero_bit(pool->index.table, range, pool->index.last);
-	if (index >= range)
-		index = find_first_zero_bit(pool->index.table, range);
-
-	WARN_ON_ONCE(index >= range);
-	set_bit(index, pool->index.table);
-	pool->index.last = index;
-	return index + pool->index.min_index;
-}
-
-static int rxe_insert_index(struct rxe_pool *pool, struct rxe_pool_elem *new)
-{
-	struct rb_node **link = &pool->index.tree.rb_node;
-	struct rb_node *parent = NULL;
 	struct rxe_pool_elem *elem;
-
-	while (*link) {
-		parent = *link;
-		elem = rb_entry(parent, struct rxe_pool_elem, index_node);
-
-		if (elem->index == new->index) {
-			pr_warn("element already exists!\n");
-			return -EINVAL;
+	unsigned long index = 0;
+	unsigned long max = ULONG_MAX;
+	unsigned int elem_count = 0;
+	unsigned int obj_count = 0;
+
+	do {
+		elem = xa_find(&pool->xa, &index, max, XA_PRESENT);
+		if (elem) {
+			elem_count++;
+			xa_erase(&pool->xa, index);
+			if (pool->flags & RXE_POOL_ALLOC) {
+				kfree(elem->obj);
+				obj_count++;
+			}
 		}
+	} while (elem);
 
-		if (elem->index > new->index)
-			link = &(*link)->rb_left;
-		else
-			link = &(*link)->rb_right;
-	}
-
-	rb_link_node(&new->index_node, parent, link);
-	rb_insert_color(&new->index_node, &pool->index.tree);
-
-	return 0;
-}
-
-int __rxe_add_index(struct rxe_pool_elem *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-	unsigned long flags;
-	int err;
-
-	write_lock_irqsave(&pool->pool_lock, flags);
-	elem->index = alloc_index(pool);
-	err = rxe_insert_index(pool, elem);
-	write_unlock_irqrestore(&pool->pool_lock, flags);
-
-	return err;
-}
-
-void __rxe_drop_index(struct rxe_pool_elem *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-	unsigned long flags;
-
-	write_lock_irqsave(&pool->pool_lock, flags);
-	clear_bit(elem->index - pool->index.min_index, pool->index.table);
-	rb_erase(&elem->index_node, &pool->index.tree);
-	write_unlock_irqrestore(&pool->pool_lock, flags);
+	if (elem_count || obj_count)
+		pr_warn("Freed %d indices & %d objects from pool %s\n",
+			elem_count, obj_count, pool->name + 4);
 }
 
 void *rxe_alloc(struct rxe_pool *pool)
 {
 	struct rxe_pool_elem *elem;
 	void *obj;
+	int err;
 
 	if (!(pool->flags & RXE_POOL_ALLOC)) {
-		pr_warn_once("%s: Pool %s must call rxe_add_to_pool\n",
+		pr_warn_once("%s: pool %s must call rxe_add_to_pool\n",
 				__func__, pool->name);
 		return NULL;
 	}
 
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
-		goto out_cnt;
+		goto err_cnt;
 
 	obj = kzalloc(pool->elem_size, GFP_KERNEL);
 	if (!obj)
-		goto out_cnt;
+		goto err_cnt;
 
 	elem = (struct rxe_pool_elem *)((u8 *)obj + pool->elem_offset);
 
@@ -238,42 +152,77 @@ void *rxe_alloc(struct rxe_pool *pool)
 	elem->obj = obj;
 	kref_init(&elem->ref_cnt);
 
+	err = xa_alloc_cyclic_bh(&pool->xa, &elem->index, elem, pool->limit,
+			&pool->next, GFP_KERNEL);
+	if (err)
+		goto err_free;
+
 	return obj;
 
-out_cnt:
+err_free:
+	kfree(obj);
+err_cnt:
 	atomic_dec(&pool->num_elem);
 	return NULL;
 }
 
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
 {
+	int err;
+
 	if (pool->flags & RXE_POOL_ALLOC) {
-		pr_warn_once("%s: Pool %s must call rxe_alloc\n",
+		pr_warn_once("%s: pool %s must call rxe_alloc\n",
 				__func__, pool->name);
 		return -EINVAL;
 	}
 
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
-		goto out_cnt;
+		goto err_cnt;
 
 	elem->pool = pool;
 	elem->obj = (u8 *)elem - pool->elem_offset;
 	kref_init(&elem->ref_cnt);
 
+	err = xa_alloc_cyclic_bh(&pool->xa, &elem->index, elem, pool->limit,
+			&pool->next, GFP_KERNEL);
+	if (err)
+		goto err_cnt;
+
 	return 0;
 
-out_cnt:
+err_cnt:
 	atomic_dec(&pool->num_elem);
 	return -EINVAL;
 }
 
-void rxe_elem_release(struct kref *kref)
+void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
+{
+	struct rxe_pool_elem *elem;
+	unsigned long flags;
+	void *obj;
+
+	spin_lock_irqsave(&pool->xa.xa_lock, flags);
+	elem = xa_load(&pool->xa, index);
+	if (elem && kref_get_unless_zero(&elem->ref_cnt))
+		obj = elem->obj;
+	else
+		obj = NULL;
+	spin_unlock_irqrestore(&pool->xa.xa_lock, flags);
+
+	return obj;
+}
+
+static void rxe_elem_release(struct kref *kref, unsigned long flags)
+	__releases(&pool->xa.xa_lock)
 {
 	struct rxe_pool_elem *elem =
 		container_of(kref, struct rxe_pool_elem, ref_cnt);
 	struct rxe_pool *pool = elem->pool;
 	void *obj;
 
+	__xa_erase(&pool->xa, elem->index);
+
+	spin_unlock_irqrestore(&pool->xa.xa_lock, flags);
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
@@ -285,34 +234,31 @@ void rxe_elem_release(struct kref *kref)
 	atomic_dec(&pool->num_elem);
 }
 
-void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
+int __rxe_add_ref(struct rxe_pool_elem *elem)
 {
-	struct rxe_pool_elem *elem;
-	struct rb_node *node;
-	unsigned long flags;
-	void *obj;
-
-	read_lock_irqsave(&pool->pool_lock, flags);
-	node = pool->index.tree.rb_node;
+	return kref_get_unless_zero(&elem->ref_cnt);
+}
 
-	while (node) {
-		elem = rb_entry(node, struct rxe_pool_elem, index_node);
+/* local copy of kref_put_lock_irqsave same as kref_put_lock
+ * except for _irqsave locks
+ */
+static int kref_put_lock_irqsave(struct kref *kref,
+		 void (*release)(struct kref *kref, unsigned long flags),
+		 spinlock_t *lock)
+{
+	unsigned long flags;
 
-		if (elem->index > index)
-			node = node->rb_left;
-		else if (elem->index < index)
-			node = node->rb_right;
-		else
-			break;
+	if (refcount_dec_and_lock_irqsave(&kref->refcount, lock, &flags)) {
+		release(kref, flags);
+		return 1;
 	}
+	return 0;
+}
 
-	if (node) {
-		kref_get(&elem->ref_cnt);
-		obj = elem->obj;
-	} else {
-		obj = NULL;
-	}
-	read_unlock_irqrestore(&pool->pool_lock, flags);
+int __rxe_drop_ref(struct rxe_pool_elem *elem)
+{
+	struct rxe_pool *pool = elem->pool;
 
-	return obj;
+	return kref_put_lock_irqsave(&elem->ref_cnt, rxe_elem_release,
+			&pool->xa.xa_lock);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index a8582ad85b1e..422987c90cb9 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -8,8 +8,7 @@
 #define RXE_POOL_H
 
 enum rxe_pool_flags {
-	RXE_POOL_INDEX		= BIT(1),
-	RXE_POOL_ALLOC		= BIT(2),
+	RXE_POOL_ALLOC		= BIT(1),
 };
 
 enum rxe_elem_type {
@@ -29,16 +28,12 @@ struct rxe_pool_elem {
 	void			*obj;
 	struct kref		ref_cnt;
 	struct list_head	list;
-
-	/* only used if indexed */
-	struct rb_node		index_node;
 	u32			index;
 };
 
 struct rxe_pool {
 	struct rxe_dev		*rxe;
 	const char		*name;
-	rwlock_t		pool_lock; /* protects pool add/del/search */
 	void			(*cleanup)(struct rxe_pool_elem *elem);
 	enum rxe_pool_flags	flags;
 	enum rxe_elem_type	type;
@@ -48,21 +43,16 @@ struct rxe_pool {
 	size_t			elem_size;
 	size_t			elem_offset;
 
-	/* only used if indexed */
-	struct {
-		struct rb_root		tree;
-		unsigned long		*table;
-		u32			last;
-		u32			max_index;
-		u32			min_index;
-	} index;
+	struct xarray		xa;
+	struct xa_limit		limit;
+	u32			next;
 };
 
 /* initialize a pool of objects with given limit on
  * number of elements. gets parameters from rxe_type_info
  * pool elements will be allocated out of a slab cache
  */
-int rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
+void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
 		  enum rxe_elem_type type, u32 max_elem);
 
 /* free resources from object pool */
@@ -76,29 +66,18 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem);
 
 #define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->elem)
 
-/* assign an index to an indexed object and insert object into
- * pool's rb tree
- */
-int __rxe_add_index(struct rxe_pool_elem *elem);
-
-#define rxe_add_index(obj) __rxe_add_index(&(obj)->elem)
-
-/* drop an index and remove object from rb tree */
-void __rxe_drop_index(struct rxe_pool_elem *elem);
-
-#define rxe_drop_index(obj) __rxe_drop_index(&(obj)->elem)
-
 /* lookup an indexed object from index. takes a reference on object */
 void *rxe_pool_get_index(struct rxe_pool *pool, u32 index);
 
-/* cleanup an object when all references are dropped */
-void rxe_elem_release(struct kref *kref);
-
 /* take a reference on an object */
-#define rxe_add_ref(obj) kref_get(&(obj)->elem.ref_cnt)
+int __rxe_add_ref(struct rxe_pool_elem *elem);
+
+#define rxe_add_ref(obj) __rxe_add_ref(&(obj)->elem)
 
 /* drop a reference on an object */
-#define rxe_drop_ref(obj) kref_put(&(obj)->elem.ref_cnt, rxe_elem_release)
+int __rxe_drop_ref(struct rxe_pool_elem *elem);
+
+#define rxe_drop_ref(obj) __rxe_drop_ref(&(obj)->elem)
 
 #define rxe_read_ref(obj) kref_read(&(obj)->elem.ref_cnt)
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 80df9a8f71a1..f0c5715ac500 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -181,7 +181,6 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		return err;
 
 	/* create index > 0 */
-	rxe_add_index(ah);
 	ah->ah_num = ah->elem.index;
 
 	if (uresp) {
@@ -189,7 +188,6 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		err = copy_to_user(&uresp->ah_num, &ah->ah_num,
 					 sizeof(uresp->ah_num));
 		if (err) {
-			rxe_drop_index(ah);
 			rxe_drop_ref(ah);
 			return -EFAULT;
 		}
@@ -230,7 +228,6 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 {
 	struct rxe_ah *ah = to_rah(ibah);
 
-	rxe_drop_index(ah);
 	rxe_drop_ref(ah);
 	return 0;
 }
@@ -438,7 +435,6 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	if (err)
 		return err;
 
-	rxe_add_index(qp);
 	err = rxe_qp_from_init(rxe, qp, pd, init, uresp, ibqp->pd, udata);
 	if (err)
 		goto qp_init;
@@ -446,7 +442,6 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	return 0;
 
 qp_init:
-	rxe_drop_index(qp);
 	rxe_drop_ref(qp);
 	return err;
 }
@@ -501,7 +496,6 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 		return ret;
 
 	rxe_qp_destroy(qp);
-	rxe_drop_index(qp);
 	rxe_drop_ref(qp);
 	return 0;
 }
@@ -908,7 +902,6 @@ static struct ib_mr *rxe_get_dma_mr(struct ib_pd *ibpd, int access)
 	if (!mr)
 		return ERR_PTR(-ENOMEM);
 
-	rxe_add_index(mr);
 	rxe_add_ref(pd);
 	rxe_mr_init_dma(pd, access, mr);
 
@@ -932,7 +925,6 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 		goto err2;
 	}
 
-	rxe_add_index(mr);
 
 	rxe_add_ref(pd);
 
@@ -944,7 +936,6 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 
 err3:
 	rxe_drop_ref(pd);
-	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
 err2:
 	return ERR_PTR(err);
@@ -967,8 +958,6 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 		goto err1;
 	}
 
-	rxe_add_index(mr);
-
 	rxe_add_ref(pd);
 
 	err = rxe_mr_init_fast(pd, max_num_sg, mr);
@@ -979,7 +968,6 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 
 err2:
 	rxe_drop_ref(pd);
-	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
 err1:
 	return ERR_PTR(err);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 05/11] RDMA/rxe: Stop lookup of partially built objects
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (3 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-28 17:01   ` Jason Gunthorpe
  2022-02-25 19:57 ` [PATCH for-next v10 06/11] RDMA/rxe: Add wait_for_completion to pool objects Bob Pearson
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently the rdma_rxe driver has a security weakness due to adding
objects which are partially initialized to indices allowing external
actors to gain access to them by sending packets which refer to
their index (e.g. qpn, rkey, etc).

This patch adds a member to the pool element struct indicating whether
the object should/or should not allow looking up from its index. This
variable is set only after the object is completely created and unset
as soon as possible when the object is destroyed.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    |  1 +
 drivers/infiniband/sw/rxe/rxe_mw.c    |  3 +++
 drivers/infiniband/sw/rxe/rxe_pool.c  |  2 +-
 drivers/infiniband/sw/rxe/rxe_pool.h  |  5 +++++
 drivers/infiniband/sw/rxe/rxe_verbs.c | 22 +++++++++++++++++++++-
 5 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 35628b8a00b4..157cfb710a7e 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -689,6 +689,7 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 		return -EINVAL;
 	}
 
+	rxe_disable(mr);
 	mr->state = RXE_MR_STATE_INVALID;
 	rxe_drop_ref(mr_pd(mr));
 	rxe_drop_ref(mr);
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 7df36c40eec2..e4d207f24eba 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -25,6 +25,8 @@ int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
 			RXE_MW_STATE_FREE : RXE_MW_STATE_VALID;
 	spin_lock_init(&mw->lock);
 
+	rxe_enable(mw);
+
 	return 0;
 }
 
@@ -56,6 +58,7 @@ int rxe_dealloc_mw(struct ib_mw *ibmw)
 	struct rxe_mw *mw = to_rmw(ibmw);
 	struct rxe_pd *pd = to_rpd(ibmw->pd);
 
+	rxe_disable(mw);
 	spin_lock_bh(&mw->lock);
 	rxe_do_dealloc_mw(mw);
 	spin_unlock_bh(&mw->lock);
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index c298467337b8..19d8fb77b166 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -203,7 +203,7 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 
 	spin_lock_irqsave(&pool->xa.xa_lock, flags);
 	elem = xa_load(&pool->xa, index);
-	if (elem && kref_get_unless_zero(&elem->ref_cnt))
+	if (elem && elem->enabled && kref_get_unless_zero(&elem->ref_cnt))
 		obj = elem->obj;
 	else
 		obj = NULL;
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 422987c90cb9..b963c300d636 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -29,6 +29,7 @@ struct rxe_pool_elem {
 	struct kref		ref_cnt;
 	struct list_head	list;
 	u32			index;
+	bool			enabled;
 };
 
 struct rxe_pool {
@@ -81,4 +82,8 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem);
 
 #define rxe_read_ref(obj) kref_read(&(obj)->elem.ref_cnt)
 
+#define rxe_enable(obj) ((obj)->elem.enabled = true)
+
+#define rxe_disable(obj) ((obj)->elem.enabled = false)
+
 #endif /* RXE_POOL_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index f0c5715ac500..077be3eb9763 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -150,6 +150,7 @@ static int rxe_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	struct rxe_pd *pd = to_rpd(ibpd);
 
 	rxe_drop_ref(pd);
+
 	return 0;
 }
 
@@ -197,6 +198,9 @@ static int rxe_create_ah(struct ib_ah *ibah,
 	}
 
 	rxe_init_av(init_attr->ah_attr, &ah->av);
+
+	rxe_enable(ah);
+
 	return 0;
 }
 
@@ -228,6 +232,8 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 {
 	struct rxe_ah *ah = to_rah(ibah);
 
+	rxe_disable(ah);
+
 	rxe_drop_ref(ah);
 	return 0;
 }
@@ -310,6 +316,8 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
 	if (err)
 		goto err2;
 
+	rxe_enable(srq);
+
 	return 0;
 
 err2:
@@ -368,6 +376,8 @@ static int rxe_destroy_srq(struct ib_srq *ibsrq, struct ib_udata *udata)
 {
 	struct rxe_srq *srq = to_rsrq(ibsrq);
 
+	rxe_disable(srq);
+
 	if (srq->rq.queue)
 		rxe_queue_cleanup(srq->rq.queue);
 
@@ -439,6 +449,8 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	if (err)
 		goto qp_init;
 
+	rxe_enable(qp);
+
 	return 0;
 
 qp_init:
@@ -495,6 +507,7 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 	if (ret)
 		return ret;
 
+	rxe_disable(qp);
 	rxe_qp_destroy(qp);
 	rxe_drop_ref(qp);
 	return 0;
@@ -807,9 +820,10 @@ static int rxe_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 {
 	struct rxe_cq *cq = to_rcq(ibcq);
 
-	rxe_cq_disable(cq);
 
+	rxe_cq_disable(cq);
 	rxe_drop_ref(cq);
+
 	return 0;
 }
 
@@ -905,6 +919,8 @@ static struct ib_mr *rxe_get_dma_mr(struct ib_pd *ibpd, int access)
 	rxe_add_ref(pd);
 	rxe_mr_init_dma(pd, access, mr);
 
+	rxe_enable(mr);
+
 	return &mr->ibmr;
 }
 
@@ -932,6 +948,8 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 	if (err)
 		goto err3;
 
+	rxe_enable(mr);
+
 	return &mr->ibmr;
 
 err3:
@@ -964,6 +982,8 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	if (err)
 		goto err2;
 
+	rxe_enable(mr);
+
 	return &mr->ibmr;
 
 err2:
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 06/11] RDMA/rxe: Add wait_for_completion to pool objects
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (4 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 05/11] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-28 17:05   ` Jason Gunthorpe
  2022-02-25 19:57 ` [PATCH for-next v10 07/11] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Reference counting for object deletion can cause an object to
wait for something else to happen before an object gets deleted.
The destroy verbs can then return to rdma-core with the object still
holding references. Adding wait_for_completion in this path
prevents this.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    |  2 +-
 drivers/infiniband/sw/rxe/rxe_mw.c    |  2 +-
 drivers/infiniband/sw/rxe/rxe_pool.c  | 49 +++++++++++++++++++++------
 drivers/infiniband/sw/rxe/rxe_pool.h  |  6 ++++
 drivers/infiniband/sw/rxe/rxe_verbs.c | 23 +++++++------
 5 files changed, 59 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 157cfb710a7e..245785def608 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -692,7 +692,7 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 	rxe_disable(mr);
 	mr->state = RXE_MR_STATE_INVALID;
 	rxe_drop_ref(mr_pd(mr));
-	rxe_drop_ref(mr);
+	rxe_drop_wait(mr);
 
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index e4d207f24eba..c45d832d7427 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -63,8 +63,8 @@ int rxe_dealloc_mw(struct ib_mw *ibmw)
 	rxe_do_dealloc_mw(mw);
 	spin_unlock_bh(&mw->lock);
 
-	rxe_drop_ref(mw);
 	rxe_drop_ref(pd);
+	rxe_drop_wait(mw);
 
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 19d8fb77b166..1d1e10290991 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -6,6 +6,8 @@
 
 #include "rxe.h"
 
+#define RXE_POOL_TIMEOUT	(200)
+#define RXE_MAX_POOL_TIMEOUTS	(3)
 #define RXE_POOL_ALIGN		(16)
 
 static const struct rxe_type_info {
@@ -123,7 +125,7 @@ void rxe_pool_cleanup(struct rxe_pool *pool)
 	} while (elem);
 
 	if (elem_count || obj_count)
-		pr_warn("Freed %d indices & %d objects from pool %s\n",
+		pr_warn("Freed %d indices and %d objects from pool %s\n",
 			elem_count, obj_count, pool->name + 4);
 }
 
@@ -151,6 +153,7 @@ void *rxe_alloc(struct rxe_pool *pool)
 	elem->pool = pool;
 	elem->obj = obj;
 	kref_init(&elem->ref_cnt);
+	init_completion(&elem->complete);
 
 	err = xa_alloc_cyclic_bh(&pool->xa, &elem->index, elem, pool->limit,
 			&pool->next, GFP_KERNEL);
@@ -182,6 +185,7 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
 	elem->pool = pool;
 	elem->obj = (u8 *)elem - pool->elem_offset;
 	kref_init(&elem->ref_cnt);
+	init_completion(&elem->complete);
 
 	err = xa_alloc_cyclic_bh(&pool->xa, &elem->index, elem, pool->limit,
 			&pool->next, GFP_KERNEL);
@@ -218,20 +222,12 @@ static void rxe_elem_release(struct kref *kref, unsigned long flags)
 	struct rxe_pool_elem *elem =
 		container_of(kref, struct rxe_pool_elem, ref_cnt);
 	struct rxe_pool *pool = elem->pool;
-	void *obj;
 
 	__xa_erase(&pool->xa, elem->index);
 
 	spin_unlock_irqrestore(&pool->xa.xa_lock, flags);
-	if (pool->cleanup)
-		pool->cleanup(elem);
-
-	if (pool->flags & RXE_POOL_ALLOC) {
-		obj = elem->obj;
-		kfree(obj);
-	}
 
-	atomic_dec(&pool->num_elem);
+	complete(&elem->complete);
 }
 
 int __rxe_add_ref(struct rxe_pool_elem *elem)
@@ -262,3 +258,36 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem)
 	return kref_put_lock_irqsave(&elem->ref_cnt, rxe_elem_release,
 			&pool->xa.xa_lock);
 }
+
+
+int __rxe_drop_wait(struct rxe_pool_elem *elem)
+{
+	struct rxe_pool *pool = elem->pool;
+	static int timeout = RXE_POOL_TIMEOUT;
+	int ret;
+
+	__rxe_drop_ref(elem);
+
+	if (timeout) {
+		ret = wait_for_completion_timeout(&elem->complete, timeout);
+		if (!ret) {
+			pr_warn("Timed out waiting for %s#%d\n",
+				pool->name + 4, elem->index);
+			if (++pool->timeouts == RXE_MAX_POOL_TIMEOUTS) {
+				timeout = 0;
+				pr_warn("Reached max %s timeouts.\n",
+					pool->name + 4);
+			}
+		}
+	}
+
+	if (pool->cleanup)
+		pool->cleanup(elem);
+
+	if (pool->flags & RXE_POOL_ALLOC)
+		kfree(elem->obj);
+
+	atomic_dec(&pool->num_elem);
+
+	return ret;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index b963c300d636..f98d2950bb9f 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -28,6 +28,7 @@ struct rxe_pool_elem {
 	void			*obj;
 	struct kref		ref_cnt;
 	struct list_head	list;
+	struct completion	complete;
 	u32			index;
 	bool			enabled;
 };
@@ -38,6 +39,7 @@ struct rxe_pool {
 	void			(*cleanup)(struct rxe_pool_elem *elem);
 	enum rxe_pool_flags	flags;
 	enum rxe_elem_type	type;
+	unsigned int		timeouts;
 
 	unsigned int		max_elem;
 	atomic_t		num_elem;
@@ -86,4 +88,8 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem);
 
 #define rxe_disable(obj) ((obj)->elem.enabled = false)
 
+int __rxe_drop_wait(struct rxe_pool_elem *elem);
+
+#define rxe_drop_wait(obj) __rxe_drop_wait(&(obj)->elem)
+
 #endif /* RXE_POOL_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 077be3eb9763..ced6972396c4 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -115,7 +115,7 @@ static void rxe_dealloc_ucontext(struct ib_ucontext *ibuc)
 {
 	struct rxe_ucontext *uc = to_ruc(ibuc);
 
-	rxe_drop_ref(uc);
+	rxe_drop_wait(uc);
 }
 
 static int rxe_port_immutable(struct ib_device *dev, u32 port_num,
@@ -149,7 +149,7 @@ static int rxe_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 {
 	struct rxe_pd *pd = to_rpd(ibpd);
 
-	rxe_drop_ref(pd);
+	rxe_drop_wait(pd);
 
 	return 0;
 }
@@ -189,7 +189,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		err = copy_to_user(&uresp->ah_num, &ah->ah_num,
 					 sizeof(uresp->ah_num));
 		if (err) {
-			rxe_drop_ref(ah);
+			rxe_drop_wait(ah);
 			return -EFAULT;
 		}
 	} else if (ah->is_user) {
@@ -233,8 +233,8 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 	struct rxe_ah *ah = to_rah(ibah);
 
 	rxe_disable(ah);
+	rxe_drop_wait(ah);
 
-	rxe_drop_ref(ah);
 	return 0;
 }
 
@@ -322,7 +322,7 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
 
 err2:
 	rxe_drop_ref(pd);
-	rxe_drop_ref(srq);
+	rxe_drop_wait(srq);
 err1:
 	return err;
 }
@@ -382,7 +382,8 @@ static int rxe_destroy_srq(struct ib_srq *ibsrq, struct ib_udata *udata)
 		rxe_queue_cleanup(srq->rq.queue);
 
 	rxe_drop_ref(srq->pd);
-	rxe_drop_ref(srq);
+	rxe_drop_wait(srq);
+
 	return 0;
 }
 
@@ -454,7 +455,7 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	return 0;
 
 qp_init:
-	rxe_drop_ref(qp);
+	rxe_drop_wait(qp);
 	return err;
 }
 
@@ -509,7 +510,7 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 
 	rxe_disable(qp);
 	rxe_qp_destroy(qp);
-	rxe_drop_ref(qp);
+	rxe_drop_wait(qp);
 	return 0;
 }
 
@@ -822,7 +823,7 @@ static int rxe_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 
 
 	rxe_cq_disable(cq);
-	rxe_drop_ref(cq);
+	rxe_drop_wait(cq);
 
 	return 0;
 }
@@ -954,7 +955,7 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 
 err3:
 	rxe_drop_ref(pd);
-	rxe_drop_ref(mr);
+	rxe_drop_wait(mr);
 err2:
 	return ERR_PTR(err);
 }
@@ -988,7 +989,7 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 
 err2:
 	rxe_drop_ref(pd);
-	rxe_drop_ref(mr);
+	rxe_drop_wait(mr);
 err1:
 	return ERR_PTR(err);
 }
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 07/11] RDMA/rxe: Fix ref error in rxe_av.c
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (5 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 06/11] RDMA/rxe: Add wait_for_completion to pool objects Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-28 17:06   ` Jason Gunthorpe
  2022-02-25 19:57 ` [PATCH for-next v10 08/11] RDMA/rxe: Replace mr by rkey in responder resources Bob Pearson
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

The commit referenced below can take a reference to the AH which is
never dropped. This only happens in the UD request path. This patch
optionally passes that AH back to the caller so that it can hold the
reference while the AV is being accessed and then drop it. Code to
do this is added to rxe_req.c. The AV is also passed to rxe_prepare
in rxe_net.c as an optimization.

Fixes: e2fe06c90806 ("RDMA/rxe: Lookup kernel AH from ah index in UD WQEs")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_av.c   | 19 +++++++++-
 drivers/infiniband/sw/rxe/rxe_loc.h  |  5 ++-
 drivers/infiniband/sw/rxe/rxe_net.c  | 17 +++++----
 drivers/infiniband/sw/rxe/rxe_req.c  | 55 +++++++++++++++++-----------
 drivers/infiniband/sw/rxe/rxe_resp.c |  2 +-
 5 files changed, 63 insertions(+), 35 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_av.c b/drivers/infiniband/sw/rxe/rxe_av.c
index 38c7b6fb39d7..360a567159fe 100644
--- a/drivers/infiniband/sw/rxe/rxe_av.c
+++ b/drivers/infiniband/sw/rxe/rxe_av.c
@@ -99,11 +99,14 @@ void rxe_av_fill_ip_info(struct rxe_av *av, struct rdma_ah_attr *attr)
 	av->network_type = type;
 }
 
-struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt)
+struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt, struct rxe_ah **ahp)
 {
 	struct rxe_ah *ah;
 	u32 ah_num;
 
+	if (ahp)
+		*ahp = NULL;
+
 	if (!pkt || !pkt->qp)
 		return NULL;
 
@@ -117,10 +120,22 @@ struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt)
 	if (ah_num) {
 		/* only new user provider or kernel client */
 		ah = rxe_pool_get_index(&pkt->rxe->ah_pool, ah_num);
-		if (!ah || ah->ah_num != ah_num || rxe_ah_pd(ah) != pkt->qp->pd) {
+		if (!ah) {
 			pr_warn("Unable to find AH matching ah_num\n");
 			return NULL;
 		}
+
+		if (rxe_ah_pd(ah) != pkt->qp->pd) {
+			pr_warn("PDs don't match for AH and QP\n");
+			rxe_drop_ref(ah);
+			return NULL;
+		}
+
+		if (ahp)
+			*ahp = ah;
+		else
+			rxe_drop_ref(ah);
+
 		return &ah->av;
 	}
 
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 2ad28351ba44..c7ba112d5c67 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -19,7 +19,7 @@ void rxe_av_to_attr(struct rxe_av *av, struct rdma_ah_attr *attr);
 
 void rxe_av_fill_ip_info(struct rxe_av *av, struct rdma_ah_attr *attr);
 
-struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt);
+struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt, struct rxe_ah **ahp);
 
 /* rxe_cq.c */
 int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq,
@@ -94,7 +94,8 @@ void rxe_mw_cleanup(struct rxe_pool_elem *arg);
 /* rxe_net.c */
 struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
 				int paylen, struct rxe_pkt_info *pkt);
-int rxe_prepare(struct rxe_pkt_info *pkt, struct sk_buff *skb);
+int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		struct sk_buff *skb);
 int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 		    struct sk_buff *skb);
 const char *rxe_parent_name(struct rxe_dev *rxe, unsigned int port_num);
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index a8cfa7160478..b06f22ffc5a8 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -271,13 +271,13 @@ static void prepare_ipv6_hdr(struct dst_entry *dst, struct sk_buff *skb,
 	ip6h->payload_len = htons(skb->len - sizeof(*ip6h));
 }
 
-static int prepare4(struct rxe_pkt_info *pkt, struct sk_buff *skb)
+static int prepare4(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		    struct sk_buff *skb)
 {
 	struct rxe_qp *qp = pkt->qp;
 	struct dst_entry *dst;
 	bool xnet = false;
 	__be16 df = htons(IP_DF);
-	struct rxe_av *av = rxe_get_av(pkt);
 	struct in_addr *saddr = &av->sgid_addr._sockaddr_in.sin_addr;
 	struct in_addr *daddr = &av->dgid_addr._sockaddr_in.sin_addr;
 
@@ -297,11 +297,11 @@ static int prepare4(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 	return 0;
 }
 
-static int prepare6(struct rxe_pkt_info *pkt, struct sk_buff *skb)
+static int prepare6(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		    struct sk_buff *skb)
 {
 	struct rxe_qp *qp = pkt->qp;
 	struct dst_entry *dst;
-	struct rxe_av *av = rxe_get_av(pkt);
 	struct in6_addr *saddr = &av->sgid_addr._sockaddr_in6.sin6_addr;
 	struct in6_addr *daddr = &av->dgid_addr._sockaddr_in6.sin6_addr;
 
@@ -322,16 +322,17 @@ static int prepare6(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 	return 0;
 }
 
-int rxe_prepare(struct rxe_pkt_info *pkt, struct sk_buff *skb)
+int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		struct sk_buff *skb)
 {
 	int err = 0;
 
 	if (skb->protocol == htons(ETH_P_IP))
-		err = prepare4(pkt, skb);
+		err = prepare4(av, pkt, skb);
 	else if (skb->protocol == htons(ETH_P_IPV6))
-		err = prepare6(pkt, skb);
+		err = prepare6(av, pkt, skb);
 
-	if (ether_addr_equal(skb->dev->dev_addr, rxe_get_av(pkt)->dmac))
+	if (ether_addr_equal(skb->dev->dev_addr, av->dmac))
 		pkt->mask |= RXE_LOOPBACK_MASK;
 
 	return err;
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 5eb89052dd66..f44535f82bea 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -358,6 +358,7 @@ static inline int get_mtu(struct rxe_qp *qp)
 }
 
 static struct sk_buff *init_req_packet(struct rxe_qp *qp,
+				       struct rxe_av *av,
 				       struct rxe_send_wqe *wqe,
 				       int opcode, int payload,
 				       struct rxe_pkt_info *pkt)
@@ -365,7 +366,6 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 	struct rxe_dev		*rxe = to_rdev(qp->ibqp.device);
 	struct sk_buff		*skb;
 	struct rxe_send_wr	*ibwr = &wqe->wr;
-	struct rxe_av		*av;
 	int			pad = (-payload) & 0x3;
 	int			paylen;
 	int			solicited;
@@ -374,21 +374,9 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 
 	/* length from start of bth to end of icrc */
 	paylen = rxe_opcode[opcode].length + payload + pad + RXE_ICRC_SIZE;
-
-	/* pkt->hdr, port_num and mask are initialized in ifc layer */
-	pkt->rxe	= rxe;
-	pkt->opcode	= opcode;
-	pkt->qp		= qp;
-	pkt->psn	= qp->req.psn;
-	pkt->mask	= rxe_opcode[opcode].mask;
-	pkt->paylen	= paylen;
-	pkt->wqe	= wqe;
+	pkt->paylen = paylen;
 
 	/* init skb */
-	av = rxe_get_av(pkt);
-	if (!av)
-		return NULL;
-
 	skb = rxe_init_packet(rxe, av, paylen, pkt);
 	if (unlikely(!skb))
 		return NULL;
@@ -447,13 +435,13 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 	return skb;
 }
 
-static int finish_packet(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
-		       struct rxe_pkt_info *pkt, struct sk_buff *skb,
-		       int paylen)
+static int finish_packet(struct rxe_qp *qp, struct rxe_av *av,
+			 struct rxe_send_wqe *wqe, struct rxe_pkt_info *pkt,
+			 struct sk_buff *skb, int paylen)
 {
 	int err;
 
-	err = rxe_prepare(pkt, skb);
+	err = rxe_prepare(av, pkt, skb);
 	if (err)
 		return err;
 
@@ -608,6 +596,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 int rxe_requester(void *arg)
 {
 	struct rxe_qp *qp = (struct rxe_qp *)arg;
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	struct rxe_pkt_info pkt;
 	struct sk_buff *skb;
 	struct rxe_send_wqe *wqe;
@@ -619,6 +608,8 @@ int rxe_requester(void *arg)
 	struct rxe_send_wqe rollback_wqe;
 	u32 rollback_psn;
 	struct rxe_queue *q = qp->sq.queue;
+	struct rxe_ah *ah;
+	struct rxe_av *av;
 
 	rxe_add_ref(qp);
 
@@ -705,14 +696,28 @@ int rxe_requester(void *arg)
 		payload = mtu;
 	}
 
-	skb = init_req_packet(qp, wqe, opcode, payload, &pkt);
+	pkt.rxe = rxe;
+	pkt.opcode = opcode;
+	pkt.qp = qp;
+	pkt.psn = qp->req.psn;
+	pkt.mask = rxe_opcode[opcode].mask;
+	pkt.wqe = wqe;
+
+	av = rxe_get_av(&pkt, &ah);
+	if (unlikely(!av)) {
+		pr_err("qp#%d Failed no address vector\n", qp_num(qp));
+		wqe->status = IB_WC_LOC_QP_OP_ERR;
+		goto err_drop_ah;
+	}
+
+	skb = init_req_packet(qp, av, wqe, opcode, payload, &pkt);
 	if (unlikely(!skb)) {
 		pr_err("qp#%d Failed allocating skb\n", qp_num(qp));
 		wqe->status = IB_WC_LOC_QP_OP_ERR;
-		goto err;
+		goto err_drop_ah;
 	}
 
-	ret = finish_packet(qp, wqe, &pkt, skb, payload);
+	ret = finish_packet(qp, av, wqe, &pkt, skb, payload);
 	if (unlikely(ret)) {
 		pr_debug("qp#%d Error during finish packet\n", qp_num(qp));
 		if (ret == -EFAULT)
@@ -720,9 +725,12 @@ int rxe_requester(void *arg)
 		else
 			wqe->status = IB_WC_LOC_QP_OP_ERR;
 		kfree_skb(skb);
-		goto err;
+		goto err_drop_ah;
 	}
 
+	if (ah)
+		rxe_drop_ref(ah);
+
 	/*
 	 * To prevent a race on wqe access between requester and completer,
 	 * wqe members state and psn need to be set before calling
@@ -751,6 +759,9 @@ int rxe_requester(void *arg)
 
 	goto next_wqe;
 
+err_drop_ah:
+	if (ah)
+		rxe_drop_ref(ah);
 err:
 	wqe->state = wqe_state_error;
 	__rxe_do_task(&qp->comp.task);
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index c369d78fc8e8..b5ebe853748a 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -633,7 +633,7 @@ static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
 	if (ack->mask & RXE_ATMACK_MASK)
 		atmack_set_orig(ack, qp->resp.atomic_orig);
 
-	err = rxe_prepare(ack, skb);
+	err = rxe_prepare(&qp->pri_av, ack, skb);
 	if (err) {
 		kfree_skb(skb);
 		return NULL;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 08/11] RDMA/rxe: Replace mr by rkey in responder resources
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (6 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 07/11] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 09/11] RDMA/rxe: Convert read side locking to rcu Bob Pearson
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently rxe saves a copy of MR in responder resources for RDMA reads.
Since the responder resources are never freed just over written if
more are needed this MR may not have a reference freed until the QP
is destroyed. This patch uses the rkey instead of the MR and on
subsequent packets of a multipacket read reply message it looks up the
MR from the rkey for each packet. This makes it possible for a user
to deregister an MR or unbind a MW on the fly and get correct behaviour.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c    |  10 +--
 drivers/infiniband/sw/rxe/rxe_resp.c  | 123 ++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_verbs.h |   1 -
 3 files changed, 87 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 5f270cbf18c6..26d461a8d71c 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -135,12 +135,8 @@ static void free_rd_atomic_resources(struct rxe_qp *qp)
 
 void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
 {
-	if (res->type == RXE_ATOMIC_MASK) {
+	if (res->type == RXE_ATOMIC_MASK)
 		kfree_skb(res->atomic.skb);
-	} else if (res->type == RXE_READ_MASK) {
-		if (res->read.mr)
-			rxe_drop_ref(res->read.mr);
-	}
 	res->type = 0;
 }
 
@@ -825,10 +821,8 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
 	if (qp->pd)
 		rxe_drop_ref(qp->pd);
 
-	if (qp->resp.mr) {
+	if (qp->resp.mr)
 		rxe_drop_ref(qp->resp.mr);
-		qp->resp.mr = NULL;
-	}
 
 	if (qp_type(qp) == IB_QPT_RC)
 		sk_dst_reset(qp->sk->sk);
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index b5ebe853748a..b1ec003f0bb8 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -642,6 +642,78 @@ static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
 	return skb;
 }
 
+static struct resp_res *rxe_prepare_read_res(struct rxe_qp *qp,
+					struct rxe_pkt_info *pkt)
+{
+	struct resp_res *res;
+	u32 pkts;
+
+	res = &qp->resp.resources[qp->resp.res_head];
+	rxe_advance_resp_resource(qp);
+	free_rd_atomic_resource(qp, res);
+
+	res->type = RXE_READ_MASK;
+	res->replay = 0;
+	res->read.va = qp->resp.va + qp->resp.offset;
+	res->read.va_org = qp->resp.va + qp->resp.offset;
+	res->read.resid = qp->resp.resid;
+	res->read.length = qp->resp.resid;
+	res->read.rkey = qp->resp.rkey;
+
+	pkts = max_t(u32, (reth_len(pkt) + qp->mtu - 1)/qp->mtu, 1);
+	res->first_psn = pkt->psn;
+	res->cur_psn = pkt->psn;
+	res->last_psn = (pkt->psn + pkts - 1) & BTH_PSN_MASK;
+
+	res->state = rdatm_res_state_new;
+
+	return res;
+}
+
+/**
+ * rxe_recheck_mr - revalidate MR from rkey and get a reference
+ * @qp: the qp
+ * @rkey: the rkey
+ *
+ * This code allows the MR to be invalidated or deregistered or
+ * the MW if one was used to be invalidated or deallocated.
+ * It is assumed that the access permissions if originally good
+ * are OK and the mappings to be unchanged.
+ *
+ * Return: mr on success else NULL
+ */
+static struct rxe_mr *rxe_recheck_mr(struct rxe_qp *qp, u32 rkey)
+{
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
+	struct rxe_mr *mr;
+	struct rxe_mw *mw;
+
+	if (rkey_is_mw(rkey)) {
+		mw = rxe_pool_get_index(&rxe->mw_pool, rkey >> 8);
+		if (!mw || mw->rkey != rkey)
+			return NULL;
+
+		if (mw->state != RXE_MW_STATE_VALID) {
+			rxe_drop_ref(mw);
+			return NULL;
+		}
+
+		mr = mw->mr;
+		rxe_drop_ref(mw);
+	} else {
+		mr = rxe_pool_get_index(&rxe->mr_pool, rkey >> 8);
+		if (!mr || mr->rkey != rkey)
+			return NULL;
+	}
+
+	if (mr->state != RXE_MR_STATE_VALID) {
+		rxe_drop_ref(mr);
+		return NULL;
+	}
+
+	return mr;
+}
+
 /* RDMA read response. If res is not NULL, then we have a current RDMA request
  * being processed or replayed.
  */
@@ -656,53 +728,26 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 	int opcode;
 	int err;
 	struct resp_res *res = qp->resp.res;
+	struct rxe_mr *mr;
 
 	if (!res) {
-		/* This is the first time we process that request. Get a
-		 * resource
-		 */
-		res = &qp->resp.resources[qp->resp.res_head];
-
-		free_rd_atomic_resource(qp, res);
-		rxe_advance_resp_resource(qp);
-
-		res->type		= RXE_READ_MASK;
-		res->replay		= 0;
-
-		res->read.va		= qp->resp.va +
-					  qp->resp.offset;
-		res->read.va_org	= qp->resp.va +
-					  qp->resp.offset;
-
-		res->first_psn		= req_pkt->psn;
-
-		if (reth_len(req_pkt)) {
-			res->last_psn	= (req_pkt->psn +
-					   (reth_len(req_pkt) + mtu - 1) /
-					   mtu - 1) & BTH_PSN_MASK;
-		} else {
-			res->last_psn	= res->first_psn;
-		}
-		res->cur_psn		= req_pkt->psn;
-
-		res->read.resid		= qp->resp.resid;
-		res->read.length	= qp->resp.resid;
-		res->read.rkey		= qp->resp.rkey;
-
-		/* note res inherits the reference to mr from qp */
-		res->read.mr		= qp->resp.mr;
-		qp->resp.mr		= NULL;
-
-		qp->resp.res		= res;
-		res->state		= rdatm_res_state_new;
+		res = rxe_prepare_read_res(qp, req_pkt);
+		qp->resp.res = res;
 	}
 
 	if (res->state == rdatm_res_state_new) {
+		mr = qp->resp.mr;
+		qp->resp.mr = NULL;
+
 		if (res->read.resid <= mtu)
 			opcode = IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY;
 		else
 			opcode = IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST;
 	} else {
+		mr = rxe_recheck_mr(qp, res->read.rkey);
+		if (!mr)
+			return RESPST_ERR_RKEY_VIOLATION;
+
 		if (res->read.resid > mtu)
 			opcode = IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE;
 		else
@@ -718,10 +763,12 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 	if (!skb)
 		return RESPST_ERR_RNR;
 
-	err = rxe_mr_copy(res->read.mr, res->read.va, payload_addr(&ack_pkt),
+	err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
 			  payload, RXE_FROM_MR_OBJ);
 	if (err)
 		pr_err("Failed copying memory\n");
+	if (mr)
+		rxe_drop_ref(mr);
 
 	if (bth_pad(&ack_pkt)) {
 		u8 *pad = payload_addr(&ack_pkt) + payload;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 4ee51de50b95..897f77ff604c 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -157,7 +157,6 @@ struct resp_res {
 			struct sk_buff	*skb;
 		} atomic;
 		struct {
-			struct rxe_mr	*mr;
 			u64		va_org;
 			u32		rkey;
 			u32		length;
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 09/11] RDMA/rxe: Convert read side locking to rcu
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (7 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 08/11] RDMA/rxe: Replace mr by rkey in responder resources Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-28 17:12   ` Jason Gunthorpe
  2022-02-25 19:57 ` [PATCH for-next v10 10/11] RDMA/rxe: Move max_elem into rxe_type_info Bob Pearson
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Use rcu_read_lock() for protecting read side operations in rxe_pool.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 1d1e10290991..713df1ce2bbc 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -202,16 +202,15 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
 void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 {
 	struct rxe_pool_elem *elem;
-	unsigned long flags;
 	void *obj;
 
-	spin_lock_irqsave(&pool->xa.xa_lock, flags);
+	rcu_read_lock();
 	elem = xa_load(&pool->xa, index);
 	if (elem && elem->enabled && kref_get_unless_zero(&elem->ref_cnt))
 		obj = elem->obj;
 	else
 		obj = NULL;
-	spin_unlock_irqrestore(&pool->xa.xa_lock, flags);
+	rcu_read_unlock();
 
 	return obj;
 }
@@ -259,13 +258,18 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem)
 			&pool->xa.xa_lock);
 }
 
-
 int __rxe_drop_wait(struct rxe_pool_elem *elem)
 {
 	struct rxe_pool *pool = elem->pool;
 	static int timeout = RXE_POOL_TIMEOUT;
 	int ret;
 
+	if (elem->enabled) {
+		pr_warn_once("%s#%d: should be disabled by now\n",
+			     elem->pool->name + 4, elem->index);
+		elem->enabled = false;
+	}
+
 	__rxe_drop_ref(elem);
 
 	if (timeout) {
@@ -284,6 +288,7 @@ int __rxe_drop_wait(struct rxe_pool_elem *elem)
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
+	/* if we reach here it is safe to free the object */
 	if (pool->flags & RXE_POOL_ALLOC)
 		kfree(elem->obj);
 
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 10/11] RDMA/rxe: Move max_elem into rxe_type_info
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (8 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 09/11] RDMA/rxe: Convert read side locking to rcu Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-25 19:57 ` [PATCH for-next v10 11/11] RDMA/rxe: Cleanup rxe_pool.c Bob Pearson
  2022-02-25 20:46 ` [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Jason Gunthorpe
  11 siblings, 0 replies; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Move the maximum number of elements from a parameter in rxe_pool_init
to a member of the rxe_type_info array.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c      | 16 ++++++++--------
 drivers/infiniband/sw/rxe/rxe_pool.c | 13 +++++++++++--
 drivers/infiniband/sw/rxe/rxe_pool.h |  2 +-
 3 files changed, 20 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 29e2b93f6d7e..fab00a753af1 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -116,14 +116,14 @@ static void rxe_init_ports(struct rxe_dev *rxe)
 /* init pools of managed objects */
 static void rxe_init_pools(struct rxe_dev *rxe)
 {
-	rxe_pool_init(rxe, &rxe->uc_pool, RXE_TYPE_UC, rxe->max_ucontext);
-	rxe_pool_init(rxe, &rxe->pd_pool, RXE_TYPE_PD, rxe->attr.max_pd);
-	rxe_pool_init(rxe, &rxe->ah_pool, RXE_TYPE_AH, rxe->attr.max_ah);
-	rxe_pool_init(rxe, &rxe->srq_pool, RXE_TYPE_SRQ, rxe->attr.max_srq);
-	rxe_pool_init(rxe, &rxe->qp_pool, RXE_TYPE_QP, rxe->attr.max_qp);
-	rxe_pool_init(rxe, &rxe->cq_pool, RXE_TYPE_CQ, rxe->attr.max_cq);
-	rxe_pool_init(rxe, &rxe->mr_pool, RXE_TYPE_MR, rxe->attr.max_mr);
-	rxe_pool_init(rxe, &rxe->mw_pool, RXE_TYPE_MW, rxe->attr.max_mw);
+	rxe_pool_init(rxe, &rxe->uc_pool, RXE_TYPE_UC);
+	rxe_pool_init(rxe, &rxe->pd_pool, RXE_TYPE_PD);
+	rxe_pool_init(rxe, &rxe->ah_pool, RXE_TYPE_AH);
+	rxe_pool_init(rxe, &rxe->srq_pool, RXE_TYPE_SRQ);
+	rxe_pool_init(rxe, &rxe->qp_pool, RXE_TYPE_QP);
+	rxe_pool_init(rxe, &rxe->cq_pool, RXE_TYPE_CQ);
+	rxe_pool_init(rxe, &rxe->mr_pool, RXE_TYPE_MR);
+	rxe_pool_init(rxe, &rxe->mw_pool, RXE_TYPE_MW);
 }
 
 /* initialize rxe device state */
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 713df1ce2bbc..20b97a90b4c8 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -18,6 +18,7 @@ static const struct rxe_type_info {
 	enum rxe_pool_flags flags;
 	u32 min_index;
 	u32 max_index;
+	u32 max_elem;
 } rxe_type_info[RXE_NUM_TYPES] = {
 	[RXE_TYPE_UC] = {
 		.name		= "rxe-uc",
@@ -25,6 +26,7 @@ static const struct rxe_type_info {
 		.elem_offset	= offsetof(struct rxe_ucontext, elem),
 		.min_index	= 1,
 		.max_index	= UINT_MAX,
+		.max_elem	= UINT_MAX,
 	},
 	[RXE_TYPE_PD] = {
 		.name		= "rxe-pd",
@@ -32,6 +34,7 @@ static const struct rxe_type_info {
 		.elem_offset	= offsetof(struct rxe_pd, elem),
 		.min_index	= 1,
 		.max_index	= UINT_MAX,
+		.max_elem	= UINT_MAX,
 	},
 	[RXE_TYPE_AH] = {
 		.name		= "rxe-ah",
@@ -39,6 +42,7 @@ static const struct rxe_type_info {
 		.elem_offset	= offsetof(struct rxe_ah, elem),
 		.min_index	= RXE_MIN_AH_INDEX,
 		.max_index	= RXE_MAX_AH_INDEX,
+		.max_elem	= RXE_MAX_AH_INDEX - RXE_MIN_AH_INDEX + 1,
 	},
 	[RXE_TYPE_SRQ] = {
 		.name		= "rxe-srq",
@@ -46,6 +50,7 @@ static const struct rxe_type_info {
 		.elem_offset	= offsetof(struct rxe_srq, elem),
 		.min_index	= RXE_MIN_SRQ_INDEX,
 		.max_index	= RXE_MAX_SRQ_INDEX,
+		.max_elem	= RXE_MAX_SRQ_INDEX - RXE_MIN_SRQ_INDEX + 1,
 	},
 	[RXE_TYPE_QP] = {
 		.name		= "rxe-qp",
@@ -54,6 +59,7 @@ static const struct rxe_type_info {
 		.cleanup	= rxe_qp_cleanup,
 		.min_index	= RXE_MIN_QP_INDEX,
 		.max_index	= RXE_MAX_QP_INDEX,
+		.max_elem	= RXE_MAX_QP_INDEX - RXE_MIN_QP_INDEX + 1,
 	},
 	[RXE_TYPE_CQ] = {
 		.name		= "rxe-cq",
@@ -62,6 +68,7 @@ static const struct rxe_type_info {
 		.cleanup	= rxe_cq_cleanup,
 		.min_index	= 1,
 		.max_index	= UINT_MAX,
+		.max_elem	= UINT_MAX,
 	},
 	[RXE_TYPE_MR] = {
 		.name		= "rxe-mr",
@@ -71,6 +78,7 @@ static const struct rxe_type_info {
 		.flags		= RXE_POOL_ALLOC,
 		.min_index	= RXE_MIN_MR_INDEX,
 		.max_index	= RXE_MAX_MR_INDEX,
+		.max_elem	= RXE_MAX_MR_INDEX - RXE_MIN_MR_INDEX + 1,
 	},
 	[RXE_TYPE_MW] = {
 		.name		= "rxe-mw",
@@ -78,11 +86,12 @@ static const struct rxe_type_info {
 		.elem_offset	= offsetof(struct rxe_mw, elem),
 		.min_index	= RXE_MIN_MW_INDEX,
 		.max_index	= RXE_MAX_MW_INDEX,
+		.max_elem	= RXE_MAX_MW_INDEX - RXE_MIN_MW_INDEX + 1,
 	},
 };
 
 void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
-		   enum rxe_elem_type type, unsigned int max_elem)
+		   enum rxe_elem_type type)
 {
 	const struct rxe_type_info *info = &rxe_type_info[type];
 
@@ -91,7 +100,7 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
 	pool->rxe		= rxe;
 	pool->name		= info->name;
 	pool->type		= type;
-	pool->max_elem		= max_elem;
+	pool->max_elem		= info->max_elem;
 	pool->elem_size		= ALIGN(info->size, RXE_POOL_ALIGN);
 	pool->elem_offset	= info->elem_offset;
 	pool->flags		= info->flags;
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index f98d2950bb9f..5450f62b01bd 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -56,7 +56,7 @@ struct rxe_pool {
  * pool elements will be allocated out of a slab cache
  */
 void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
-		  enum rxe_elem_type type, u32 max_elem);
+		  enum rxe_elem_type type);
 
 /* free resources from object pool */
 void rxe_pool_cleanup(struct rxe_pool *pool);
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH for-next v10 11/11] RDMA/rxe: Cleanup rxe_pool.c
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (9 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 10/11] RDMA/rxe: Move max_elem into rxe_type_info Bob Pearson
@ 2022-02-25 19:57 ` Bob Pearson
  2022-02-25 20:46 ` [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Jason Gunthorpe
  11 siblings, 0 replies; 21+ messages in thread
From: Bob Pearson @ 2022-02-25 19:57 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Minor cleanup of rxe_pool.c. Add document comment headers for
the subroutines. Increase alignment for pool elements.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 110 ++++++++++++++++++++++-----
 1 file changed, 93 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 20b97a90b4c8..8c207c90304a 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -1,14 +1,20 @@
 // SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
 /*
+ * Copyright (c) 2022 Hewlett Packard Enterprise, Inc. All rights reserved.
  * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
  * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
  */
 
 #include "rxe.h"
 
-#define RXE_POOL_TIMEOUT	(200)
-#define RXE_MAX_POOL_TIMEOUTS	(3)
-#define RXE_POOL_ALIGN		(16)
+#define RXE_POOL_TIMEOUT	200		/* jiffies */
+#define RXE_MAX_POOL_TIMEOUTS	3
+
+#ifdef L1_CACHE_BYTES
+#define RXE_POOL_ALIGN		L1_CACHE_BYTES
+#else
+#define RXE_POOL_ALIGN		64
+#endif
 
 static const struct rxe_type_info {
 	const char *name;
@@ -21,7 +27,7 @@ static const struct rxe_type_info {
 	u32 max_elem;
 } rxe_type_info[RXE_NUM_TYPES] = {
 	[RXE_TYPE_UC] = {
-		.name		= "rxe-uc",
+		.name		= "uc",
 		.size		= sizeof(struct rxe_ucontext),
 		.elem_offset	= offsetof(struct rxe_ucontext, elem),
 		.min_index	= 1,
@@ -29,7 +35,7 @@ static const struct rxe_type_info {
 		.max_elem	= UINT_MAX,
 	},
 	[RXE_TYPE_PD] = {
-		.name		= "rxe-pd",
+		.name		= "pd",
 		.size		= sizeof(struct rxe_pd),
 		.elem_offset	= offsetof(struct rxe_pd, elem),
 		.min_index	= 1,
@@ -37,7 +43,7 @@ static const struct rxe_type_info {
 		.max_elem	= UINT_MAX,
 	},
 	[RXE_TYPE_AH] = {
-		.name		= "rxe-ah",
+		.name		= "ah",
 		.size		= sizeof(struct rxe_ah),
 		.elem_offset	= offsetof(struct rxe_ah, elem),
 		.min_index	= RXE_MIN_AH_INDEX,
@@ -45,7 +51,7 @@ static const struct rxe_type_info {
 		.max_elem	= RXE_MAX_AH_INDEX - RXE_MIN_AH_INDEX + 1,
 	},
 	[RXE_TYPE_SRQ] = {
-		.name		= "rxe-srq",
+		.name		= "srq",
 		.size		= sizeof(struct rxe_srq),
 		.elem_offset	= offsetof(struct rxe_srq, elem),
 		.min_index	= RXE_MIN_SRQ_INDEX,
@@ -53,7 +59,7 @@ static const struct rxe_type_info {
 		.max_elem	= RXE_MAX_SRQ_INDEX - RXE_MIN_SRQ_INDEX + 1,
 	},
 	[RXE_TYPE_QP] = {
-		.name		= "rxe-qp",
+		.name		= "qp",
 		.size		= sizeof(struct rxe_qp),
 		.elem_offset	= offsetof(struct rxe_qp, elem),
 		.cleanup	= rxe_qp_cleanup,
@@ -62,7 +68,7 @@ static const struct rxe_type_info {
 		.max_elem	= RXE_MAX_QP_INDEX - RXE_MIN_QP_INDEX + 1,
 	},
 	[RXE_TYPE_CQ] = {
-		.name		= "rxe-cq",
+		.name		= "cq",
 		.size		= sizeof(struct rxe_cq),
 		.elem_offset	= offsetof(struct rxe_cq, elem),
 		.cleanup	= rxe_cq_cleanup,
@@ -71,7 +77,7 @@ static const struct rxe_type_info {
 		.max_elem	= UINT_MAX,
 	},
 	[RXE_TYPE_MR] = {
-		.name		= "rxe-mr",
+		.name		= "mr",
 		.size		= sizeof(struct rxe_mr),
 		.elem_offset	= offsetof(struct rxe_mr, elem),
 		.cleanup	= rxe_mr_cleanup,
@@ -81,7 +87,7 @@ static const struct rxe_type_info {
 		.max_elem	= RXE_MAX_MR_INDEX - RXE_MIN_MR_INDEX + 1,
 	},
 	[RXE_TYPE_MW] = {
-		.name		= "rxe-mw",
+		.name		= "mw",
 		.size		= sizeof(struct rxe_mw),
 		.elem_offset	= offsetof(struct rxe_mw, elem),
 		.min_index	= RXE_MIN_MW_INDEX,
@@ -90,6 +96,12 @@ static const struct rxe_type_info {
 	},
 };
 
+/**
+ * rxe_pool_init - initialize a rxe object pool
+ * @rxe: rxe device pool belongs to
+ * @pool: object pool
+ * @type: pool type
+ */
 void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
 		   enum rxe_elem_type type)
 {
@@ -113,6 +125,10 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
 	pool->limit.min = info->min_index;
 }
 
+/**
+ * rxe_pool_cleanup - free any remaining pool resources
+ * @pool: object pool
+ */
 void rxe_pool_cleanup(struct rxe_pool *pool)
 {
 	struct rxe_pool_elem *elem;
@@ -135,9 +151,15 @@ void rxe_pool_cleanup(struct rxe_pool *pool)
 
 	if (elem_count || obj_count)
 		pr_warn("Freed %d indices and %d objects from pool %s\n",
-			elem_count, obj_count, pool->name + 4);
+			elem_count, obj_count, pool->name);
 }
 
+/**
+ * rxe_alloc - allocate a new pool object
+ * @pool: object pool
+ *
+ * Returns: object on success else NULL
+ */
 void *rxe_alloc(struct rxe_pool *pool)
 {
 	struct rxe_pool_elem *elem;
@@ -178,6 +200,13 @@ void *rxe_alloc(struct rxe_pool *pool)
 	return NULL;
 }
 
+/**
+ * __rxe_add_to_pool - add rdma-core allocated object to rxe object pool
+ * @pool: object pool
+ * @elem: rxe_pool_elem embedded in object
+ *
+ * Returns: 0 on success else an error
+ */
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
 {
 	int err;
@@ -208,6 +237,13 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
 	return -EINVAL;
 }
 
+/**
+ * rxe_pool_get - find object in pool with given index
+ * @pool: object pool
+ * @index: index
+ *
+ * Returns: object on success else NULL
+ */
 void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 {
 	struct rxe_pool_elem *elem;
@@ -224,6 +260,16 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 	return obj;
 }
 
+/**
+ * rxe_elem_release - remove object index and complete
+ * @kref: kref embedded in pool element
+ * @flags: flags for lock release
+ *
+ * Context: called holding pool->xa.xa_lock which must be
+ *	    dropped. Called from __rxe_drop_ref() when the ref count
+ *	    reaches zero. Completes the object scheduling
+ *	    a waiter on the object.
+ */
 static void rxe_elem_release(struct kref *kref, unsigned long flags)
 	__releases(&pool->xa.xa_lock)
 {
@@ -238,13 +284,28 @@ static void rxe_elem_release(struct kref *kref, unsigned long flags)
 	complete(&elem->complete);
 }
 
+/**
+ * __rxe_add_ref - gets a kref on the object unless ref count is zero
+ * @elem: rxe_pool_elem embedded in object
+ *
+ * Returns: 1 if reference is added else 0 because
+ *	    ref count has reached zero
+ */
 int __rxe_add_ref(struct rxe_pool_elem *elem)
 {
 	return kref_get_unless_zero(&elem->ref_cnt);
 }
 
-/* local copy of kref_put_lock_irqsave same as kref_put_lock
- * except for _irqsave locks
+/**
+ * kref_put_lock_irqsave - local copy of kref_put_lock
+ * @kref: kref embedded in elem
+ * @release: cleanup function called when ref count reaches zero
+ * @lock: irqsave spinlock to take if ref count reaches zero
+ *
+ * Same as kref_put_lock except for _irqsave locks
+ *
+ * Returns: 1 if ref count reaches zero and release is called
+ *	    while holding the lock else 0.
  */
 static int kref_put_lock_irqsave(struct kref *kref,
 		 void (*release)(struct kref *kref, unsigned long flags),
@@ -259,6 +320,15 @@ static int kref_put_lock_irqsave(struct kref *kref,
 	return 0;
 }
 
+/**
+ * __rxe_drop_ref - puts a kref on the object
+ * @elem: rxe_pool_elem embedded in object
+ *
+ * Puts a kref on the object and if ref count reaches zero
+ * takes the lock and calls release() which must free the lock.
+ *
+ * Returns: 1 if ref count reaches zero and release called else 0
+ */
 int __rxe_drop_ref(struct rxe_pool_elem *elem)
 {
 	struct rxe_pool *pool = elem->pool;
@@ -267,6 +337,12 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem)
 			&pool->xa.xa_lock);
 }
 
+/**
+ * __rxe_drop_wait - put a kref on object and wait for completion
+ * @elem: rxe_pool_elem embedded in object
+ *
+ * Returns: non-zero value if completion is reached else 0 if timeout
+ */
 int __rxe_drop_wait(struct rxe_pool_elem *elem)
 {
 	struct rxe_pool *pool = elem->pool;
@@ -275,7 +351,7 @@ int __rxe_drop_wait(struct rxe_pool_elem *elem)
 
 	if (elem->enabled) {
 		pr_warn_once("%s#%d: should be disabled by now\n",
-			     elem->pool->name + 4, elem->index);
+			     elem->pool->name, elem->index);
 		elem->enabled = false;
 	}
 
@@ -285,11 +361,11 @@ int __rxe_drop_wait(struct rxe_pool_elem *elem)
 		ret = wait_for_completion_timeout(&elem->complete, timeout);
 		if (!ret) {
 			pr_warn("Timed out waiting for %s#%d\n",
-				pool->name + 4, elem->index);
+				pool->name, elem->index);
 			if (++pool->timeouts == RXE_MAX_POOL_TIMEOUTS) {
 				timeout = 0;
 				pr_warn("Reached max %s timeouts.\n",
-					pool->name + 4);
+					pool->name);
 			}
 		}
 	}
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 00/11] Fix race conditions in rxe_pool
  2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
                   ` (10 preceding siblings ...)
  2022-02-25 19:57 ` [PATCH for-next v10 11/11] RDMA/rxe: Cleanup rxe_pool.c Bob Pearson
@ 2022-02-25 20:46 ` Jason Gunthorpe
  11 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-25 20:46 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:40PM -0600, Bob Pearson wrote:
> There are several race conditions discovered in the current rdma_rxe
> driver.  They mostly relate to races between normal operations and
> destroying objects.  This patch series
>  - Makes several minor cleanups in rxe_pool.[ch]
>  - Replaces the red-black trees currently used by xarrays for indices
>  - Corrects several reference counting errors
>  - Adds wait for completions to the paths in verbs APIs which destroy
>    objects.
>  - Changes read side locking to rcu.
> 
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
> v10
>   Rebased to current wip/jgg-for-next.
>   Split some patches into smaller ones.

Before I look at this, can I apply it without the last two mcast
patches?

Thanks,
Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays
  2022-02-25 19:57 ` [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays Bob Pearson
@ 2022-02-28 16:57   ` Jason Gunthorpe
  2022-02-28 17:28     ` Robert Pearson
  0 siblings, 1 reply; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 16:57 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:44PM -0600, Bob Pearson wrote:

> +void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
> +{
> +	struct rxe_pool_elem *elem;
> +	unsigned long flags;
> +	void *obj;
> +
> +	spin_lock_irqsave(&pool->xa.xa_lock, flags);

You can't reach into the xa_lock like this, use the xa_lock()
family of functions instead, everywhere.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 05/11] RDMA/rxe: Stop lookup of partially built objects
  2022-02-25 19:57 ` [PATCH for-next v10 05/11] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
@ 2022-02-28 17:01   ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 17:01 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:45PM -0600, Bob Pearson wrote:
> Currently the rdma_rxe driver has a security weakness due to adding
> objects which are partially initialized to indices allowing external
> actors to gain access to them by sending packets which refer to
> their index (e.g. qpn, rkey, etc).
> 
> This patch adds a member to the pool element struct indicating whether
> the object should/or should not allow looking up from its index. This
> variable is set only after the object is completely created and unset
> as soon as possible when the object is destroyed.

Why do we have to put incompletely initialized pointers into the
xarray?

Either:

 1) Do the xa_alloc after everything is setup properly, splitting
    allocation and ID assignment.

 2) Do xa_alloc(XA_ZERO_ENTRY) at the start to reserve the ID
    then xa_store to set the pointer (can't fail) or xa_erase()
    to abort it

> @@ -81,4 +82,8 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem);
>  
>  #define rxe_read_ref(obj) kref_read(&(obj)->elem.ref_cnt)
>  
> +#define rxe_enable(obj) ((obj)->elem.enabled = true)
> +
> +#define rxe_disable(obj) ((obj)->elem.enabled = false)

None of this is locked properly. A release/acquire needs to happen to
ensure all the stores that initialized the memory are visible to the
reader. Both of the above will ensure that happens.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 06/11] RDMA/rxe: Add wait_for_completion to pool objects
  2022-02-25 19:57 ` [PATCH for-next v10 06/11] RDMA/rxe: Add wait_for_completion to pool objects Bob Pearson
@ 2022-02-28 17:05   ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 17:05 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:46PM -0600, Bob Pearson wrote:
>  int __rxe_add_ref(struct rxe_pool_elem *elem)
> @@ -262,3 +258,36 @@ int __rxe_drop_ref(struct rxe_pool_elem *elem)
>  	return kref_put_lock_irqsave(&elem->ref_cnt, rxe_elem_release,
>  			&pool->xa.xa_lock);

Also can't touch the xa_lock to do stuff like this,

> +int __rxe_drop_wait(struct rxe_pool_elem *elem)

I think I would call this something else since it it basically unconditionally
frees the memory.

> +{
> +	struct rxe_pool *pool = elem->pool;
> +	static int timeout = RXE_POOL_TIMEOUT;
> +	int ret;
> +
> +	__rxe_drop_ref(elem);

> +	if (timeout) {
> +		ret = wait_for_completion_timeout(&elem->complete, timeout);
> +		if (!ret) {
> +			pr_warn("Timed out waiting for %s#%d\n",
> +				pool->name + 4, elem->index);

This is a WARN_ON event, kernel is broken, and you should leak the
memory rather than cause memory corruption.

> +			if (++pool->timeouts == RXE_MAX_POOL_TIMEOUTS) {
> +				timeout = 0;
> +				pr_warn("Reached max %s timeouts.\n",
> +					pool->name + 4);
> +			}

Why?

> +		}
> +	}
> +
> +	if (pool->cleanup)
> +		pool->cleanup(elem);
> +
> +	if (pool->flags & RXE_POOL_ALLOC)
> +		kfree(elem->obj);
> +
> +	atomic_dec(&pool->num_elem);
> +
> +	return ret;

And we return a failure code but freed the memory? This shouldn't fail

But the idea is right..

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 07/11] RDMA/rxe: Fix ref error in rxe_av.c
  2022-02-25 19:57 ` [PATCH for-next v10 07/11] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
@ 2022-02-28 17:06   ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 17:06 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:47PM -0600, Bob Pearson wrote:
> The commit referenced below can take a reference to the AH which is
> never dropped. This only happens in the UD request path. This patch
> optionally passes that AH back to the caller so that it can hold the
> reference while the AV is being accessed and then drop it. Code to
> do this is added to rxe_req.c. The AV is also passed to rxe_prepare
> in rxe_net.c as an optimization.
> 
> Fixes: e2fe06c90806 ("RDMA/rxe: Lookup kernel AH from ah index in UD WQEs")
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_av.c   | 19 +++++++++-
>  drivers/infiniband/sw/rxe/rxe_loc.h  |  5 ++-
>  drivers/infiniband/sw/rxe/rxe_net.c  | 17 +++++----
>  drivers/infiniband/sw/rxe/rxe_req.c  | 55 +++++++++++++++++-----------
>  drivers/infiniband/sw/rxe/rxe_resp.c |  2 +-
>  5 files changed, 63 insertions(+), 35 deletions(-)

This should be ordered earlier as the prior patch turns this into a
bigger problem?

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 09/11] RDMA/rxe: Convert read side locking to rcu
  2022-02-25 19:57 ` [PATCH for-next v10 09/11] RDMA/rxe: Convert read side locking to rcu Bob Pearson
@ 2022-02-28 17:12   ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 17:12 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:49PM -0600, Bob Pearson wrote:
> Use rcu_read_lock() for protecting read side operations in rxe_pool.c.
> 
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_pool.c | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 1d1e10290991..713df1ce2bbc 100644
> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -202,16 +202,15 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem)
>  void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
>  {
>  	struct rxe_pool_elem *elem;
> -	unsigned long flags;
>  	void *obj;
>  
> -	spin_lock_irqsave(&pool->xa.xa_lock, flags);
> +	rcu_read_lock();
>  	elem = xa_load(&pool->xa, index);
>  	if (elem && elem->enabled && kref_get_unless_zero(&elem->ref_cnt))
>  		obj = elem->obj;
>  	else
>  		obj = NULL;
> -	spin_unlock_irqrestore(&pool->xa.xa_lock, flags);
> +	rcu_read_unlock();

Where is the kfree_rcu to go along with this?

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC
  2022-02-25 19:57 ` [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC Bob Pearson
@ 2022-02-28 17:15   ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 17:15 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Feb 25, 2022 at 01:57:41PM -0600, Bob Pearson wrote:
> @@ -264,6 +261,12 @@ void *rxe_alloc(struct rxe_pool *pool)
>  	struct rxe_pool_elem *elem;
>  	void *obj;
>  
> +	if (!(pool->flags & RXE_POOL_ALLOC)) {
> +		pr_warn_once("%s: Pool %s must call rxe_add_to_pool\n",
> +				__func__, pool->name);

Just make these

if (WARN_ON(!(pool->flags & RXE_POOL_ALLOC)))
   return NULL;

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays
  2022-02-28 16:57   ` Jason Gunthorpe
@ 2022-02-28 17:28     ` Robert Pearson
  2022-02-28 17:56       ` Jason Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Robert Pearson @ 2022-02-28 17:28 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Zhu Yanjun, RDMA mailing list

There is a xa_lock_irqsave()/ublock_irqrestore but I actually need
some things that they don't support.
In particular there is not an option to call xa_alloc_cyclic_irqsave
and I also need an irqsave version
of kref_put_lock and had to code one which calls the refcount version
but again that takes the address
of a lock and not an xarray. All this is because rdmacm is crazy and
makes verbs api calls with
spinlocks held.

Bob

On Mon, Feb 28, 2022 at 10:57 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Feb 25, 2022 at 01:57:44PM -0600, Bob Pearson wrote:
>
> > +void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
> > +{
> > +     struct rxe_pool_elem *elem;
> > +     unsigned long flags;
> > +     void *obj;
> > +
> > +     spin_lock_irqsave(&pool->xa.xa_lock, flags);
>
> You can't reach into the xa_lock like this, use the xa_lock()
> family of functions instead, everywhere.
>
> Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays
  2022-02-28 17:28     ` Robert Pearson
@ 2022-02-28 17:56       ` Jason Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Jason Gunthorpe @ 2022-02-28 17:56 UTC (permalink / raw)
  To: Robert Pearson; +Cc: Zhu Yanjun, RDMA mailing list

On Mon, Feb 28, 2022 at 11:28:44AM -0600, Robert Pearson wrote:
> There is a xa_lock_irqsave()/ublock_irqrestore but I actually need
> some things that they don't support.
> In particular there is not an option to call xa_alloc_cyclic_irqsave

Well, yes, there is actually good reason for this. The lock/unlock
scheme that the allocating xarray functions use can't be trivially
nested like this.

When, and only when, you need to allocate the non-blocking AH you
should use this pattern

xa_lock_irqsave(..)
__xa_alloc_cyclic_reserve(..., GFP_ATOMIC);
xa_lock_irq_restore(..)

Everything else should use a simple _irq (is this even needed?)
variant without nesting under another spinlock

> and I also need an irqsave version of kref_put_lock and had to code
> one which calls the refcount version but again that takes the
> address of a lock and not an xarray. All this is because rdmacm is
> crazy and makes verbs api calls with spinlocks held.

You shouldn't need kref_put_lock at all. It isn't really a kref, just
use a normal refcount and trigger the completion when it reaches
0. Nothing fancy required.

Jason

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-02-28 18:19 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-25 19:57 [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Bob Pearson
2022-02-25 19:57 ` [PATCH for-next v10 01/11] RDMA/rxe: Reverse the sense of RXE_POOL_NO_ALLOC Bob Pearson
2022-02-28 17:15   ` Jason Gunthorpe
2022-02-25 19:57 ` [PATCH for-next v10 02/11] RDMA/rxe: Delete _locked() APIs for pool objects Bob Pearson
2022-02-25 19:57 ` [PATCH for-next v10 03/11] RDMA/rxe: Replace obj by elem in declaration Bob Pearson
2022-02-25 19:57 ` [PATCH for-next v10 04/11] RDMA/rxe: Replace red-black trees by xarrays Bob Pearson
2022-02-28 16:57   ` Jason Gunthorpe
2022-02-28 17:28     ` Robert Pearson
2022-02-28 17:56       ` Jason Gunthorpe
2022-02-25 19:57 ` [PATCH for-next v10 05/11] RDMA/rxe: Stop lookup of partially built objects Bob Pearson
2022-02-28 17:01   ` Jason Gunthorpe
2022-02-25 19:57 ` [PATCH for-next v10 06/11] RDMA/rxe: Add wait_for_completion to pool objects Bob Pearson
2022-02-28 17:05   ` Jason Gunthorpe
2022-02-25 19:57 ` [PATCH for-next v10 07/11] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
2022-02-28 17:06   ` Jason Gunthorpe
2022-02-25 19:57 ` [PATCH for-next v10 08/11] RDMA/rxe: Replace mr by rkey in responder resources Bob Pearson
2022-02-25 19:57 ` [PATCH for-next v10 09/11] RDMA/rxe: Convert read side locking to rcu Bob Pearson
2022-02-28 17:12   ` Jason Gunthorpe
2022-02-25 19:57 ` [PATCH for-next v10 10/11] RDMA/rxe: Move max_elem into rxe_type_info Bob Pearson
2022-02-25 19:57 ` [PATCH for-next v10 11/11] RDMA/rxe: Cleanup rxe_pool.c Bob Pearson
2022-02-25 20:46 ` [PATCH for-next v10 00/11] Fix race conditions in rxe_pool Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).