linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe
@ 2021-10-22 19:18 Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock Bob Pearson
                   ` (9 more replies)
  0 siblings, 10 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

There are several race conditions discovered in the rdma_rxe driver.
These patches correct them. They mostly relate to races between
normal operations and destroying objects.

This patch series applies cleanly to current for-next.
commit 71ee1f127543 ("Merge brank 'mlx5_mkey' into rdma.git for-next")

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v3
  Changed rxe_alloc to use GFP_KERNEL
  Addressed other comments by Jason Gunthorp
  Merged the previous 06/10 and 07/10 patches into one since they overlapped
  Added some minor cleanups as 10/10
v2
  Rebased to current for-next.
  Added 4 additional patches

Bob Pearson (10):
  RDMA/rxe: Make rxe_alloc() take pool lock
  RDMA/rxe: Copy setup parameters into rxe_pool
  RDMA/rxe: Save object pointer in pool element
  RDMA/rxe: Combine rxe_add_index with rxe_alloc
  RDMA/rxe: Combine rxe_add_key with rxe_alloc
  RDMA/rxe: Separate out last rxe_drop_ref
  RDMA/rxe: Rewrite rxe_mcast.c
  RDMA/rxe: Fix ref error in rxe_av.c
  RDMA/rxe: Replace mr by rkey in responder resources
  RDMA/rxe: Minor cleanup in rxe_pool.c

 drivers/infiniband/sw/rxe/rxe.c       |   8 -
 drivers/infiniband/sw/rxe/rxe.h       |   1 +
 drivers/infiniband/sw/rxe/rxe_av.c    |  24 +-
 drivers/infiniband/sw/rxe/rxe_cq.c    |   9 +-
 drivers/infiniband/sw/rxe/rxe_loc.h   |  19 +-
 drivers/infiniband/sw/rxe/rxe_mcast.c | 210 ++++++++-----
 drivers/infiniband/sw/rxe/rxe_mr.c    |  11 +-
 drivers/infiniband/sw/rxe/rxe_mw.c    |  28 +-
 drivers/infiniband/sw/rxe/rxe_net.c   |  39 +--
 drivers/infiniband/sw/rxe/rxe_pool.c  | 408 +++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_pool.h  |  84 +++---
 drivers/infiniband/sw/rxe/rxe_qp.c    |  10 +-
 drivers/infiniband/sw/rxe/rxe_req.c   |  55 ++--
 drivers/infiniband/sw/rxe/rxe_resp.c  | 125 +++++---
 drivers/infiniband/sw/rxe/rxe_srq.c   |   8 +
 drivers/infiniband/sw/rxe/rxe_verbs.c |  67 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.h |   4 +-
 17 files changed, 653 insertions(+), 457 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-25 17:55   ` Jason Gunthorpe
  2021-10-22 19:18 ` [PATCH for-next v3 02/10] RDMA/rxe: Copy setup parameters into rxe_pool Bob Pearson
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

In rxe_pool.c there are two separate pool APIs for creating a new object
rxe_alloc() and rxe_alloc_locked(). Currently they are identical except
for GFP_KERNEL vs GFP_ATOMIC. Make rxe_alloc() take the pool lock which
is in line with the other APIs in the library and was the original intent.
Keep kzalloc() outside of the lock to avoid using GFP_ATOMIC.
Make similar change to rxe_add_to_pool. The code checking the pool limit
is potentially racy without a lock. The lock around finishing the pool
element provides a memory barrier before 'publishing' the object.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v3
  Kept rxe_alloc and rxe_alloc_locked separate to allow use of GFP_KERNEL
  in rxe_alloc

 drivers/infiniband/sw/rxe/rxe_pool.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 7b4cb46edfd9..8138e459b6c1 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -356,39 +356,51 @@ void *rxe_alloc(struct rxe_pool *pool)
 {
 	struct rxe_type_info *info = &rxe_type_info[pool->type];
 	struct rxe_pool_entry *elem;
+	unsigned long flags;
 	u8 *obj;
 
+	write_lock_irqsave(&pool->pool_lock, flags);
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
+	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	obj = kzalloc(info->size, GFP_KERNEL);
-	if (!obj)
-		goto out_cnt;
+	if (!obj) {
+		atomic_dec(&pool->num_elem);
+		return NULL;
+	}
 
+	write_lock_irqsave(&pool->pool_lock, flags);
 	elem = (struct rxe_pool_entry *)(obj + info->elem_offset);
-
 	elem->pool = pool;
 	kref_init(&elem->ref_cnt);
+	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return obj;
 
 out_cnt:
 	atomic_dec(&pool->num_elem);
+	write_unlock_irqrestore(&pool->pool_lock, flags);
 	return NULL;
 }
 
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 {
+	unsigned long flags;
+
+	write_lock_irqsave(&pool->pool_lock, flags);
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
 
 	elem->pool = pool;
 	kref_init(&elem->ref_cnt);
+	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return 0;
 
 out_cnt:
 	atomic_dec(&pool->num_elem);
+	write_unlock_irqrestore(&pool->pool_lock, flags);
 	return -EINVAL;
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 02/10] RDMA/rxe: Copy setup parameters into rxe_pool
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 03/10] RDMA/rxe: Save object pointer in pool element Bob Pearson
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

In rxe_pool.c copy remaining pool setup parameters from rxe_pool_info into
rxe_pool. This saves looking up rxe_pool_info in the performance path.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 54 ++++++++++++----------------
 drivers/infiniband/sw/rxe/rxe_pool.h |  6 ++--
 2 files changed, 25 insertions(+), 35 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 8138e459b6c1..86251145705f 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -7,9 +7,8 @@
 #include "rxe.h"
 #include "rxe_loc.h"
 
-/* info about object pools
- */
-struct rxe_type_info rxe_type_info[RXE_NUM_TYPES] = {
+/* info about object pools */
+static const struct rxe_type_info rxe_type_info[RXE_NUM_TYPES] = {
 	[RXE_TYPE_UC] = {
 		.name		= "rxe-uc",
 		.size		= sizeof(struct rxe_ucontext),
@@ -88,11 +87,6 @@ struct rxe_type_info rxe_type_info[RXE_NUM_TYPES] = {
 	},
 };
 
-static inline const char *pool_name(struct rxe_pool *pool)
-{
-	return rxe_type_info[pool->type].name;
-}
-
 static int rxe_pool_init_index(struct rxe_pool *pool, u32 max, u32 min)
 {
 	int err = 0;
@@ -127,35 +121,36 @@ int rxe_pool_init(
 	enum rxe_elem_type	type,
 	unsigned int		max_elem)
 {
+	const struct rxe_type_info *info = &rxe_type_info[type];
 	int			err = 0;
-	size_t			size = rxe_type_info[type].size;
 
 	memset(pool, 0, sizeof(*pool));
 
 	pool->rxe		= rxe;
+	pool->name		= info->name;
 	pool->type		= type;
 	pool->max_elem		= max_elem;
-	pool->elem_size		= ALIGN(size, RXE_POOL_ALIGN);
-	pool->flags		= rxe_type_info[type].flags;
+	pool->elem_size		= ALIGN(info->size, RXE_POOL_ALIGN);
+	pool->elem_offset	= info->elem_offset;
+	pool->flags		= info->flags;
 	pool->index.tree	= RB_ROOT;
 	pool->key.tree		= RB_ROOT;
-	pool->cleanup		= rxe_type_info[type].cleanup;
+	pool->cleanup		= info->cleanup;
 
 	atomic_set(&pool->num_elem, 0);
 
 	rwlock_init(&pool->pool_lock);
 
-	if (rxe_type_info[type].flags & RXE_POOL_INDEX) {
-		err = rxe_pool_init_index(pool,
-					  rxe_type_info[type].max_index,
-					  rxe_type_info[type].min_index);
+	if (info->flags & RXE_POOL_INDEX) {
+		err = rxe_pool_init_index(pool, info->max_index,
+					  info->min_index);
 		if (err)
 			goto out;
 	}
 
-	if (rxe_type_info[type].flags & RXE_POOL_KEY) {
-		pool->key.key_offset = rxe_type_info[type].key_offset;
-		pool->key.key_size = rxe_type_info[type].key_size;
+	if (info->flags & RXE_POOL_KEY) {
+		pool->key.key_offset = info->key_offset;
+		pool->key.key_size = info->key_size;
 	}
 
 out:
@@ -166,7 +161,7 @@ void rxe_pool_cleanup(struct rxe_pool *pool)
 {
 	if (atomic_read(&pool->num_elem) > 0)
 		pr_warn("%s pool destroyed with unfree'd elem\n",
-			pool_name(pool));
+			pool->name);
 
 	kfree(pool->index.table);
 }
@@ -329,18 +324,17 @@ void __rxe_drop_index(struct rxe_pool_entry *elem)
 
 void *rxe_alloc_locked(struct rxe_pool *pool)
 {
-	struct rxe_type_info *info = &rxe_type_info[pool->type];
 	struct rxe_pool_entry *elem;
 	u8 *obj;
 
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
 
-	obj = kzalloc(info->size, GFP_ATOMIC);
+	obj = kzalloc(pool->elem_size, GFP_ATOMIC);
 	if (!obj)
 		goto out_cnt;
 
-	elem = (struct rxe_pool_entry *)(obj + info->elem_offset);
+	elem = (struct rxe_pool_entry *)(obj + pool->elem_offset);
 
 	elem->pool = pool;
 	kref_init(&elem->ref_cnt);
@@ -354,7 +348,6 @@ void *rxe_alloc_locked(struct rxe_pool *pool)
 
 void *rxe_alloc(struct rxe_pool *pool)
 {
-	struct rxe_type_info *info = &rxe_type_info[pool->type];
 	struct rxe_pool_entry *elem;
 	unsigned long flags;
 	u8 *obj;
@@ -364,14 +357,14 @@ void *rxe_alloc(struct rxe_pool *pool)
 		goto out_cnt;
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
-	obj = kzalloc(info->size, GFP_KERNEL);
+	obj = kzalloc(pool->elem_size, GFP_KERNEL);
 	if (!obj) {
 		atomic_dec(&pool->num_elem);
 		return NULL;
 	}
 
 	write_lock_irqsave(&pool->pool_lock, flags);
-	elem = (struct rxe_pool_entry *)(obj + info->elem_offset);
+	elem = (struct rxe_pool_entry *)(obj + pool->elem_offset);
 	elem->pool = pool;
 	kref_init(&elem->ref_cnt);
 	write_unlock_irqrestore(&pool->pool_lock, flags);
@@ -409,14 +402,13 @@ void rxe_elem_release(struct kref *kref)
 	struct rxe_pool_entry *elem =
 		container_of(kref, struct rxe_pool_entry, ref_cnt);
 	struct rxe_pool *pool = elem->pool;
-	struct rxe_type_info *info = &rxe_type_info[pool->type];
 	u8 *obj;
 
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
 	if (!(pool->flags & RXE_POOL_NO_ALLOC)) {
-		obj = (u8 *)elem - info->elem_offset;
+		obj = (u8 *)elem - pool->elem_offset;
 		kfree(obj);
 	}
 
@@ -425,7 +417,6 @@ void rxe_elem_release(struct kref *kref)
 
 void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 {
-	struct rxe_type_info *info = &rxe_type_info[pool->type];
 	struct rb_node *node;
 	struct rxe_pool_entry *elem;
 	u8 *obj;
@@ -445,7 +436,7 @@ void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 
 	if (node) {
 		kref_get(&elem->ref_cnt);
-		obj = (u8 *)elem - info->elem_offset;
+		obj = (u8 *)elem - pool->elem_offset;
 	} else {
 		obj = NULL;
 	}
@@ -467,7 +458,6 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 
 void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 {
-	struct rxe_type_info *info = &rxe_type_info[pool->type];
 	struct rb_node *node;
 	struct rxe_pool_entry *elem;
 	u8 *obj;
@@ -491,7 +481,7 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 
 	if (node) {
 		kref_get(&elem->ref_cnt);
-		obj = (u8 *)elem - info->elem_offset;
+		obj = (u8 *)elem - pool->elem_offset;
 	} else {
 		obj = NULL;
 	}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 1feca1bffced..fb10e0098415 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -44,8 +44,6 @@ struct rxe_type_info {
 	size_t			key_size;
 };
 
-extern struct rxe_type_info rxe_type_info[];
-
 struct rxe_pool_entry {
 	struct rxe_pool		*pool;
 	struct kref		ref_cnt;
@@ -61,14 +59,16 @@ struct rxe_pool_entry {
 
 struct rxe_pool {
 	struct rxe_dev		*rxe;
+	const char		*name;
 	rwlock_t		pool_lock; /* protects pool add/del/search */
-	size_t			elem_size;
 	void			(*cleanup)(struct rxe_pool_entry *obj);
 	enum rxe_pool_flags	flags;
 	enum rxe_elem_type	type;
 
 	unsigned int		max_elem;
 	atomic_t		num_elem;
+	size_t			elem_size;
+	size_t			elem_offset;
 
 	/* only used if indexed */
 	struct {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 03/10] RDMA/rxe: Save object pointer in pool element
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 02/10] RDMA/rxe: Copy setup parameters into rxe_pool Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 04/10] RDMA/rxe: Combine rxe_add_index with rxe_alloc Bob Pearson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

In rxe_pool.c currently there are many cases where it is necessary to
compute the offset from a pool element struct to the object containing
it in a type independent way where the offset is different for each type.
By saving a pointer to the object when they are created extra work can be
saved.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 30 ++++++++++++++++------------
 drivers/infiniband/sw/rxe/rxe_pool.h |  1 +
 2 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 86251145705f..2655bd372f59 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -220,7 +220,8 @@ static int rxe_insert_key(struct rxe_pool *pool, struct rxe_pool_entry *new)
 		elem = rb_entry(parent, struct rxe_pool_entry, key_node);
 
 		cmp = memcmp((u8 *)elem + pool->key.key_offset,
-			     (u8 *)new + pool->key.key_offset, pool->key.key_size);
+			     (u8 *)new + pool->key.key_offset,
+			     pool->key.key_size);
 
 		if (cmp == 0) {
 			pr_warn("key already exists!\n");
@@ -325,7 +326,7 @@ void __rxe_drop_index(struct rxe_pool_entry *elem)
 void *rxe_alloc_locked(struct rxe_pool *pool)
 {
 	struct rxe_pool_entry *elem;
-	u8 *obj;
+	void *obj;
 
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
@@ -334,9 +335,10 @@ void *rxe_alloc_locked(struct rxe_pool *pool)
 	if (!obj)
 		goto out_cnt;
 
-	elem = (struct rxe_pool_entry *)(obj + pool->elem_offset);
+	elem = (struct rxe_pool_entry *)((u8 *)obj + pool->elem_offset);
 
 	elem->pool = pool;
+	elem->obj = obj;
 	kref_init(&elem->ref_cnt);
 
 	return obj;
@@ -350,7 +352,7 @@ void *rxe_alloc(struct rxe_pool *pool)
 {
 	struct rxe_pool_entry *elem;
 	unsigned long flags;
-	u8 *obj;
+	void *obj;
 
 	write_lock_irqsave(&pool->pool_lock, flags);
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
@@ -364,8 +366,9 @@ void *rxe_alloc(struct rxe_pool *pool)
 	}
 
 	write_lock_irqsave(&pool->pool_lock, flags);
-	elem = (struct rxe_pool_entry *)(obj + pool->elem_offset);
+	elem = (struct rxe_pool_entry *)((u8 *)obj + pool->elem_offset);
 	elem->pool = pool;
+	elem->obj = obj;
 	kref_init(&elem->ref_cnt);
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
@@ -386,6 +389,7 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 		goto out_cnt;
 
 	elem->pool = pool;
+	elem->obj = (u8 *)elem - pool->elem_offset;
 	kref_init(&elem->ref_cnt);
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
@@ -402,13 +406,13 @@ void rxe_elem_release(struct kref *kref)
 	struct rxe_pool_entry *elem =
 		container_of(kref, struct rxe_pool_entry, ref_cnt);
 	struct rxe_pool *pool = elem->pool;
-	u8 *obj;
+	void *obj;
 
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
 	if (!(pool->flags & RXE_POOL_NO_ALLOC)) {
-		obj = (u8 *)elem - pool->elem_offset;
+		obj = elem->obj;
 		kfree(obj);
 	}
 
@@ -419,7 +423,7 @@ void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 {
 	struct rb_node *node;
 	struct rxe_pool_entry *elem;
-	u8 *obj;
+	void *obj;
 
 	node = pool->index.tree.rb_node;
 
@@ -436,7 +440,7 @@ void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 
 	if (node) {
 		kref_get(&elem->ref_cnt);
-		obj = (u8 *)elem - pool->elem_offset;
+		obj = elem->obj;
 	} else {
 		obj = NULL;
 	}
@@ -446,7 +450,7 @@ void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 
 void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 {
-	u8 *obj;
+	void *obj;
 	unsigned long flags;
 
 	read_lock_irqsave(&pool->pool_lock, flags);
@@ -460,7 +464,7 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 {
 	struct rb_node *node;
 	struct rxe_pool_entry *elem;
-	u8 *obj;
+	void *obj;
 	int cmp;
 
 	node = pool->key.tree.rb_node;
@@ -481,7 +485,7 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 
 	if (node) {
 		kref_get(&elem->ref_cnt);
-		obj = (u8 *)elem - pool->elem_offset;
+		obj = elem->obj;
 	} else {
 		obj = NULL;
 	}
@@ -491,7 +495,7 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 
 void *rxe_pool_get_key(struct rxe_pool *pool, void *key)
 {
-	u8 *obj;
+	void *obj;
 	unsigned long flags;
 
 	read_lock_irqsave(&pool->pool_lock, flags);
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index fb10e0098415..e9bda4b14f86 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -46,6 +46,7 @@ struct rxe_type_info {
 
 struct rxe_pool_entry {
 	struct rxe_pool		*pool;
+	void			*obj;
 	struct kref		ref_cnt;
 	struct list_head	list;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 04/10] RDMA/rxe: Combine rxe_add_index with rxe_alloc
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (2 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 03/10] RDMA/rxe: Save object pointer in pool element Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 05/10] RDMA/rxe: Combine rxe_add_key " Bob Pearson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently rxe objects which have an index require adding and dropping
the indices in a separate API call from allocating and freeing the
object. These are always performed together so this patch combines
them in a single operation.

By taking a single pool lock around allocating the object and adding
the index metadata or dropping the index metadata and releasing the
object the possibility of a race condition where the metadata is not
consistent with the state of the object is removed.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    |  1 -
 drivers/infiniband/sw/rxe/rxe_mw.c    |  5 +-
 drivers/infiniband/sw/rxe/rxe_pool.c  | 67 ++++++++++++++++-----------
 drivers/infiniband/sw/rxe/rxe_pool.h  | 22 ---------
 drivers/infiniband/sw/rxe/rxe_verbs.c | 13 ------
 5 files changed, 41 insertions(+), 67 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 53271df10e47..6e71f67ccfe9 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -693,7 +693,6 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 
 	mr->state = RXE_MR_STATE_INVALID;
 	rxe_drop_ref(mr_pd(mr));
-	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
 
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 9534a7fe1a98..854d0c283521 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -20,7 +20,6 @@ int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
 		return ret;
 	}
 
-	rxe_add_index(mw);
 	mw->rkey = ibmw->rkey = (mw->pelem.index << 8) | rxe_get_next_key(-1);
 	mw->state = (mw->ibmw.type == IB_MW_TYPE_2) ?
 			RXE_MW_STATE_FREE : RXE_MW_STATE_VALID;
@@ -335,7 +334,5 @@ struct rxe_mw *rxe_lookup_mw(struct rxe_qp *qp, int access, u32 rkey)
 
 void rxe_mw_cleanup(struct rxe_pool_entry *elem)
 {
-	struct rxe_mw *mw = container_of(elem, typeof(*mw), pelem);
-
-	rxe_drop_index(mw);
+	/* nothing to do currently */
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 2655bd372f59..bfed7bb48cd1 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -166,12 +166,16 @@ void rxe_pool_cleanup(struct rxe_pool *pool)
 	kfree(pool->index.table);
 }
 
+/* should never fail because there are at least as many indices as
+ * max objects
+ */
 static u32 alloc_index(struct rxe_pool *pool)
 {
 	u32 index;
 	u32 range = pool->index.max_index - pool->index.min_index + 1;
 
-	index = find_next_zero_bit(pool->index.table, range, pool->index.last);
+	index = find_next_zero_bit(pool->index.table, range,
+				   pool->index.last);
 	if (index >= range)
 		index = find_first_zero_bit(pool->index.table, range);
 
@@ -192,7 +196,8 @@ static int rxe_insert_index(struct rxe_pool *pool, struct rxe_pool_entry *new)
 		elem = rb_entry(parent, struct rxe_pool_entry, index_node);
 
 		if (elem->index == new->index) {
-			pr_warn("element already exists!\n");
+			pr_warn("element with index = 0x%x already exists!\n",
+					new->index);
 			return -EINVAL;
 		}
 
@@ -281,31 +286,21 @@ void __rxe_drop_key(struct rxe_pool_entry *elem)
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 }
 
-int __rxe_add_index_locked(struct rxe_pool_entry *elem)
+static int rxe_add_index(struct rxe_pool_entry *elem)
 {
 	struct rxe_pool *pool = elem->pool;
 	int err;
 
 	elem->index = alloc_index(pool);
 	err = rxe_insert_index(pool, elem);
+	if (err)
+		clear_bit(elem->index - pool->index.min_index,
+			  pool->index.table);
 
 	return err;
 }
 
-int __rxe_add_index(struct rxe_pool_entry *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-	unsigned long flags;
-	int err;
-
-	write_lock_irqsave(&pool->pool_lock, flags);
-	err = __rxe_add_index_locked(elem);
-	write_unlock_irqrestore(&pool->pool_lock, flags);
-
-	return err;
-}
-
-void __rxe_drop_index_locked(struct rxe_pool_entry *elem)
+static void rxe_drop_index(struct rxe_pool_entry *elem)
 {
 	struct rxe_pool *pool = elem->pool;
 
@@ -313,20 +308,11 @@ void __rxe_drop_index_locked(struct rxe_pool_entry *elem)
 	rb_erase(&elem->index_node, &pool->index.tree);
 }
 
-void __rxe_drop_index(struct rxe_pool_entry *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-	unsigned long flags;
-
-	write_lock_irqsave(&pool->pool_lock, flags);
-	__rxe_drop_index_locked(elem);
-	write_unlock_irqrestore(&pool->pool_lock, flags);
-}
-
 void *rxe_alloc_locked(struct rxe_pool *pool)
 {
 	struct rxe_pool_entry *elem;
 	void *obj;
+	int err;
 
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto out_cnt;
@@ -341,6 +327,14 @@ void *rxe_alloc_locked(struct rxe_pool *pool)
 	elem->obj = obj;
 	kref_init(&elem->ref_cnt);
 
+	if (pool->flags & RXE_POOL_INDEX) {
+		err = rxe_add_index(elem);
+		if (err) {
+			kfree(elem);
+			goto out_cnt;
+		}
+	}
+
 	return obj;
 
 out_cnt:
@@ -353,6 +347,7 @@ void *rxe_alloc(struct rxe_pool *pool)
 	struct rxe_pool_entry *elem;
 	unsigned long flags;
 	void *obj;
+	int err;
 
 	write_lock_irqsave(&pool->pool_lock, flags);
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
@@ -370,6 +365,14 @@ void *rxe_alloc(struct rxe_pool *pool)
 	elem->pool = pool;
 	elem->obj = obj;
 	kref_init(&elem->ref_cnt);
+
+	if (pool->flags & RXE_POOL_INDEX) {
+		err = rxe_add_index(elem);
+		if (err) {
+			kfree(elem);
+			goto out_cnt;
+		}
+	}
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return obj;
@@ -383,6 +386,7 @@ void *rxe_alloc(struct rxe_pool *pool)
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 {
 	unsigned long flags;
+	int err;
 
 	write_lock_irqsave(&pool->pool_lock, flags);
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
@@ -391,6 +395,12 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 	elem->pool = pool;
 	elem->obj = (u8 *)elem - pool->elem_offset;
 	kref_init(&elem->ref_cnt);
+
+	if (pool->flags & RXE_POOL_INDEX) {
+		err = rxe_add_index(elem);
+		if (err)
+			goto out_cnt;
+	}
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return 0;
@@ -411,6 +421,9 @@ void rxe_elem_release(struct kref *kref)
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
+	if (pool->flags & RXE_POOL_INDEX)
+		rxe_drop_index(elem);
+
 	if (!(pool->flags & RXE_POOL_NO_ALLOC)) {
 		obj = elem->obj;
 		kfree(obj);
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index e9bda4b14f86..f76addf87b4a 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -109,28 +109,6 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem);
 
 #define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->pelem)
 
-/* assign an index to an indexed object and insert object into
- *  pool's rb tree holding and not holding the pool_lock
- */
-int __rxe_add_index_locked(struct rxe_pool_entry *elem);
-
-#define rxe_add_index_locked(obj) __rxe_add_index_locked(&(obj)->pelem)
-
-int __rxe_add_index(struct rxe_pool_entry *elem);
-
-#define rxe_add_index(obj) __rxe_add_index(&(obj)->pelem)
-
-/* drop an index and remove object from rb tree
- * holding and not holding the pool_lock
- */
-void __rxe_drop_index_locked(struct rxe_pool_entry *elem);
-
-#define rxe_drop_index_locked(obj) __rxe_drop_index_locked(&(obj)->pelem)
-
-void __rxe_drop_index(struct rxe_pool_entry *elem);
-
-#define rxe_drop_index(obj) __rxe_drop_index(&(obj)->pelem)
-
 /* assign a key to a keyed object and insert object into
  * pool's rb tree holding and not holding pool_lock
  */
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 0aa0d7e52773..84ea03bc6a26 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -181,7 +181,6 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		return err;
 
 	/* create index > 0 */
-	rxe_add_index(ah);
 	ah->ah_num = ah->pelem.index;
 
 	if (uresp) {
@@ -189,7 +188,6 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		err = copy_to_user(&uresp->ah_num, &ah->ah_num,
 					 sizeof(uresp->ah_num));
 		if (err) {
-			rxe_drop_index(ah);
 			rxe_drop_ref(ah);
 			return -EFAULT;
 		}
@@ -230,7 +228,6 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 {
 	struct rxe_ah *ah = to_rah(ibah);
 
-	rxe_drop_index(ah);
 	rxe_drop_ref(ah);
 	return 0;
 }
@@ -438,7 +435,6 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	if (err)
 		return err;
 
-	rxe_add_index(qp);
 	err = rxe_qp_from_init(rxe, qp, pd, init, uresp, ibqp->pd, udata);
 	if (err)
 		goto qp_init;
@@ -446,7 +442,6 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	return 0;
 
 qp_init:
-	rxe_drop_index(qp);
 	rxe_drop_ref(qp);
 	return err;
 }
@@ -491,7 +486,6 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 	struct rxe_qp *qp = to_rqp(ibqp);
 
 	rxe_qp_destroy(qp);
-	rxe_drop_index(qp);
 	rxe_drop_ref(qp);
 	return 0;
 }
@@ -898,7 +892,6 @@ static struct ib_mr *rxe_get_dma_mr(struct ib_pd *ibpd, int access)
 	if (!mr)
 		return ERR_PTR(-ENOMEM);
 
-	rxe_add_index(mr);
 	rxe_add_ref(pd);
 	rxe_mr_init_dma(pd, access, mr);
 
@@ -922,8 +915,6 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 		goto err2;
 	}
 
-	rxe_add_index(mr);
-
 	rxe_add_ref(pd);
 
 	err = rxe_mr_init_user(pd, start, length, iova, access, mr);
@@ -934,7 +925,6 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 
 err3:
 	rxe_drop_ref(pd);
-	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
 err2:
 	return ERR_PTR(err);
@@ -957,8 +947,6 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 		goto err1;
 	}
 
-	rxe_add_index(mr);
-
 	rxe_add_ref(pd);
 
 	err = rxe_mr_init_fast(pd, max_num_sg, mr);
@@ -969,7 +957,6 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 
 err2:
 	rxe_drop_ref(pd);
-	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
 err1:
 	return ERR_PTR(err);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 05/10] RDMA/rxe: Combine rxe_add_key with rxe_alloc
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (3 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 04/10] RDMA/rxe: Combine rxe_add_index with rxe_alloc Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 06/10] RDMA/rxe: Separate out last rxe_drop_ref Bob Pearson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently adding and dropping a key from a rxe object requires separate
API calls from allocating and freeing the object but these are always
performed together. This patch combines these into single APIs.
This requires adding new rxe_allocate_with_key(_locked) APIs.

By combining allocating an object and adding key metadata inside a
single locked sequence and dropping the key metadata and releasing the
object the possibility of a race condition where the object state and
key metadata state are inconsistent is removed.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mcast.c |  5 +-
 drivers/infiniband/sw/rxe/rxe_pool.c  | 81 +++++++++++++--------------
 drivers/infiniband/sw/rxe/rxe_pool.h  | 24 ++------
 3 files changed, 45 insertions(+), 65 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 1c1d1b53312d..337dc2c68051 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -15,18 +15,16 @@ static struct rxe_mc_grp *create_grp(struct rxe_dev *rxe,
 	int err;
 	struct rxe_mc_grp *grp;
 
-	grp = rxe_alloc_locked(&rxe->mc_grp_pool);
+	grp = rxe_alloc_with_key_locked(&rxe->mc_grp_pool, mgid);
 	if (!grp)
 		return ERR_PTR(-ENOMEM);
 
 	INIT_LIST_HEAD(&grp->qp_list);
 	spin_lock_init(&grp->mcg_lock);
 	grp->rxe = rxe;
-	rxe_add_key_locked(grp, mgid);
 
 	err = rxe_mcast_add(rxe, mgid);
 	if (unlikely(err)) {
-		rxe_drop_key_locked(grp);
 		rxe_drop_ref(grp);
 		return ERR_PTR(err);
 	}
@@ -174,6 +172,5 @@ void rxe_mc_cleanup(struct rxe_pool_entry *arg)
 	struct rxe_mc_grp *grp = container_of(arg, typeof(*grp), pelem);
 	struct rxe_dev *rxe = grp->rxe;
 
-	rxe_drop_key(grp);
 	rxe_mcast_delete(rxe, &grp->mgid);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index bfed7bb48cd1..7fd5543ef1ae 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -245,47 +245,6 @@ static int rxe_insert_key(struct rxe_pool *pool, struct rxe_pool_entry *new)
 	return 0;
 }
 
-int __rxe_add_key_locked(struct rxe_pool_entry *elem, void *key)
-{
-	struct rxe_pool *pool = elem->pool;
-	int err;
-
-	memcpy((u8 *)elem + pool->key.key_offset, key, pool->key.key_size);
-	err = rxe_insert_key(pool, elem);
-
-	return err;
-}
-
-int __rxe_add_key(struct rxe_pool_entry *elem, void *key)
-{
-	struct rxe_pool *pool = elem->pool;
-	unsigned long flags;
-	int err;
-
-	write_lock_irqsave(&pool->pool_lock, flags);
-	err = __rxe_add_key_locked(elem, key);
-	write_unlock_irqrestore(&pool->pool_lock, flags);
-
-	return err;
-}
-
-void __rxe_drop_key_locked(struct rxe_pool_entry *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-
-	rb_erase(&elem->key_node, &pool->key.tree);
-}
-
-void __rxe_drop_key(struct rxe_pool_entry *elem)
-{
-	struct rxe_pool *pool = elem->pool;
-	unsigned long flags;
-
-	write_lock_irqsave(&pool->pool_lock, flags);
-	__rxe_drop_key_locked(elem);
-	write_unlock_irqrestore(&pool->pool_lock, flags);
-}
-
 static int rxe_add_index(struct rxe_pool_entry *elem)
 {
 	struct rxe_pool *pool = elem->pool;
@@ -342,6 +301,31 @@ void *rxe_alloc_locked(struct rxe_pool *pool)
 	return NULL;
 }
 
+void *rxe_alloc_with_key_locked(struct rxe_pool *pool, void *key)
+{
+	struct rxe_pool_entry *elem;
+	u8 *obj;
+	int err;
+
+	obj = rxe_alloc_locked(pool);
+	if (!obj)
+		return NULL;
+
+	elem = (struct rxe_pool_entry *)(obj + pool->elem_offset);
+	memcpy((u8 *)elem + pool->key.key_offset, key, pool->key.key_size);
+	err = rxe_insert_key(pool, elem);
+	if (err) {
+		kfree(obj);
+		goto out_cnt;
+	}
+
+	return obj;
+
+out_cnt:
+	atomic_dec(&pool->num_elem);
+	return NULL;
+}
+
 void *rxe_alloc(struct rxe_pool *pool)
 {
 	struct rxe_pool_entry *elem;
@@ -383,6 +367,18 @@ void *rxe_alloc(struct rxe_pool *pool)
 	return NULL;
 }
 
+void *rxe_alloc_with_key(struct rxe_pool *pool, void *key)
+{
+	unsigned long flags;
+	void *obj;
+
+	write_lock_irqsave(&pool->pool_lock, flags);
+	obj = rxe_alloc_with_key_locked(pool, key);
+	write_unlock_irqrestore(&pool->pool_lock, flags);
+
+	return obj;
+}
+
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 {
 	unsigned long flags;
@@ -424,6 +420,9 @@ void rxe_elem_release(struct kref *kref)
 	if (pool->flags & RXE_POOL_INDEX)
 		rxe_drop_index(elem);
 
+	if (pool->flags & RXE_POOL_KEY)
+		rb_erase(&elem->key_node, &pool->key.tree);
+
 	if (!(pool->flags & RXE_POOL_NO_ALLOC)) {
 		obj = elem->obj;
 		kfree(obj);
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index f76addf87b4a..e0242d488cc8 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -104,31 +104,15 @@ void *rxe_alloc_locked(struct rxe_pool *pool);
 
 void *rxe_alloc(struct rxe_pool *pool);
 
+void *rxe_alloc_with_key_locked(struct rxe_pool *pool, void *key);
+
+void *rxe_alloc_with_key(struct rxe_pool *pool, void *key);
+
 /* connect already allocated object to pool */
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem);
 
 #define rxe_add_to_pool(pool, obj) __rxe_add_to_pool(pool, &(obj)->pelem)
 
-/* assign a key to a keyed object and insert object into
- * pool's rb tree holding and not holding pool_lock
- */
-int __rxe_add_key_locked(struct rxe_pool_entry *elem, void *key);
-
-#define rxe_add_key_locked(obj, key) __rxe_add_key_locked(&(obj)->pelem, key)
-
-int __rxe_add_key(struct rxe_pool_entry *elem, void *key);
-
-#define rxe_add_key(obj, key) __rxe_add_key(&(obj)->pelem, key)
-
-/* remove elem from rb tree holding and not holding the pool_lock */
-void __rxe_drop_key_locked(struct rxe_pool_entry *elem);
-
-#define rxe_drop_key_locked(obj) __rxe_drop_key_locked(&(obj)->pelem)
-
-void __rxe_drop_key(struct rxe_pool_entry *elem);
-
-#define rxe_drop_key(obj) __rxe_drop_key(&(obj)->pelem)
-
 /* lookup an indexed object from index holding and not holding the pool_lock.
  * takes a reference on object
  */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 06/10] RDMA/rxe: Separate out last rxe_drop_ref
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (4 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 05/10] RDMA/rxe: Combine rxe_add_key " Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 07/10] RDMA/rxe: Rewrite rxe_mcast.c Bob Pearson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently rxe_drop_ref() can be called from a destroy or deallocate
verb or from someone releasing a reference protecting a pointer. This
leads to the possibility that the object will be destroyed after rdma-core
thinks it has. This was the original intent of using kref's but can
cause other problems.

This patch also separates out the last rxe_drop_ref as rxe_fini_ref() which
can return an error if the object is not ready to be destroyed and changes
rxe_drop_ref() to return an error if the object has already been destroyed.
Now the destroy verbs will return -EBUSY if the object cannot be destroyed
and other add/drop references will return -EINVAL if the object has already
or would be destroyed. Correct programs should not normally return these
values.

Locking is added so that cleanup is executed atomically. Some drop
references have their order changed so references are dropped after
the pointer they protect has been successfully removed.

This change requires that rxe_qp_destroy() be moved from the object
cleanup routine to rxe_destroy_qp() the main verb API for destroying
the QP so that the first stage of cleanup can operate while the QP
reference count is still positive.

This patch exposes some referencing errors which are addressed in the
following patches.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.h       |   1 +
 drivers/infiniband/sw/rxe/rxe_av.c    |   1 +
 drivers/infiniband/sw/rxe/rxe_cq.c    |   9 +-
 drivers/infiniband/sw/rxe/rxe_loc.h   |   3 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    |  10 +-
 drivers/infiniband/sw/rxe/rxe_mw.c    |  23 +--
 drivers/infiniband/sw/rxe/rxe_pool.c  | 233 +++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_pool.h  |  32 +++-
 drivers/infiniband/sw/rxe/rxe_srq.c   |   8 +
 drivers/infiniband/sw/rxe/rxe_verbs.c |  42 +++--
 10 files changed, 253 insertions(+), 109 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.h b/drivers/infiniband/sw/rxe/rxe.h
index 1bb3fb618bf5..adecf146d999 100644
--- a/drivers/infiniband/sw/rxe/rxe.h
+++ b/drivers/infiniband/sw/rxe/rxe.h
@@ -14,6 +14,7 @@
 
 #include <linux/module.h>
 #include <linux/skbuff.h>
+#include <linux/refcount.h>
 
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_user_verbs.h>
diff --git a/drivers/infiniband/sw/rxe/rxe_av.c b/drivers/infiniband/sw/rxe/rxe_av.c
index 38c7b6fb39d7..5084841a7cd0 100644
--- a/drivers/infiniband/sw/rxe/rxe_av.c
+++ b/drivers/infiniband/sw/rxe/rxe_av.c
@@ -117,6 +117,7 @@ struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt)
 	if (ah_num) {
 		/* only new user provider or kernel client */
 		ah = rxe_pool_get_index(&pkt->rxe->ah_pool, ah_num);
+		rxe_drop_ref(ah);
 		if (!ah || ah->ah_num != ah_num || rxe_ah_pd(ah) != pkt->qp->pd) {
 			pr_warn("Unable to find AH matching ah_num\n");
 			return NULL;
diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
index 6848426c074f..0c05d612ae63 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -141,18 +141,15 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
 	return 0;
 }
 
-void rxe_cq_disable(struct rxe_cq *cq)
+void rxe_cq_cleanup(struct rxe_pool_entry *arg)
 {
+	struct rxe_cq *cq = container_of(arg, typeof(*cq), pelem);
 	unsigned long flags;
 
+	/* TODO get rid of this */
 	spin_lock_irqsave(&cq->cq_lock, flags);
 	cq->is_dying = true;
 	spin_unlock_irqrestore(&cq->cq_lock, flags);
-}
-
-void rxe_cq_cleanup(struct rxe_pool_entry *arg)
-{
-	struct rxe_cq *cq = container_of(arg, typeof(*cq), pelem);
 
 	if (cq->queue)
 		rxe_queue_cleanup(cq->queue);
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 1ca43b859d80..a25d1c9f6adb 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -35,8 +35,6 @@ int rxe_cq_resize_queue(struct rxe_cq *cq, int new_cqe,
 
 int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited);
 
-void rxe_cq_disable(struct rxe_cq *cq);
-
 void rxe_cq_cleanup(struct rxe_pool_entry *arg);
 
 /* rxe_mcast.c */
@@ -187,6 +185,7 @@ int rxe_srq_from_init(struct rxe_dev *rxe, struct rxe_srq *srq,
 int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 		      struct ib_srq_attr *attr, enum ib_srq_attr_mask mask,
 		      struct rxe_modify_srq_cmd *ucmd, struct ib_udata *udata);
+void rxe_srq_cleanup(struct rxe_pool_entry *arg);
 
 void rxe_dealloc(struct ib_device *ib_dev);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 6e71f67ccfe9..6c50c8562fd8 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -684,6 +684,8 @@ int rxe_mr_set_page(struct ib_mr *ibmr, u64 addr)
 int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
+	struct rxe_pd *pd = mr_pd(mr);
+	int err;
 
 	if (atomic_read(&mr->num_mw) > 0) {
 		pr_warn("%s: Attempt to deregister an MR while bound to MWs\n",
@@ -692,8 +694,12 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 	}
 
 	mr->state = RXE_MR_STATE_INVALID;
-	rxe_drop_ref(mr_pd(mr));
-	rxe_drop_ref(mr);
+
+	err = rxe_fini_ref(mr);
+	if (err)
+		return err;
+
+	rxe_drop_ref(pd);
 
 	return 0;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 854d0c283521..599699f93332 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -56,12 +56,15 @@ int rxe_dealloc_mw(struct ib_mw *ibmw)
 	struct rxe_mw *mw = to_rmw(ibmw);
 	struct rxe_pd *pd = to_rpd(ibmw->pd);
 	unsigned long flags;
+	int err;
 
 	spin_lock_irqsave(&mw->lock, flags);
 	rxe_do_dealloc_mw(mw);
 	spin_unlock_irqrestore(&mw->lock, flags);
 
-	rxe_drop_ref(mw);
+	err = rxe_fini_ref(mw);
+	if (err)
+		return err;
 	rxe_drop_ref(pd);
 
 	return 0;
@@ -177,9 +180,9 @@ static void rxe_do_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 	}
 
 	if (mw->length) {
+		/* take over ref on mr from caller */
 		mw->mr = mr;
 		atomic_inc(&mr->num_mw);
-		rxe_add_ref(mr);
 	}
 
 	if (mw->ibmw.type == IB_MW_TYPE_2) {
@@ -192,7 +195,7 @@ int rxe_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 {
 	int ret;
 	struct rxe_mw *mw;
-	struct rxe_mr *mr;
+	struct rxe_mr *mr = NULL;
 	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	u32 mw_rkey = wqe->wr.wr.mw.mw_rkey;
 	u32 mr_lkey = wqe->wr.wr.mw.mr_lkey;
@@ -217,25 +220,23 @@ int rxe_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 		}
 
 		if (unlikely(mr->lkey != mr_lkey)) {
+			rxe_drop_ref(mr);
 			ret = -EINVAL;
-			goto err_drop_mr;
+			goto err_drop_mw;
 		}
-	} else {
-		mr = NULL;
 	}
 
 	spin_lock_irqsave(&mw->lock, flags);
-
 	ret = rxe_check_bind_mw(qp, wqe, mw, mr);
-	if (ret)
+	if (ret) {
+		if (mr)
+			rxe_drop_ref(mr);
 		goto err_unlock;
+	}
 
 	rxe_do_bind_mw(qp, wqe, mw, mr);
 err_unlock:
 	spin_unlock_irqrestore(&mw->lock, flags);
-err_drop_mr:
-	if (mr)
-		rxe_drop_ref(mr);
 err_drop_mw:
 	rxe_drop_ref(mw);
 err:
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 7fd5543ef1ae..4be43ae58219 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -33,6 +33,7 @@ static const struct rxe_type_info rxe_type_info[RXE_NUM_TYPES] = {
 		.name		= "rxe-srq",
 		.size		= sizeof(struct rxe_srq),
 		.elem_offset	= offsetof(struct rxe_srq, pelem),
+		.cleanup	= rxe_srq_cleanup,
 		.flags		= RXE_POOL_INDEX | RXE_POOL_NO_ALLOC,
 		.min_index	= RXE_MIN_SRQ_INDEX,
 		.max_index	= RXE_MAX_SRQ_INDEX,
@@ -195,9 +196,13 @@ static int rxe_insert_index(struct rxe_pool *pool, struct rxe_pool_entry *new)
 		parent = *link;
 		elem = rb_entry(parent, struct rxe_pool_entry, index_node);
 
-		if (elem->index == new->index) {
-			pr_warn("element with index = 0x%x already exists!\n",
-					new->index);
+		/* this can happen if memory was recycled and/or the
+		 * old object was not deleted from the pool index
+		 */
+		if (unlikely(elem == new || elem->index == new->index)) {
+			pr_warn("%s#%d rf=%d: already in pool\n",
+					pool->name, new->index,
+					refcount_read(&new->refcnt));
 			return -EINVAL;
 		}
 
@@ -281,21 +286,20 @@ void *rxe_alloc_locked(struct rxe_pool *pool)
 		goto out_cnt;
 
 	elem = (struct rxe_pool_entry *)((u8 *)obj + pool->elem_offset);
-
 	elem->pool = pool;
 	elem->obj = obj;
-	kref_init(&elem->ref_cnt);
+	refcount_set(&elem->refcnt, 1);
 
 	if (pool->flags & RXE_POOL_INDEX) {
 		err = rxe_add_index(elem);
-		if (err) {
-			kfree(elem);
-			goto out_cnt;
-		}
+		if (err)
+			goto out_free;
 	}
 
 	return obj;
 
+out_free:
+	kfree(elem);
 out_cnt:
 	atomic_dec(&pool->num_elem);
 	return NULL;
@@ -307,20 +311,33 @@ void *rxe_alloc_with_key_locked(struct rxe_pool *pool, void *key)
 	u8 *obj;
 	int err;
 
-	obj = rxe_alloc_locked(pool);
+	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
+		goto out_cnt;
+
+	obj = kzalloc(pool->elem_size, GFP_ATOMIC);
 	if (!obj)
-		return NULL;
+		goto out_cnt;
+
+	elem = (struct rxe_pool_entry *)((u8 *)obj + pool->elem_offset);
+	elem->pool = pool;
+	elem->obj = obj;
+	refcount_set(&elem->refcnt, 1);
+
+	if (pool->flags & RXE_POOL_INDEX) {
+		err = rxe_add_index(elem);
+		if (err)
+			goto out_free;
+	}
 
-	elem = (struct rxe_pool_entry *)(obj + pool->elem_offset);
 	memcpy((u8 *)elem + pool->key.key_offset, key, pool->key.key_size);
 	err = rxe_insert_key(pool, elem);
-	if (err) {
-		kfree(obj);
-		goto out_cnt;
-	}
+	if (err)
+		goto out_free;
 
 	return obj;
 
+out_free:
+	kfree(obj);
 out_cnt:
 	atomic_dec(&pool->num_elem);
 	return NULL;
@@ -348,19 +365,19 @@ void *rxe_alloc(struct rxe_pool *pool)
 	elem = (struct rxe_pool_entry *)((u8 *)obj + pool->elem_offset);
 	elem->pool = pool;
 	elem->obj = obj;
-	kref_init(&elem->ref_cnt);
+	refcount_set(&elem->refcnt, 1);
 
 	if (pool->flags & RXE_POOL_INDEX) {
 		err = rxe_add_index(elem);
-		if (err) {
-			kfree(elem);
-			goto out_cnt;
-		}
+		if (err)
+			goto out_free;
 	}
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return obj;
 
+out_free:
+	kfree(elem);
 out_cnt:
 	atomic_dec(&pool->num_elem);
 	write_unlock_irqrestore(&pool->pool_lock, flags);
@@ -369,14 +386,48 @@ void *rxe_alloc(struct rxe_pool *pool)
 
 void *rxe_alloc_with_key(struct rxe_pool *pool, void *key)
 {
+	struct rxe_pool_entry *elem;
 	unsigned long flags;
-	void *obj;
+	u8 *obj;
+	int err;
+
+	write_lock_irqsave(&pool->pool_lock, flags);
+	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
+		goto out_cnt;
+	write_unlock_irqrestore(&pool->pool_lock, flags);
+
+	obj = kzalloc(pool->elem_size, GFP_KERNEL);
+	if (!obj) {
+		atomic_dec(&pool->num_elem);
+		return NULL;
+	}
 
 	write_lock_irqsave(&pool->pool_lock, flags);
-	obj = rxe_alloc_with_key_locked(pool, key);
+	elem = (struct rxe_pool_entry *)((u8 *)obj + pool->elem_offset);
+	elem->pool = pool;
+	elem->obj = obj;
+	refcount_set(&elem->refcnt, 1);
+
+	if (pool->flags & RXE_POOL_INDEX) {
+		err = rxe_add_index(elem);
+		if (err)
+			goto out_free;
+	}
+
+	memcpy((u8 *)elem + pool->key.key_offset, key, pool->key.key_size);
+	err = rxe_insert_key(pool, elem);
+	if (err)
+		goto out_free;
 	write_unlock_irqrestore(&pool->pool_lock, flags);
 
 	return obj;
+
+out_free:
+	kfree(elem);
+out_cnt:
+	atomic_dec(&pool->num_elem);
+	write_unlock_irqrestore(&pool->pool_lock, flags);
+	return NULL;
 }
 
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
@@ -390,7 +441,7 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 
 	elem->pool = pool;
 	elem->obj = (u8 *)elem - pool->elem_offset;
-	kref_init(&elem->ref_cnt);
+	refcount_set(&elem->refcnt, 1);
 
 	if (pool->flags & RXE_POOL_INDEX) {
 		err = rxe_add_index(elem);
@@ -407,35 +458,11 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_entry *elem)
 	return -EINVAL;
 }
 
-void rxe_elem_release(struct kref *kref)
-{
-	struct rxe_pool_entry *elem =
-		container_of(kref, struct rxe_pool_entry, ref_cnt);
-	struct rxe_pool *pool = elem->pool;
-	void *obj;
-
-	if (pool->cleanup)
-		pool->cleanup(elem);
-
-	if (pool->flags & RXE_POOL_INDEX)
-		rxe_drop_index(elem);
-
-	if (pool->flags & RXE_POOL_KEY)
-		rb_erase(&elem->key_node, &pool->key.tree);
-
-	if (!(pool->flags & RXE_POOL_NO_ALLOC)) {
-		obj = elem->obj;
-		kfree(obj);
-	}
-
-	atomic_dec(&pool->num_elem);
-}
-
 void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 {
 	struct rb_node *node;
 	struct rxe_pool_entry *elem;
-	void *obj;
+	void *obj = NULL;
 
 	node = pool->index.tree.rb_node;
 
@@ -450,12 +477,8 @@ void *rxe_pool_get_index_locked(struct rxe_pool *pool, u32 index)
 			break;
 	}
 
-	if (node) {
-		kref_get(&elem->ref_cnt);
+	if (node && refcount_inc_not_zero(&elem->refcnt))
 		obj = elem->obj;
-	} else {
-		obj = NULL;
-	}
 
 	return obj;
 }
@@ -476,7 +499,7 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 {
 	struct rb_node *node;
 	struct rxe_pool_entry *elem;
-	void *obj;
+	void *obj = NULL;
 	int cmp;
 
 	node = pool->key.tree.rb_node;
@@ -495,12 +518,8 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key)
 			break;
 	}
 
-	if (node) {
-		kref_get(&elem->ref_cnt);
+	if (node && refcount_inc_not_zero(&elem->refcnt))
 		obj = elem->obj;
-	} else {
-		obj = NULL;
-	}
 
 	return obj;
 }
@@ -516,3 +535,97 @@ void *rxe_pool_get_key(struct rxe_pool *pool, void *key)
 
 	return obj;
 }
+
+int __rxe_add_ref_locked(struct rxe_pool_entry *elem)
+{
+	return refcount_inc_not_zero(&elem->refcnt) ? 0 : -EINVAL;
+}
+
+int __rxe_add_ref(struct rxe_pool_entry *elem)
+{
+	struct rxe_pool *pool = elem->pool;
+	unsigned long flags;
+	int ret;
+
+	read_lock_irqsave(&pool->pool_lock, flags);
+	ret = __rxe_add_ref_locked(elem);
+	read_unlock_irqrestore(&pool->pool_lock, flags);
+
+	return ret;
+}
+
+int __rxe_drop_ref_locked(struct rxe_pool_entry *elem)
+{
+	return refcount_dec_not_one(&elem->refcnt) ? 0 : -EINVAL;
+}
+
+int __rxe_drop_ref(struct rxe_pool_entry *elem)
+{
+	struct rxe_pool *pool = elem->pool;
+	unsigned long flags;
+	int ret;
+
+	read_lock_irqsave(&pool->pool_lock, flags);
+	ret = __rxe_drop_ref_locked(elem);
+	read_unlock_irqrestore(&pool->pool_lock, flags);
+
+	return ret;
+}
+
+static int __rxe_fini(struct rxe_pool_entry *elem)
+{
+	struct rxe_pool *pool = elem->pool;
+	int done;
+
+	done = refcount_dec_if_one(&elem->refcnt);
+	if (done) {
+		if (pool->flags & RXE_POOL_INDEX)
+			rxe_drop_index(elem);
+		if (pool->flags & RXE_POOL_KEY)
+			rb_erase(&elem->key_node, &pool->key.tree);
+		atomic_dec(&pool->num_elem);
+		return 0;
+	} else {
+		return -EBUSY;
+	}
+}
+
+/* can only be used by pools that have a cleanup
+ * routine that can run while holding a spinlock
+ */
+int __rxe_fini_ref_locked(struct rxe_pool_entry *elem)
+{
+	struct rxe_pool *pool = elem->pool;
+	int ret;
+
+	ret = __rxe_fini(elem);
+
+	if (!ret) {
+		if (pool->cleanup)
+			pool->cleanup(elem);
+		if (!(pool->flags & RXE_POOL_NO_ALLOC))
+			kfree(elem->obj);
+	}
+
+	return ret;
+}
+
+int __rxe_fini_ref(struct rxe_pool_entry *elem)
+{
+	struct rxe_pool *pool = elem->pool;
+	unsigned long flags;
+	int ret;
+
+	read_lock_irqsave(&pool->pool_lock, flags);
+	ret = __rxe_fini(elem);
+	read_unlock_irqrestore(&pool->pool_lock, flags);
+
+	if (!ret) {
+		if (pool->cleanup)
+			pool->cleanup(elem);
+		if (!(pool->flags & RXE_POOL_NO_ALLOC))
+			kfree(elem->obj);
+	}
+
+	return ret;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index e0242d488cc8..7df52d34e306 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -47,7 +47,7 @@ struct rxe_type_info {
 struct rxe_pool_entry {
 	struct rxe_pool		*pool;
 	void			*obj;
-	struct kref		ref_cnt;
+	refcount_t		refcnt;
 	struct list_head	list;
 
 	/* only used if keyed */
@@ -127,13 +127,33 @@ void *rxe_pool_get_key_locked(struct rxe_pool *pool, void *key);
 
 void *rxe_pool_get_key(struct rxe_pool *pool, void *key);
 
-/* cleanup an object when all references are dropped */
-void rxe_elem_release(struct kref *kref);
-
 /* take a reference on an object */
-#define rxe_add_ref(elem) kref_get(&(elem)->pelem.ref_cnt)
+int __rxe_add_ref_locked(struct rxe_pool_entry *elem);
+
+#define rxe_add_ref_locked(obj) __rxe_add_ref_locked(&(obj)->pelem)
+
+int __rxe_add_ref(struct rxe_pool_entry *elem);
+
+#define rxe_add_ref(obj) __rxe_add_ref(&(obj)->pelem)
 
 /* drop a reference on an object */
-#define rxe_drop_ref(elem) kref_put(&(elem)->pelem.ref_cnt, rxe_elem_release)
+int __rxe_drop_ref_locked(struct rxe_pool_entry *elem);
+
+#define rxe_drop_ref_locked(obj) __rxe_drop_ref_locked(&(obj)->pelem)
+
+int __rxe_drop_ref(struct rxe_pool_entry *elem);
+
+#define rxe_drop_ref(obj) __rxe_drop_ref(&(obj)->pelem)
+
+/* drop last reference on an object */
+int __rxe_fini_ref_locked(struct rxe_pool_entry *elem);
+
+#define rxe_fini_ref_locked(obj) __rxe_fini_ref_locked(&(obj)->pelem)
+
+int __rxe_fini_ref(struct rxe_pool_entry *elem);
+
+#define rxe_fini_ref(obj) __rxe_fini_ref(&(obj)->pelem)
+
+#define rxe_read_ref(obj) refcount_read(&(obj)->pelem.refcnt)
 
 #endif /* RXE_POOL_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
index eb1c4c3b3a78..bb00643a2929 100644
--- a/drivers/infiniband/sw/rxe/rxe_srq.c
+++ b/drivers/infiniband/sw/rxe/rxe_srq.c
@@ -154,3 +154,11 @@ int rxe_srq_from_attr(struct rxe_dev *rxe, struct rxe_srq *srq,
 	srq->rq.queue = NULL;
 	return err;
 }
+
+void rxe_srq_cleanup(struct rxe_pool_entry *arg)
+{
+	struct rxe_srq *srq = container_of(arg, typeof(*srq), pelem);
+
+	if (srq->rq.queue)
+		rxe_queue_cleanup(srq->rq.queue);
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 84ea03bc6a26..61fa633775f3 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -115,7 +115,7 @@ static void rxe_dealloc_ucontext(struct ib_ucontext *ibuc)
 {
 	struct rxe_ucontext *uc = to_ruc(ibuc);
 
-	rxe_drop_ref(uc);
+	rxe_fini_ref(uc);
 }
 
 static int rxe_port_immutable(struct ib_device *dev, u32 port_num,
@@ -149,8 +149,7 @@ static int rxe_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 {
 	struct rxe_pd *pd = to_rpd(ibpd);
 
-	rxe_drop_ref(pd);
-	return 0;
+	return rxe_fini_ref(pd);
 }
 
 static int rxe_create_ah(struct ib_ah *ibah,
@@ -188,7 +187,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		err = copy_to_user(&uresp->ah_num, &ah->ah_num,
 					 sizeof(uresp->ah_num));
 		if (err) {
-			rxe_drop_ref(ah);
+			rxe_fini_ref(ah);
 			return -EFAULT;
 		}
 	} else if (ah->is_user) {
@@ -228,8 +227,7 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 {
 	struct rxe_ah *ah = to_rah(ibah);
 
-	rxe_drop_ref(ah);
-	return 0;
+	return rxe_fini_ref(ah);
 }
 
 static int post_one_recv(struct rxe_rq *rq, const struct ib_recv_wr *ibwr)
@@ -313,8 +311,8 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
 	return 0;
 
 err2:
+	rxe_fini_ref(srq);
 	rxe_drop_ref(pd);
-	rxe_drop_ref(srq);
 err1:
 	return err;
 }
@@ -367,12 +365,15 @@ static int rxe_query_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr)
 static int rxe_destroy_srq(struct ib_srq *ibsrq, struct ib_udata *udata)
 {
 	struct rxe_srq *srq = to_rsrq(ibsrq);
+	struct rxe_pd *pd = srq->pd;
+	int err;
+
+	err = rxe_fini_ref(srq);
+	if (err)
+		return err;
 
-	if (srq->rq.queue)
-		rxe_queue_cleanup(srq->rq.queue);
+	rxe_drop_ref(pd);
 
-	rxe_drop_ref(srq->pd);
-	rxe_drop_ref(srq);
 	return 0;
 }
 
@@ -437,12 +438,12 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 
 	err = rxe_qp_from_init(rxe, qp, pd, init, uresp, ibqp->pd, udata);
 	if (err)
-		goto qp_init;
+		goto err_fini;
 
 	return 0;
 
-qp_init:
-	rxe_drop_ref(qp);
+err_fini:
+	rxe_fini_ref(qp);
 	return err;
 }
 
@@ -486,8 +487,8 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 	struct rxe_qp *qp = to_rqp(ibqp);
 
 	rxe_qp_destroy(qp);
-	rxe_drop_ref(qp);
-	return 0;
+
+	return rxe_fini_ref(qp);
 }
 
 static int validate_send_wr(struct rxe_qp *qp, const struct ib_send_wr *ibwr,
@@ -797,10 +798,7 @@ static int rxe_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 {
 	struct rxe_cq *cq = to_rcq(ibcq);
 
-	rxe_cq_disable(cq);
-
-	rxe_drop_ref(cq);
-	return 0;
+	return rxe_fini_ref(cq);
 }
 
 static int rxe_resize_cq(struct ib_cq *ibcq, int cqe, struct ib_udata *udata)
@@ -924,8 +922,8 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 	return &mr->ibmr;
 
 err3:
+	rxe_fini_ref(mr);
 	rxe_drop_ref(pd);
-	rxe_drop_ref(mr);
 err2:
 	return ERR_PTR(err);
 }
@@ -956,8 +954,8 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	return &mr->ibmr;
 
 err2:
+	rxe_fini_ref(mr);
 	rxe_drop_ref(pd);
-	rxe_drop_ref(mr);
 err1:
 	return ERR_PTR(err);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 07/10] RDMA/rxe: Rewrite rxe_mcast.c
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (5 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 06/10] RDMA/rxe: Separate out last rxe_drop_ref Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 08/10] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Rewrite rxe_mcast to support rxe_fini_ref and clean up referencing.
Drop mc_elem as rxe pool objects and just use kmalloc/kfree.
Take qp references as well as mc_grp references to force all mc groups
to be de-attached before destroying QPs. Simplify the api to rxe_verbs.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c       |   8 -
 drivers/infiniband/sw/rxe/rxe_loc.h   |  11 +-
 drivers/infiniband/sw/rxe/rxe_mcast.c | 205 +++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_net.c   |  22 ---
 drivers/infiniband/sw/rxe/rxe_pool.c  |   6 -
 drivers/infiniband/sw/rxe/rxe_pool.h  |   1 -
 drivers/infiniband/sw/rxe/rxe_verbs.c |  12 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |   3 +-
 8 files changed, 143 insertions(+), 125 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 8e0f9c489cab..4298a1d20ad5 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -31,7 +31,6 @@ void rxe_dealloc(struct ib_device *ib_dev)
 	rxe_pool_cleanup(&rxe->mr_pool);
 	rxe_pool_cleanup(&rxe->mw_pool);
 	rxe_pool_cleanup(&rxe->mc_grp_pool);
-	rxe_pool_cleanup(&rxe->mc_elem_pool);
 
 	if (rxe->tfm)
 		crypto_free_shash(rxe->tfm);
@@ -165,15 +164,8 @@ static int rxe_init_pools(struct rxe_dev *rxe)
 	if (err)
 		goto err9;
 
-	err = rxe_pool_init(rxe, &rxe->mc_elem_pool, RXE_TYPE_MC_ELEM,
-			    rxe->attr.max_total_mcast_qp_attach);
-	if (err)
-		goto err10;
-
 	return 0;
 
-err10:
-	rxe_pool_cleanup(&rxe->mc_grp_pool);
 err9:
 	rxe_pool_cleanup(&rxe->mw_pool);
 err8:
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index a25d1c9f6adb..78312df8eea3 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -38,19 +38,12 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited);
 void rxe_cq_cleanup(struct rxe_pool_entry *arg);
 
 /* rxe_mcast.c */
-int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid,
-		      struct rxe_mc_grp **grp_p);
-
 int rxe_mcast_add_grp_elem(struct rxe_dev *rxe, struct rxe_qp *qp,
-			   struct rxe_mc_grp *grp);
-
+			   union ib_gid *mgid);
 int rxe_mcast_drop_grp_elem(struct rxe_dev *rxe, struct rxe_qp *qp,
 			    union ib_gid *mgid);
-
 void rxe_drop_all_mcast_groups(struct rxe_qp *qp);
 
-void rxe_mc_cleanup(struct rxe_pool_entry *arg);
-
 /* rxe_mmap.c */
 struct rxe_mmap_info {
 	struct list_head	pending_mmaps;
@@ -104,8 +97,6 @@ int rxe_prepare(struct rxe_pkt_info *pkt, struct sk_buff *skb);
 int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 		    struct sk_buff *skb);
 const char *rxe_parent_name(struct rxe_dev *rxe, unsigned int port_num);
-int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid);
-int rxe_mcast_delete(struct rxe_dev *rxe, union ib_gid *mgid);
 
 /* rxe_qp.c */
 int rxe_qp_chk_init(struct rxe_dev *rxe, struct ib_qp_init_attr *init);
diff --git a/drivers/infiniband/sw/rxe/rxe_mcast.c b/drivers/infiniband/sw/rxe/rxe_mcast.c
index 337dc2c68051..685440a20669 100644
--- a/drivers/infiniband/sw/rxe/rxe_mcast.c
+++ b/drivers/infiniband/sw/rxe/rxe_mcast.c
@@ -5,116 +5,192 @@
  */
 
 #include "rxe.h"
-#include "rxe_loc.h"
 
-/* caller should hold mc_grp_pool->pool_lock */
-static struct rxe_mc_grp *create_grp(struct rxe_dev *rxe,
-				     struct rxe_pool *pool,
-				     union ib_gid *mgid)
+static int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
 {
 	int err;
+	unsigned char ll_addr[ETH_ALEN];
+
+	ipv6_eth_mc_map((struct in6_addr *)mgid->raw, ll_addr);
+	err = dev_mc_add(rxe->ndev, ll_addr);
+
+	return err;
+}
+
+static int rxe_mcast_delete(struct rxe_dev *rxe, union ib_gid *mgid)
+{
+	int err;
+	unsigned char ll_addr[ETH_ALEN];
+
+	ipv6_eth_mc_map((struct in6_addr *)mgid->raw, ll_addr);
+	err = dev_mc_del(rxe->ndev, ll_addr);
+
+	return err;
+}
+
+static int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid,
+		      struct rxe_mc_grp **grp_p)
+{
+	struct rxe_pool *pool = &rxe->mc_grp_pool;
 	struct rxe_mc_grp *grp;
+	unsigned long flags;
+	int err = 0;
+
+	/* Perform this while holding the mc_grp_pool lock
+	 * to prevent races where two coincident calls fail to lookup the
+	 * same group and then both create the same group.
+	 */
+	write_lock_irqsave(&pool->pool_lock, flags);
+	grp = rxe_pool_get_key_locked(pool, mgid);
+	if (grp)
+		goto done;
 
 	grp = rxe_alloc_with_key_locked(&rxe->mc_grp_pool, mgid);
-	if (!grp)
-		return ERR_PTR(-ENOMEM);
+	if (!grp) {
+		err = -ENOMEM;
+		goto done;
+	}
 
 	INIT_LIST_HEAD(&grp->qp_list);
 	spin_lock_init(&grp->mcg_lock);
 	grp->rxe = rxe;
 
 	err = rxe_mcast_add(rxe, mgid);
-	if (unlikely(err)) {
-		rxe_drop_ref(grp);
-		return ERR_PTR(err);
+	if (err) {
+		rxe_fini_ref_locked(grp);
+		grp = NULL;
+		goto done;
 	}
 
-	return grp;
+	/* match the reference taken by get_key */
+	rxe_add_ref_locked(grp);
+done:
+	*grp_p = grp;
+	write_unlock_irqrestore(&pool->pool_lock, flags);
+
+	return err;
 }
 
-int rxe_mcast_get_grp(struct rxe_dev *rxe, union ib_gid *mgid,
-		      struct rxe_mc_grp **grp_p)
+static void rxe_mcast_put_grp(struct rxe_mc_grp *grp)
 {
-	int err;
-	struct rxe_mc_grp *grp;
+	struct rxe_dev *rxe = grp->rxe;
 	struct rxe_pool *pool = &rxe->mc_grp_pool;
 	unsigned long flags;
 
-	if (rxe->attr.max_mcast_qp_attach == 0)
-		return -EINVAL;
-
 	write_lock_irqsave(&pool->pool_lock, flags);
 
-	grp = rxe_pool_get_key_locked(pool, mgid);
-	if (grp)
-		goto done;
+	rxe_drop_ref_locked(grp);
 
-	grp = create_grp(rxe, pool, mgid);
-	if (IS_ERR(grp)) {
-		write_unlock_irqrestore(&pool->pool_lock, flags);
-		err = PTR_ERR(grp);
-		return err;
+	if (rxe_read_ref(grp) == 1) {
+		rxe_mcast_delete(rxe, &grp->mgid);
+		rxe_fini_ref_locked(grp);
 	}
 
-done:
 	write_unlock_irqrestore(&pool->pool_lock, flags);
-	*grp_p = grp;
-	return 0;
 }
 
+/**
+ * rxe_mcast_add_grp_elem() - Associate a multicast address with a QP
+ * @rxe: the rxe device
+ * @qp: the QP
+ * @mgid: the multicast address
+ *
+ * Each multicast group can be associated with one or more QPs and
+ * each QP can be associated with zero or more multicast groups.
+ * Between each multicast group associated with a QP there is a
+ * rxe_mc_elem struct which has two list head structs and is joined
+ * both to a list of QPs on the multicast group and a list of groups
+ * on the QP. The elem has pointers to the group and the QP and
+ * takes a reference for each one.
+ *
+ * Return: 0 on success or an error on failure.
+ */
 int rxe_mcast_add_grp_elem(struct rxe_dev *rxe, struct rxe_qp *qp,
-			   struct rxe_mc_grp *grp)
+			   union ib_gid *mgid)
 {
-	int err;
 	struct rxe_mc_elem *elem;
+	struct rxe_mc_grp *grp;
+	int err;
+
+	if (rxe->attr.max_mcast_qp_attach == 0)
+		return -EINVAL;
+
+	/* takes a ref on grp if successful */
+	err = rxe_mcast_get_grp(rxe, mgid, &grp);
+	if (err)
+		return err;
 
-	/* check to see of the qp is already a member of the group */
 	spin_lock_bh(&qp->grp_lock);
 	spin_lock_bh(&grp->mcg_lock);
+
+	/* check to see if the qp is already a member of the group */
 	list_for_each_entry(elem, &grp->qp_list, qp_list) {
-		if (elem->qp == qp) {
-			err = 0;
-			goto out;
-		}
+		if (elem->qp == qp)
+			goto drop_ref;
 	}
 
 	if (grp->num_qp >= rxe->attr.max_mcast_qp_attach) {
 		err = -ENOMEM;
-		goto out;
+		goto drop_ref;
 	}
 
-	elem = rxe_alloc_locked(&rxe->mc_elem_pool);
-	if (!elem) {
+	if (atomic_read(&rxe->total_mcast_qp_attach) >=
+			rxe->attr.max_total_mcast_qp_attach) {
 		err = -ENOMEM;
-		goto out;
+		goto drop_ref;
 	}
 
-	/* each qp holds a ref on the grp */
-	rxe_add_ref(grp);
+	elem = kmalloc(sizeof(*elem), GFP_KERNEL);
+	if (!elem) {
+		err = -ENOMEM;
+		goto drop_ref;
+	}
 
+	atomic_inc(&rxe->total_mcast_qp_attach);
 	grp->num_qp++;
+	rxe_add_ref(qp);
 	elem->qp = qp;
+	/* still holding a ref on grp */
 	elem->grp = grp;
 
 	list_add(&elem->qp_list, &grp->qp_list);
 	list_add(&elem->grp_list, &qp->grp_list);
 
-	err = 0;
-out:
+	goto done;
+
+drop_ref:
+	rxe_drop_ref(grp);
+
+done:
 	spin_unlock_bh(&grp->mcg_lock);
 	spin_unlock_bh(&qp->grp_lock);
+
 	return err;
 }
 
+/**
+ * rxe_mcast_drop_grp_elem() - Dissociate multicast address and QP
+ * @rxe: the rxe device
+ * @qp: the QP
+ * @mgid: the multicast group
+ *
+ * Walk the list of group elements to find one which matches QP
+ * Then delete from group and qp lists and free pointers and the elem.
+ * Check to see if we have removed the last qp from group and delete
+ * it if so.
+ *
+ * Return: 0 on success else an error on failure
+ */
 int rxe_mcast_drop_grp_elem(struct rxe_dev *rxe, struct rxe_qp *qp,
 			    union ib_gid *mgid)
 {
-	struct rxe_mc_grp *grp;
 	struct rxe_mc_elem *elem, *tmp;
+	struct rxe_mc_grp *grp;
+	int err = 0;
 
 	grp = rxe_pool_get_key(&rxe->mc_grp_pool, mgid);
 	if (!grp)
-		goto err1;
+		return -EINVAL;
 
 	spin_lock_bh(&qp->grp_lock);
 	spin_lock_bh(&grp->mcg_lock);
@@ -123,26 +199,28 @@ int rxe_mcast_drop_grp_elem(struct rxe_dev *rxe, struct rxe_qp *qp,
 		if (elem->qp == qp) {
 			list_del(&elem->qp_list);
 			list_del(&elem->grp_list);
+			rxe_drop_ref(grp);
+			rxe_drop_ref(qp);
 			grp->num_qp--;
-
-			spin_unlock_bh(&grp->mcg_lock);
-			spin_unlock_bh(&qp->grp_lock);
-			rxe_drop_ref(elem);
-			rxe_drop_ref(grp);	/* ref held by QP */
-			rxe_drop_ref(grp);	/* ref from get_key */
-			return 0;
+			kfree(elem);
+			atomic_dec(&rxe->total_mcast_qp_attach);
+			goto done;
 		}
 	}
 
+	err = -EINVAL;
+done:
 	spin_unlock_bh(&grp->mcg_lock);
 	spin_unlock_bh(&qp->grp_lock);
-	rxe_drop_ref(grp);			/* ref from get_key */
-err1:
-	return -EINVAL;
+
+	rxe_mcast_put_grp(grp);
+
+	return err;
 }
 
 void rxe_drop_all_mcast_groups(struct rxe_qp *qp)
 {
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	struct rxe_mc_grp *grp;
 	struct rxe_mc_elem *elem;
 
@@ -162,15 +240,10 @@ void rxe_drop_all_mcast_groups(struct rxe_qp *qp)
 		list_del(&elem->qp_list);
 		grp->num_qp--;
 		spin_unlock_bh(&grp->mcg_lock);
-		rxe_drop_ref(grp);
-		rxe_drop_ref(elem);
-	}
-}
 
-void rxe_mc_cleanup(struct rxe_pool_entry *arg)
-{
-	struct rxe_mc_grp *grp = container_of(arg, typeof(*grp), pelem);
-	struct rxe_dev *rxe = grp->rxe;
-
-	rxe_mcast_delete(rxe, &grp->mgid);
+		kfree(elem);
+		atomic_dec(&rxe->total_mcast_qp_attach);
+		rxe_drop_ref(qp);
+		rxe_mcast_put_grp(grp);
+	}
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 2cb810cb890a..fcdf998ec896 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -20,28 +20,6 @@
 
 static struct rxe_recv_sockets recv_sockets;
 
-int rxe_mcast_add(struct rxe_dev *rxe, union ib_gid *mgid)
-{
-	int err;
-	unsigned char ll_addr[ETH_ALEN];
-
-	ipv6_eth_mc_map((struct in6_addr *)mgid->raw, ll_addr);
-	err = dev_mc_add(rxe->ndev, ll_addr);
-
-	return err;
-}
-
-int rxe_mcast_delete(struct rxe_dev *rxe, union ib_gid *mgid)
-{
-	int err;
-	unsigned char ll_addr[ETH_ALEN];
-
-	ipv6_eth_mc_map((struct in6_addr *)mgid->raw, ll_addr);
-	err = dev_mc_del(rxe->ndev, ll_addr);
-
-	return err;
-}
-
 static struct dst_entry *rxe_find_route4(struct net_device *ndev,
 				  struct in_addr *saddr,
 				  struct in_addr *daddr)
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 4be43ae58219..2cd4d8803a0e 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -76,16 +76,10 @@ static const struct rxe_type_info rxe_type_info[RXE_NUM_TYPES] = {
 		.name		= "rxe-mc_grp",
 		.size		= sizeof(struct rxe_mc_grp),
 		.elem_offset	= offsetof(struct rxe_mc_grp, pelem),
-		.cleanup	= rxe_mc_cleanup,
 		.flags		= RXE_POOL_KEY,
 		.key_offset	= offsetof(struct rxe_mc_grp, mgid),
 		.key_size	= sizeof(union ib_gid),
 	},
-	[RXE_TYPE_MC_ELEM] = {
-		.name		= "rxe-mc_elem",
-		.size		= sizeof(struct rxe_mc_elem),
-		.elem_offset	= offsetof(struct rxe_mc_elem, pelem),
-	},
 };
 
 static int rxe_pool_init_index(struct rxe_pool *pool, u32 max, u32 min)
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index 7df52d34e306..071e5c557d31 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -26,7 +26,6 @@ enum rxe_elem_type {
 	RXE_TYPE_MR,
 	RXE_TYPE_MW,
 	RXE_TYPE_MC_GRP,
-	RXE_TYPE_MC_ELEM,
 	RXE_NUM_TYPES,		/* keep me last */
 };
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 61fa633775f3..a475fc04e05b 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -986,20 +986,10 @@ static int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 
 static int rxe_attach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
 {
-	int err;
 	struct rxe_dev *rxe = to_rdev(ibqp->device);
 	struct rxe_qp *qp = to_rqp(ibqp);
-	struct rxe_mc_grp *grp;
-
-	/* takes a ref on grp if successful */
-	err = rxe_mcast_get_grp(rxe, mgid, &grp);
-	if (err)
-		return err;
-
-	err = rxe_mcast_add_grp_elem(rxe, qp, grp);
 
-	rxe_drop_ref(grp);
-	return err;
+	return rxe_mcast_add_grp_elem(rxe, qp, mgid);
 }
 
 static int rxe_detach_mcast(struct ib_qp *ibqp, union ib_gid *mgid, u16 mlid)
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 35e041450090..4f1d7777f755 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -403,7 +403,8 @@ struct rxe_dev {
 	struct rxe_pool		mr_pool;
 	struct rxe_pool		mw_pool;
 	struct rxe_pool		mc_grp_pool;
-	struct rxe_pool		mc_elem_pool;
+
+	atomic_t		total_mcast_qp_attach;
 
 	spinlock_t		pending_lock; /* guard pending_mmaps */
 	struct list_head	pending_mmaps;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 08/10] RDMA/rxe: Fix ref error in rxe_av.c
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (6 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 07/10] RDMA/rxe: Rewrite rxe_mcast.c Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 09/10] RDMA/rxe: Replace mr by rkey in responder resources Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 10/10] RDMA/rxe: Minor cleanup in rxe_pool.c Bob Pearson
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

The commit referenced below can take a reference to the AH which is
never dropped. This only happens in the UD request path. This patch
optionally passes that AH back to the caller so that it can hold the
reference while the AV is being accessed and then drop it. Code to
do this is added to rxe_req.c. The AV is also passed to rxe_prepare
in rxe_net.c as an optimization.

Fixes: e2fe06c90806 ("RDMA/rxe: Lookup kernel AH from ah index in UD WQEs")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_av.c   | 25 +++++++++++--
 drivers/infiniband/sw/rxe/rxe_loc.h  |  5 ++-
 drivers/infiniband/sw/rxe/rxe_net.c  | 17 +++++----
 drivers/infiniband/sw/rxe/rxe_req.c  | 55 +++++++++++++++++-----------
 drivers/infiniband/sw/rxe/rxe_resp.c |  2 +-
 5 files changed, 67 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_av.c b/drivers/infiniband/sw/rxe/rxe_av.c
index 5084841a7cd0..8a6910a01e66 100644
--- a/drivers/infiniband/sw/rxe/rxe_av.c
+++ b/drivers/infiniband/sw/rxe/rxe_av.c
@@ -99,11 +99,14 @@ void rxe_av_fill_ip_info(struct rxe_av *av, struct rdma_ah_attr *attr)
 	av->network_type = type;
 }
 
-struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt)
+struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt, struct rxe_ah **ahp)
 {
 	struct rxe_ah *ah;
 	u32 ah_num;
 
+	if (ahp)
+		*ahp = NULL;
+
 	if (!pkt || !pkt->qp)
 		return NULL;
 
@@ -117,11 +120,25 @@ struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt)
 	if (ah_num) {
 		/* only new user provider or kernel client */
 		ah = rxe_pool_get_index(&pkt->rxe->ah_pool, ah_num);
-		rxe_drop_ref(ah);
-		if (!ah || ah->ah_num != ah_num || rxe_ah_pd(ah) != pkt->qp->pd) {
-			pr_warn("Unable to find AH matching ah_num\n");
+		if (!ah) {
+			pr_warn("%s: Unable to find AH matching ah_num\n",
+				__func__);
+			return NULL;
+		}
+
+		if (rxe_ah_pd(ah) != pkt->qp->pd) {
+			pr_warn("%s: PDs don't match for AH and QP\n",
+				__func__);
+			rxe_drop_ref(ah);
 			return NULL;
 		}
+
+		/* let caller hold ref to ah */
+		if (ahp)
+			*ahp = ah;
+		else
+			rxe_drop_ref(ah);
+
 		return &ah->av;
 	}
 
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 78312df8eea3..a689ee8386b8 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -19,7 +19,7 @@ void rxe_av_to_attr(struct rxe_av *av, struct rdma_ah_attr *attr);
 
 void rxe_av_fill_ip_info(struct rxe_av *av, struct rdma_ah_attr *attr);
 
-struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt);
+struct rxe_av *rxe_get_av(struct rxe_pkt_info *pkt, struct rxe_ah **ahp);
 
 /* rxe_cq.c */
 int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq,
@@ -93,7 +93,8 @@ void rxe_mw_cleanup(struct rxe_pool_entry *arg);
 /* rxe_net.c */
 struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
 				int paylen, struct rxe_pkt_info *pkt);
-int rxe_prepare(struct rxe_pkt_info *pkt, struct sk_buff *skb);
+int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		struct sk_buff *skb);
 int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 		    struct sk_buff *skb);
 const char *rxe_parent_name(struct rxe_dev *rxe, unsigned int port_num);
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index fcdf998ec896..996aabd6e57b 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -271,13 +271,13 @@ static void prepare_ipv6_hdr(struct dst_entry *dst, struct sk_buff *skb,
 	ip6h->payload_len = htons(skb->len - sizeof(*ip6h));
 }
 
-static int prepare4(struct rxe_pkt_info *pkt, struct sk_buff *skb)
+static int prepare4(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		    struct sk_buff *skb)
 {
 	struct rxe_qp *qp = pkt->qp;
 	struct dst_entry *dst;
 	bool xnet = false;
 	__be16 df = htons(IP_DF);
-	struct rxe_av *av = rxe_get_av(pkt);
 	struct in_addr *saddr = &av->sgid_addr._sockaddr_in.sin_addr;
 	struct in_addr *daddr = &av->dgid_addr._sockaddr_in.sin_addr;
 
@@ -297,11 +297,11 @@ static int prepare4(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 	return 0;
 }
 
-static int prepare6(struct rxe_pkt_info *pkt, struct sk_buff *skb)
+static int prepare6(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		    struct sk_buff *skb)
 {
 	struct rxe_qp *qp = pkt->qp;
 	struct dst_entry *dst;
-	struct rxe_av *av = rxe_get_av(pkt);
 	struct in6_addr *saddr = &av->sgid_addr._sockaddr_in6.sin6_addr;
 	struct in6_addr *daddr = &av->dgid_addr._sockaddr_in6.sin6_addr;
 
@@ -322,16 +322,17 @@ static int prepare6(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 	return 0;
 }
 
-int rxe_prepare(struct rxe_pkt_info *pkt, struct sk_buff *skb)
+int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
+		struct sk_buff *skb)
 {
 	int err = 0;
 
 	if (skb->protocol == htons(ETH_P_IP))
-		err = prepare4(pkt, skb);
+		err = prepare4(av, pkt, skb);
 	else if (skb->protocol == htons(ETH_P_IPV6))
-		err = prepare6(pkt, skb);
+		err = prepare6(av, pkt, skb);
 
-	if (ether_addr_equal(skb->dev->dev_addr, rxe_get_av(pkt)->dmac))
+	if (ether_addr_equal(skb->dev->dev_addr, av->dmac))
 		pkt->mask |= RXE_LOOPBACK_MASK;
 
 	return err;
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 0c9d2af15f3d..891cf98c74a0 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -361,6 +361,7 @@ static inline int get_mtu(struct rxe_qp *qp)
 }
 
 static struct sk_buff *init_req_packet(struct rxe_qp *qp,
+				       struct rxe_av *av,
 				       struct rxe_send_wqe *wqe,
 				       int opcode, int payload,
 				       struct rxe_pkt_info *pkt)
@@ -368,7 +369,6 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 	struct rxe_dev		*rxe = to_rdev(qp->ibqp.device);
 	struct sk_buff		*skb;
 	struct rxe_send_wr	*ibwr = &wqe->wr;
-	struct rxe_av		*av;
 	int			pad = (-payload) & 0x3;
 	int			paylen;
 	int			solicited;
@@ -378,21 +378,9 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 
 	/* length from start of bth to end of icrc */
 	paylen = rxe_opcode[opcode].length + payload + pad + RXE_ICRC_SIZE;
-
-	/* pkt->hdr, port_num and mask are initialized in ifc layer */
-	pkt->rxe	= rxe;
-	pkt->opcode	= opcode;
-	pkt->qp		= qp;
-	pkt->psn	= qp->req.psn;
-	pkt->mask	= rxe_opcode[opcode].mask;
-	pkt->paylen	= paylen;
-	pkt->wqe	= wqe;
+	pkt->paylen = paylen;
 
 	/* init skb */
-	av = rxe_get_av(pkt);
-	if (!av)
-		return NULL;
-
 	skb = rxe_init_packet(rxe, av, paylen, pkt);
 	if (unlikely(!skb))
 		return NULL;
@@ -453,13 +441,13 @@ static struct sk_buff *init_req_packet(struct rxe_qp *qp,
 	return skb;
 }
 
-static int finish_packet(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
-		       struct rxe_pkt_info *pkt, struct sk_buff *skb,
-		       int paylen)
+static int finish_packet(struct rxe_qp *qp, struct rxe_av *av,
+			 struct rxe_send_wqe *wqe, struct rxe_pkt_info *pkt,
+			 struct sk_buff *skb, int paylen)
 {
 	int err;
 
-	err = rxe_prepare(pkt, skb);
+	err = rxe_prepare(av, pkt, skb);
 	if (err)
 		return err;
 
@@ -614,6 +602,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 int rxe_requester(void *arg)
 {
 	struct rxe_qp *qp = (struct rxe_qp *)arg;
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	struct rxe_pkt_info pkt;
 	struct sk_buff *skb;
 	struct rxe_send_wqe *wqe;
@@ -625,6 +614,8 @@ int rxe_requester(void *arg)
 	struct rxe_send_wqe rollback_wqe;
 	u32 rollback_psn;
 	struct rxe_queue *q = qp->sq.queue;
+	struct rxe_ah *ah;
+	struct rxe_av *av;
 
 	rxe_add_ref(qp);
 
@@ -711,14 +702,28 @@ int rxe_requester(void *arg)
 		payload = mtu;
 	}
 
-	skb = init_req_packet(qp, wqe, opcode, payload, &pkt);
+	pkt.rxe = rxe;
+	pkt.opcode = opcode;
+	pkt.qp = qp;
+	pkt.psn = qp->req.psn;
+	pkt.mask = rxe_opcode[opcode].mask;
+	pkt.wqe = wqe;
+
+	av = rxe_get_av(&pkt, &ah);
+	if (unlikely(!av)) {
+		pr_err("qp#%d Failed no address vector\n", qp_num(qp));
+		wqe->status = IB_WC_LOC_QP_OP_ERR;
+		goto err_drop_ah;
+	}
+
+	skb = init_req_packet(qp, av, wqe, opcode, payload, &pkt);
 	if (unlikely(!skb)) {
 		pr_err("qp#%d Failed allocating skb\n", qp_num(qp));
 		wqe->status = IB_WC_LOC_QP_OP_ERR;
-		goto err;
+		goto err_drop_ah;
 	}
 
-	ret = finish_packet(qp, wqe, &pkt, skb, payload);
+	ret = finish_packet(qp, av, wqe, &pkt, skb, payload);
 	if (unlikely(ret)) {
 		pr_debug("qp#%d Error during finish packet\n", qp_num(qp));
 		if (ret == -EFAULT)
@@ -726,9 +731,12 @@ int rxe_requester(void *arg)
 		else
 			wqe->status = IB_WC_LOC_QP_OP_ERR;
 		kfree_skb(skb);
-		goto err;
+		goto err_drop_ah;
 	}
 
+	if (ah)
+		rxe_drop_ref(ah);
+
 	/*
 	 * To prevent a race on wqe access between requester and completer,
 	 * wqe members state and psn need to be set before calling
@@ -757,6 +765,9 @@ int rxe_requester(void *arg)
 
 	goto next_wqe;
 
+err_drop_ah:
+	if (ah)
+		rxe_drop_ref(ah);
 err:
 	wqe->state = wqe_state_error;
 	__rxe_do_task(&qp->comp.task);
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index e8f435fa6e4d..f589f4dde35c 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -632,7 +632,7 @@ static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
 	if (ack->mask & RXE_ATMACK_MASK)
 		atmack_set_orig(ack, qp->resp.atomic_orig);
 
-	err = rxe_prepare(ack, skb);
+	err = rxe_prepare(&qp->pri_av, ack, skb);
 	if (err) {
 		kfree_skb(skb);
 		return NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 09/10] RDMA/rxe: Replace mr by rkey in responder resources
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (7 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 08/10] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  2021-10-22 19:18 ` [PATCH for-next v3 10/10] RDMA/rxe: Minor cleanup in rxe_pool.c Bob Pearson
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Currently rxe saves a copy of MR in responder resources for RDMA reads.
Since the responder resources are never freed just over written if
more are needed this MR may not have a reference freed until the QP
is destroyed. This patch uses the rkey instead of the MR and on
subsequent packets of a multipacket read reply message it looks up the
MR from the rkey for each packet. This makes it possible for a user
to deregister an MR or unbind a MW on the fly and get correct behaviour.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c    |  10 +--
 drivers/infiniband/sw/rxe/rxe_resp.c  | 123 ++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_verbs.h |   1 -
 3 files changed, 87 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 975321812c87..5b9273b171af 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -135,12 +135,8 @@ static void free_rd_atomic_resources(struct rxe_qp *qp)
 
 void free_rd_atomic_resource(struct rxe_qp *qp, struct resp_res *res)
 {
-	if (res->type == RXE_ATOMIC_MASK) {
+	if (res->type == RXE_ATOMIC_MASK)
 		kfree_skb(res->atomic.skb);
-	} else if (res->type == RXE_READ_MASK) {
-		if (res->read.mr)
-			rxe_drop_ref(res->read.mr);
-	}
 	res->type = 0;
 }
 
@@ -816,10 +812,8 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
 	if (qp->pd)
 		rxe_drop_ref(qp->pd);
 
-	if (qp->resp.mr) {
+	if (qp->resp.mr)
 		rxe_drop_ref(qp->resp.mr);
-		qp->resp.mr = NULL;
-	}
 
 	if (qp_type(qp) == IB_QPT_RC)
 		sk_dst_reset(qp->sk->sk);
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index f589f4dde35c..c776289842e5 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -641,6 +641,78 @@ static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
 	return skb;
 }
 
+static struct resp_res *rxe_prepare_read_res(struct rxe_qp *qp,
+					struct rxe_pkt_info *pkt)
+{
+	struct resp_res *res;
+	u32 pkts;
+
+	res = &qp->resp.resources[qp->resp.res_head];
+	rxe_advance_resp_resource(qp);
+	free_rd_atomic_resource(qp, res);
+
+	res->type = RXE_READ_MASK;
+	res->replay = 0;
+	res->read.va = qp->resp.va + qp->resp.offset;
+	res->read.va_org = qp->resp.va + qp->resp.offset;
+	res->read.resid = qp->resp.resid;
+	res->read.length = qp->resp.resid;
+	res->read.rkey = qp->resp.rkey;
+
+	pkts = max_t(u32, (reth_len(pkt) + qp->mtu - 1)/qp->mtu, 1);
+	res->first_psn = pkt->psn;
+	res->cur_psn = pkt->psn;
+	res->last_psn = (pkt->psn + pkts - 1) & BTH_PSN_MASK;
+
+	res->state = rdatm_res_state_new;
+
+	return res;
+}
+
+/**
+ * rxe_recheck_mr - revalidate MR from rkey and get a reference
+ * @qp: the qp
+ * @rkey: the rkey
+ *
+ * This code allows the MR to be invalidated or deregistered or
+ * the MW if one was used to be invalidated or deallocated.
+ * It is assumed that the access permissions if originally good
+ * are OK and the mappings to be unchanged.
+ *
+ * Return: mr on success else NULL
+ */
+static struct rxe_mr *rxe_recheck_mr(struct rxe_qp *qp, u32 rkey)
+{
+	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
+	struct rxe_mr *mr;
+	struct rxe_mw *mw;
+
+	if (rkey_is_mw(rkey)) {
+		mw = rxe_pool_get_index(&rxe->mw_pool, rkey >> 8);
+		if (!mw || mw->rkey != rkey)
+			return NULL;
+
+		if (mw->state != RXE_MW_STATE_VALID) {
+			rxe_drop_ref(mw);
+			return NULL;
+		}
+
+		mr = mw->mr;
+		rxe_drop_ref(mw);
+	} else {
+		mr = rxe_pool_get_index(&rxe->mr_pool, rkey >> 8);
+		if (!mr || mr->rkey != rkey)
+			return NULL;
+	}
+
+	if (mr->state != RXE_MR_STATE_VALID) {
+		rxe_drop_ref(mr);
+		return NULL;
+	}
+
+	return mr;
+}
+
 /* RDMA read response. If res is not NULL, then we have a current RDMA request
  * being processed or replayed.
  */
@@ -655,53 +727,26 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 	int opcode;
 	int err;
 	struct resp_res *res = qp->resp.res;
+	struct rxe_mr *mr;
 
 	if (!res) {
-		/* This is the first time we process that request. Get a
-		 * resource
-		 */
-		res = &qp->resp.resources[qp->resp.res_head];
-
-		free_rd_atomic_resource(qp, res);
-		rxe_advance_resp_resource(qp);
-
-		res->type		= RXE_READ_MASK;
-		res->replay		= 0;
-
-		res->read.va		= qp->resp.va +
-					  qp->resp.offset;
-		res->read.va_org	= qp->resp.va +
-					  qp->resp.offset;
-
-		res->first_psn		= req_pkt->psn;
-
-		if (reth_len(req_pkt)) {
-			res->last_psn	= (req_pkt->psn +
-					   (reth_len(req_pkt) + mtu - 1) /
-					   mtu - 1) & BTH_PSN_MASK;
-		} else {
-			res->last_psn	= res->first_psn;
-		}
-		res->cur_psn		= req_pkt->psn;
-
-		res->read.resid		= qp->resp.resid;
-		res->read.length	= qp->resp.resid;
-		res->read.rkey		= qp->resp.rkey;
-
-		/* note res inherits the reference to mr from qp */
-		res->read.mr		= qp->resp.mr;
-		qp->resp.mr		= NULL;
-
-		qp->resp.res		= res;
-		res->state		= rdatm_res_state_new;
+		res = rxe_prepare_read_res(qp, req_pkt);
+		qp->resp.res = res;
 	}
 
 	if (res->state == rdatm_res_state_new) {
+		mr = qp->resp.mr;
+		qp->resp.mr = NULL;
+
 		if (res->read.resid <= mtu)
 			opcode = IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY;
 		else
 			opcode = IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST;
 	} else {
+		mr = rxe_recheck_mr(qp, res->read.rkey);
+		if (!mr)
+			return RESPST_ERR_RKEY_VIOLATION;
+
 		if (res->read.resid > mtu)
 			opcode = IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE;
 		else
@@ -717,10 +762,12 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 	if (!skb)
 		return RESPST_ERR_RNR;
 
-	err = rxe_mr_copy(res->read.mr, res->read.va, payload_addr(&ack_pkt),
+	err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
 			  payload, RXE_FROM_MR_OBJ);
 	if (err)
 		pr_err("Failed copying memory\n");
+	if (mr)
+		rxe_drop_ref(mr);
 
 	if (bth_pad(&ack_pkt)) {
 		u8 *pad = payload_addr(&ack_pkt) + payload;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 4f1d7777f755..0cfbef7a36c9 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -157,7 +157,6 @@ struct resp_res {
 			struct sk_buff	*skb;
 		} atomic;
 		struct {
-			struct rxe_mr	*mr;
 			u64		va_org;
 			u32		rkey;
 			u32		length;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH for-next v3 10/10] RDMA/rxe: Minor cleanup in rxe_pool.c
  2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
                   ` (8 preceding siblings ...)
  2021-10-22 19:18 ` [PATCH for-next v3 09/10] RDMA/rxe: Replace mr by rkey in responder resources Bob Pearson
@ 2021-10-22 19:18 ` Bob Pearson
  9 siblings, 0 replies; 14+ messages in thread
From: Bob Pearson @ 2021-10-22 19:18 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma; +Cc: Bob Pearson

Change pr_warn() to pr_debug()
Check if RXE_POOL_INDEX in rxe_pool_cleanup()

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 2cd4d8803a0e..82efc4626bf6 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -88,7 +88,7 @@ static int rxe_pool_init_index(struct rxe_pool *pool, u32 max, u32 min)
 	size_t size;
 
 	if ((max - min + 1) < pool->max_elem) {
-		pr_warn("not enough indices for max_elem\n");
+		pr_debug("not enough indices for max_elem\n");
 		err = -EINVAL;
 		goto out;
 	}
@@ -155,10 +155,11 @@ int rxe_pool_init(
 void rxe_pool_cleanup(struct rxe_pool *pool)
 {
 	if (atomic_read(&pool->num_elem) > 0)
-		pr_warn("%s pool destroyed with unfree'd elem\n",
+		pr_debug("%s pool destroyed with unfree'd elem\n",
 			pool->name);
 
-	kfree(pool->index.table);
+	if (pool->flags & RXE_POOL_INDEX)
+		kfree(pool->index.table);
 }
 
 /* should never fail because there are at least as many indices as
@@ -194,7 +195,7 @@ static int rxe_insert_index(struct rxe_pool *pool, struct rxe_pool_entry *new)
 		 * old object was not deleted from the pool index
 		 */
 		if (unlikely(elem == new || elem->index == new->index)) {
-			pr_warn("%s#%d rf=%d: already in pool\n",
+			pr_debug("%s#%d(%d): already in pool\n",
 					pool->name, new->index,
 					refcount_read(&new->refcnt));
 			return -EINVAL;
@@ -228,7 +229,9 @@ static int rxe_insert_key(struct rxe_pool *pool, struct rxe_pool_entry *new)
 			     pool->key.key_size);
 
 		if (cmp == 0) {
-			pr_warn("key already exists!\n");
+			pr_debug("%s#%d(%d): key already in pool\n",
+					pool->name, new->index,
+					refcount_read(&new->refcnt));
 			return -EINVAL;
 		}
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock
  2021-10-22 19:18 ` [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock Bob Pearson
@ 2021-10-25 17:55   ` Jason Gunthorpe
  2021-10-25 19:09     ` Robert Pearson
  0 siblings, 1 reply; 14+ messages in thread
From: Jason Gunthorpe @ 2021-10-25 17:55 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma

On Fri, Oct 22, 2021 at 02:18:16PM -0500, Bob Pearson wrote:
> In rxe_pool.c there are two separate pool APIs for creating a new object
> rxe_alloc() and rxe_alloc_locked(). Currently they are identical except
> for GFP_KERNEL vs GFP_ATOMIC. Make rxe_alloc() take the pool lock which
> is in line with the other APIs in the library and was the original intent.
> Keep kzalloc() outside of the lock to avoid using GFP_ATOMIC.
> Make similar change to rxe_add_to_pool. The code checking the pool limit
> is potentially racy without a lock. The lock around finishing the pool
> element provides a memory barrier before 'publishing' the object.
> 
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> v3
>   Kept rxe_alloc and rxe_alloc_locked separate to allow use of GFP_KERNEL
>   in rxe_alloc
> 
>  drivers/infiniband/sw/rxe/rxe_pool.c | 18 +++++++++++++++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 7b4cb46edfd9..8138e459b6c1 100644
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -356,39 +356,51 @@ void *rxe_alloc(struct rxe_pool *pool)
>  {
>  	struct rxe_type_info *info = &rxe_type_info[pool->type];
>  	struct rxe_pool_entry *elem;
> +	unsigned long flags;
>  	u8 *obj;
>  
> +	write_lock_irqsave(&pool->pool_lock, flags);
>  	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
>  		goto out_cnt;
> +	write_unlock_irqrestore(&pool->pool_lock, flags);

Atomic's don't need locks this isn't racy as is today


>  	obj = kzalloc(info->size, GFP_KERNEL);
> -	if (!obj)
> -		goto out_cnt;
> +	if (!obj) {
> +		atomic_dec(&pool->num_elem);
> +		return NULL;
> +	}
>  
> +	write_lock_irqsave(&pool->pool_lock, flags);
>  	elem = (struct rxe_pool_entry *)(obj + info->elem_offset);
> -
>  	elem->pool = pool;
>  	kref_init(&elem->ref_cnt);
> +	write_unlock_irqrestore(&pool->pool_lock, flags);

And none of this needs a lock either

The 'release' operation should be provided by the code that writes a
non-thread-local pointer, not by the code that initializes it a
pointer that never leaves the stack.

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock
  2021-10-25 17:55   ` Jason Gunthorpe
@ 2021-10-25 19:09     ` Robert Pearson
  2021-10-25 19:41       ` Jason Gunthorpe
  0 siblings, 1 reply; 14+ messages in thread
From: Robert Pearson @ 2021-10-25 19:09 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Zhu Yanjun, RDMA mailing list

Here (not related to the mcast issue) I am guilty of hiding my plans.
A little later I was going to move creating the metadata (RB tree and
index table) into the lock and then later after converting to xarrays
replace the lock with the xa_lock which you have to have anyway
because it's part of xa_alloc().
As you point out it is the index/key tree change that is visible to
everyone. The races here are in trying to get the index or change the
RB tree. Perhaps I should merge this change with the later one where I
get rid of add/drop_index(). Or, rearrange things and convert to
xarrays and then worry about races.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock
  2021-10-25 19:09     ` Robert Pearson
@ 2021-10-25 19:41       ` Jason Gunthorpe
  0 siblings, 0 replies; 14+ messages in thread
From: Jason Gunthorpe @ 2021-10-25 19:41 UTC (permalink / raw)
  To: Robert Pearson; +Cc: Zhu Yanjun, RDMA mailing list

On Mon, Oct 25, 2021 at 02:09:36PM -0500, Robert Pearson wrote:
> Here (not related to the mcast issue) I am guilty of hiding my plans.
> A little later I was going to move creating the metadata (RB tree and
> index table) into the lock and then later after converting to xarrays
> replace the lock with the xa_lock which you have to have anyway
> because it's part of xa_alloc().
> As you point out it is the index/key tree change that is visible to
> everyone. The races here are in trying to get the index or change the
> RB tree. Perhaps I should merge this change with the later one where I
> get rid of add/drop_index(). Or, rearrange things and convert to
> xarrays and then worry about races.

Right, it should be later

Jason

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-10-25 19:47 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22 19:18 [PATCH for-next v3 00/10] Correct race conditions in rdma_rxe Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 01/10] RDMA/rxe: Make rxe_alloc() take pool lock Bob Pearson
2021-10-25 17:55   ` Jason Gunthorpe
2021-10-25 19:09     ` Robert Pearson
2021-10-25 19:41       ` Jason Gunthorpe
2021-10-22 19:18 ` [PATCH for-next v3 02/10] RDMA/rxe: Copy setup parameters into rxe_pool Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 03/10] RDMA/rxe: Save object pointer in pool element Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 04/10] RDMA/rxe: Combine rxe_add_index with rxe_alloc Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 05/10] RDMA/rxe: Combine rxe_add_key " Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 06/10] RDMA/rxe: Separate out last rxe_drop_ref Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 07/10] RDMA/rxe: Rewrite rxe_mcast.c Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 08/10] RDMA/rxe: Fix ref error in rxe_av.c Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 09/10] RDMA/rxe: Replace mr by rkey in responder resources Bob Pearson
2021-10-22 19:18 ` [PATCH for-next v3 10/10] RDMA/rxe: Minor cleanup in rxe_pool.c Bob Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).