All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups
@ 2023-07-21 20:50 Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c Bob Pearson
                   ` (8 more replies)
  0 siblings, 9 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

This patch set includes several miscellaneous fixes and cleanups
developed while getting the rxe driver to pass a 24 hour simulated
cable pull fail-over fail-back stress test using Lustre on a 256 node
system. These patches apply over for-next with three recently submitted
patches as prerequisites:

	RDMA/rxe: Fix incomplete state save in rxe_requester
	RDMA/core: Support drivers use of rcu locking
	RDMA/rxe: Enable rcu locking of indexed objects

Bob Pearson (9):
  RDMA/rxe: Fix handling sleepable in rxe_pool.c
  RDMA/rxe: Fix xarray locking in rxe_pool.c
  RDMA/rxe: Fix freeing busy objects
  RDMA/rxe: Fix delayed send packet handling
  RDMA/rxe: Optimize rxe_init_packet in rxe_net.c
  RDMA/rxe: Delete unused field elem->list
  RDMA/rxe: Add elem->valid field
  RDMA/rxe: Report leaked objects
  RDMA/rxe: Protect pending send packets

 drivers/infiniband/sw/rxe/rxe.c       |  26 ++++++
 drivers/infiniband/sw/rxe/rxe.h       |   3 +
 drivers/infiniband/sw/rxe/rxe_net.c   | 119 +++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_pool.c  |  85 ++++++++++--------
 drivers/infiniband/sw/rxe/rxe_pool.h  |   9 +-
 drivers/infiniband/sw/rxe/rxe_qp.c    |   1 -
 drivers/infiniband/sw/rxe/rxe_verbs.c |  86 ++++++-------------
 7 files changed, 185 insertions(+), 144 deletions(-)


base-commit: b3d2b014b259ba758d72d7026685091bde1cf2d6
prerequisite-patch-id: c3994e7a93e37e0ce4f50e0c768f3c1a0059a02f
prerequisite-patch-id: 48e13f6ccb560fdeacbd20aaf6696782c23d1190
prerequisite-patch-id: da75fb8eaa863df840e7b392b5048fcc72b0bef3
-- 
2.39.2


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-31 18:08   ` Jason Gunthorpe
  2023-07-21 20:50 ` [PATCH for-next 2/9] RDMA/rxe: Fix xarray locking " Bob Pearson
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

Since the AH creation and destroy verbs APIs can be called in
interrupt context it is necessary to not sleep in these cases.
This was partially dealt with in previous fixes but some cases
remain where the code could still potentially sleep.

This patch fixes this by extending the __rxe_finalize() call to
have a sleepable parameter and using this in the AH verbs calls.
It also fixes one call to rxe_cleanup which did not use the
sleepable API.

Fixes: 215d0a755e1b ("RDMA/rxe: Stop lookup of partially built objects")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c  |  5 +++--
 drivers/infiniband/sw/rxe/rxe_pool.h  |  5 +++--
 drivers/infiniband/sw/rxe/rxe_verbs.c | 11 ++++++-----
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 6215c6de3a84..de0043b6d3f3 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -247,10 +247,11 @@ int __rxe_put(struct rxe_pool_elem *elem)
 	return kref_put(&elem->ref_cnt, rxe_elem_release);
 }
 
-void __rxe_finalize(struct rxe_pool_elem *elem)
+void __rxe_finalize(struct rxe_pool_elem *elem, bool sleepable)
 {
 	void *xa_ret;
+	gfp_t gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
 
-	xa_ret = xa_store(&elem->pool->xa, elem->index, elem, GFP_KERNEL);
+	xa_ret = xa_store(&elem->pool->xa, elem->index, elem, gfp_flags);
 	WARN_ON(xa_err(xa_ret));
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index ef2f6f88e254..d764c51ed6f7 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -77,7 +77,8 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable);
 
 #define rxe_read(obj) kref_read(&(obj)->elem.ref_cnt)
 
-void __rxe_finalize(struct rxe_pool_elem *elem);
-#define rxe_finalize(obj) __rxe_finalize(&(obj)->elem)
+void __rxe_finalize(struct rxe_pool_elem *elem, bool sleepable);
+#define rxe_finalize(obj) __rxe_finalize(&(obj)->elem, true)
+#define rxe_finalize_ah(obj, sleepable) __rxe_finalize(&(obj)->elem, sleepable)
 
 #endif /* RXE_POOL_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index b899924b81eb..5e93dbac17b3 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -265,6 +265,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
 	struct rxe_dev *rxe = to_rdev(ibah->device);
 	struct rxe_ah *ah = to_rah(ibah);
 	struct rxe_create_ah_resp __user *uresp = NULL;
+	bool sleepable = init_attr->flags & RDMA_CREATE_AH_SLEEPABLE;
 	int err, cleanup_err;
 
 	if (udata) {
@@ -276,8 +277,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
 		ah->is_user = false;
 	}
 
-	err = rxe_add_to_pool_ah(&rxe->ah_pool, ah,
-			init_attr->flags & RDMA_CREATE_AH_SLEEPABLE);
+	err = rxe_add_to_pool_ah(&rxe->ah_pool, ah, sleepable);
 	if (err) {
 		rxe_dbg_dev(rxe, "unable to create ah");
 		goto err_out;
@@ -307,12 +307,12 @@ static int rxe_create_ah(struct ib_ah *ibah,
 	}
 
 	rxe_init_av(init_attr->ah_attr, &ah->av);
-	rxe_finalize(ah);
+	rxe_finalize_ah(ah, sleepable);
 
 	return 0;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup(ah);
+	cleanup_err = rxe_cleanup_ah(ah, sleepable);
 	if (cleanup_err)
 		rxe_err_ah(ah, "cleanup failed, err = %d", cleanup_err);
 err_out:
@@ -354,9 +354,10 @@ static int rxe_query_ah(struct ib_ah *ibah, struct rdma_ah_attr *attr)
 static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 {
 	struct rxe_ah *ah = to_rah(ibah);
+	bool sleepable = flags & RDMA_DESTROY_AH_SLEEPABLE;
 	int err;
 
-	err = rxe_cleanup_ah(ah, flags & RDMA_DESTROY_AH_SLEEPABLE);
+	err = rxe_cleanup_ah(ah, sleepable);
 	if (err)
 		rxe_err_ah(ah, "cleanup failed, err = %d", err);
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 2/9] RDMA/rxe: Fix xarray locking in rxe_pool.c
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects Bob Pearson
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

The kernel rdma verbs API calls are not documented to require that
the caller be in process context and it is well known that the
address handle calls sometimes are in interrupt context while the
other ones are in process context. If we replace the xarray spin_lock
in the rxe_pool APIs by irqsave spin_locks we make most of the
verbs API calls safe in any context.

This patch replaces the xa locks with xa_lock_irqsave.

Fixes: 3225717f6dfa ("RDMA/rxe: Replace red-black trees by xarrays")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index de0043b6d3f3..b88492f5f300 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -119,8 +119,10 @@ void rxe_pool_cleanup(struct rxe_pool *pool)
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
 				bool sleepable)
 {
-	int err;
+	struct xarray *xa = &pool->xa;
+	unsigned long flags;
 	gfp_t gfp_flags;
+	int err;
 
 	if (atomic_inc_return(&pool->num_elem) > pool->max_elem)
 		goto err_cnt;
@@ -138,8 +140,10 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
 
 	if (sleepable)
 		might_sleep();
-	err = xa_alloc_cyclic(&pool->xa, &elem->index, NULL, pool->limit,
+	xa_lock_irqsave(xa, flags);
+	err = __xa_alloc_cyclic(xa, &elem->index, NULL, pool->limit,
 			      &pool->next, gfp_flags);
+	xa_unlock_irqrestore(xa, flags);
 	if (err < 0)
 		goto err_cnt;
 
@@ -179,6 +183,7 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
 	struct rxe_pool *pool = elem->pool;
 	struct xarray *xa = &pool->xa;
 	static int timeout = RXE_POOL_TIMEOUT;
+	unsigned long flags;
 	int ret, err = 0;
 	void *xa_ret;
 
@@ -188,7 +193,9 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
 	/* erase xarray entry to prevent looking up
 	 * the pool elem from its index
 	 */
-	xa_ret = xa_erase(xa, elem->index);
+	xa_lock_irqsave(xa, flags);
+	xa_ret = __xa_erase(xa, elem->index);
+	xa_unlock_irqrestore(xa, flags);
 	WARN_ON(xa_err(xa_ret));
 
 	/* if this is the last call to rxe_put complete the
@@ -249,9 +256,13 @@ int __rxe_put(struct rxe_pool_elem *elem)
 
 void __rxe_finalize(struct rxe_pool_elem *elem, bool sleepable)
 {
-	void *xa_ret;
 	gfp_t gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
+	struct xarray *xa = &elem->pool->xa;
+	unsigned long flags;
+	void *xa_ret;
 
-	xa_ret = xa_store(&elem->pool->xa, elem->index, elem, gfp_flags);
+	xa_lock_irqsave(xa, flags);
+	xa_ret = __xa_store(xa, elem->index, elem, gfp_flags);
+	xa_unlock_irqrestore(xa, flags);
 	WARN_ON(xa_err(xa_ret));
 }
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 2/9] RDMA/rxe: Fix xarray locking " Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-31 18:11   ` Jason Gunthorpe
  2023-07-21 20:50 ` [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling Bob Pearson
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

Currently the rxe driver calls wait_for_completion_timeout() in
rxe_complete() to wait for the references to the object to be
freed before final cleanup and returning to the rdma-core
destroy verb. If this does not happen within the timeout interval
it prints a WARN_ON and returns to the 'destroy' verb caller
without any indication that there was an error. This is incorrect.

A heavily loaded system can take an arbitrarily long time to
complete the work needed before freeing all the references with no
guarantees of performance within a specific time.

Another frequent cause of these timeouts is due to ref counting bugs
introduced by changes in the driver so it is helpful to report the
timeouts if they occur.

This patch changes the type of the rxe_cleanup() subroutine to void
and fixes calls to reflect this API change in rxe_verbs.c. This is
better aligned with the code in rdma-core which sometimes fails to
check the return value of destroy verb calls assuming they will always
succeed. Specifically this is the case for kernel qp's.

Not able to return an error, this patch puts the completion timeout or
busy wait code into a subroutine called wait_until_done(). And places
the call in a loop with a 10 second timeout that issues a WARN_ON
each pass through the loop. This is slow enough to not overload the
logs.

Fixes: 215d0a755e1b ("RDMA/rxe: Stop lookup of partially built objects")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c  | 39 ++++++-------
 drivers/infiniband/sw/rxe/rxe_pool.h  |  2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c | 79 +++++++--------------------
 3 files changed, 38 insertions(+), 82 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index b88492f5f300..9877a798258a 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -6,7 +6,7 @@
 
 #include "rxe.h"
 
-#define RXE_POOL_TIMEOUT	(200)
+#define RXE_POOL_TIMEOUT	(10000)	/* 10 seconds */
 #define RXE_POOL_ALIGN		(16)
 
 static const struct rxe_type_info {
@@ -175,16 +175,17 @@ static void rxe_elem_release(struct kref *kref)
 {
 	struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
 
-	complete(&elem->complete);
+	complete_all(&elem->complete);
 }
 
-int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
+void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
 {
 	struct rxe_pool *pool = elem->pool;
 	struct xarray *xa = &pool->xa;
-	static int timeout = RXE_POOL_TIMEOUT;
+	int timeout = RXE_POOL_TIMEOUT;
+	unsigned long until;
 	unsigned long flags;
-	int ret, err = 0;
+	int ret;
 	void *xa_ret;
 
 	if (sleepable)
@@ -209,39 +210,31 @@ int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
 	 * return to rdma-core
 	 */
 	if (sleepable) {
-		if (!completion_done(&elem->complete) && timeout) {
+		while (!completion_done(&elem->complete) && timeout) {
 			ret = wait_for_completion_timeout(&elem->complete,
 					timeout);
-
-			/* Shouldn't happen. There are still references to
-			 * the object but, rather than deadlock, free the
-			 * object or pass back to rdma-core.
-			 */
-			if (WARN_ON(!ret))
-				err = -EINVAL;
+			WARN_ON(!ret);
 		}
 	} else {
-		unsigned long until = jiffies + timeout;
-
 		/* AH objects are unique in that the destroy_ah verb
 		 * can be called in atomic context. This delay
 		 * replaces the wait_for_completion call above
 		 * when the destroy_ah call is not sleepable
 		 */
-		while (!completion_done(&elem->complete) &&
-				time_before(jiffies, until))
-			mdelay(1);
-
-		if (WARN_ON(!completion_done(&elem->complete)))
-			err = -EINVAL;
+		while (!completion_done(&elem->complete) && timeout) {
+			until = jiffies + timeout;
+			while (!completion_done(&elem->complete) &&
+			       time_before(jiffies, until)) {
+				mdelay(10);
+			}
+			WARN_ON(!completion_done(&elem->complete));
+		}
 	}
 
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
 	atomic_dec(&pool->num_elem);
-
-	return err;
 }
 
 int __rxe_get(struct rxe_pool_elem *elem)
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index d764c51ed6f7..efef4b05d1ed 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -71,7 +71,7 @@ int __rxe_get(struct rxe_pool_elem *elem);
 int __rxe_put(struct rxe_pool_elem *elem);
 #define rxe_put(obj) __rxe_put(&(obj)->elem)
 
-int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable);
+void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable);
 #define rxe_cleanup(obj) __rxe_cleanup(&(obj)->elem, true)
 #define rxe_cleanup_ah(obj, sleepable) __rxe_cleanup(&(obj)->elem, sleepable)
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 5e93dbac17b3..67995c259916 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -218,11 +218,8 @@ static int rxe_alloc_ucontext(struct ib_ucontext *ibuc, struct ib_udata *udata)
 static void rxe_dealloc_ucontext(struct ib_ucontext *ibuc)
 {
 	struct rxe_ucontext *uc = to_ruc(ibuc);
-	int err;
 
-	err = rxe_cleanup(uc);
-	if (err)
-		rxe_err_uc(uc, "cleanup failed, err = %d", err);
+	rxe_cleanup(uc);
 }
 
 /* pd */
@@ -248,11 +245,8 @@ static int rxe_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 static int rxe_dealloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 {
 	struct rxe_pd *pd = to_rpd(ibpd);
-	int err;
 
-	err = rxe_cleanup(pd);
-	if (err)
-		rxe_err_pd(pd, "cleanup failed, err = %d", err);
+	rxe_cleanup(pd);
 
 	return 0;
 }
@@ -266,7 +260,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
 	struct rxe_ah *ah = to_rah(ibah);
 	struct rxe_create_ah_resp __user *uresp = NULL;
 	bool sleepable = init_attr->flags & RDMA_CREATE_AH_SLEEPABLE;
-	int err, cleanup_err;
+	int err;
 
 	if (udata) {
 		/* test if new user provider */
@@ -312,9 +306,7 @@ static int rxe_create_ah(struct ib_ah *ibah,
 	return 0;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup_ah(ah, sleepable);
-	if (cleanup_err)
-		rxe_err_ah(ah, "cleanup failed, err = %d", cleanup_err);
+	rxe_cleanup_ah(ah, sleepable);
 err_out:
 	rxe_err_ah(ah, "returned err = %d", err);
 	return err;
@@ -355,11 +347,8 @@ static int rxe_destroy_ah(struct ib_ah *ibah, u32 flags)
 {
 	struct rxe_ah *ah = to_rah(ibah);
 	bool sleepable = flags & RDMA_DESTROY_AH_SLEEPABLE;
-	int err;
 
-	err = rxe_cleanup_ah(ah, sleepable);
-	if (err)
-		rxe_err_ah(ah, "cleanup failed, err = %d", err);
+	rxe_cleanup_ah(ah, sleepable);
 
 	return 0;
 }
@@ -372,7 +361,7 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
 	struct rxe_pd *pd = to_rpd(ibsrq->pd);
 	struct rxe_srq *srq = to_rsrq(ibsrq);
 	struct rxe_create_srq_resp __user *uresp = NULL;
-	int err, cleanup_err;
+	int err;
 
 	if (udata) {
 		if (udata->outlen < sizeof(*uresp)) {
@@ -414,9 +403,7 @@ static int rxe_create_srq(struct ib_srq *ibsrq, struct ib_srq_init_attr *init,
 	return 0;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup(srq);
-	if (cleanup_err)
-		rxe_err_srq(srq, "cleanup failed, err = %d", cleanup_err);
+	rxe_cleanup(srq);
 err_out:
 	rxe_err_dev(rxe, "returned err = %d", err);
 	return err;
@@ -515,11 +502,8 @@ static int rxe_post_srq_recv(struct ib_srq *ibsrq, const struct ib_recv_wr *wr,
 static int rxe_destroy_srq(struct ib_srq *ibsrq, struct ib_udata *udata)
 {
 	struct rxe_srq *srq = to_rsrq(ibsrq);
-	int err;
 
-	err = rxe_cleanup(srq);
-	if (err)
-		rxe_err_srq(srq, "cleanup failed, err = %d", err);
+	rxe_cleanup(srq);
 
 	return 0;
 }
@@ -532,7 +516,7 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	struct rxe_pd *pd = to_rpd(ibqp->pd);
 	struct rxe_qp *qp = to_rqp(ibqp);
 	struct rxe_create_qp_resp __user *uresp = NULL;
-	int err, cleanup_err;
+	int err;
 
 	if (udata) {
 		if (udata->inlen) {
@@ -581,9 +565,7 @@ static int rxe_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *init,
 	return 0;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup(qp);
-	if (cleanup_err)
-		rxe_err_qp(qp, "cleanup failed, err = %d", cleanup_err);
+	rxe_cleanup(qp);
 err_out:
 	rxe_err_dev(rxe, "returned err = %d", err);
 	return err;
@@ -649,9 +631,7 @@ static int rxe_destroy_qp(struct ib_qp *ibqp, struct ib_udata *udata)
 		goto err_out;
 	}
 
-	err = rxe_cleanup(qp);
-	if (err)
-		rxe_err_qp(qp, "cleanup failed, err = %d", err);
+	rxe_cleanup(qp);
 
 	return 0;
 
@@ -1062,7 +1042,7 @@ static int rxe_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct rxe_dev *rxe = to_rdev(dev);
 	struct rxe_cq *cq = to_rcq(ibcq);
 	struct rxe_create_cq_resp __user *uresp = NULL;
-	int err, cleanup_err;
+	int err;
 
 	if (udata) {
 		if (udata->outlen < sizeof(*uresp)) {
@@ -1101,9 +1081,7 @@ static int rxe_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	return 0;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup(cq);
-	if (cleanup_err)
-		rxe_err_cq(cq, "cleanup failed, err = %d", cleanup_err);
+	rxe_cleanup(cq);
 err_out:
 	rxe_err_dev(rxe, "returned err = %d", err);
 	return err;
@@ -1208,9 +1186,7 @@ static int rxe_destroy_cq(struct ib_cq *ibcq, struct ib_udata *udata)
 		goto err_out;
 	}
 
-	err = rxe_cleanup(cq);
-	if (err)
-		rxe_err_cq(cq, "cleanup failed, err = %d", err);
+	rxe_cleanup(cq);
 
 	return 0;
 
@@ -1258,7 +1234,7 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, u64 start,
 	struct rxe_dev *rxe = to_rdev(ibpd->device);
 	struct rxe_pd *pd = to_rpd(ibpd);
 	struct rxe_mr *mr;
-	int err, cleanup_err;
+	int err;
 
 	if (access & ~RXE_ACCESS_SUPPORTED_MR) {
 		rxe_err_pd(pd, "access = %#x not supported (%#x)", access,
@@ -1290,9 +1266,7 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd, u64 start,
 	return &mr->ibmr;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup(mr);
-	if (cleanup_err)
-		rxe_err_mr(mr, "cleanup failed, err = %d", cleanup_err);
+	rxe_cleanup(mr);
 err_free:
 	kfree(mr);
 	rxe_err_pd(pd, "returned err = %d", err);
@@ -1339,7 +1313,7 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	struct rxe_dev *rxe = to_rdev(ibpd->device);
 	struct rxe_pd *pd = to_rpd(ibpd);
 	struct rxe_mr *mr;
-	int err, cleanup_err;
+	int err;
 
 	if (mr_type != IB_MR_TYPE_MEM_REG) {
 		err = -EINVAL;
@@ -1370,9 +1344,7 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	return &mr->ibmr;
 
 err_cleanup:
-	cleanup_err = rxe_cleanup(mr);
-	if (cleanup_err)
-		rxe_err_mr(mr, "cleanup failed, err = %d", err);
+	rxe_cleanup(mr);
 err_free:
 	kfree(mr);
 err_out:
@@ -1383,25 +1355,16 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 static int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
-	int err, cleanup_err;
 
 	/* See IBA 10.6.7.2.6 */
 	if (atomic_read(&mr->num_mw) > 0) {
-		err = -EINVAL;
-		rxe_dbg_mr(mr, "mr has mw's bound");
-		goto err_out;
+		rxe_err_mr(mr, "mr has mw's bound");
+		return -EINVAL;
 	}
 
-	cleanup_err = rxe_cleanup(mr);
-	if (cleanup_err)
-		rxe_err_mr(mr, "cleanup failed, err = %d", cleanup_err);
-
+	rxe_cleanup(mr);
 	kfree_rcu(mr, elem.rcu);
 	return 0;
-
-err_out:
-	rxe_err_mr(mr, "returned err = %d", err);
-	return err;
 }
 
 static ssize_t parent_show(struct device *device,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
                   ` (2 preceding siblings ...)
  2023-07-21 20:50 ` [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-23 13:03   ` Zhu Yanjun
  2023-07-31 18:12   ` Jason Gunthorpe
  2023-07-21 20:50 ` [PATCH for-next 5/9] RDMA/rxe: Optimize rxe_init_packet in rxe_net.c Bob Pearson
                   ` (4 subsequent siblings)
  8 siblings, 2 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

In cable pull testing some NICs can hold a send packet long enough
to allow ulp protocol stacks to destroy the qp and the cleanup
routines to timeout waiting for all qp references to be released.
When the NIC driver finally frees the SKB the qp pointer is no longer
valid and causes a seg fault in rxe_skb_tx_dtor().

This patch passes the qp index instead of the qp to the skb destructor
callback function. The call back is required to lookup the qp from the
index and if it has been destroyed the lookup will return NULL and the
qp will not be referenced avoiding the seg fault.

Fixes: 8700e3e7c485 ("Soft RoCE driver")
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 87 ++++++++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_qp.c  |  1 -
 2 files changed, 67 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index cd59666158b1..10e4a752ff7c 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -131,19 +131,28 @@ static struct dst_entry *rxe_find_route(struct net_device *ndev,
 	return dst;
 }
 
+static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
+{
+	struct rxe_dev *rxe;
+	struct net_device *ndev = skb->dev;
+
+	rxe = rxe_get_dev_from_net(ndev);
+	if (!rxe && is_vlan_dev(ndev))
+		rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
+
+	return rxe;
+}
+
 static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
 {
 	struct udphdr *udph;
 	struct rxe_dev *rxe;
-	struct net_device *ndev = skb->dev;
 	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
 
 	/* takes a reference on rxe->ib_dev
 	 * drop when skb is freed
 	 */
-	rxe = rxe_get_dev_from_net(ndev);
-	if (!rxe && is_vlan_dev(ndev))
-		rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
+	rxe = get_rxe_from_skb(skb);
 	if (!rxe)
 		goto drop;
 
@@ -345,46 +354,84 @@ int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
 
 static void rxe_skb_tx_dtor(struct sk_buff *skb)
 {
-	struct sock *sk = skb->sk;
-	struct rxe_qp *qp = sk->sk_user_data;
-	int skb_out = atomic_dec_return(&qp->skb_out);
+	struct rxe_dev *rxe;
+	unsigned int index;
+	struct rxe_qp *qp;
+	int skb_out;
+
+	/* takes a ref on ib device if success */
+	rxe = get_rxe_from_skb(skb);
+	if (!rxe)
+		goto out;
+
+	/* recover source qp index from sk->sk_user_data
+	 * free the reference taken in rxe_send
+	 */
+	index = (int)(uintptr_t)skb->sk->sk_user_data;
+	sock_put(skb->sk);
 
+	/* lookup qp from index, takes a ref on success */
+	qp = rxe_pool_get_index(&rxe->qp_pool, index);
+	if (!qp)
+		goto out_put_ibdev;
+
+	skb_out = atomic_dec_return(&qp->skb_out);
 	if (unlikely(qp->need_req_skb &&
 		     skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
 		rxe_sched_task(&qp->req.task);
 
 	rxe_put(qp);
+out_put_ibdev:
+	ib_device_put(&rxe->ib_dev);
+out:
+	return;
 }
 
 static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 {
+	struct rxe_qp *qp = pkt->qp;
+	struct sock *sk;
 	int err;
 
-	skb->destructor = rxe_skb_tx_dtor;
-	skb->sk = pkt->qp->sk->sk;
+	/* qp can be destroyed while this packet is waiting on
+	 * the tx queue. So need to protect sk.
+	 */
+	sk = qp->sk->sk;
+	sock_hold(sk);
+	skb->sk = sk;
 
-	rxe_get(pkt->qp);
 	atomic_inc(&pkt->qp->skb_out);
 
+	sk->sk_user_data = (void *)(long)qp->elem.index;
+	skb->destructor = rxe_skb_tx_dtor;
+
 	if (skb->protocol == htons(ETH_P_IP)) {
-		err = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+		err = ip_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
 	} else if (skb->protocol == htons(ETH_P_IPV6)) {
-		err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
+		err = ip6_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
 	} else {
-		rxe_dbg_qp(pkt->qp, "Unknown layer 3 protocol: %d\n",
-				skb->protocol);
-		atomic_dec(&pkt->qp->skb_out);
-		rxe_put(pkt->qp);
-		kfree_skb(skb);
-		return -EINVAL;
+		rxe_dbg_qp(qp, "Unknown layer 3 protocol: %d",
+			   skb->protocol);
+		err = -EINVAL;
+		goto err_not_sent;
 	}
 
+	/* IP consumed the packet, the destructor will handle cleanup */
 	if (unlikely(net_xmit_eval(err))) {
-		rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
-		return -EAGAIN;
+		rxe_dbg_qp(qp, "Error sending packet: %d", err);
+		err = -EAGAIN;
+		goto err_out;
 	}
 
 	return 0;
+
+err_not_sent:
+	skb->destructor = NULL;
+	atomic_dec(&pkt->qp->skb_out);
+	kfree_skb(skb);
+	sock_put(sk);
+err_out:
+	return err;
 }
 
 /* fix up a send packet to match the packets
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index a569b111a9d2..dcbf71031453 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -194,7 +194,6 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 	err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &qp->sk);
 	if (err < 0)
 		return err;
-	qp->sk->sk->sk_user_data = qp;
 
 	/* pick a source UDP port number for this QP based on
 	 * the source QPN. this spreads traffic for different QPs
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 5/9] RDMA/rxe: Optimize rxe_init_packet in rxe_net.c
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
                   ` (3 preceding siblings ...)
  2023-07-21 20:50 ` [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 6/9] RDMA/rxe: Delete unused field elem->list Bob Pearson
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

This patch replaces code that looks up the ndev from the ib_device
and port number of a rxe device with the saved value of the ndev
used when the rxe device was created. Since the rxe driver does
not support multi port operation or path migration, the ndev is
constant for the life of the rxe device. The ndev lookup code in
rxe_init_packet has a significant performance impact under heavy
load and can consume a noticeable fraction of the overall cpu time.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 25 ++++---------------------
 1 file changed, 4 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 10e4a752ff7c..c1b2eaf82334 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -508,14 +508,9 @@ struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
 {
 	unsigned int hdr_len;
 	struct sk_buff *skb = NULL;
-	struct net_device *ndev;
-	const struct ib_gid_attr *attr;
+	struct net_device *ndev = rxe->ndev;
 	const int port_num = 1;
 
-	attr = rdma_get_gid_attr(&rxe->ib_dev, port_num, av->grh.sgid_index);
-	if (IS_ERR(attr))
-		return NULL;
-
 	if (av->network_type == RXE_NETWORK_TYPE_IPV4)
 		hdr_len = ETH_HLEN + sizeof(struct udphdr) +
 			sizeof(struct iphdr);
@@ -523,25 +518,14 @@ struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
 		hdr_len = ETH_HLEN + sizeof(struct udphdr) +
 			sizeof(struct ipv6hdr);
 
-	rcu_read_lock();
-	ndev = rdma_read_gid_attr_ndev_rcu(attr);
-	if (IS_ERR(ndev)) {
-		rcu_read_unlock();
-		goto out;
-	}
-	skb = alloc_skb(paylen + hdr_len + LL_RESERVED_SPACE(ndev),
-			GFP_ATOMIC);
-
-	if (unlikely(!skb)) {
-		rcu_read_unlock();
+	skb = alloc_skb(paylen + hdr_len + LL_RESERVED_SPACE(ndev), GFP_ATOMIC);
+	if (unlikely(!skb))
 		goto out;
-	}
 
 	skb_reserve(skb, hdr_len + LL_RESERVED_SPACE(ndev));
 
 	/* FIXME: hold reference to this netdev until life of this skb. */
-	skb->dev	= ndev;
-	rcu_read_unlock();
+	skb->dev = ndev;
 
 	if (av->network_type == RXE_NETWORK_TYPE_IPV4)
 		skb->protocol = htons(ETH_P_IP);
@@ -554,7 +538,6 @@ struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, struct rxe_av *av,
 	pkt->mask	|= RXE_GRH_MASK;
 
 out:
-	rdma_put_gid_attr(attr);
 	return skb;
 }
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 6/9] RDMA/rxe: Delete unused field elem->list
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
                   ` (4 preceding siblings ...)
  2023-07-21 20:50 ` [PATCH for-next 5/9] RDMA/rxe: Optimize rxe_init_packet in rxe_net.c Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 7/9] RDMA/rxe: Add elem->valid field Bob Pearson
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

struct rxe_pool_elem field list is not used. This patch removes it.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index efef4b05d1ed..daef1ed72722 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -23,7 +23,6 @@ struct rxe_pool_elem {
 	struct rxe_pool		*pool;
 	void			*obj;
 	struct kref		ref_cnt;
-	struct list_head	list;
 	struct completion	complete;
 	struct rcu_head		rcu;
 	u32			index;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 7/9] RDMA/rxe: Add elem->valid field
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
                   ` (5 preceding siblings ...)
  2023-07-21 20:50 ` [PATCH for-next 6/9] RDMA/rxe: Delete unused field elem->list Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-31 18:15   ` Jason Gunthorpe
  2023-07-21 20:50 ` [PATCH for-next 8/9] RDMA/rxe: Report leaked objects Bob Pearson
  2023-07-21 20:50 ` [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets Bob Pearson
  8 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

Currently the rxe driver stores NULL in the pool xarray to
indicate that the pool element should not be looked up from its
index. This prevents looping over pool elements during driver
cleanup. This patch adds a new valid field to struct rxe_pool_elem
as an alternative way to accomplish the same thing.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 32 ++++++++++++----------------
 drivers/infiniband/sw/rxe/rxe_pool.h |  1 +
 2 files changed, 15 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index 9877a798258a..cb812bd969c6 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -141,7 +141,7 @@ int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
 	if (sleepable)
 		might_sleep();
 	xa_lock_irqsave(xa, flags);
-	err = __xa_alloc_cyclic(xa, &elem->index, NULL, pool->limit,
+	err = __xa_alloc_cyclic(xa, &elem->index, elem, pool->limit,
 			      &pool->next, gfp_flags);
 	xa_unlock_irqrestore(xa, flags);
 	if (err < 0)
@@ -158,11 +158,14 @@ void *rxe_pool_get_index(struct rxe_pool *pool, u32 index)
 {
 	struct rxe_pool_elem *elem;
 	struct xarray *xa = &pool->xa;
+	int valid;
 	void *obj;
 
 	rcu_read_lock();
 	elem = xa_load(xa, index);
-	if (elem && kref_get_unless_zero(&elem->ref_cnt))
+	/* get current value of elem->valid */
+	valid = elem ? smp_load_acquire(&elem->valid) : 0;
+	if (valid && kref_get_unless_zero(&elem->ref_cnt))
 		obj = elem->obj;
 	else
 		obj = NULL;
@@ -191,13 +194,8 @@ void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
 	if (sleepable)
 		might_sleep();
 
-	/* erase xarray entry to prevent looking up
-	 * the pool elem from its index
-	 */
-	xa_lock_irqsave(xa, flags);
-	xa_ret = __xa_erase(xa, elem->index);
-	xa_unlock_irqrestore(xa, flags);
-	WARN_ON(xa_err(xa_ret));
+	/* prevent looking up element from index */
+	smp_store_release(&elem->valid, 0);
 
 	/* if this is the last call to rxe_put complete the
 	 * object. It is safe to touch obj->elem after this since
@@ -231,6 +229,11 @@ void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
 		}
 	}
 
+	xa_lock_irqsave(xa, flags);
+	xa_ret = __xa_erase(xa, elem->index);
+	xa_unlock_irqrestore(xa, flags);
+	WARN_ON(xa_err(xa_ret));
+
 	if (pool->cleanup)
 		pool->cleanup(elem);
 
@@ -249,13 +252,6 @@ int __rxe_put(struct rxe_pool_elem *elem)
 
 void __rxe_finalize(struct rxe_pool_elem *elem, bool sleepable)
 {
-	gfp_t gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
-	struct xarray *xa = &elem->pool->xa;
-	unsigned long flags;
-	void *xa_ret;
-
-	xa_lock_irqsave(xa, flags);
-	xa_ret = __xa_store(xa, elem->index, elem, gfp_flags);
-	xa_unlock_irqrestore(xa, flags);
-	WARN_ON(xa_err(xa_ret));
+	/* allow looking up element from index */
+	smp_store_release(&elem->valid, 1);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_pool.h b/drivers/infiniband/sw/rxe/rxe_pool.h
index daef1ed72722..5cca33c5cfb5 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.h
+++ b/drivers/infiniband/sw/rxe/rxe_pool.h
@@ -26,6 +26,7 @@ struct rxe_pool_elem {
 	struct completion	complete;
 	struct rcu_head		rcu;
 	u32			index;
+	int			valid;
 };
 
 struct rxe_pool {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
                   ` (6 preceding siblings ...)
  2023-07-21 20:50 ` [PATCH for-next 7/9] RDMA/rxe: Add elem->valid field Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-31 18:15   ` Jason Gunthorpe
  2023-07-21 20:50 ` [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets Bob Pearson
  8 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

This patch gives a more detailed list of objects that are not
freed in case of error before the module exits.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
index cb812bd969c6..3249c2741491 100644
--- a/drivers/infiniband/sw/rxe/rxe_pool.c
+++ b/drivers/infiniband/sw/rxe/rxe_pool.c
@@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
 
 void rxe_pool_cleanup(struct rxe_pool *pool)
 {
-	WARN_ON(!xa_empty(&pool->xa));
+	unsigned long index;
+	struct rxe_pool_elem *elem;
+
+	xa_lock(&pool->xa);
+	xa_for_each(&pool->xa, index, elem) {
+		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
+				elem->index);
+	}
+	xa_unlock(&pool->xa);
+
+	xa_destroy(&pool->xa);
 }
 
 int __rxe_add_to_pool(struct rxe_pool *pool, struct rxe_pool_elem *elem,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
                   ` (7 preceding siblings ...)
  2023-07-21 20:50 ` [PATCH for-next 8/9] RDMA/rxe: Report leaked objects Bob Pearson
@ 2023-07-21 20:50 ` Bob Pearson
  2023-07-31 18:17   ` Jason Gunthorpe
  8 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-21 20:50 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

Network interruptions may cause long delays in the processing of
send packets during which time the rxe driver may be unloaded.
This will cause seg faults when the packet is ultimately freed as
it calls the destructor function in the rxe driver. This has been
observed in cable pull fail over fail back testing.

This patch takes a reference on the driver to protect the packet
from this happening.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c     | 26 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe.h     |  3 +++
 drivers/infiniband/sw/rxe/rxe_net.c |  7 +++++++
 3 files changed, 36 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 54c723a6edda..6b55c595f8f8 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -208,10 +208,33 @@ static struct rdma_link_ops rxe_link_ops = {
 	.newlink = rxe_newlink,
 };
 
+static struct rxe_module {
+	struct kref		ref_cnt;
+	struct completion	complete;
+} rxe_module;
+
+static void rxe_module_release(struct kref *kref)
+{
+	complete(&rxe_module.complete);
+}
+
+int rxe_module_get(void)
+{
+	return kref_get_unless_zero(&rxe_module.ref_cnt);
+}
+
+int rxe_module_put(void)
+{
+	return kref_put(&rxe_module.ref_cnt, rxe_module_release);
+}
+
 static int __init rxe_module_init(void)
 {
 	int err;
 
+	kref_init(&rxe_module.ref_cnt);
+	init_completion(&rxe_module.complete);
+
 	err = rxe_alloc_wq();
 	if (err)
 		return err;
@@ -229,6 +252,9 @@ static int __init rxe_module_init(void)
 
 static void __exit rxe_module_exit(void)
 {
+	rxe_module_put();
+	wait_for_completion(&rxe_module.complete);
+
 	rdma_link_unregister(&rxe_link_ops);
 	ib_unregister_driver(RDMA_DRIVER_RXE);
 	rxe_net_exit();
diff --git a/drivers/infiniband/sw/rxe/rxe.h b/drivers/infiniband/sw/rxe/rxe.h
index d33dd6cf83d3..077e3ad8f39a 100644
--- a/drivers/infiniband/sw/rxe/rxe.h
+++ b/drivers/infiniband/sw/rxe/rxe.h
@@ -158,4 +158,7 @@ void rxe_port_up(struct rxe_dev *rxe);
 void rxe_port_down(struct rxe_dev *rxe);
 void rxe_set_port_state(struct rxe_dev *rxe);
 
+int rxe_module_get(void);
+int rxe_module_put(void);
+
 #endif /* RXE_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index c1b2eaf82334..0e447420a441 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -384,6 +384,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
 out_put_ibdev:
 	ib_device_put(&rxe->ib_dev);
 out:
+	rxe_module_put();
 	return;
 }
 
@@ -400,6 +401,12 @@ static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 	sock_hold(sk);
 	skb->sk = sk;
 
+	/* the module may be potentially removed while this packet
+	 * is waiting on the tx queue causing seg faults. So need
+	 * to protect the module.
+	 */
+	rxe_module_get();
+
 	atomic_inc(&pkt->qp->skb_out);
 
 	sk->sk_user_data = (void *)(long)qp->elem.index;
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-21 20:50 ` [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling Bob Pearson
@ 2023-07-23 13:03   ` Zhu Yanjun
  2023-07-23 17:24     ` Bob Pearson
  2023-07-24 17:59     ` Leon Romanovsky
  2023-07-31 18:12   ` Jason Gunthorpe
  1 sibling, 2 replies; 38+ messages in thread
From: Zhu Yanjun @ 2023-07-23 13:03 UTC (permalink / raw)
  To: Bob Pearson; +Cc: jgg, leon, jhack, linux-rdma

On Sat, Jul 22, 2023 at 4:51 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> In cable pull testing some NICs can hold a send packet long enough
> to allow ulp protocol stacks to destroy the qp and the cleanup
> routines to timeout waiting for all qp references to be released.
> When the NIC driver finally frees the SKB the qp pointer is no longer
> valid and causes a seg fault in rxe_skb_tx_dtor().

If a packet is held in some NICs, the later packets of this packet
will also be held by this NIC. So this will cause QP not to work well.
This is a serious problem. And the problem should be fixed in this
kind of NICs. We should not fix it in RXE.

And a QP exists for at least several seconds. A packet can be held in
NIC for such a long time? This problem does exist in the real world,
or you imagine this scenario?

I can not imagine this kind of scenario. Please Jason or Leon also check this.

Thanks.
Zhu Yanjun
>
> This patch passes the qp index instead of the qp to the skb destructor
> callback function. The call back is required to lookup the qp from the
> index and if it has been destroyed the lookup will return NULL and the
> qp will not be referenced avoiding the seg fault.
>
> Fixes: 8700e3e7c485 ("Soft RoCE driver")
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_net.c | 87 ++++++++++++++++++++++-------
>  drivers/infiniband/sw/rxe/rxe_qp.c  |  1 -
>  2 files changed, 67 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index cd59666158b1..10e4a752ff7c 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -131,19 +131,28 @@ static struct dst_entry *rxe_find_route(struct net_device *ndev,
>         return dst;
>  }
>
> +static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
> +{
> +       struct rxe_dev *rxe;
> +       struct net_device *ndev = skb->dev;
> +
> +       rxe = rxe_get_dev_from_net(ndev);
> +       if (!rxe && is_vlan_dev(ndev))
> +               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
> +
> +       return rxe;
> +}
> +
>  static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
>  {
>         struct udphdr *udph;
>         struct rxe_dev *rxe;
> -       struct net_device *ndev = skb->dev;
>         struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
>
>         /* takes a reference on rxe->ib_dev
>          * drop when skb is freed
>          */
> -       rxe = rxe_get_dev_from_net(ndev);
> -       if (!rxe && is_vlan_dev(ndev))
> -               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
> +       rxe = get_rxe_from_skb(skb);
>         if (!rxe)
>                 goto drop;
>
> @@ -345,46 +354,84 @@ int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
>
>  static void rxe_skb_tx_dtor(struct sk_buff *skb)
>  {
> -       struct sock *sk = skb->sk;
> -       struct rxe_qp *qp = sk->sk_user_data;
> -       int skb_out = atomic_dec_return(&qp->skb_out);
> +       struct rxe_dev *rxe;
> +       unsigned int index;
> +       struct rxe_qp *qp;
> +       int skb_out;
> +
> +       /* takes a ref on ib device if success */
> +       rxe = get_rxe_from_skb(skb);
> +       if (!rxe)
> +               goto out;
> +
> +       /* recover source qp index from sk->sk_user_data
> +        * free the reference taken in rxe_send
> +        */
> +       index = (int)(uintptr_t)skb->sk->sk_user_data;
> +       sock_put(skb->sk);
>
> +       /* lookup qp from index, takes a ref on success */
> +       qp = rxe_pool_get_index(&rxe->qp_pool, index);
> +       if (!qp)
> +               goto out_put_ibdev;
> +
> +       skb_out = atomic_dec_return(&qp->skb_out);
>         if (unlikely(qp->need_req_skb &&
>                      skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
>                 rxe_sched_task(&qp->req.task);
>
>         rxe_put(qp);
> +out_put_ibdev:
> +       ib_device_put(&rxe->ib_dev);
> +out:
> +       return;
>  }
>
>  static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
>  {
> +       struct rxe_qp *qp = pkt->qp;
> +       struct sock *sk;
>         int err;
>
> -       skb->destructor = rxe_skb_tx_dtor;
> -       skb->sk = pkt->qp->sk->sk;
> +       /* qp can be destroyed while this packet is waiting on
> +        * the tx queue. So need to protect sk.
> +        */
> +       sk = qp->sk->sk;
> +       sock_hold(sk);
> +       skb->sk = sk;
>
> -       rxe_get(pkt->qp);
>         atomic_inc(&pkt->qp->skb_out);
>
> +       sk->sk_user_data = (void *)(long)qp->elem.index;
> +       skb->destructor = rxe_skb_tx_dtor;
> +
>         if (skb->protocol == htons(ETH_P_IP)) {
> -               err = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
> +               err = ip_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
>         } else if (skb->protocol == htons(ETH_P_IPV6)) {
> -               err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
> +               err = ip6_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
>         } else {
> -               rxe_dbg_qp(pkt->qp, "Unknown layer 3 protocol: %d\n",
> -                               skb->protocol);
> -               atomic_dec(&pkt->qp->skb_out);
> -               rxe_put(pkt->qp);
> -               kfree_skb(skb);
> -               return -EINVAL;
> +               rxe_dbg_qp(qp, "Unknown layer 3 protocol: %d",
> +                          skb->protocol);
> +               err = -EINVAL;
> +               goto err_not_sent;
>         }
>
> +       /* IP consumed the packet, the destructor will handle cleanup */
>         if (unlikely(net_xmit_eval(err))) {
> -               rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
> -               return -EAGAIN;
> +               rxe_dbg_qp(qp, "Error sending packet: %d", err);
> +               err = -EAGAIN;
> +               goto err_out;
>         }
>
>         return 0;
> +
> +err_not_sent:
> +       skb->destructor = NULL;
> +       atomic_dec(&pkt->qp->skb_out);
> +       kfree_skb(skb);
> +       sock_put(sk);
> +err_out:
> +       return err;
>  }
>
>  /* fix up a send packet to match the packets
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index a569b111a9d2..dcbf71031453 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -194,7 +194,6 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
>         err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &qp->sk);
>         if (err < 0)
>                 return err;
> -       qp->sk->sk->sk_user_data = qp;
>
>         /* pick a source UDP port number for this QP based on
>          * the source QPN. this spreads traffic for different QPs
> --
> 2.39.2
>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-23 13:03   ` Zhu Yanjun
@ 2023-07-23 17:24     ` Bob Pearson
  2023-07-24 17:59     ` Leon Romanovsky
  1 sibling, 0 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-23 17:24 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: jgg, leon, jhack, linux-rdma

On 7/23/23 08:03, Zhu Yanjun wrote:
> On Sat, Jul 22, 2023 at 4:51 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>
>> In cable pull testing some NICs can hold a send packet long enough
>> to allow ulp protocol stacks to destroy the qp and the cleanup
>> routines to timeout waiting for all qp references to be released.
>> When the NIC driver finally frees the SKB the qp pointer is no longer
>> valid and causes a seg fault in rxe_skb_tx_dtor().
> 
> If a packet is held in some NICs, the later packets of this packet
> will also be held by this NIC. So this will cause QP not to work well.
> This is a serious problem. And the problem should be fixed in this
> kind of NICs. We should not fix it in RXE.
> 
> And a QP exists for at least several seconds. A packet can be held in
> NIC for such a long time? This problem does exist in the real world,
> or you imagine this scenario?

This was a real bug observed in a 24 hour stress test where we were running IOR
traffic over Lustre on a 256 node cluster. Then we ran a test that randomly disabled
links for 10-20 second and then reenbled them. Lustre will automatically detect the
down links and reroute the storage traffic to a new RC connection and then when the
link recovers it will reestablish RC connections on the link. The failover is triggered
by Lustre timing out on an IO operation and it then tears down the QPs. When the link
is back up the tx queue is finally processed by the NIC and the skbs are freed in the
sender calling the rxe skb destructor function which attempts to update a counter in
the QP state. During the test we would observe 2-3 kernel panics from referencing the
destroyed qp pointer. We made this change to avoid delaying the qp destroy call for an
arbitrarily long time and instead took a reference on the struct sock and used the qpn
instead of the qp pointer. There is a later patch in this series that fixes a related
but different problem which can occur at the end of the test if someone attempts to
rmmod the rxe driver while a packet is still pending. In this case it the code in the
destructor function which is gone not the QP. But still need to protect with ref counts.

Bob
> 
> I can not imagine this kind of scenario. Please Jason or Leon also check this.
> 
> Thanks.
> Zhu Yanjun
>>
>> This patch passes the qp index instead of the qp to the skb destructor
>> callback function. The call back is required to lookup the qp from the
>> index and if it has been destroyed the lookup will return NULL and the
>> qp will not be referenced avoiding the seg fault.
>>
>> Fixes: 8700e3e7c485 ("Soft RoCE driver")
>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>> ---
>>  drivers/infiniband/sw/rxe/rxe_net.c | 87 ++++++++++++++++++++++-------
>>  drivers/infiniband/sw/rxe/rxe_qp.c  |  1 -
>>  2 files changed, 67 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
>> index cd59666158b1..10e4a752ff7c 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_net.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
>> @@ -131,19 +131,28 @@ static struct dst_entry *rxe_find_route(struct net_device *ndev,
>>         return dst;
>>  }
>>
>> +static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
>> +{
>> +       struct rxe_dev *rxe;
>> +       struct net_device *ndev = skb->dev;
>> +
>> +       rxe = rxe_get_dev_from_net(ndev);
>> +       if (!rxe && is_vlan_dev(ndev))
>> +               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
>> +
>> +       return rxe;
>> +}
>> +
>>  static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
>>  {
>>         struct udphdr *udph;
>>         struct rxe_dev *rxe;
>> -       struct net_device *ndev = skb->dev;
>>         struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
>>
>>         /* takes a reference on rxe->ib_dev
>>          * drop when skb is freed
>>          */
>> -       rxe = rxe_get_dev_from_net(ndev);
>> -       if (!rxe && is_vlan_dev(ndev))
>> -               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
>> +       rxe = get_rxe_from_skb(skb);
>>         if (!rxe)
>>                 goto drop;
>>
>> @@ -345,46 +354,84 @@ int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
>>
>>  static void rxe_skb_tx_dtor(struct sk_buff *skb)
>>  {
>> -       struct sock *sk = skb->sk;
>> -       struct rxe_qp *qp = sk->sk_user_data;
>> -       int skb_out = atomic_dec_return(&qp->skb_out);
>> +       struct rxe_dev *rxe;
>> +       unsigned int index;
>> +       struct rxe_qp *qp;
>> +       int skb_out;
>> +
>> +       /* takes a ref on ib device if success */
>> +       rxe = get_rxe_from_skb(skb);
>> +       if (!rxe)
>> +               goto out;
>> +
>> +       /* recover source qp index from sk->sk_user_data
>> +        * free the reference taken in rxe_send
>> +        */
>> +       index = (int)(uintptr_t)skb->sk->sk_user_data;
>> +       sock_put(skb->sk);
>>
>> +       /* lookup qp from index, takes a ref on success */
>> +       qp = rxe_pool_get_index(&rxe->qp_pool, index);
>> +       if (!qp)
>> +               goto out_put_ibdev;
>> +
>> +       skb_out = atomic_dec_return(&qp->skb_out);
>>         if (unlikely(qp->need_req_skb &&
>>                      skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
>>                 rxe_sched_task(&qp->req.task);
>>
>>         rxe_put(qp);
>> +out_put_ibdev:
>> +       ib_device_put(&rxe->ib_dev);
>> +out:
>> +       return;
>>  }
>>
>>  static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
>>  {
>> +       struct rxe_qp *qp = pkt->qp;
>> +       struct sock *sk;
>>         int err;
>>
>> -       skb->destructor = rxe_skb_tx_dtor;
>> -       skb->sk = pkt->qp->sk->sk;
>> +       /* qp can be destroyed while this packet is waiting on
>> +        * the tx queue. So need to protect sk.
>> +        */
>> +       sk = qp->sk->sk;
>> +       sock_hold(sk);
>> +       skb->sk = sk;
>>
>> -       rxe_get(pkt->qp);
>>         atomic_inc(&pkt->qp->skb_out);
>>
>> +       sk->sk_user_data = (void *)(long)qp->elem.index;
>> +       skb->destructor = rxe_skb_tx_dtor;
>> +
>>         if (skb->protocol == htons(ETH_P_IP)) {
>> -               err = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
>> +               err = ip_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
>>         } else if (skb->protocol == htons(ETH_P_IPV6)) {
>> -               err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
>> +               err = ip6_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
>>         } else {
>> -               rxe_dbg_qp(pkt->qp, "Unknown layer 3 protocol: %d\n",
>> -                               skb->protocol);
>> -               atomic_dec(&pkt->qp->skb_out);
>> -               rxe_put(pkt->qp);
>> -               kfree_skb(skb);
>> -               return -EINVAL;
>> +               rxe_dbg_qp(qp, "Unknown layer 3 protocol: %d",
>> +                          skb->protocol);
>> +               err = -EINVAL;
>> +               goto err_not_sent;
>>         }
>>
>> +       /* IP consumed the packet, the destructor will handle cleanup */
>>         if (unlikely(net_xmit_eval(err))) {
>> -               rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
>> -               return -EAGAIN;
>> +               rxe_dbg_qp(qp, "Error sending packet: %d", err);
>> +               err = -EAGAIN;
>> +               goto err_out;
>>         }
>>
>>         return 0;
>> +
>> +err_not_sent:
>> +       skb->destructor = NULL;
>> +       atomic_dec(&pkt->qp->skb_out);
>> +       kfree_skb(skb);
>> +       sock_put(sk);
>> +err_out:
>> +       return err;
>>  }
>>
>>  /* fix up a send packet to match the packets
>> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
>> index a569b111a9d2..dcbf71031453 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
>> @@ -194,7 +194,6 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
>>         err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &qp->sk);
>>         if (err < 0)
>>                 return err;
>> -       qp->sk->sk->sk_user_data = qp;
>>
>>         /* pick a source UDP port number for this QP based on
>>          * the source QPN. this spreads traffic for different QPs
>> --
>> 2.39.2
>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-23 13:03   ` Zhu Yanjun
  2023-07-23 17:24     ` Bob Pearson
@ 2023-07-24 17:59     ` Leon Romanovsky
  2023-07-24 18:26       ` Bob Pearson
  1 sibling, 1 reply; 38+ messages in thread
From: Leon Romanovsky @ 2023-07-24 17:59 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Bob Pearson, jgg, jhack, linux-rdma

On Sun, Jul 23, 2023 at 09:03:34PM +0800, Zhu Yanjun wrote:
> On Sat, Jul 22, 2023 at 4:51 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > In cable pull testing some NICs can hold a send packet long enough
> > to allow ulp protocol stacks to destroy the qp and the cleanup
> > routines to timeout waiting for all qp references to be released.
> > When the NIC driver finally frees the SKB the qp pointer is no longer
> > valid and causes a seg fault in rxe_skb_tx_dtor().
> 
> If a packet is held in some NICs, the later packets of this packet
> will also be held by this NIC. So this will cause QP not to work well.
> This is a serious problem. And the problem should be fixed in this
> kind of NICs. We should not fix it in RXE.
> 
> And a QP exists for at least several seconds. A packet can be held in
> NIC for such a long time? This problem does exist in the real world,
> or you imagine this scenario?
> 
> I can not imagine this kind of scenario. Please Jason or Leon also check this.

I stopped to follow RXE changes for a long time. They don't make any sense to me.
Even this patch, which moves existing IB reference counter from one place to another
and does it for every SKB is beyond my understanding.

Thanks

> 
> Thanks.
> Zhu Yanjun
> >
> > This patch passes the qp index instead of the qp to the skb destructor
> > callback function. The call back is required to lookup the qp from the
> > index and if it has been destroyed the lookup will return NULL and the
> > qp will not be referenced avoiding the seg fault.
> >
> > Fixes: 8700e3e7c485 ("Soft RoCE driver")
> > Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> > ---
> >  drivers/infiniband/sw/rxe/rxe_net.c | 87 ++++++++++++++++++++++-------
> >  drivers/infiniband/sw/rxe/rxe_qp.c  |  1 -
> >  2 files changed, 67 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> > index cd59666158b1..10e4a752ff7c 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_net.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> > @@ -131,19 +131,28 @@ static struct dst_entry *rxe_find_route(struct net_device *ndev,
> >         return dst;
> >  }
> >
> > +static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
> > +{
> > +       struct rxe_dev *rxe;
> > +       struct net_device *ndev = skb->dev;
> > +
> > +       rxe = rxe_get_dev_from_net(ndev);
> > +       if (!rxe && is_vlan_dev(ndev))
> > +               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
> > +
> > +       return rxe;
> > +}
> > +
> >  static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
> >  {
> >         struct udphdr *udph;
> >         struct rxe_dev *rxe;
> > -       struct net_device *ndev = skb->dev;
> >         struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> >
> >         /* takes a reference on rxe->ib_dev
> >          * drop when skb is freed
> >          */
> > -       rxe = rxe_get_dev_from_net(ndev);
> > -       if (!rxe && is_vlan_dev(ndev))
> > -               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
> > +       rxe = get_rxe_from_skb(skb);
> >         if (!rxe)
> >                 goto drop;
> >
> > @@ -345,46 +354,84 @@ int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
> >
> >  static void rxe_skb_tx_dtor(struct sk_buff *skb)
> >  {
> > -       struct sock *sk = skb->sk;
> > -       struct rxe_qp *qp = sk->sk_user_data;
> > -       int skb_out = atomic_dec_return(&qp->skb_out);
> > +       struct rxe_dev *rxe;
> > +       unsigned int index;
> > +       struct rxe_qp *qp;
> > +       int skb_out;
> > +
> > +       /* takes a ref on ib device if success */
> > +       rxe = get_rxe_from_skb(skb);
> > +       if (!rxe)
> > +               goto out;
> > +
> > +       /* recover source qp index from sk->sk_user_data
> > +        * free the reference taken in rxe_send
> > +        */
> > +       index = (int)(uintptr_t)skb->sk->sk_user_data;
> > +       sock_put(skb->sk);
> >
> > +       /* lookup qp from index, takes a ref on success */
> > +       qp = rxe_pool_get_index(&rxe->qp_pool, index);
> > +       if (!qp)
> > +               goto out_put_ibdev;
> > +
> > +       skb_out = atomic_dec_return(&qp->skb_out);
> >         if (unlikely(qp->need_req_skb &&
> >                      skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
> >                 rxe_sched_task(&qp->req.task);
> >
> >         rxe_put(qp);
> > +out_put_ibdev:
> > +       ib_device_put(&rxe->ib_dev);
> > +out:
> > +       return;
> >  }
> >
> >  static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
> >  {
> > +       struct rxe_qp *qp = pkt->qp;
> > +       struct sock *sk;
> >         int err;
> >
> > -       skb->destructor = rxe_skb_tx_dtor;
> > -       skb->sk = pkt->qp->sk->sk;
> > +       /* qp can be destroyed while this packet is waiting on
> > +        * the tx queue. So need to protect sk.
> > +        */
> > +       sk = qp->sk->sk;
> > +       sock_hold(sk);
> > +       skb->sk = sk;
> >
> > -       rxe_get(pkt->qp);
> >         atomic_inc(&pkt->qp->skb_out);
> >
> > +       sk->sk_user_data = (void *)(long)qp->elem.index;
> > +       skb->destructor = rxe_skb_tx_dtor;
> > +
> >         if (skb->protocol == htons(ETH_P_IP)) {
> > -               err = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
> > +               err = ip_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
> >         } else if (skb->protocol == htons(ETH_P_IPV6)) {
> > -               err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
> > +               err = ip6_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
> >         } else {
> > -               rxe_dbg_qp(pkt->qp, "Unknown layer 3 protocol: %d\n",
> > -                               skb->protocol);
> > -               atomic_dec(&pkt->qp->skb_out);
> > -               rxe_put(pkt->qp);
> > -               kfree_skb(skb);
> > -               return -EINVAL;
> > +               rxe_dbg_qp(qp, "Unknown layer 3 protocol: %d",
> > +                          skb->protocol);
> > +               err = -EINVAL;
> > +               goto err_not_sent;
> >         }
> >
> > +       /* IP consumed the packet, the destructor will handle cleanup */
> >         if (unlikely(net_xmit_eval(err))) {
> > -               rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
> > -               return -EAGAIN;
> > +               rxe_dbg_qp(qp, "Error sending packet: %d", err);
> > +               err = -EAGAIN;
> > +               goto err_out;
> >         }
> >
> >         return 0;
> > +
> > +err_not_sent:
> > +       skb->destructor = NULL;
> > +       atomic_dec(&pkt->qp->skb_out);
> > +       kfree_skb(skb);
> > +       sock_put(sk);
> > +err_out:
> > +       return err;
> >  }
> >
> >  /* fix up a send packet to match the packets
> > diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> > index a569b111a9d2..dcbf71031453 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> > @@ -194,7 +194,6 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
> >         err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &qp->sk);
> >         if (err < 0)
> >                 return err;
> > -       qp->sk->sk->sk_user_data = qp;
> >
> >         /* pick a source UDP port number for this QP based on
> >          * the source QPN. this spreads traffic for different QPs
> > --
> > 2.39.2
> >

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-24 17:59     ` Leon Romanovsky
@ 2023-07-24 18:26       ` Bob Pearson
  0 siblings, 0 replies; 38+ messages in thread
From: Bob Pearson @ 2023-07-24 18:26 UTC (permalink / raw)
  To: Leon Romanovsky, Zhu Yanjun; +Cc: jgg, jhack, linux-rdma

On 7/24/23 12:59, Leon Romanovsky wrote:
> On Sun, Jul 23, 2023 at 09:03:34PM +0800, Zhu Yanjun wrote:
>> On Sat, Jul 22, 2023 at 4:51 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>>
>>> In cable pull testing some NICs can hold a send packet long enough
>>> to allow ulp protocol stacks to destroy the qp and the cleanup
>>> routines to timeout waiting for all qp references to be released.
>>> When the NIC driver finally frees the SKB the qp pointer is no longer
>>> valid and causes a seg fault in rxe_skb_tx_dtor().
>>
>> If a packet is held in some NICs, the later packets of this packet
>> will also be held by this NIC. So this will cause QP not to work well.
>> This is a serious problem. And the problem should be fixed in this
>> kind of NICs. We should not fix it in RXE.
>>
>> And a QP exists for at least several seconds. A packet can be held in
>> NIC for such a long time? This problem does exist in the real world,
>> or you imagine this scenario?
>>
>> I can not imagine this kind of scenario. Please Jason or Leon also check this.
> 
> I stopped to follow RXE changes for a long time. They don't make any sense to me.
> Even this patch, which moves existing IB reference counter from one place to another
> and does it for every SKB is beyond my understanding.

At the end of the day, we used to take a reference on the QP for each send packet and
drop it in the destructor function. (why? we are keeping a count of pending send packets
to keep from overloading the send queue.) But with link failures which can take many seconds
to recover this gets in the way of lustre cleaning up a bad connection when it times out
first. So instead we take a reference on the sock struct since if you destroy the QP you
also destroy the QP socket struct which drops its reference on the sock struct which gets it
freed before the packet is finally sent and the destructor gets called. That reference is dropped in
the destructor. We still need to get to the qp pointer in the destructor so we pass the
qp# in the sock->user_data instead of the qp pointer. Now if the qp has been destroyed the
qp lookup from the qp# fails and you just don't touch the counter. It really all just does work fine.

Bob
> 
> Thanks
> 
>>
>> Thanks.
>> Zhu Yanjun
>>>
>>> This patch passes the qp index instead of the qp to the skb destructor
>>> callback function. The call back is required to lookup the qp from the
>>> index and if it has been destroyed the lookup will return NULL and the
>>> qp will not be referenced avoiding the seg fault.
>>>
>>> Fixes: 8700e3e7c485 ("Soft RoCE driver")
>>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>>> ---
>>>  drivers/infiniband/sw/rxe/rxe_net.c | 87 ++++++++++++++++++++++-------
>>>  drivers/infiniband/sw/rxe/rxe_qp.c  |  1 -
>>>  2 files changed, 67 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
>>> index cd59666158b1..10e4a752ff7c 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_net.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
>>> @@ -131,19 +131,28 @@ static struct dst_entry *rxe_find_route(struct net_device *ndev,
>>>         return dst;
>>>  }
>>>
>>> +static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
>>> +{
>>> +       struct rxe_dev *rxe;
>>> +       struct net_device *ndev = skb->dev;
>>> +
>>> +       rxe = rxe_get_dev_from_net(ndev);
>>> +       if (!rxe && is_vlan_dev(ndev))
>>> +               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
>>> +
>>> +       return rxe;
>>> +}
>>> +
>>>  static int rxe_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
>>>  {
>>>         struct udphdr *udph;
>>>         struct rxe_dev *rxe;
>>> -       struct net_device *ndev = skb->dev;
>>>         struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
>>>
>>>         /* takes a reference on rxe->ib_dev
>>>          * drop when skb is freed
>>>          */
>>> -       rxe = rxe_get_dev_from_net(ndev);
>>> -       if (!rxe && is_vlan_dev(ndev))
>>> -               rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
>>> +       rxe = get_rxe_from_skb(skb);
>>>         if (!rxe)
>>>                 goto drop;
>>>
>>> @@ -345,46 +354,84 @@ int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
>>>
>>>  static void rxe_skb_tx_dtor(struct sk_buff *skb)
>>>  {
>>> -       struct sock *sk = skb->sk;
>>> -       struct rxe_qp *qp = sk->sk_user_data;
>>> -       int skb_out = atomic_dec_return(&qp->skb_out);
>>> +       struct rxe_dev *rxe;
>>> +       unsigned int index;
>>> +       struct rxe_qp *qp;
>>> +       int skb_out;
>>> +
>>> +       /* takes a ref on ib device if success */
>>> +       rxe = get_rxe_from_skb(skb);
>>> +       if (!rxe)
>>> +               goto out;
>>> +
>>> +       /* recover source qp index from sk->sk_user_data
>>> +        * free the reference taken in rxe_send
>>> +        */
>>> +       index = (int)(uintptr_t)skb->sk->sk_user_data;
>>> +       sock_put(skb->sk);
>>>
>>> +       /* lookup qp from index, takes a ref on success */
>>> +       qp = rxe_pool_get_index(&rxe->qp_pool, index);
>>> +       if (!qp)
>>> +               goto out_put_ibdev;
>>> +
>>> +       skb_out = atomic_dec_return(&qp->skb_out);
>>>         if (unlikely(qp->need_req_skb &&
>>>                      skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
>>>                 rxe_sched_task(&qp->req.task);
>>>
>>>         rxe_put(qp);
>>> +out_put_ibdev:
>>> +       ib_device_put(&rxe->ib_dev);
>>> +out:
>>> +       return;
>>>  }
>>>
>>>  static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
>>>  {
>>> +       struct rxe_qp *qp = pkt->qp;
>>> +       struct sock *sk;
>>>         int err;
>>>
>>> -       skb->destructor = rxe_skb_tx_dtor;
>>> -       skb->sk = pkt->qp->sk->sk;
>>> +       /* qp can be destroyed while this packet is waiting on
>>> +        * the tx queue. So need to protect sk.
>>> +        */
>>> +       sk = qp->sk->sk;
>>> +       sock_hold(sk);
>>> +       skb->sk = sk;
>>>
>>> -       rxe_get(pkt->qp);
>>>         atomic_inc(&pkt->qp->skb_out);
>>>
>>> +       sk->sk_user_data = (void *)(long)qp->elem.index;
>>> +       skb->destructor = rxe_skb_tx_dtor;
>>> +
>>>         if (skb->protocol == htons(ETH_P_IP)) {
>>> -               err = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
>>> +               err = ip_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
>>>         } else if (skb->protocol == htons(ETH_P_IPV6)) {
>>> -               err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
>>> +               err = ip6_local_out(dev_net(skb_dst(skb)->dev), sk, skb);
>>>         } else {
>>> -               rxe_dbg_qp(pkt->qp, "Unknown layer 3 protocol: %d\n",
>>> -                               skb->protocol);
>>> -               atomic_dec(&pkt->qp->skb_out);
>>> -               rxe_put(pkt->qp);
>>> -               kfree_skb(skb);
>>> -               return -EINVAL;
>>> +               rxe_dbg_qp(qp, "Unknown layer 3 protocol: %d",
>>> +                          skb->protocol);
>>> +               err = -EINVAL;
>>> +               goto err_not_sent;
>>>         }
>>>
>>> +       /* IP consumed the packet, the destructor will handle cleanup */
>>>         if (unlikely(net_xmit_eval(err))) {
>>> -               rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
>>> -               return -EAGAIN;
>>> +               rxe_dbg_qp(qp, "Error sending packet: %d", err);
>>> +               err = -EAGAIN;
>>> +               goto err_out;
>>>         }
>>>
>>>         return 0;
>>> +
>>> +err_not_sent:
>>> +       skb->destructor = NULL;
>>> +       atomic_dec(&pkt->qp->skb_out);
>>> +       kfree_skb(skb);
>>> +       sock_put(sk);
>>> +err_out:
>>> +       return err;
>>>  }
>>>
>>>  /* fix up a send packet to match the packets
>>> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
>>> index a569b111a9d2..dcbf71031453 100644
>>> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
>>> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
>>> @@ -194,7 +194,6 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
>>>         err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &qp->sk);
>>>         if (err < 0)
>>>                 return err;
>>> -       qp->sk->sk->sk_user_data = qp;
>>>
>>>         /* pick a source UDP port number for this QP based on
>>>          * the source QPN. this spreads traffic for different QPs
>>> --
>>> 2.39.2
>>>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c
  2023-07-21 20:50 ` [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c Bob Pearson
@ 2023-07-31 18:08   ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:08 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Fri, Jul 21, 2023 at 03:50:14PM -0500, Bob Pearson wrote:
> Since the AH creation and destroy verbs APIs can be called in
> interrupt context it is necessary to not sleep in these cases.
> This was partially dealt with in previous fixes but some cases
> remain where the code could still potentially sleep.
> 
> This patch fixes this by extending the __rxe_finalize() call to
> have a sleepable parameter and using this in the AH verbs calls.
> It also fixes one call to rxe_cleanup which did not use the
> sleepable API.
> 
> Fixes: 215d0a755e1b ("RDMA/rxe: Stop lookup of partially built objects")
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_pool.c  |  5 +++--
>  drivers/infiniband/sw/rxe/rxe_pool.h  |  5 +++--
>  drivers/infiniband/sw/rxe/rxe_verbs.c | 11 ++++++-----
>  3 files changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index 6215c6de3a84..de0043b6d3f3 100644
> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -247,10 +247,11 @@ int __rxe_put(struct rxe_pool_elem *elem)
>  	return kref_put(&elem->ref_cnt, rxe_elem_release);
>  }
>  
> -void __rxe_finalize(struct rxe_pool_elem *elem)
> +void __rxe_finalize(struct rxe_pool_elem *elem, bool sleepable)
>  {
>  	void *xa_ret;
> +	gfp_t gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
>  
> -	xa_ret = xa_store(&elem->pool->xa, elem->index, elem, GFP_KERNEL);
> +	xa_ret = xa_store(&elem->pool->xa, elem->index, elem, gfp_flags);
>  	WARN_ON(xa_err(xa_ret));
>  }

You don't need to do this, there is no allocation here, when you call
xa_alloc_cyclic() it reserved all the memory necessary for this.

Just add a comment and mark it GFP_ATOMIC.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects
  2023-07-21 20:50 ` [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects Bob Pearson
@ 2023-07-31 18:11   ` Jason Gunthorpe
  2023-07-31 18:16     ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:11 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Fri, Jul 21, 2023 at 03:50:16PM -0500, Bob Pearson wrote:
> @@ -175,16 +175,17 @@ static void rxe_elem_release(struct kref *kref)
>  {
>  	struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>  
> -	complete(&elem->complete);
> +	complete_all(&elem->complete);
>  }
>  
> -int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
> +void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
>  {

You should definately put this one change in its own patch doing just
that.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-21 20:50 ` [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling Bob Pearson
  2023-07-23 13:03   ` Zhu Yanjun
@ 2023-07-31 18:12   ` Jason Gunthorpe
  2023-07-31 18:20     ` Bob Pearson
  1 sibling, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:12 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Fri, Jul 21, 2023 at 03:50:17PM -0500, Bob Pearson wrote:
> In cable pull testing some NICs can hold a send packet long enough
> to allow ulp protocol stacks to destroy the qp and the cleanup
> routines to timeout waiting for all qp references to be released.
> When the NIC driver finally frees the SKB the qp pointer is no longer
> valid and causes a seg fault in rxe_skb_tx_dtor().
> 
> This patch passes the qp index instead of the qp to the skb destructor
> callback function. The call back is required to lookup the qp from the
> index and if it has been destroyed the lookup will return NULL and the
> qp will not be referenced avoiding the seg fault.

And what if it is a different QP returned?

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 7/9] RDMA/rxe: Add elem->valid field
  2023-07-21 20:50 ` [PATCH for-next 7/9] RDMA/rxe: Add elem->valid field Bob Pearson
@ 2023-07-31 18:15   ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:15 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Fri, Jul 21, 2023 at 03:50:20PM -0500, Bob Pearson wrote:
> Currently the rxe driver stores NULL in the pool xarray to
> indicate that the pool element should not be looked up from its
> index. This prevents looping over pool elements during driver
> cleanup. This patch adds a new valid field to struct rxe_pool_elem
> as an alternative way to accomplish the same thing.

Wah? Why? Yuk.

> @@ -249,13 +252,6 @@ int __rxe_put(struct rxe_pool_elem *elem)
>  
>  void __rxe_finalize(struct rxe_pool_elem *elem, bool sleepable)
>  {
> -	gfp_t gfp_flags = sleepable ? GFP_KERNEL : GFP_ATOMIC;
> -	struct xarray *xa = &elem->pool->xa;
> -	unsigned long flags;
> -	void *xa_ret;
> -
> -	xa_lock_irqsave(xa, flags);
> -	xa_ret = __xa_store(xa, elem->index, elem, gfp_flags);
> -	xa_unlock_irqrestore(xa, flags);
> -	WARN_ON(xa_err(xa_ret));
> +	/* allow looking up element from index */
> +	smp_store_release(&elem->valid, 1);
>  }

That is horrible, why?

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-21 20:50 ` [PATCH for-next 8/9] RDMA/rxe: Report leaked objects Bob Pearson
@ 2023-07-31 18:15   ` Jason Gunthorpe
  2023-07-31 18:23     ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:15 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
> This patch gives a more detailed list of objects that are not
> freed in case of error before the module exits.
> 
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
>  1 file changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> index cb812bd969c6..3249c2741491 100644
> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>  
>  void rxe_pool_cleanup(struct rxe_pool *pool)
>  {
> -	WARN_ON(!xa_empty(&pool->xa));
> +	unsigned long index;
> +	struct rxe_pool_elem *elem;
> +
> +	xa_lock(&pool->xa);
> +	xa_for_each(&pool->xa, index, elem) {
> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
> +				elem->index);
> +	}
> +	xa_unlock(&pool->xa);
> +
> +	xa_destroy(&pool->xa);
>  }

Is this why? Just count the number of unfinalized objects and report
if there is difference, don't mess up the xarray.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects
  2023-07-31 18:11   ` Jason Gunthorpe
@ 2023-07-31 18:16     ` Bob Pearson
  2023-07-31 18:22       ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:16 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:11, Jason Gunthorpe wrote:
> On Fri, Jul 21, 2023 at 03:50:16PM -0500, Bob Pearson wrote:
>> @@ -175,16 +175,17 @@ static void rxe_elem_release(struct kref *kref)
>>  {
>>  	struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
>>  
>> -	complete(&elem->complete);
>> +	complete_all(&elem->complete);
>>  }
>>  
>> -int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
>> +void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
>>  {
> 
> You should definately put this one change in its own patch doing just
> that.
> 
> Jason

Please explain a little more. I only found this change because the
change in rxe_complete to repeat the wait_for_completion call didn't work.
This seemed to fix it but the man page and comments didn't explain
complete_all very well.

Bob

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-07-21 20:50 ` [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets Bob Pearson
@ 2023-07-31 18:17   ` Jason Gunthorpe
  2023-07-31 18:26     ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:17 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
> Network interruptions may cause long delays in the processing of
> send packets during which time the rxe driver may be unloaded.
> This will cause seg faults when the packet is ultimately freed as
> it calls the destructor function in the rxe driver. This has been
> observed in cable pull fail over fail back testing.

No, module reference counts are only for code that is touching
function pointers.

If your driver is becoming removed and that messes it up then you need
to prevent the driver from unloading by adding something to the remove
function (dellink, I guess in this case)

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-31 18:12   ` Jason Gunthorpe
@ 2023-07-31 18:20     ` Bob Pearson
  2023-07-31 18:23       ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:20 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:12, Jason Gunthorpe wrote:
> On Fri, Jul 21, 2023 at 03:50:17PM -0500, Bob Pearson wrote:
>> In cable pull testing some NICs can hold a send packet long enough
>> to allow ulp protocol stacks to destroy the qp and the cleanup
>> routines to timeout waiting for all qp references to be released.
>> When the NIC driver finally frees the SKB the qp pointer is no longer
>> valid and causes a seg fault in rxe_skb_tx_dtor().
>>
>> This patch passes the qp index instead of the qp to the skb destructor
>> callback function. The call back is required to lookup the qp from the
>> index and if it has been destroyed the lookup will return NULL and the
>> qp will not be referenced avoiding the seg fault.
> 
> And what if it is a different QP returned?
> 
> Jason

Since we are using xarray cyclic alloc you would have to create 16M QPs before the
index was reused. This is as good as it gets I think.

Bob

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects
  2023-07-31 18:16     ` Bob Pearson
@ 2023-07-31 18:22       ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:22 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:16:48PM -0500, Bob Pearson wrote:
> On 7/31/23 13:11, Jason Gunthorpe wrote:
> > On Fri, Jul 21, 2023 at 03:50:16PM -0500, Bob Pearson wrote:
> >> @@ -175,16 +175,17 @@ static void rxe_elem_release(struct kref *kref)
> >>  {
> >>  	struct rxe_pool_elem *elem = container_of(kref, typeof(*elem), ref_cnt);
> >>  
> >> -	complete(&elem->complete);
> >> +	complete_all(&elem->complete);
> >>  }
> >>  
> >> -int __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
> >> +void __rxe_cleanup(struct rxe_pool_elem *elem, bool sleepable)
> >>  {
> > 
> > You should definately put this one change in its own patch doing just
> > that.
> > 
> > Jason
> 
> Please explain a little more. 

I mean just remove the function return in one patch, these functions
are not supposed to fail anyhow.

> I only found this change because the change in rxe_complete to
> repeat the wait_for_completion call didn't work.  This seemed to fix
> it but the man page and comments didn't explain complete_all very
> well.

I didn't mean the complete_all hunk, I ment the __rxe_cleanup hunk :)

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-31 18:20     ` Bob Pearson
@ 2023-07-31 18:23       ` Jason Gunthorpe
  2023-07-31 18:33         ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:23 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:20:35PM -0500, Bob Pearson wrote:
> On 7/31/23 13:12, Jason Gunthorpe wrote:
> > On Fri, Jul 21, 2023 at 03:50:17PM -0500, Bob Pearson wrote:
> >> In cable pull testing some NICs can hold a send packet long enough
> >> to allow ulp protocol stacks to destroy the qp and the cleanup
> >> routines to timeout waiting for all qp references to be released.
> >> When the NIC driver finally frees the SKB the qp pointer is no longer
> >> valid and causes a seg fault in rxe_skb_tx_dtor().
> >>
> >> This patch passes the qp index instead of the qp to the skb destructor
> >> callback function. The call back is required to lookup the qp from the
> >> index and if it has been destroyed the lookup will return NULL and the
> >> qp will not be referenced avoiding the seg fault.
> > 
> > And what if it is a different QP returned?
> > 
> > Jason
> 
> Since we are using xarray cyclic alloc you would have to create 16M QPs before the
> index was reused. This is as good as it gets I think.

Sounds terrible, why can't you store the QP pointer instead and hold a
refcount on it?

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-31 18:15   ` Jason Gunthorpe
@ 2023-07-31 18:23     ` Bob Pearson
  2023-07-31 18:31       ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:23 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:15, Jason Gunthorpe wrote:
> On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
>> This patch gives a more detailed list of objects that are not
>> freed in case of error before the module exits.
>>
>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>> ---
>>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>> index cb812bd969c6..3249c2741491 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>  
>>  void rxe_pool_cleanup(struct rxe_pool *pool)
>>  {
>> -	WARN_ON(!xa_empty(&pool->xa));
>> +	unsigned long index;
>> +	struct rxe_pool_elem *elem;
>> +
>> +	xa_lock(&pool->xa);
>> +	xa_for_each(&pool->xa, index, elem) {
>> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
>> +				elem->index);
>> +	}
>> +	xa_unlock(&pool->xa);
>> +
>> +	xa_destroy(&pool->xa);
>>  }
> 
> Is this why? Just count the number of unfinalized objects and report
> if there is difference, don't mess up the xarray.
> 
> Jason
This is why I made the last change but I really didn't like that there was no
way to lookup the qp from its index since we were using a NULL xarray entry to
indicate the state of the qp. Making it explicit, i.e. a variable in the struct
seems much more straight forward. Not sure why you hated the last change.

Bob

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-07-31 18:17   ` Jason Gunthorpe
@ 2023-07-31 18:26     ` Bob Pearson
  2023-07-31 18:32       ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:26 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:17, Jason Gunthorpe wrote:
> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
>> Network interruptions may cause long delays in the processing of
>> send packets during which time the rxe driver may be unloaded.
>> This will cause seg faults when the packet is ultimately freed as
>> it calls the destructor function in the rxe driver. This has been
>> observed in cable pull fail over fail back testing.
> 
> No, module reference counts are only for code that is touching
> function pointers.

this is exactly the case here. it is the skb destructor function that
is carried by the skb.

> 
> If your driver is becoming removed and that messes it up then you need
> to prevent the driver from unloading by adding something to the remove
> function (dellink, I guess in this case)
> 
> Jason


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-31 18:23     ` Bob Pearson
@ 2023-07-31 18:31       ` Jason Gunthorpe
  2023-07-31 18:42         ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:31 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:23:59PM -0500, Bob Pearson wrote:
> On 7/31/23 13:15, Jason Gunthorpe wrote:
> > On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
> >> This patch gives a more detailed list of objects that are not
> >> freed in case of error before the module exits.
> >>
> >> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> >> ---
> >>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
> >>  1 file changed, 11 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> >> index cb812bd969c6..3249c2741491 100644
> >> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> >> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> >> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
> >>  
> >>  void rxe_pool_cleanup(struct rxe_pool *pool)
> >>  {
> >> -	WARN_ON(!xa_empty(&pool->xa));
> >> +	unsigned long index;
> >> +	struct rxe_pool_elem *elem;
> >> +
> >> +	xa_lock(&pool->xa);
> >> +	xa_for_each(&pool->xa, index, elem) {
> >> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
> >> +				elem->index);
> >> +	}
> >> +	xa_unlock(&pool->xa);
> >> +
> >> +	xa_destroy(&pool->xa);
> >>  }
> > 
> > Is this why? Just count the number of unfinalized objects and report
> > if there is difference, don't mess up the xarray.
> > 
> > Jason
> This is why I made the last change but I really didn't like that there was no
> way to lookup the qp from its index since we were using a NULL xarray entry to
> indicate the state of the qp. Making it explicit, i.e. a variable in the struct
> seems much more straight forward. Not sure why you hated the last
> change.

Because if you don't call finalize abort has to be deterministic, and
abort can't be that if it someone else can access the poitner and, eg,
take a reference.

It breaks the entire model of how the memory ownership works during creation.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-07-31 18:26     ` Bob Pearson
@ 2023-07-31 18:32       ` Jason Gunthorpe
  2023-07-31 18:44         ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:32 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote:
> On 7/31/23 13:17, Jason Gunthorpe wrote:
> > On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
> >> Network interruptions may cause long delays in the processing of
> >> send packets during which time the rxe driver may be unloaded.
> >> This will cause seg faults when the packet is ultimately freed as
> >> it calls the destructor function in the rxe driver. This has been
> >> observed in cable pull fail over fail back testing.
> > 
> > No, module reference counts are only for code that is touching
> > function pointers.
> 
> this is exactly the case here. it is the skb destructor function that
> is carried by the skb.

It can't possibly call it correctly without also having the rxe
ib_device reference too though??

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-31 18:23       ` Jason Gunthorpe
@ 2023-07-31 18:33         ` Bob Pearson
  2023-08-04 14:17           ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:33 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:23, Jason Gunthorpe wrote:
> On Mon, Jul 31, 2023 at 01:20:35PM -0500, Bob Pearson wrote:
>> On 7/31/23 13:12, Jason Gunthorpe wrote:
>>> On Fri, Jul 21, 2023 at 03:50:17PM -0500, Bob Pearson wrote:
>>>> In cable pull testing some NICs can hold a send packet long enough
>>>> to allow ulp protocol stacks to destroy the qp and the cleanup
>>>> routines to timeout waiting for all qp references to be released.
>>>> When the NIC driver finally frees the SKB the qp pointer is no longer
>>>> valid and causes a seg fault in rxe_skb_tx_dtor().
>>>>
>>>> This patch passes the qp index instead of the qp to the skb destructor
>>>> callback function. The call back is required to lookup the qp from the
>>>> index and if it has been destroyed the lookup will return NULL and the
>>>> qp will not be referenced avoiding the seg fault.
>>>
>>> And what if it is a different QP returned?
>>>
>>> Jason
>>
>> Since we are using xarray cyclic alloc you would have to create 16M QPs before the
>> index was reused. This is as good as it gets I think.
> 
> Sounds terrible, why can't you store the QP pointer instead and hold a
> refcount on it?

The goal here was to make packet send semantics to be 'fire and forget' i.e. once we
send the packet not have any dependencies hanging around. But we still wanted to count
the packets pending to avoid overrunning the send queue.

This allows lustre to do its normal error recovery and destroy the qp and try to create
a new one when it times out.

Bob
> 
> Jason


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-31 18:31       ` Jason Gunthorpe
@ 2023-07-31 18:42         ` Bob Pearson
  2023-07-31 18:43           ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:42 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:31, Jason Gunthorpe wrote:
> On Mon, Jul 31, 2023 at 01:23:59PM -0500, Bob Pearson wrote:
>> On 7/31/23 13:15, Jason Gunthorpe wrote:
>>> On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
>>>> This patch gives a more detailed list of objects that are not
>>>> freed in case of error before the module exits.
>>>>
>>>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>>>> ---
>>>>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
>>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>> index cb812bd969c6..3249c2741491 100644
>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>  
>>>>  void rxe_pool_cleanup(struct rxe_pool *pool)
>>>>  {
>>>> -	WARN_ON(!xa_empty(&pool->xa));
>>>> +	unsigned long index;
>>>> +	struct rxe_pool_elem *elem;
>>>> +
>>>> +	xa_lock(&pool->xa);
>>>> +	xa_for_each(&pool->xa, index, elem) {
>>>> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
>>>> +				elem->index);
>>>> +	}
>>>> +	xa_unlock(&pool->xa);
>>>> +
>>>> +	xa_destroy(&pool->xa);
>>>>  }
>>>
>>> Is this why? Just count the number of unfinalized objects and report
>>> if there is difference, don't mess up the xarray.
>>>
>>> Jason
>> This is why I made the last change but I really didn't like that there was no
>> way to lookup the qp from its index since we were using a NULL xarray entry to
>> indicate the state of the qp. Making it explicit, i.e. a variable in the struct
>> seems much more straight forward. Not sure why you hated the last
>> change.
> 
> Because if you don't call finalize abort has to be deterministic, and
> abort can't be that if it someone else can access the poitner and, eg,
> take a reference.

rxe_pool_get_index() is the only 'correct' way to look up the pointer and
it checks the valid state (now). Scanning the xarray or just looking up
the qp is something outside the scope of the normal flows. Like listing
orphan objects on module exit.

Memory ownership didn't change. It is still the same. The only change is
how we mark whether the object is valid for lookup.

Bob
> 
> It breaks the entire model of how the memory ownership works during creation.
> 
> Jason


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-31 18:42         ` Bob Pearson
@ 2023-07-31 18:43           ` Jason Gunthorpe
  2023-07-31 18:51             ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-07-31 18:43 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:42:09PM -0500, Bob Pearson wrote:
> On 7/31/23 13:31, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 01:23:59PM -0500, Bob Pearson wrote:
> >> On 7/31/23 13:15, Jason Gunthorpe wrote:
> >>> On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
> >>>> This patch gives a more detailed list of objects that are not
> >>>> freed in case of error before the module exits.
> >>>>
> >>>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> >>>> ---
> >>>>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
> >>>>  1 file changed, 11 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> >>>> index cb812bd969c6..3249c2741491 100644
> >>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> >>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> >>>> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
> >>>>  
> >>>>  void rxe_pool_cleanup(struct rxe_pool *pool)
> >>>>  {
> >>>> -	WARN_ON(!xa_empty(&pool->xa));
> >>>> +	unsigned long index;
> >>>> +	struct rxe_pool_elem *elem;
> >>>> +
> >>>> +	xa_lock(&pool->xa);
> >>>> +	xa_for_each(&pool->xa, index, elem) {
> >>>> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
> >>>> +				elem->index);
> >>>> +	}
> >>>> +	xa_unlock(&pool->xa);
> >>>> +
> >>>> +	xa_destroy(&pool->xa);
> >>>>  }
> >>>
> >>> Is this why? Just count the number of unfinalized objects and report
> >>> if there is difference, don't mess up the xarray.
> >>>
> >>> Jason
> >> This is why I made the last change but I really didn't like that there was no
> >> way to lookup the qp from its index since we were using a NULL xarray entry to
> >> indicate the state of the qp. Making it explicit, i.e. a variable in the struct
> >> seems much more straight forward. Not sure why you hated the last
> >> change.
> > 
> > Because if you don't call finalize abort has to be deterministic, and
> > abort can't be that if it someone else can access the poitner and, eg,
> > take a reference.
> 
> rxe_pool_get_index() is the only 'correct' way to look up the pointer and
> it checks the valid state (now). Scanning the xarray or just looking up
> the qp is something outside the scope of the normal flows. Like listing
> orphan objects on module exit.

Maybe at this instance, but people keep changing this and it is
fundamentally wrong to store a pointer to an incompletely initialized
memory for other threads to see.

Especially for some minor debugging feature.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-07-31 18:32       ` Jason Gunthorpe
@ 2023-07-31 18:44         ` Bob Pearson
  2023-08-01 22:56           ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:44 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:32, Jason Gunthorpe wrote:
> On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote:
>> On 7/31/23 13:17, Jason Gunthorpe wrote:
>>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
>>>> Network interruptions may cause long delays in the processing of
>>>> send packets during which time the rxe driver may be unloaded.
>>>> This will cause seg faults when the packet is ultimately freed as
>>>> it calls the destructor function in the rxe driver. This has been
>>>> observed in cable pull fail over fail back testing.
>>>
>>> No, module reference counts are only for code that is touching
>>> function pointers.
>>
>> this is exactly the case here. it is the skb destructor function that
>> is carried by the skb.
> 
> It can't possibly call it correctly without also having the rxe
> ib_device reference too though??

Nope. This was causing seg faults in testing when there was a long network
hang and the admin tried to reload the rxe driver. The skb code doesn't care
about the ib device at all.

Bob
> 
> Jason


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-31 18:43           ` Jason Gunthorpe
@ 2023-07-31 18:51             ` Bob Pearson
  2023-08-04 14:16               ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-07-31 18:51 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 7/31/23 13:43, Jason Gunthorpe wrote:
> On Mon, Jul 31, 2023 at 01:42:09PM -0500, Bob Pearson wrote:
>> On 7/31/23 13:31, Jason Gunthorpe wrote:
>>> On Mon, Jul 31, 2023 at 01:23:59PM -0500, Bob Pearson wrote:
>>>> On 7/31/23 13:15, Jason Gunthorpe wrote:
>>>>> On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
>>>>>> This patch gives a more detailed list of objects that are not
>>>>>> freed in case of error before the module exits.
>>>>>>
>>>>>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
>>>>>> ---
>>>>>>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
>>>>>>  1 file changed, 11 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>> index cb812bd969c6..3249c2741491 100644
>>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
>>>>>> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
>>>>>>  
>>>>>>  void rxe_pool_cleanup(struct rxe_pool *pool)
>>>>>>  {
>>>>>> -	WARN_ON(!xa_empty(&pool->xa));
>>>>>> +	unsigned long index;
>>>>>> +	struct rxe_pool_elem *elem;
>>>>>> +
>>>>>> +	xa_lock(&pool->xa);
>>>>>> +	xa_for_each(&pool->xa, index, elem) {
>>>>>> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
>>>>>> +				elem->index);
>>>>>> +	}
>>>>>> +	xa_unlock(&pool->xa);
>>>>>> +
>>>>>> +	xa_destroy(&pool->xa);
>>>>>>  }
>>>>>
>>>>> Is this why? Just count the number of unfinalized objects and report
>>>>> if there is difference, don't mess up the xarray.
>>>>>
>>>>> Jason
>>>> This is why I made the last change but I really didn't like that there was no
>>>> way to lookup the qp from its index since we were using a NULL xarray entry to
>>>> indicate the state of the qp. Making it explicit, i.e. a variable in the struct
>>>> seems much more straight forward. Not sure why you hated the last
>>>> change.
>>>
>>> Because if you don't call finalize abort has to be deterministic, and
>>> abort can't be that if it someone else can access the poitner and, eg,
>>> take a reference.
>>
>> rxe_pool_get_index() is the only 'correct' way to look up the pointer and
>> it checks the valid state (now). Scanning the xarray or just looking up
>> the qp is something outside the scope of the normal flows. Like listing
>> orphan objects on module exit.
> 
> Maybe at this instance, but people keep changing this and it is
> fundamentally wrong to store a pointer to an incompletely initialized
> memory for other threads to see.
> 
> Especially for some minor debugging feature.

Maybe warning comments would help. There are lots of ways to break the code, sigh.
The typical one is someone makes a change that breaks reference counting.
> 
> Jason


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-07-31 18:44         ` Bob Pearson
@ 2023-08-01 22:56           ` Jason Gunthorpe
  2023-08-02 14:39             ` Bob Pearson
  0 siblings, 1 reply; 38+ messages in thread
From: Jason Gunthorpe @ 2023-08-01 22:56 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:44:47PM -0500, Bob Pearson wrote:
> On 7/31/23 13:32, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote:
> >> On 7/31/23 13:17, Jason Gunthorpe wrote:
> >>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
> >>>> Network interruptions may cause long delays in the processing of
> >>>> send packets during which time the rxe driver may be unloaded.
> >>>> This will cause seg faults when the packet is ultimately freed as
> >>>> it calls the destructor function in the rxe driver. This has been
> >>>> observed in cable pull fail over fail back testing.
> >>>
> >>> No, module reference counts are only for code that is touching
> >>> function pointers.
> >>
> >> this is exactly the case here. it is the skb destructor function that
> >> is carried by the skb.
> > 
> > It can't possibly call it correctly without also having the rxe
> > ib_device reference too though??
> 
> Nope. This was causing seg faults in testing when there was a long network
> hang and the admin tried to reload the rxe driver. The skb code doesn't care
> about the ib device at all.

I don't get it, there aren't globals in rxe, so WTF is it doing if it
isn't somehow tracing back to memory that is under the ib_device
lifetime?

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-08-01 22:56           ` Jason Gunthorpe
@ 2023-08-02 14:39             ` Bob Pearson
  2023-08-02 14:57               ` Jason Gunthorpe
  0 siblings, 1 reply; 38+ messages in thread
From: Bob Pearson @ 2023-08-02 14:39 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On 8/1/23 17:56, Jason Gunthorpe wrote:
> On Mon, Jul 31, 2023 at 01:44:47PM -0500, Bob Pearson wrote:
>> On 7/31/23 13:32, Jason Gunthorpe wrote:
>>> On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote:
>>>> On 7/31/23 13:17, Jason Gunthorpe wrote:
>>>>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
>>>>>> Network interruptions may cause long delays in the processing of
>>>>>> send packets during which time the rxe driver may be unloaded.
>>>>>> This will cause seg faults when the packet is ultimately freed as
>>>>>> it calls the destructor function in the rxe driver. This has been
>>>>>> observed in cable pull fail over fail back testing.
>>>>>
>>>>> No, module reference counts are only for code that is touching
>>>>> function pointers.
>>>>
>>>> this is exactly the case here. it is the skb destructor function that
>>>> is carried by the skb.
>>>
>>> It can't possibly call it correctly without also having the rxe
>>> ib_device reference too though??
>>
>> Nope. This was causing seg faults in testing when there was a long network
>> hang and the admin tried to reload the rxe driver. The skb code doesn't care
>> about the ib device at all.
> 
> I don't get it, there aren't globals in rxe, so WTF is it doing if it
> isn't somehow tracing back to memory that is under the ib_device
> lifetime?
> 
> Jason

When the rxe driver builds a send packet it puts the address of its destructor
subroutine in the skb before calling ip_local_out and sending it. The address of
driver software is now hanging around. If you don't delay the module exit routine
until all the skb's are freed you can cause seg faults. The only way to cause this to
happen is to call rmmod on the driver too early but people have done this occasionally
and report it as a bug.

Bob

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets
  2023-08-02 14:39             ` Bob Pearson
@ 2023-08-02 14:57               ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2023-08-02 14:57 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Wed, Aug 02, 2023 at 09:39:55AM -0500, Bob Pearson wrote:
> On 8/1/23 17:56, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 01:44:47PM -0500, Bob Pearson wrote:
> >> On 7/31/23 13:32, Jason Gunthorpe wrote:
> >>> On Mon, Jul 31, 2023 at 01:26:23PM -0500, Bob Pearson wrote:
> >>>> On 7/31/23 13:17, Jason Gunthorpe wrote:
> >>>>> On Fri, Jul 21, 2023 at 03:50:22PM -0500, Bob Pearson wrote:
> >>>>>> Network interruptions may cause long delays in the processing of
> >>>>>> send packets during which time the rxe driver may be unloaded.
> >>>>>> This will cause seg faults when the packet is ultimately freed as
> >>>>>> it calls the destructor function in the rxe driver. This has been
> >>>>>> observed in cable pull fail over fail back testing.
> >>>>>
> >>>>> No, module reference counts are only for code that is touching
> >>>>> function pointers.
> >>>>
> >>>> this is exactly the case here. it is the skb destructor function that
> >>>> is carried by the skb.
> >>>
> >>> It can't possibly call it correctly without also having the rxe
> >>> ib_device reference too though??
> >>
> >> Nope. This was causing seg faults in testing when there was a long network
> >> hang and the admin tried to reload the rxe driver. The skb code doesn't care
> >> about the ib device at all.
> > 
> > I don't get it, there aren't globals in rxe, so WTF is it doing if it
> > isn't somehow tracing back to memory that is under the ib_device
> > lifetime?
> > 
> > Jason
> 
> When the rxe driver builds a send packet it puts the address of its destructor
> subroutine in the skb before calling ip_local_out and sending it. The address of
> driver software is now hanging around. If you don't delay the module exit routine
> until all the skb's are freed you can cause seg faults. The only way to cause this to
> happen is to call rmmod on the driver too early but people have done this occasionally
> and report it as a bug.

You are missing the point, the destructor currently does this:

static void rxe_skb_tx_dtor(struct sk_buff *skb)
{
	struct sock *sk = skb->sk;
	struct rxe_qp *qp = sk->sk_user_data;

So you've already UAF'd because rxe_qp is freed memory well before you
get to unloading the module.

This series changed it to do this:
 
 static void rxe_skb_tx_dtor(struct sk_buff *skb)
 {
	struct rxe_dev *rxe;
	unsigned int index;
	struct rxe_qp *qp;
	int skb_out;

	/* takes a ref on ib device if success */
	rxe = get_rxe_from_skb(skb);
	if (!rxe)
		goto out;

 
static struct rxe_dev *get_rxe_from_skb(struct sk_buff *skb)
{
	struct rxe_dev *rxe;
	struct net_device *ndev = skb->dev;

	rxe = rxe_get_dev_from_net(ndev);
	if (!rxe && is_vlan_dev(ndev))
		rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));


Which seems totally nutz, you are now relying on the global hash table
in ib_core to resolve the ib device.

Again, why can't this code do something sane like refcount the qp or
ib_device so the destruction doesn't progress until all the SKBs are
flushed out?

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 8/9] RDMA/rxe: Report leaked objects
  2023-07-31 18:51             ` Bob Pearson
@ 2023-08-04 14:16               ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2023-08-04 14:16 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:51:59PM -0500, Bob Pearson wrote:
> On 7/31/23 13:43, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 01:42:09PM -0500, Bob Pearson wrote:
> >> On 7/31/23 13:31, Jason Gunthorpe wrote:
> >>> On Mon, Jul 31, 2023 at 01:23:59PM -0500, Bob Pearson wrote:
> >>>> On 7/31/23 13:15, Jason Gunthorpe wrote:
> >>>>> On Fri, Jul 21, 2023 at 03:50:21PM -0500, Bob Pearson wrote:
> >>>>>> This patch gives a more detailed list of objects that are not
> >>>>>> freed in case of error before the module exits.
> >>>>>>
> >>>>>> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> >>>>>> ---
> >>>>>>  drivers/infiniband/sw/rxe/rxe_pool.c | 12 +++++++++++-
> >>>>>>  1 file changed, 11 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/drivers/infiniband/sw/rxe/rxe_pool.c b/drivers/infiniband/sw/rxe/rxe_pool.c
> >>>>>> index cb812bd969c6..3249c2741491 100644
> >>>>>> --- a/drivers/infiniband/sw/rxe/rxe_pool.c
> >>>>>> +++ b/drivers/infiniband/sw/rxe/rxe_pool.c
> >>>>>> @@ -113,7 +113,17 @@ void rxe_pool_init(struct rxe_dev *rxe, struct rxe_pool *pool,
> >>>>>>  
> >>>>>>  void rxe_pool_cleanup(struct rxe_pool *pool)
> >>>>>>  {
> >>>>>> -	WARN_ON(!xa_empty(&pool->xa));
> >>>>>> +	unsigned long index;
> >>>>>> +	struct rxe_pool_elem *elem;
> >>>>>> +
> >>>>>> +	xa_lock(&pool->xa);
> >>>>>> +	xa_for_each(&pool->xa, index, elem) {
> >>>>>> +		rxe_err_dev(pool->rxe, "%s#%d: Leaked", pool->name,
> >>>>>> +				elem->index);
> >>>>>> +	}
> >>>>>> +	xa_unlock(&pool->xa);
> >>>>>> +
> >>>>>> +	xa_destroy(&pool->xa);
> >>>>>>  }
> >>>>>
> >>>>> Is this why? Just count the number of unfinalized objects and report
> >>>>> if there is difference, don't mess up the xarray.
> >>>>>
> >>>>> Jason
> >>>> This is why I made the last change but I really didn't like that there was no
> >>>> way to lookup the qp from its index since we were using a NULL xarray entry to
> >>>> indicate the state of the qp. Making it explicit, i.e. a variable in the struct
> >>>> seems much more straight forward. Not sure why you hated the last
> >>>> change.
> >>>
> >>> Because if you don't call finalize abort has to be deterministic, and
> >>> abort can't be that if it someone else can access the poitner and, eg,
> >>> take a reference.
> >>
> >> rxe_pool_get_index() is the only 'correct' way to look up the pointer and
> >> it checks the valid state (now). Scanning the xarray or just looking up
> >> the qp is something outside the scope of the normal flows. Like listing
> >> orphan objects on module exit.
> > 
> > Maybe at this instance, but people keep changing this and it is
> > fundamentally wrong to store a pointer to an incompletely initialized
> > memory for other threads to see.
> > 
> > Especially for some minor debugging feature.
> 
> Maybe warning comments would help. There are lots of ways to break the code, sigh.
> The typical one is someone makes a change that breaks reference
> counting.

Don't do it wrong in the first place is the most important thing.

Don't put incompletely intialized objects in global memory. It is a
very simple rule.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling
  2023-07-31 18:33         ` Bob Pearson
@ 2023-08-04 14:17           ` Jason Gunthorpe
  0 siblings, 0 replies; 38+ messages in thread
From: Jason Gunthorpe @ 2023-08-04 14:17 UTC (permalink / raw)
  To: Bob Pearson; +Cc: leon, zyjzyj2000, jhack, linux-rdma

On Mon, Jul 31, 2023 at 01:33:15PM -0500, Bob Pearson wrote:
> On 7/31/23 13:23, Jason Gunthorpe wrote:
> > On Mon, Jul 31, 2023 at 01:20:35PM -0500, Bob Pearson wrote:
> >> On 7/31/23 13:12, Jason Gunthorpe wrote:
> >>> On Fri, Jul 21, 2023 at 03:50:17PM -0500, Bob Pearson wrote:
> >>>> In cable pull testing some NICs can hold a send packet long enough
> >>>> to allow ulp protocol stacks to destroy the qp and the cleanup
> >>>> routines to timeout waiting for all qp references to be released.
> >>>> When the NIC driver finally frees the SKB the qp pointer is no longer
> >>>> valid and causes a seg fault in rxe_skb_tx_dtor().
> >>>>
> >>>> This patch passes the qp index instead of the qp to the skb destructor
> >>>> callback function. The call back is required to lookup the qp from the
> >>>> index and if it has been destroyed the lookup will return NULL and the
> >>>> qp will not be referenced avoiding the seg fault.
> >>>
> >>> And what if it is a different QP returned?
> >>>
> >>> Jason
> >>
> >> Since we are using xarray cyclic alloc you would have to create 16M QPs before the
> >> index was reused. This is as good as it gets I think.
> > 
> > Sounds terrible, why can't you store the QP pointer instead and hold a
> > refcount on it?
> 
> The goal here was to make packet send semantics to be 'fire and forget' i.e. once we
> send the packet not have any dependencies hanging around. But we still wanted to count
> the packets pending to avoid overrunning the send queue.

Well, you can't have it both ways really.

Maybe you need another bit of memory to track the packet counters that
can be refcounted independently of the qp.

And wait for those refcounts to zero out before allowing the driver to
unprobe.

Jason

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2023-08-04 14:17 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-21 20:50 [PATCH for-next 0/9] RDMA/rxe: Misc fixes and cleanups Bob Pearson
2023-07-21 20:50 ` [PATCH for-next 1/9] RDMA/rxe: Fix handling sleepable in rxe_pool.c Bob Pearson
2023-07-31 18:08   ` Jason Gunthorpe
2023-07-21 20:50 ` [PATCH for-next 2/9] RDMA/rxe: Fix xarray locking " Bob Pearson
2023-07-21 20:50 ` [PATCH for-next 3/9] RDMA/rxe: Fix freeing busy objects Bob Pearson
2023-07-31 18:11   ` Jason Gunthorpe
2023-07-31 18:16     ` Bob Pearson
2023-07-31 18:22       ` Jason Gunthorpe
2023-07-21 20:50 ` [PATCH for-next 4/9] RDMA/rxe: Fix delayed send packet handling Bob Pearson
2023-07-23 13:03   ` Zhu Yanjun
2023-07-23 17:24     ` Bob Pearson
2023-07-24 17:59     ` Leon Romanovsky
2023-07-24 18:26       ` Bob Pearson
2023-07-31 18:12   ` Jason Gunthorpe
2023-07-31 18:20     ` Bob Pearson
2023-07-31 18:23       ` Jason Gunthorpe
2023-07-31 18:33         ` Bob Pearson
2023-08-04 14:17           ` Jason Gunthorpe
2023-07-21 20:50 ` [PATCH for-next 5/9] RDMA/rxe: Optimize rxe_init_packet in rxe_net.c Bob Pearson
2023-07-21 20:50 ` [PATCH for-next 6/9] RDMA/rxe: Delete unused field elem->list Bob Pearson
2023-07-21 20:50 ` [PATCH for-next 7/9] RDMA/rxe: Add elem->valid field Bob Pearson
2023-07-31 18:15   ` Jason Gunthorpe
2023-07-21 20:50 ` [PATCH for-next 8/9] RDMA/rxe: Report leaked objects Bob Pearson
2023-07-31 18:15   ` Jason Gunthorpe
2023-07-31 18:23     ` Bob Pearson
2023-07-31 18:31       ` Jason Gunthorpe
2023-07-31 18:42         ` Bob Pearson
2023-07-31 18:43           ` Jason Gunthorpe
2023-07-31 18:51             ` Bob Pearson
2023-08-04 14:16               ` Jason Gunthorpe
2023-07-21 20:50 ` [PATCH for-next 9/9] RDMA/rxe: Protect pending send packets Bob Pearson
2023-07-31 18:17   ` Jason Gunthorpe
2023-07-31 18:26     ` Bob Pearson
2023-07-31 18:32       ` Jason Gunthorpe
2023-07-31 18:44         ` Bob Pearson
2023-08-01 22:56           ` Jason Gunthorpe
2023-08-02 14:39             ` Bob Pearson
2023-08-02 14:57               ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.