All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups
@ 2024-03-29 14:55 Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 01/12] RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt Bob Pearson
                   ` (12 more replies)
  0 siblings, 13 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

This series of patches is the result of high scale testing on a large
HPC system with a large attached Lustre file system. Several errors
were found which had not been previously seen at smaller scales. In
this case up to 1600 QPs on 1024 compute nodes attached to about 100
flash storage nodes. Each patch has it's own description.

v3
	Fixed an error in "Don't call rxe_requester from rxe_completer"
	Moved run_requester_again from a global to rxe_req_info.again.
	The control parameter has to be local to each qp.
v2
	Minor edits to some of the commit messages.
	Added a missing change to "Don't schedule rxe_completer...".
	Added a missing change to "Git rid of pkt resend on err".
	Added one additional commit.

Bob Pearson (12):
  RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
  RDMA/rxe: Allow good work requests to be executed
  RDMA/rxe: Remove redundant scheduling of rxe_completer
  RDMA/rxe: Merge request and complete tasks
  RDMA/rxe: Remove save/rollback_state in rxe_requester
  RDMA/rxe: Don't schedule rxe_completer from rxe_requester
  RDMA/rxe: Don't call rxe_requester from rxe_completer
  RDMA/rxe: Don't call direct between tasks
  RDMA/rxe: Fix incorrect rxe_put in error path
  RDMA/rxe: Make rxe_loopback match rxe_send behavior
  RDMA/rxe: Get rid of pkt resend on err
  RDMA/rxe: Let destroy qp succeed with stuck packet

 drivers/infiniband/sw/rxe/rxe_comp.c        | 32 ++++----
 drivers/infiniband/sw/rxe/rxe_hw_counters.c |  2 +-
 drivers/infiniband/sw/rxe/rxe_hw_counters.h |  2 +-
 drivers/infiniband/sw/rxe/rxe_loc.h         |  3 +-
 drivers/infiniband/sw/rxe/rxe_net.c         | 69 +++++++++--------
 drivers/infiniband/sw/rxe/rxe_qp.c          | 46 +++++-------
 drivers/infiniband/sw/rxe/rxe_req.c         | 82 ++++++---------------
 drivers/infiniband/sw/rxe/rxe_resp.c        | 14 +---
 drivers/infiniband/sw/rxe/rxe_verbs.c       | 17 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.h       |  7 +-
 10 files changed, 111 insertions(+), 163 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 01/12] RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 02/12] RDMA/rxe: Allow good work requests to be executed Bob Pearson
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

In rxe_comp_queue_pkt() an incoming response packet skb is enqueued
to the resp_pkts queue and then a decision is made whether to run the
completer task inline or schedule it. Finally the skb is dereferenced
to bump a 'hw' performance counter. This is wrong because if the
completer task is already running in a separate thread it may have
already processed the skb and freed it which can cause a seg fault.
This has been observed infrequently in testing at high scale.

This patch fixes this by changing the order of enqueuing the packet
until after the counter is accessed.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 0b1e5b99a48b ("IB/rxe: Add port protocol stats")
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index b78b8c0856ab..c997b7cbf2a9 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -131,12 +131,12 @@ void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 {
 	int must_sched;
 
-	skb_queue_tail(&qp->resp_pkts, skb);
-
-	must_sched = skb_queue_len(&qp->resp_pkts) > 1;
+	must_sched = skb_queue_len(&qp->resp_pkts) > 0;
 	if (must_sched != 0)
 		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
 
+	skb_queue_tail(&qp->resp_pkts, skb);
+
 	if (must_sched)
 		rxe_sched_task(&qp->comp.task);
 	else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 02/12] RDMA/rxe: Allow good work requests to be executed
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 01/12] RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-30 15:01   ` Zhu Yanjun
  2024-03-29 14:55 ` [PATCH for-next v3 03/12] RDMA/rxe: Remove redundant scheduling of rxe_completer Bob Pearson
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

A previous commit incorrectly added an 'if(!err)' before scheduling
the requester task in rxe_post_send_kernel(). But if there were send
wrs successfully added to the send queue before a bad wr they might
never get executed.

This commit fixes this by scheduling the requester task if any wqes were
successfully posted in rxe_post_send_kernel() in rxe_verbs.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 5bf944f24129 ("RDMA/rxe: Add error messages")
---
 drivers/infiniband/sw/rxe/rxe_verbs.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 614581989b38..a49784e5156c 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -888,6 +888,7 @@ static int rxe_post_send_kernel(struct rxe_qp *qp,
 {
 	int err = 0;
 	unsigned long flags;
+	int good = 0;
 
 	spin_lock_irqsave(&qp->sq.sq_lock, flags);
 	while (ibwr) {
@@ -895,12 +896,15 @@ static int rxe_post_send_kernel(struct rxe_qp *qp,
 		if (err) {
 			*bad_wr = ibwr;
 			break;
+		} else {
+			good++;
 		}
 		ibwr = ibwr->next;
 	}
 	spin_unlock_irqrestore(&qp->sq.sq_lock, flags);
 
-	if (!err)
+	/* kickoff processing of any posted wqes */
+	if (good)
 		rxe_sched_task(&qp->req.task);
 
 	spin_lock_irqsave(&qp->state_lock, flags);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 03/12] RDMA/rxe: Remove redundant scheduling of rxe_completer
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 01/12] RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 02/12] RDMA/rxe: Allow good work requests to be executed Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 04/12] RDMA/rxe: Merge request and complete tasks Bob Pearson
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

In rxe_post_send_kernel() if the qp is in the error state after
posting the work requests the rxe_completer() task is scheduled.

But, the only way to move the qp into the error state is to call
rxe_qp_error() which also schedules the rxe_completer() task to drain
the queues. Calling it a second time has no effect. This commit
removes the redundant call.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_verbs.c | 5 -----
 1 file changed, 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index a49784e5156c..71b0f834030f 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -907,11 +907,6 @@ static int rxe_post_send_kernel(struct rxe_qp *qp,
 	if (good)
 		rxe_sched_task(&qp->req.task);
 
-	spin_lock_irqsave(&qp->state_lock, flags);
-	if (qp_state(qp) == IB_QPS_ERR)
-		rxe_sched_task(&qp->comp.task);
-	spin_unlock_irqrestore(&qp->state_lock, flags);
-
 	return err;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 04/12] RDMA/rxe: Merge request and complete tasks
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (2 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 03/12] RDMA/rxe: Remove redundant scheduling of rxe_completer Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 05/12] RDMA/rxe: Remove save/rollback_state in rxe_requester Bob Pearson
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

Currently the rxe driver has three work queue tasks per qp.
These are the req.task, comp.task and resp.task which call
rxe_requester(), rxe_completer() and rxe_responder() respectively
directly or on work queues. Each of these subroutines checks to
see if there is work to be performed on the send queue or on the
response packet queue or the request packet queue and will run
until there is no work remaining or yield the cpu and reschedule
itself until there is no work remaining.

This commit combines the req.task and comp.task into a single
send.task and renames the resp.task to the recv.task. The combined
send.task calls rxe_requester() and rxe_completer() serially
and continues until all work on both the send queue and the
response packet queue are done.

In various benchmarks the performance is either improved or
left the same. At high scale there is a significant reduction
in the load on the cpu.

This is the first step in combining these two tasks. Once they are
serialized cross rescheduling of req.task and comp.task
can be more efficiently handled by just letting the send.task
continue to run. This will be done in the next several patches.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c        | 20 +++++-----
 drivers/infiniband/sw/rxe/rxe_hw_counters.c |  2 +-
 drivers/infiniband/sw/rxe/rxe_hw_counters.h |  2 +-
 drivers/infiniband/sw/rxe/rxe_loc.h         |  3 +-
 drivers/infiniband/sw/rxe/rxe_net.c         |  4 +-
 drivers/infiniband/sw/rxe/rxe_qp.c          | 44 ++++++++-------------
 drivers/infiniband/sw/rxe/rxe_req.c         | 25 ++++++++++--
 drivers/infiniband/sw/rxe/rxe_resp.c        |  6 +--
 drivers/infiniband/sw/rxe/rxe_verbs.c       |  6 +--
 drivers/infiniband/sw/rxe/rxe_verbs.h       |  6 +--
 10 files changed, 63 insertions(+), 55 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index c997b7cbf2a9..ea64a25fe876 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -122,7 +122,7 @@ void retransmit_timer(struct timer_list *t)
 	spin_lock_irqsave(&qp->state_lock, flags);
 	if (qp->valid) {
 		qp->comp.timeout = 1;
-		rxe_sched_task(&qp->comp.task);
+		rxe_sched_task(&qp->send_task);
 	}
 	spin_unlock_irqrestore(&qp->state_lock, flags);
 }
@@ -133,14 +133,14 @@ void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 
 	must_sched = skb_queue_len(&qp->resp_pkts) > 0;
 	if (must_sched != 0)
-		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
+		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_SENDER_SCHED);
 
 	skb_queue_tail(&qp->resp_pkts, skb);
 
 	if (must_sched)
-		rxe_sched_task(&qp->comp.task);
+		rxe_sched_task(&qp->send_task);
 	else
-		rxe_run_task(&qp->comp.task);
+		rxe_run_task(&qp->send_task);
 }
 
 static inline enum comp_state get_wqe(struct rxe_qp *qp,
@@ -325,7 +325,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
 					qp->comp.psn = pkt->psn;
 					if (qp->req.wait_psn) {
 						qp->req.wait_psn = 0;
-						rxe_sched_task(&qp->req.task);
+						rxe_sched_task(&qp->send_task);
 					}
 				}
 				return COMPST_ERROR_RETRY;
@@ -476,7 +476,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	 */
 	if (qp->req.wait_fence) {
 		qp->req.wait_fence = 0;
-		rxe_sched_task(&qp->req.task);
+		rxe_sched_task(&qp->send_task);
 	}
 }
 
@@ -515,7 +515,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
 		if (qp->req.need_rd_atomic) {
 			qp->comp.timeout_retry = 0;
 			qp->req.need_rd_atomic = 0;
-			rxe_sched_task(&qp->req.task);
+			rxe_sched_task(&qp->send_task);
 		}
 	}
 
@@ -541,7 +541,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp,
 
 		if (qp->req.wait_psn) {
 			qp->req.wait_psn = 0;
-			rxe_sched_task(&qp->req.task);
+			rxe_sched_task(&qp->send_task);
 		}
 	}
 
@@ -737,7 +737,7 @@ int rxe_completer(struct rxe_qp *qp)
 
 			if (qp->req.wait_psn) {
 				qp->req.wait_psn = 0;
-				rxe_sched_task(&qp->req.task);
+				rxe_sched_task(&qp->send_task);
 			}
 
 			state = COMPST_DONE;
@@ -792,7 +792,7 @@ int rxe_completer(struct rxe_qp *qp)
 							RXE_CNT_COMP_RETRY);
 					qp->req.need_retry = 1;
 					qp->comp.started_retry = 1;
-					rxe_sched_task(&qp->req.task);
+					rxe_sched_task(&qp->send_task);
 				}
 				goto done;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.c b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
index a012522b577a..437917a7d8f2 100644
--- a/drivers/infiniband/sw/rxe/rxe_hw_counters.c
+++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.c
@@ -14,7 +14,7 @@ static const struct rdma_stat_desc rxe_counter_descs[] = {
 	[RXE_CNT_RCV_RNR].name             =  "rcvd_rnr_err",
 	[RXE_CNT_SND_RNR].name             =  "send_rnr_err",
 	[RXE_CNT_RCV_SEQ_ERR].name         =  "rcvd_seq_err",
-	[RXE_CNT_COMPLETER_SCHED].name     =  "ack_deferred",
+	[RXE_CNT_SENDER_SCHED].name        =  "ack_deferred",
 	[RXE_CNT_RETRY_EXCEEDED].name      =  "retry_exceeded_err",
 	[RXE_CNT_RNR_RETRY_EXCEEDED].name  =  "retry_rnr_exceeded_err",
 	[RXE_CNT_COMP_RETRY].name          =  "completer_retry_err",
diff --git a/drivers/infiniband/sw/rxe/rxe_hw_counters.h b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
index 71f4d4fa9dc8..051f9e1c3852 100644
--- a/drivers/infiniband/sw/rxe/rxe_hw_counters.h
+++ b/drivers/infiniband/sw/rxe/rxe_hw_counters.h
@@ -18,7 +18,7 @@ enum rxe_counters {
 	RXE_CNT_RCV_RNR,
 	RXE_CNT_SND_RNR,
 	RXE_CNT_RCV_SEQ_ERR,
-	RXE_CNT_COMPLETER_SCHED,
+	RXE_CNT_SENDER_SCHED,
 	RXE_CNT_RETRY_EXCEEDED,
 	RXE_CNT_RNR_RETRY_EXCEEDED,
 	RXE_CNT_COMP_RETRY,
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 746110898a0e..ded46119151b 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -164,7 +164,8 @@ void rxe_dealloc(struct ib_device *ib_dev);
 
 int rxe_completer(struct rxe_qp *qp);
 int rxe_requester(struct rxe_qp *qp);
-int rxe_responder(struct rxe_qp *qp);
+int rxe_sender(struct rxe_qp *qp);
+int rxe_receiver(struct rxe_qp *qp);
 
 /* rxe_icrc.c */
 int rxe_icrc_init(struct rxe_dev *rxe);
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index cd59666158b1..928508558df4 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -351,7 +351,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
 
 	if (unlikely(qp->need_req_skb &&
 		     skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
-		rxe_sched_task(&qp->req.task);
+		rxe_sched_task(&qp->send_task);
 
 	rxe_put(qp);
 }
@@ -443,7 +443,7 @@ int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 	if ((qp_type(qp) != IB_QPT_RC) &&
 	    (pkt->mask & RXE_END_MASK)) {
 		pkt->wqe->state = wqe_state_done;
-		rxe_sched_task(&qp->comp.task);
+		rxe_sched_task(&qp->send_task);
 	}
 
 	rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS);
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index e3589c02013e..c7d99063594b 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -265,8 +265,7 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 	qp->req.opcode		= -1;
 	qp->comp.opcode		= -1;
 
-	rxe_init_task(&qp->req.task, qp, rxe_requester);
-	rxe_init_task(&qp->comp.task, qp, rxe_completer);
+	rxe_init_task(&qp->send_task, qp, rxe_sender);
 
 	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
 	if (init->qp_type == IB_QPT_RC) {
@@ -337,7 +336,7 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
 			return err;
 	}
 
-	rxe_init_task(&qp->resp.task, qp, rxe_responder);
+	rxe_init_task(&qp->recv_task, qp, rxe_receiver);
 
 	qp->resp.opcode		= OPCODE_NONE;
 	qp->resp.msn		= 0;
@@ -514,14 +513,12 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp,
 static void rxe_qp_reset(struct rxe_qp *qp)
 {
 	/* stop tasks from running */
-	rxe_disable_task(&qp->resp.task);
-	rxe_disable_task(&qp->comp.task);
-	rxe_disable_task(&qp->req.task);
+	rxe_disable_task(&qp->recv_task);
+	rxe_disable_task(&qp->send_task);
 
 	/* drain work and packet queuesc */
-	rxe_requester(qp);
-	rxe_completer(qp);
-	rxe_responder(qp);
+	rxe_sender(qp);
+	rxe_receiver(qp);
 
 	if (qp->rq.queue)
 		rxe_queue_reset(qp->rq.queue);
@@ -548,9 +545,8 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 	cleanup_rd_atomic_resources(qp);
 
 	/* reenable tasks */
-	rxe_enable_task(&qp->resp.task);
-	rxe_enable_task(&qp->comp.task);
-	rxe_enable_task(&qp->req.task);
+	rxe_enable_task(&qp->recv_task);
+	rxe_enable_task(&qp->send_task);
 }
 
 /* move the qp to the error state */
@@ -562,9 +558,8 @@ void rxe_qp_error(struct rxe_qp *qp)
 	qp->attr.qp_state = IB_QPS_ERR;
 
 	/* drain work and packet queues */
-	rxe_sched_task(&qp->resp.task);
-	rxe_sched_task(&qp->comp.task);
-	rxe_sched_task(&qp->req.task);
+	rxe_sched_task(&qp->recv_task);
+	rxe_sched_task(&qp->send_task);
 	spin_unlock_irqrestore(&qp->state_lock, flags);
 }
 
@@ -575,8 +570,7 @@ static void rxe_qp_sqd(struct rxe_qp *qp, struct ib_qp_attr *attr,
 
 	spin_lock_irqsave(&qp->state_lock, flags);
 	qp->attr.sq_draining = 1;
-	rxe_sched_task(&qp->comp.task);
-	rxe_sched_task(&qp->req.task);
+	rxe_sched_task(&qp->send_task);
 	spin_unlock_irqrestore(&qp->state_lock, flags);
 }
 
@@ -821,19 +815,15 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
 		del_timer_sync(&qp->rnr_nak_timer);
 	}
 
-	if (qp->resp.task.func)
-		rxe_cleanup_task(&qp->resp.task);
+	if (qp->recv_task.func)
+		rxe_cleanup_task(&qp->recv_task);
 
-	if (qp->req.task.func)
-		rxe_cleanup_task(&qp->req.task);
-
-	if (qp->comp.task.func)
-		rxe_cleanup_task(&qp->comp.task);
+	if (qp->send_task.func)
+		rxe_cleanup_task(&qp->send_task);
 
 	/* flush out any receive wr's or pending requests */
-	rxe_requester(qp);
-	rxe_completer(qp);
-	rxe_responder(qp);
+	rxe_sender(qp);
+	rxe_receiver(qp);
 
 	if (qp->sq.queue)
 		rxe_queue_cleanup(qp->sq.queue);
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index d8c41fd626a9..31a611ced3c5 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -108,7 +108,7 @@ void rnr_nak_timer(struct timer_list *t)
 		/* request a send queue retry */
 		qp->req.need_retry = 1;
 		qp->req.wait_for_rnr_timer = 0;
-		rxe_sched_task(&qp->req.task);
+		rxe_sched_task(&qp->send_task);
 	}
 	spin_unlock_irqrestore(&qp->state_lock, flags);
 }
@@ -659,7 +659,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	 * which can lead to a deadlock. So go ahead and complete
 	 * it now.
 	 */
-	rxe_sched_task(&qp->comp.task);
+	rxe_sched_task(&qp->send_task);
 
 	return 0;
 }
@@ -786,7 +786,7 @@ int rxe_requester(struct rxe_qp *qp)
 						       qp->req.wqe_index);
 			wqe->state = wqe_state_done;
 			wqe->status = IB_WC_SUCCESS;
-			rxe_sched_task(&qp->comp.task);
+			rxe_sched_task(&qp->send_task);
 			goto done;
 		}
 		payload = mtu;
@@ -855,7 +855,7 @@ int rxe_requester(struct rxe_qp *qp)
 		 */
 		qp->need_req_skb = 1;
 
-		rxe_sched_task(&qp->req.task);
+		rxe_sched_task(&qp->send_task);
 		goto exit;
 	}
 
@@ -878,3 +878,20 @@ int rxe_requester(struct rxe_qp *qp)
 out:
 	return ret;
 }
+
+int rxe_sender(struct rxe_qp *qp)
+{
+	int req_ret;
+	int comp_ret;
+
+	/* process the send queue */
+	req_ret = rxe_requester(qp);
+
+	/* process the response queue */
+	comp_ret = rxe_completer(qp);
+
+	/* exit the task loop if both requester and completer
+	 * are ready
+	 */
+	return (req_ret && comp_ret) ? -EAGAIN : 0;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 963382f625d7..3ce7a32b5dcf 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -58,9 +58,9 @@ void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 			(skb_queue_len(&qp->req_pkts) > 1);
 
 	if (must_sched)
-		rxe_sched_task(&qp->resp.task);
+		rxe_sched_task(&qp->recv_task);
 	else
-		rxe_run_task(&qp->resp.task);
+		rxe_run_task(&qp->recv_task);
 }
 
 static inline enum resp_states get_req(struct rxe_qp *qp,
@@ -1485,7 +1485,7 @@ static void flush_recv_queue(struct rxe_qp *qp, bool notify)
 	qp->resp.wqe = NULL;
 }
 
-int rxe_responder(struct rxe_qp *qp)
+int rxe_receiver(struct rxe_qp *qp)
 {
 	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	enum resp_states state;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 71b0f834030f..d07f7bd3b2ae 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -905,7 +905,7 @@ static int rxe_post_send_kernel(struct rxe_qp *qp,
 
 	/* kickoff processing of any posted wqes */
 	if (good)
-		rxe_sched_task(&qp->req.task);
+		rxe_sched_task(&qp->send_task);
 
 	return err;
 }
@@ -935,7 +935,7 @@ static int rxe_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 
 	if (qp->is_user) {
 		/* Utilize process context to do protocol processing */
-		rxe_run_task(&qp->req.task);
+		rxe_run_task(&qp->send_task);
 	} else {
 		err = rxe_post_send_kernel(qp, wr, bad_wr);
 		if (err)
@@ -1045,7 +1045,7 @@ static int rxe_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
 
 	spin_lock_irqsave(&qp->state_lock, flags);
 	if (qp_state(qp) == IB_QPS_ERR)
-		rxe_sched_task(&qp->resp.task);
+		rxe_sched_task(&qp->recv_task);
 	spin_unlock_irqrestore(&qp->state_lock, flags);
 
 	return err;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index ccb9d19ffe8a..af8939b8c7a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -113,7 +113,6 @@ struct rxe_req_info {
 	int			need_retry;
 	int			wait_for_rnr_timer;
 	int			noack_pkts;
-	struct rxe_task		task;
 };
 
 struct rxe_comp_info {
@@ -124,7 +123,6 @@ struct rxe_comp_info {
 	int			started_retry;
 	u32			retry_cnt;
 	u32			rnr_retry;
-	struct rxe_task		task;
 };
 
 enum rdatm_res_state {
@@ -196,7 +194,6 @@ struct rxe_resp_info {
 	unsigned int		res_head;
 	unsigned int		res_tail;
 	struct resp_res		*res;
-	struct rxe_task		task;
 };
 
 struct rxe_qp {
@@ -229,6 +226,9 @@ struct rxe_qp {
 	struct sk_buff_head	req_pkts;
 	struct sk_buff_head	resp_pkts;
 
+	struct rxe_task		send_task;
+	struct rxe_task		recv_task;
+
 	struct rxe_req_info	req;
 	struct rxe_comp_info	comp;
 	struct rxe_resp_info	resp;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 05/12] RDMA/rxe: Remove save/rollback_state in rxe_requester
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (3 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 04/12] RDMA/rxe: Merge request and complete tasks Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 06/12] RDMA/rxe: Don't schedule rxe_completer from rxe_requester Bob Pearson
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

Now that req.task and comp.task are merged it is no longer
necessary to call save_state() before calling rxe_xmit_pkt() and
rollback_state() if rxe_xmit_pkt() fails. This was done
originally to prevent races between rxe_completer() and
rxe_requester() which now cannot happen.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_req.c | 40 ++---------------------------
 1 file changed, 2 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 31a611ced3c5..e20462c3040d 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -573,30 +573,6 @@ static void update_wqe_psn(struct rxe_qp *qp,
 		qp->req.psn = (qp->req.psn + 1) & BTH_PSN_MASK;
 }
 
-static void save_state(struct rxe_send_wqe *wqe,
-		       struct rxe_qp *qp,
-		       struct rxe_send_wqe *rollback_wqe,
-		       u32 *rollback_psn)
-{
-	rollback_wqe->state = wqe->state;
-	rollback_wqe->first_psn = wqe->first_psn;
-	rollback_wqe->last_psn = wqe->last_psn;
-	rollback_wqe->dma = wqe->dma;
-	*rollback_psn = qp->req.psn;
-}
-
-static void rollback_state(struct rxe_send_wqe *wqe,
-			   struct rxe_qp *qp,
-			   struct rxe_send_wqe *rollback_wqe,
-			   u32 rollback_psn)
-{
-	wqe->state = rollback_wqe->state;
-	wqe->first_psn = rollback_wqe->first_psn;
-	wqe->last_psn = rollback_wqe->last_psn;
-	wqe->dma = rollback_wqe->dma;
-	qp->req.psn = rollback_psn;
-}
-
 static void update_state(struct rxe_qp *qp, struct rxe_pkt_info *pkt)
 {
 	qp->req.opcode = pkt->opcode;
@@ -676,8 +652,6 @@ int rxe_requester(struct rxe_qp *qp)
 	int opcode;
 	int err;
 	int ret;
-	struct rxe_send_wqe rollback_wqe;
-	u32 rollback_psn;
 	struct rxe_queue *q = qp->sq.queue;
 	struct rxe_ah *ah;
 	struct rxe_av *av;
@@ -799,9 +773,6 @@ int rxe_requester(struct rxe_qp *qp)
 	pkt.mask = rxe_opcode[opcode].mask;
 	pkt.wqe = wqe;
 
-	/* save wqe state before we build and send packet */
-	save_state(wqe, qp, &rollback_wqe, &rollback_psn);
-
 	av = rxe_get_av(&pkt, &ah);
 	if (unlikely(!av)) {
 		rxe_dbg_qp(qp, "Failed no address vector\n");
@@ -834,10 +805,6 @@ int rxe_requester(struct rxe_qp *qp)
 	if (ah)
 		rxe_put(ah);
 
-	/* update wqe state as though we had sent it */
-	update_wqe_state(qp, wqe, &pkt);
-	update_wqe_psn(qp, wqe, &pkt, payload);
-
 	err = rxe_xmit_packet(qp, &pkt, skb);
 	if (err) {
 		if (err != -EAGAIN) {
@@ -845,11 +812,6 @@ int rxe_requester(struct rxe_qp *qp)
 			goto err;
 		}
 
-		/* the packet was dropped so reset wqe to the state
-		 * before we sent it so we can try to resend
-		 */
-		rollback_state(wqe, qp, &rollback_wqe, rollback_psn);
-
 		/* force a delay until the dropped packet is freed and
 		 * the send queue is drained below the low water mark
 		 */
@@ -859,6 +821,8 @@ int rxe_requester(struct rxe_qp *qp)
 		goto exit;
 	}
 
+	update_wqe_state(qp, wqe, &pkt);
+	update_wqe_psn(qp, wqe, &pkt, payload);
 	update_state(qp, &pkt);
 
 	/* A non-zero return value will cause rxe_do_task to
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 06/12] RDMA/rxe: Don't schedule rxe_completer from rxe_requester
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (4 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 05/12] RDMA/rxe: Remove save/rollback_state in rxe_requester Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 07/12] RDMA/rxe: Don't call rxe_requester from rxe_completer Bob Pearson
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

Now that rxe_completer() is always called serially after
rxe_requester() there is no reason to schedule rxe_completer()
from rxe_requester().

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 6 ------
 drivers/infiniband/sw/rxe/rxe_req.c | 9 ++-------
 2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index 928508558df4..a2fc118e7ec1 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -440,12 +440,6 @@ int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 		return err;
 	}
 
-	if ((qp_type(qp) != IB_QPT_RC) &&
-	    (pkt->mask & RXE_END_MASK)) {
-		pkt->wqe->state = wqe_state_done;
-		rxe_sched_task(&qp->send_task);
-	}
-
 	rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS);
 	goto done;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index e20462c3040d..34c55dee0774 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -545,6 +545,8 @@ static void update_wqe_state(struct rxe_qp *qp,
 	if (pkt->mask & RXE_END_MASK) {
 		if (qp_type(qp) == IB_QPT_RC)
 			wqe->state = wqe_state_pending;
+		else
+			wqe->state = wqe_state_done;
 	} else {
 		wqe->state = wqe_state_processing;
 	}
@@ -631,12 +633,6 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	wqe->status = IB_WC_SUCCESS;
 	qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index);
 
-	/* There is no ack coming for local work requests
-	 * which can lead to a deadlock. So go ahead and complete
-	 * it now.
-	 */
-	rxe_sched_task(&qp->send_task);
-
 	return 0;
 }
 
@@ -760,7 +756,6 @@ int rxe_requester(struct rxe_qp *qp)
 						       qp->req.wqe_index);
 			wqe->state = wqe_state_done;
 			wqe->status = IB_WC_SUCCESS;
-			rxe_sched_task(&qp->send_task);
 			goto done;
 		}
 		payload = mtu;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 07/12] RDMA/rxe: Don't call rxe_requester from rxe_completer
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (5 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 06/12] RDMA/rxe: Don't schedule rxe_completer from rxe_requester Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 08/12] RDMA/rxe: Don't call direct between tasks Bob Pearson
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

Instead of rescheduling rxe_requester from rxe_completer() just extend
the duration of rxe_sender() by one pass. Setting run_requester_again
forces rxe_completer() to return 0 which will cause rxe_sender() to be
called at least one more time.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c  | 17 ++++++++++-------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  1 +
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index ea64a25fe876..357c1d516efb 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -325,7 +325,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
 					qp->comp.psn = pkt->psn;
 					if (qp->req.wait_psn) {
 						qp->req.wait_psn = 0;
-						rxe_sched_task(&qp->send_task);
+						qp->req.again = 1;
 					}
 				}
 				return COMPST_ERROR_RETRY;
@@ -476,7 +476,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	 */
 	if (qp->req.wait_fence) {
 		qp->req.wait_fence = 0;
-		rxe_sched_task(&qp->send_task);
+		qp->req.again = 1;
 	}
 }
 
@@ -515,7 +515,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
 		if (qp->req.need_rd_atomic) {
 			qp->comp.timeout_retry = 0;
 			qp->req.need_rd_atomic = 0;
-			rxe_sched_task(&qp->send_task);
+			qp->req.again = 1;
 		}
 	}
 
@@ -541,7 +541,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp,
 
 		if (qp->req.wait_psn) {
 			qp->req.wait_psn = 0;
-			rxe_sched_task(&qp->send_task);
+			qp->req.again = 1;
 		}
 	}
 
@@ -654,6 +654,8 @@ int rxe_completer(struct rxe_qp *qp)
 	int ret;
 	unsigned long flags;
 
+	qp->req.again = 0;
+
 	spin_lock_irqsave(&qp->state_lock, flags);
 	if (!qp->valid || qp_state(qp) == IB_QPS_ERR ||
 			  qp_state(qp) == IB_QPS_RESET) {
@@ -737,7 +739,7 @@ int rxe_completer(struct rxe_qp *qp)
 
 			if (qp->req.wait_psn) {
 				qp->req.wait_psn = 0;
-				rxe_sched_task(&qp->send_task);
+				qp->req.again = 1;
 			}
 
 			state = COMPST_DONE;
@@ -792,7 +794,7 @@ int rxe_completer(struct rxe_qp *qp)
 							RXE_CNT_COMP_RETRY);
 					qp->req.need_retry = 1;
 					qp->comp.started_retry = 1;
-					rxe_sched_task(&qp->send_task);
+					qp->req.again = 1;
 				}
 				goto done;
 
@@ -843,8 +845,9 @@ int rxe_completer(struct rxe_qp *qp)
 	ret = 0;
 	goto out;
 exit:
-	ret = -EAGAIN;
+	ret = (qp->req.again) ? 0 : -EAGAIN;
 out:
+	qp->req.again = 0;
 	if (pkt)
 		free_pkt(pkt);
 	return ret;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index af8939b8c7a1..3c1354f82283 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -113,6 +113,7 @@ struct rxe_req_info {
 	int			need_retry;
 	int			wait_for_rnr_timer;
 	int			noack_pkts;
+	int			again;
 };
 
 struct rxe_comp_info {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 08/12] RDMA/rxe: Don't call direct between tasks
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (6 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 07/12] RDMA/rxe: Don't call rxe_requester from rxe_completer Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 09/12] RDMA/rxe: Fix incorrect rxe_put in error path Bob Pearson
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

Replace calls to rxe_run_task() with rxe_sched_task().
This prevents the tasks from all running on the same cpu.

This change slightly reduces performance for single qp send and write
benchmarks in loopback mode but greatly improves the performance
with multiple qps because if run task is used all the work tends
to be performed on one cpu. For actual on the wire benchmarks there
is no noticeable performance change.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c  | 13 ++-----------
 drivers/infiniband/sw/rxe/rxe_resp.c  | 12 +-----------
 drivers/infiniband/sw/rxe/rxe_verbs.c |  2 +-
 3 files changed, 4 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index 357c1d516efb..d48af2180745 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -129,18 +129,9 @@ void retransmit_timer(struct timer_list *t)
 
 void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 {
-	int must_sched;
-
-	must_sched = skb_queue_len(&qp->resp_pkts) > 0;
-	if (must_sched != 0)
-		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_SENDER_SCHED);
-
+	rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_SENDER_SCHED);
 	skb_queue_tail(&qp->resp_pkts, skb);
-
-	if (must_sched)
-		rxe_sched_task(&qp->send_task);
-	else
-		rxe_run_task(&qp->send_task);
+	rxe_sched_task(&qp->send_task);
 }
 
 static inline enum comp_state get_wqe(struct rxe_qp *qp,
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 3ce7a32b5dcf..c6a7fa3054fa 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -49,18 +49,8 @@ static char *resp_state_name[] = {
 /* rxe_recv calls here to add a request packet to the input queue */
 void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 {
-	int must_sched;
-	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
-
 	skb_queue_tail(&qp->req_pkts, skb);
-
-	must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) ||
-			(skb_queue_len(&qp->req_pkts) > 1);
-
-	if (must_sched)
-		rxe_sched_task(&qp->recv_task);
-	else
-		rxe_run_task(&qp->recv_task);
+	rxe_sched_task(&qp->recv_task);
 }
 
 static inline enum resp_states get_req(struct rxe_qp *qp,
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index d07f7bd3b2ae..c7d4d8ab5a09 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -935,7 +935,7 @@ static int rxe_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 
 	if (qp->is_user) {
 		/* Utilize process context to do protocol processing */
-		rxe_run_task(&qp->send_task);
+		rxe_sched_task(&qp->send_task);
 	} else {
 		err = rxe_post_send_kernel(qp, wr, bad_wr);
 		if (err)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 09/12] RDMA/rxe: Fix incorrect rxe_put in error path
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (7 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 08/12] RDMA/rxe: Don't call direct between tasks Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 10/12] RDMA/rxe: Make rxe_loopback match rxe_send behavior Bob Pearson
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

In rxe_send() a ref is taken on the qp to keep it alive until the
kfree_skb() has a chance to call the skb destructor rxe_skb_tx_dtor()
which drops the reference. If the packet has an incorrect protocol
the error path just calls kfree_skb() which will call the destructor
which will drop the ref. Currently the driver also calls rxe_put()
which is incorrect. Additionally since the packets sent to rxe_send()
are under the control of the driver and it only ever produces
IPV4 or IPV6 packets the simplest fix is to remove all the code in
this block.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
Fixes: 9eb7f8e44d13 ("IB/rxe: Move refcounting earlier in rxe_send()")
---
 drivers/infiniband/sw/rxe/rxe_net.c | 12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index a2fc118e7ec1..d81440038f91 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -366,18 +366,10 @@ static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 	rxe_get(pkt->qp);
 	atomic_inc(&pkt->qp->skb_out);
 
-	if (skb->protocol == htons(ETH_P_IP)) {
+	if (skb->protocol == htons(ETH_P_IP))
 		err = ip_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
-	} else if (skb->protocol == htons(ETH_P_IPV6)) {
+	else
 		err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
-	} else {
-		rxe_dbg_qp(pkt->qp, "Unknown layer 3 protocol: %d\n",
-				skb->protocol);
-		atomic_dec(&pkt->qp->skb_out);
-		rxe_put(pkt->qp);
-		kfree_skb(skb);
-		return -EINVAL;
-	}
 
 	if (unlikely(net_xmit_eval(err))) {
 		rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 10/12] RDMA/rxe: Make rxe_loopback match rxe_send behavior
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (8 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 09/12] RDMA/rxe: Fix incorrect rxe_put in error path Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 11/12] RDMA/rxe: Get rid of pkt resend on err Bob Pearson
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

The rxe send path currently counts the number of skbs outstanding
between the rxe driver and the ethernet driver to prevent too many
packets to accumulate waiting to send. This patch makes the local
loopback path behave the same way. The loopback path forwards the
packets to the receive path which will eventually call kfree_skb
on all packets and drop the qp references. This makes the loopback
path more useful for software testing.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index d81440038f91..d081409450a4 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -386,6 +386,12 @@ static int rxe_loopback(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 {
 	memcpy(SKB_TO_PKT(skb), pkt, sizeof(*pkt));
 
+	skb->destructor = rxe_skb_tx_dtor;
+	skb->sk = pkt->qp->sk->sk;
+
+	rxe_get(pkt->qp);
+	atomic_inc(&pkt->qp->skb_out);
+
 	if (skb->protocol == htons(ETH_P_IP))
 		skb_pull(skb, sizeof(struct iphdr));
 	else
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 11/12] RDMA/rxe: Get rid of pkt resend on err
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (9 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 10/12] RDMA/rxe: Make rxe_loopback match rxe_send behavior Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-03-29 14:55 ` [PATCH for-next v3 12/12] RDMA/rxe: Let destroy qp succeed with stuck packet Bob Pearson
  2024-04-22 20:10 ` [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Jason Gunthorpe
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

Currently the rxe_driver detects packet drops by ip_local_out()
which occur before the packet is sent on the wire and attempts to
resend them. This is redundant with the usual retry mechanism which
covers packets that get dropped in transit to or from the remote node.

The way this is implemented is not robust since it sets need_req_skb
and waits for the number of local skbs outstanding for this qp to
drop below a low water mark. This is racy since the skb may
be sent to the destructor before the requester can set the
need_req_skb flag. This will cause a deadlock in the send path for
that qp.

This patch removes this mechanism since the normal retry path will
correct the error and resend the packet and it makes no difference
if the packet is dropped locally or later.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c |  7 +------
 drivers/infiniband/sw/rxe/rxe_req.c | 14 ++------------
 2 files changed, 3 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index d081409450a4..b58eab75df97 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -371,12 +371,7 @@ static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 	else
 		err = ip6_local_out(dev_net(skb_dst(skb)->dev), skb->sk, skb);
 
-	if (unlikely(net_xmit_eval(err))) {
-		rxe_dbg_qp(pkt->qp, "error sending packet: %d\n", err);
-		return -EAGAIN;
-	}
-
-	return 0;
+	return err;
 }
 
 /* fix up a send packet to match the packets
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 34c55dee0774..cd14c4c2dff9 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -802,18 +802,8 @@ int rxe_requester(struct rxe_qp *qp)
 
 	err = rxe_xmit_packet(qp, &pkt, skb);
 	if (err) {
-		if (err != -EAGAIN) {
-			wqe->status = IB_WC_LOC_QP_OP_ERR;
-			goto err;
-		}
-
-		/* force a delay until the dropped packet is freed and
-		 * the send queue is drained below the low water mark
-		 */
-		qp->need_req_skb = 1;
-
-		rxe_sched_task(&qp->send_task);
-		goto exit;
+		wqe->status = IB_WC_LOC_QP_OP_ERR;
+		goto err;
 	}
 
 	update_wqe_state(qp, wqe, &pkt);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH for-next v3 12/12] RDMA/rxe: Let destroy qp succeed with stuck packet
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (10 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 11/12] RDMA/rxe: Get rid of pkt resend on err Bob Pearson
@ 2024-03-29 14:55 ` Bob Pearson
  2024-04-22 20:10 ` [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Jason Gunthorpe
  12 siblings, 0 replies; 15+ messages in thread
From: Bob Pearson @ 2024-03-29 14:55 UTC (permalink / raw)
  To: yanjun.zhu, jgg, leon, linux-rdma, jhack; +Cc: Bob Pearson

In some situations a sent packet may get queued in the NIC longer
than than timeout of a ULP. Currently if this happens the ULP may
try to reset the link by destroying the qp and setting up an
alternate connection but will fail because the rxe driver is
waiting for the packet to finish getting sent and be returned to
the skb destructor function where the qp reference holding things
up will be dropped. This patch modifies the way that the qp is
passed to the destructor to pass the qp index and not a qp pointer.
Then the destructor will attempt to lookup the qp from its index
and if it fails exit early. This requires taking a reference on
the struct sock rather than the qp allowing the qp to be destroyed
while the sk is still around waiting for the packet to finish.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_net.c | 42 +++++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_qp.c  |  2 +-
 2 files changed, 32 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index b58eab75df97..dc22f3922a59 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -345,25 +345,44 @@ int rxe_prepare(struct rxe_av *av, struct rxe_pkt_info *pkt,
 
 static void rxe_skb_tx_dtor(struct sk_buff *skb)
 {
-	struct sock *sk = skb->sk;
-	struct rxe_qp *qp = sk->sk_user_data;
-	int skb_out = atomic_dec_return(&qp->skb_out);
+	struct net_device *ndev = skb->dev;
+	struct rxe_dev *rxe;
+	unsigned int qp_index;
+	struct rxe_qp *qp;
+	int skb_out;
+
+	rxe = rxe_get_dev_from_net(ndev);
+	if (!rxe && is_vlan_dev(ndev))
+		rxe = rxe_get_dev_from_net(vlan_dev_real_dev(ndev));
+	if (WARN_ON(!rxe))
+		return;
 
-	if (unlikely(qp->need_req_skb &&
-		     skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
+	qp_index = (int)(uintptr_t)skb->sk->sk_user_data;
+	if (!qp_index)
+		return;
+
+	qp = rxe_pool_get_index(&rxe->qp_pool, qp_index);
+			if (!qp)
+		goto put_dev;
+
+	skb_out = atomic_dec_return(&qp->skb_out);
+	if (qp->need_req_skb && skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW)
 		rxe_sched_task(&qp->send_task);
 
 	rxe_put(qp);
+put_dev:
+	ib_device_put(&rxe->ib_dev);
+	sock_put(skb->sk);
 }
 
 static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 {
 	int err;
+	struct sock *sk = pkt->qp->sk->sk;
 
+	sock_hold(sk);
+	skb->sk = sk;
 	skb->destructor = rxe_skb_tx_dtor;
-	skb->sk = pkt->qp->sk->sk;
-
-	rxe_get(pkt->qp);
 	atomic_inc(&pkt->qp->skb_out);
 
 	if (skb->protocol == htons(ETH_P_IP))
@@ -379,12 +398,13 @@ static int rxe_send(struct sk_buff *skb, struct rxe_pkt_info *pkt)
  */
 static int rxe_loopback(struct sk_buff *skb, struct rxe_pkt_info *pkt)
 {
+	struct sock *sk = pkt->qp->sk->sk;
+
 	memcpy(SKB_TO_PKT(skb), pkt, sizeof(*pkt));
 
+	sock_hold(sk);
+	skb->sk = sk;
 	skb->destructor = rxe_skb_tx_dtor;
-	skb->sk = pkt->qp->sk->sk;
-
-	rxe_get(pkt->qp);
 	atomic_inc(&pkt->qp->skb_out);
 
 	if (skb->protocol == htons(ETH_P_IP))
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index c7d99063594b..d2f7b5195c19 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -244,7 +244,7 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 	err = sock_create_kern(&init_net, AF_INET, SOCK_DGRAM, 0, &qp->sk);
 	if (err < 0)
 		return err;
-	qp->sk->sk->sk_user_data = qp;
+	qp->sk->sk->sk_user_data = (void *)(uintptr_t)qp->elem.index;
 
 	/* pick a source UDP port number for this QP based on
 	 * the source QPN. this spreads traffic for different QPs
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next v3 02/12] RDMA/rxe: Allow good work requests to be executed
  2024-03-29 14:55 ` [PATCH for-next v3 02/12] RDMA/rxe: Allow good work requests to be executed Bob Pearson
@ 2024-03-30 15:01   ` Zhu Yanjun
  0 siblings, 0 replies; 15+ messages in thread
From: Zhu Yanjun @ 2024-03-30 15:01 UTC (permalink / raw)
  To: Bob Pearson, jgg, leon, linux-rdma, jhack


在 2024/3/29 15:55, Bob Pearson 写道:
> A previous commit incorrectly added an 'if(!err)' before scheduling
> the requester task in rxe_post_send_kernel(). But if there were send
> wrs successfully added to the send queue before a bad wr they might
> never get executed.
>
> This commit fixes this by scheduling the requester task if any wqes were
> successfully posted in rxe_post_send_kernel() in rxe_verbs.c.
>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> Fixes: 5bf944f24129 ("RDMA/rxe: Add error messages")
> ---
>   drivers/infiniband/sw/rxe/rxe_verbs.c | 6 +++++-
>   1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 614581989b38..a49784e5156c 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -888,6 +888,7 @@ static int rxe_post_send_kernel(struct rxe_qp *qp,
>   {
>   	int err = 0;
>   	unsigned long flags;
> +	int good = 0;

If we use RCT (Reverse Christmas Tree) to sort out the above variable 
declaration, it is better.

Thanks,

Zhu Yanjun

>   
>   	spin_lock_irqsave(&qp->sq.sq_lock, flags);
>   	while (ibwr) {
> @@ -895,12 +896,15 @@ static int rxe_post_send_kernel(struct rxe_qp *qp,
>   		if (err) {
>   			*bad_wr = ibwr;
>   			break;
> +		} else {
> +			good++;
>   		}
>   		ibwr = ibwr->next;
>   	}
>   	spin_unlock_irqrestore(&qp->sq.sq_lock, flags);
>   
> -	if (!err)
> +	/* kickoff processing of any posted wqes */
> +	if (good)
>   		rxe_sched_task(&qp->req.task);
>   
>   	spin_lock_irqsave(&qp->state_lock, flags);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups
  2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
                   ` (11 preceding siblings ...)
  2024-03-29 14:55 ` [PATCH for-next v3 12/12] RDMA/rxe: Let destroy qp succeed with stuck packet Bob Pearson
@ 2024-04-22 20:10 ` Jason Gunthorpe
  12 siblings, 0 replies; 15+ messages in thread
From: Jason Gunthorpe @ 2024-04-22 20:10 UTC (permalink / raw)
  To: Bob Pearson; +Cc: yanjun.zhu, leon, linux-rdma, jhack

On Fri, Mar 29, 2024 at 09:55:02AM -0500, Bob Pearson wrote:
> This series of patches is the result of high scale testing on a large
> HPC system with a large attached Lustre file system. Several errors
> were found which had not been previously seen at smaller scales. In
> this case up to 1600 QPs on 1024 compute nodes attached to about 100
> flash storage nodes. Each patch has it's own description.
> 
> v3
> 	Fixed an error in "Don't call rxe_requester from rxe_completer"
> 	Moved run_requester_again from a global to rxe_req_info.again.
> 	The control parameter has to be local to each qp.
> v2
> 	Minor edits to some of the commit messages.
> 	Added a missing change to "Don't schedule rxe_completer...".
> 	Added a missing change to "Git rid of pkt resend on err".
> 	Added one additional commit.
> 
> Bob Pearson (12):
>   RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt
>   RDMA/rxe: Allow good work requests to be executed
>   RDMA/rxe: Remove redundant scheduling of rxe_completer
>   RDMA/rxe: Merge request and complete tasks
>   RDMA/rxe: Remove save/rollback_state in rxe_requester
>   RDMA/rxe: Don't schedule rxe_completer from rxe_requester
>   RDMA/rxe: Don't call rxe_requester from rxe_completer
>   RDMA/rxe: Don't call direct between tasks
>   RDMA/rxe: Fix incorrect rxe_put in error path
>   RDMA/rxe: Make rxe_loopback match rxe_send behavior
>   RDMA/rxe: Get rid of pkt resend on err
>   RDMA/rxe: Let destroy qp succeed with stuck packet

Applied to for-next, thanks

Jason

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-04-22 20:10 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-29 14:55 [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 01/12] RDMA/rxe: Fix seg fault in rxe_comp_queue_pkt Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 02/12] RDMA/rxe: Allow good work requests to be executed Bob Pearson
2024-03-30 15:01   ` Zhu Yanjun
2024-03-29 14:55 ` [PATCH for-next v3 03/12] RDMA/rxe: Remove redundant scheduling of rxe_completer Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 04/12] RDMA/rxe: Merge request and complete tasks Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 05/12] RDMA/rxe: Remove save/rollback_state in rxe_requester Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 06/12] RDMA/rxe: Don't schedule rxe_completer from rxe_requester Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 07/12] RDMA/rxe: Don't call rxe_requester from rxe_completer Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 08/12] RDMA/rxe: Don't call direct between tasks Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 09/12] RDMA/rxe: Fix incorrect rxe_put in error path Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 10/12] RDMA/rxe: Make rxe_loopback match rxe_send behavior Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 11/12] RDMA/rxe: Get rid of pkt resend on err Bob Pearson
2024-03-29 14:55 ` [PATCH for-next v3 12/12] RDMA/rxe: Let destroy qp succeed with stuck packet Bob Pearson
2024-04-22 20:10 ` [PATCH for-next v3 00/12] RDMA/rxe: Various fixes and cleanups Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.