linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
@ 2022-10-29  3:09 Bob Pearson
  2022-10-29  3:09 ` [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable Bob Pearson
                   ` (14 more replies)
  0 siblings, 15 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:09 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

This patch series implements work queues as an alternative for
the main tasklets in the rdma_rxe driver. The patch series starts
with a patch that makes the internal API for task execution pluggable
and implements an inline and a tasklet based set of functions.
The remaining patches cleanup the qp reset and error code in the
three tasklets and modify the locking logic to prevent making
multiple calls to the tasklet scheduling routine. After
this preparation the work queue equivalent set of functions is
added and the tasklet version is dropped. 

The advantages of the work queue version of deferred task execution
is mainly that the work queue variant has much better scalability
and overall performance than the tasklet variant.  The perftest
microbenchmarks in local loopback mode (not a very realistic test
case) can reach approximately 100Gb/sec with work queues compared to
about 16Gb/sec for tasklets.

This version of the patch series drops the tasklet version as an option
but keeps the option of switching between the workqueue and inline
versions.

This patch series is derived from an earlier patch set developed by
Ian Ziemba at HPE which is used in some Lustre storage clients attached
to Lustre servers with hard RoCE v2 NICs.

It is based on the current version of wip/jgg-for-next.

v3:
Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
The v3 version drops the first few patches which have already been accepted
in for-next. It also drops the last patch of the v2 version which
introduced module parameters to select between the task interfaces. It also
drops the tasklet version entirely. It fixes a minor error caught by
the kernel test robot <lkp@intel.com> with a missing static declaration.

v2:
The v2 version of the patch set has some minor changes that address
comments from Leon Romanovsky regarding locking of the valid parameter
and the setup parameters for alloc_workqueue. It also has one
additional cleanup patch.

Bob Pearson (13):
  RDMA/rxe: Make task interface pluggable
  RDMA/rxe: Split rxe_drain_resp_pkts()
  RDMA/rxe: Simplify reset state handling in rxe_resp.c
  RDMA/rxe: Handle qp error in rxe_resp.c
  RDMA/rxe: Cleanup comp tasks in rxe_qp.c
  RDMA/rxe: Remove __rxe_do_task()
  RDMA/rxe: Make tasks schedule each other
  RDMA/rxe: Implement disable/enable_task()
  RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
  RDMA/rxe: Replace task->destroyed by task state INVALID.
  RDMA/rxe: Add workqueue support for tasks
  RDMA/rxe: Make WORKQUEUE default for RC tasks
  RDMA/rxe: Remove tasklets from rxe_task.c

 drivers/infiniband/sw/rxe/rxe.c      |   9 +-
 drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
 drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
 drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
 drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
 drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
 7 files changed, 329 insertions(+), 172 deletions(-)


base-commit: 692373d186205dfb1b56f35f22702412d94d9420
-- 
2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
@ 2022-10-29  3:09 ` Bob Pearson
  2022-11-11  2:28   ` Yanjun Zhu
  2022-10-29  3:09 ` [PATCH for-next v3 02/13] RDMA/rxe: Split rxe_drain_resp_pkts() Bob Pearson
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:09 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Make the internal interface to the task operations pluggable and
add a new 'inline' type.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c   |   8 +-
 drivers/infiniband/sw/rxe/rxe_task.c | 160 ++++++++++++++++++++++-----
 drivers/infiniband/sw/rxe/rxe_task.h |  44 +++++---
 3 files changed, 165 insertions(+), 47 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 3f6d62a80bea..b5e108794aa1 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -238,8 +238,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	skb_queue_head_init(&qp->req_pkts);
 
-	rxe_init_task(&qp->req.task, qp, rxe_requester);
-	rxe_init_task(&qp->comp.task, qp, rxe_completer);
+	rxe_init_task(&qp->req.task, qp, rxe_requester, RXE_TASK_TYPE_TASKLET);
+	rxe_init_task(&qp->comp.task, qp, rxe_completer,
+			(qp_type(qp) == IB_QPT_RC) ? RXE_TASK_TYPE_TASKLET :
+						     RXE_TASK_TYPE_INLINE);
 
 	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
 	if (init->qp_type == IB_QPT_RC) {
@@ -286,7 +288,7 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	skb_queue_head_init(&qp->resp_pkts);
 
-	rxe_init_task(&qp->resp.task, qp, rxe_responder);
+	rxe_init_task(&qp->resp.task, qp, rxe_responder, RXE_TASK_TYPE_TASKLET);
 
 	qp->resp.opcode		= OPCODE_NONE;
 	qp->resp.msn		= 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 0208d833a41b..8dfbfa164eff 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -24,12 +24,11 @@ int __rxe_do_task(struct rxe_task *task)
  * a second caller finds the task already running
  * but looks just after the last call to func
  */
-static void do_task(struct tasklet_struct *t)
+static void do_task(struct rxe_task *task)
 {
+	unsigned int iterations = RXE_MAX_ITERATIONS;
 	int cont;
 	int ret;
-	struct rxe_task *task = from_tasklet(task, t, tasklet);
-	unsigned int iterations = RXE_MAX_ITERATIONS;
 
 	spin_lock_bh(&task->lock);
 	switch (task->state) {
@@ -90,28 +89,21 @@ static void do_task(struct tasklet_struct *t)
 	task->ret = ret;
 }
 
-int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *))
+static void disable_task(struct rxe_task *task)
 {
-	task->arg	= arg;
-	task->func	= func;
-	task->destroyed	= false;
-
-	tasklet_setup(&task->tasklet, do_task);
-
-	task->state = TASK_STATE_START;
-	spin_lock_init(&task->lock);
+	/* todo */
+}
 
-	return 0;
+static void enable_task(struct rxe_task *task)
+{
+	/* todo */
 }
 
-void rxe_cleanup_task(struct rxe_task *task)
+/* busy wait until any previous tasks are done */
+static void cleanup_task(struct rxe_task *task)
 {
 	bool idle;
 
-	/*
-	 * Mark the task, then wait for it to finish. It might be
-	 * running in a non-tasklet (direct call) context.
-	 */
 	task->destroyed = true;
 
 	do {
@@ -119,32 +111,144 @@ void rxe_cleanup_task(struct rxe_task *task)
 		idle = (task->state == TASK_STATE_START);
 		spin_unlock_bh(&task->lock);
 	} while (!idle);
+}
 
-	tasklet_kill(&task->tasklet);
+/* silently treat schedule as inline for inline tasks */
+static void inline_sched(struct rxe_task *task)
+{
+	do_task(task);
 }
 
-void rxe_run_task(struct rxe_task *task)
+static void inline_run(struct rxe_task *task)
 {
-	if (task->destroyed)
-		return;
+	do_task(task);
+}
 
-	do_task(&task->tasklet);
+static void inline_disable(struct rxe_task *task)
+{
+	disable_task(task);
 }
 
-void rxe_sched_task(struct rxe_task *task)
+static void inline_enable(struct rxe_task *task)
 {
-	if (task->destroyed)
-		return;
+	enable_task(task);
+}
+
+static void inline_cleanup(struct rxe_task *task)
+{
+	cleanup_task(task);
+}
+
+static const struct rxe_task_ops inline_ops = {
+	.sched = inline_sched,
+	.run = inline_run,
+	.enable = inline_enable,
+	.disable = inline_disable,
+	.cleanup = inline_cleanup,
+};
 
+static void inline_init(struct rxe_task *task)
+{
+	task->ops = &inline_ops;
+}
+
+/* use tsklet_xxx to avoid name collisions with tasklet_xxx */
+static void tsklet_sched(struct rxe_task *task)
+{
 	tasklet_schedule(&task->tasklet);
 }
 
-void rxe_disable_task(struct rxe_task *task)
+static void tsklet_do_task(struct tasklet_struct *tasklet)
 {
+	struct rxe_task *task = container_of(tasklet, typeof(*task), tasklet);
+
+	do_task(task);
+}
+
+static void tsklet_run(struct rxe_task *task)
+{
+	do_task(task);
+}
+
+static void tsklet_disable(struct rxe_task *task)
+{
+	disable_task(task);
 	tasklet_disable(&task->tasklet);
 }
 
-void rxe_enable_task(struct rxe_task *task)
+static void tsklet_enable(struct rxe_task *task)
 {
 	tasklet_enable(&task->tasklet);
+	enable_task(task);
+}
+
+static void tsklet_cleanup(struct rxe_task *task)
+{
+	cleanup_task(task);
+	tasklet_kill(&task->tasklet);
+}
+
+static const struct rxe_task_ops tsklet_ops = {
+	.sched = tsklet_sched,
+	.run = tsklet_run,
+	.enable = tsklet_enable,
+	.disable = tsklet_disable,
+	.cleanup = tsklet_cleanup,
+};
+
+static void tsklet_init(struct rxe_task *task)
+{
+	tasklet_setup(&task->tasklet, tsklet_do_task);
+	task->ops = &tsklet_ops;
+}
+
+int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
+		  enum rxe_task_type type)
+{
+	task->arg	= arg;
+	task->func	= func;
+	task->destroyed	= false;
+	task->type	= type;
+	task->state	= TASK_STATE_START;
+
+	spin_lock_init(&task->lock);
+
+	switch (type) {
+	case RXE_TASK_TYPE_INLINE:
+		inline_init(task);
+		break;
+	case RXE_TASK_TYPE_TASKLET:
+		tsklet_init(task);
+		break;
+	default:
+		pr_debug("%s: invalid task type = %d\n", __func__, type);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+void rxe_sched_task(struct rxe_task *task)
+{
+	task->ops->sched(task);
+}
+
+void rxe_run_task(struct rxe_task *task)
+{
+	task->ops->run(task);
+}
+
+void rxe_enable_task(struct rxe_task *task)
+{
+	task->ops->enable(task);
+}
+
+void rxe_disable_task(struct rxe_task *task)
+{
+	task->ops->disable(task);
+}
+
+void rxe_cleanup_task(struct rxe_task *task)
+{
+	task->ops->cleanup(task);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index 7b88129702ac..31963129ff7a 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -7,6 +7,21 @@
 #ifndef RXE_TASK_H
 #define RXE_TASK_H
 
+struct rxe_task;
+
+struct rxe_task_ops {
+	void (*sched)(struct rxe_task *task);
+	void (*run)(struct rxe_task *task);
+	void (*disable)(struct rxe_task *task);
+	void (*enable)(struct rxe_task *task);
+	void (*cleanup)(struct rxe_task *task);
+};
+
+enum rxe_task_type {
+	RXE_TASK_TYPE_INLINE	= 0,
+	RXE_TASK_TYPE_TASKLET	= 1,
+};
+
 enum {
 	TASK_STATE_START	= 0,
 	TASK_STATE_BUSY		= 1,
@@ -19,24 +34,19 @@ enum {
  * called again.
  */
 struct rxe_task {
-	struct tasklet_struct	tasklet;
-	int			state;
-	spinlock_t		lock;
-	void			*arg;
-	int			(*func)(void *arg);
-	int			ret;
-	bool			destroyed;
+	struct tasklet_struct		tasklet;
+	int				state;
+	spinlock_t			lock;
+	void				*arg;
+	int				(*func)(void *arg);
+	int				ret;
+	bool				destroyed;
+	const struct rxe_task_ops	*ops;
+	enum rxe_task_type		type;
 };
 
-/*
- * init rxe_task structure
- *	arg  => parameter to pass to fcn
- *	func => function to call until it returns != 0
- */
-int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *));
-
-/* cleanup task */
-void rxe_cleanup_task(struct rxe_task *task);
+int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
+		  enum rxe_task_type type);
 
 /*
  * raw call to func in loop without any checking
@@ -54,4 +64,6 @@ void rxe_disable_task(struct rxe_task *task);
 /* allow task to run */
 void rxe_enable_task(struct rxe_task *task);
 
+void rxe_cleanup_task(struct rxe_task *task);
+
 #endif /* RXE_TASK_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 02/13] RDMA/rxe: Split rxe_drain_resp_pkts()
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
  2022-10-29  3:09 ` [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable Bob Pearson
@ 2022-10-29  3:09 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c Bob Pearson
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:09 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Split rxe_drain_resp_pkts() into two subroutines which perform separate
functions.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index 66f392810c86..76dc0a4702fd 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -532,17 +532,21 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp,
 	return COMPST_GET_WQE;
 }
 
-static void rxe_drain_resp_pkts(struct rxe_qp *qp, bool notify)
+static void rxe_drain_resp_pkts(struct rxe_qp *qp)
 {
 	struct sk_buff *skb;
-	struct rxe_send_wqe *wqe;
-	struct rxe_queue *q = qp->sq.queue;
 
 	while ((skb = skb_dequeue(&qp->resp_pkts))) {
 		rxe_put(qp);
 		kfree_skb(skb);
 		ib_device_put(qp->ibqp.device);
 	}
+}
+
+static void rxe_drain_send_queue(struct rxe_qp *qp, bool notify)
+{
+	struct rxe_send_wqe *wqe;
+	struct rxe_queue *q = qp->sq.queue;
 
 	while ((wqe = queue_head(q, q->type))) {
 		if (notify) {
@@ -573,6 +577,7 @@ int rxe_completer(void *arg)
 	struct sk_buff *skb = NULL;
 	struct rxe_pkt_info *pkt = NULL;
 	enum comp_state state;
+	bool notify;
 	int ret;
 
 	if (!rxe_get(qp))
@@ -580,8 +585,9 @@ int rxe_completer(void *arg)
 
 	if (!qp->valid || qp->comp.state == QP_STATE_ERROR ||
 	    qp->comp.state == QP_STATE_RESET) {
-		rxe_drain_resp_pkts(qp, qp->valid &&
-				    qp->comp.state == QP_STATE_ERROR);
+		notify = qp->valid && (qp->comp.state == QP_STATE_ERROR);
+		rxe_drain_resp_pkts(qp);
+		rxe_drain_send_queue(qp, notify);
 		goto exit;
 	}
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
  2022-10-29  3:09 ` [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable Bob Pearson
  2022-10-29  3:09 ` [PATCH for-next v3 02/13] RDMA/rxe: Split rxe_drain_resp_pkts() Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-11-11  3:04   ` Yanjun Zhu
  2022-10-29  3:10 ` [PATCH for-next v3 04/13] RDMA/rxe: Handle qp error " Bob Pearson
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Make rxe_responder() more like rxe_completer() and take qp reset
handling out of the state machine.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_resp.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index c32bc12cc82f..c4f365449aa5 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -40,7 +40,6 @@ enum resp_states {
 	RESPST_ERR_LENGTH,
 	RESPST_ERR_CQ_OVERFLOW,
 	RESPST_ERROR,
-	RESPST_RESET,
 	RESPST_DONE,
 	RESPST_EXIT,
 };
@@ -75,7 +74,6 @@ static char *resp_state_name[] = {
 	[RESPST_ERR_LENGTH]			= "ERR_LENGTH",
 	[RESPST_ERR_CQ_OVERFLOW]		= "ERR_CQ_OVERFLOW",
 	[RESPST_ERROR]				= "ERROR",
-	[RESPST_RESET]				= "RESET",
 	[RESPST_DONE]				= "DONE",
 	[RESPST_EXIT]				= "EXIT",
 };
@@ -1281,8 +1279,9 @@ int rxe_responder(void *arg)
 
 	switch (qp->resp.state) {
 	case QP_STATE_RESET:
-		state = RESPST_RESET;
-		break;
+		rxe_drain_req_pkts(qp, false);
+		qp->resp.wqe = NULL;
+		goto exit;
 
 	default:
 		state = RESPST_GET_REQ;
@@ -1441,11 +1440,6 @@ int rxe_responder(void *arg)
 
 			goto exit;
 
-		case RESPST_RESET:
-			rxe_drain_req_pkts(qp, false);
-			qp->resp.wqe = NULL;
-			goto exit;
-
 		case RESPST_ERROR:
 			qp->resp.goto_error = 0;
 			pr_debug("qp#%d moved to error state\n", qp_num(qp));
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 04/13] RDMA/rxe: Handle qp error in rxe_resp.c
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (2 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH v3 05/13] RDMA/rxe: Cleanup comp tasks in rxe_qp.c Bob Pearson
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Split rxe_drain_req_pkts() into two subroutines which perform
separate tasks. Change qp error and reset states and !qp->valid
in the same way as rxe_comp.c. Flush recv wqes for qp in error.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_resp.c | 64 ++++++++++++++++++++--------
 1 file changed, 47 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index c4f365449aa5..16298d88a9d7 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -1028,7 +1028,6 @@ static enum resp_states do_complete(struct rxe_qp *qp,
 		return RESPST_CLEANUP;
 }
 
-
 static int send_common_ack(struct rxe_qp *qp, u8 syndrome, u32 psn,
 				  int opcode, const char *msg)
 {
@@ -1243,22 +1242,56 @@ static enum resp_states do_class_d1e_error(struct rxe_qp *qp)
 	}
 }
 
-static void rxe_drain_req_pkts(struct rxe_qp *qp, bool notify)
+static void rxe_drain_req_pkts(struct rxe_qp *qp)
 {
 	struct sk_buff *skb;
-	struct rxe_queue *q = qp->rq.queue;
 
 	while ((skb = skb_dequeue(&qp->req_pkts))) {
 		rxe_put(qp);
 		kfree_skb(skb);
 		ib_device_put(qp->ibqp.device);
 	}
+}
+
+static int complete_flush(struct rxe_qp *qp, struct rxe_recv_wqe *wqe)
+{
+	struct rxe_cqe cqe;
+	struct ib_wc *wc = &cqe.ibwc;
+	struct ib_uverbs_wc *uwc = &cqe.uibwc;
+
+	memset(&cqe, 0, sizeof(cqe));
 
-	if (notify)
-		return;
+	if (qp->rcq->is_user) {
+		uwc->status		= IB_WC_WR_FLUSH_ERR;
+		uwc->qp_num		= qp->ibqp.qp_num;
+		uwc->wr_id		= wqe->wr_id;
+	} else {
+		wc->status		= IB_WC_WR_FLUSH_ERR;
+		wc->qp			= &qp->ibqp;
+		wc->wr_id		= wqe->wr_id;
+	}
 
-	while (!qp->srq && q && queue_head(q, q->type))
+	if (rxe_cq_post(qp->rcq, &cqe, 0))
+		return -ENOMEM;
+
+	return 0;
+}
+
+/* drain the receive queue. Complete each wqe with a flush error
+ * if notify is true or until a cq overflow occurs.
+ */
+static void rxe_drain_recv_queue(struct rxe_qp *qp, bool notify)
+{
+	struct rxe_recv_wqe *wqe;
+	struct rxe_queue *q = qp->rq.queue;
+
+	while ((wqe = queue_head(q, q->type))) {
+		if (notify && complete_flush(qp, wqe))
+			notify = 0;
 		queue_advance_consumer(q, q->type);
+	}
+
+	qp->resp.wqe = NULL;
 }
 
 int rxe_responder(void *arg)
@@ -1267,6 +1300,7 @@ int rxe_responder(void *arg)
 	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
 	enum resp_states state;
 	struct rxe_pkt_info *pkt = NULL;
+	bool notify;
 	int ret;
 
 	if (!rxe_get(qp))
@@ -1274,20 +1308,16 @@ int rxe_responder(void *arg)
 
 	qp->resp.aeth_syndrome = AETH_ACK_UNLIMITED;
 
-	if (!qp->valid)
-		goto exit;
-
-	switch (qp->resp.state) {
-	case QP_STATE_RESET:
-		rxe_drain_req_pkts(qp, false);
-		qp->resp.wqe = NULL;
+	if (!qp->valid || qp->resp.state == QP_STATE_ERROR ||
+	    qp->resp.state == QP_STATE_RESET) {
+		notify = qp->valid && (qp->resp.state == QP_STATE_ERROR);
+		rxe_drain_req_pkts(qp);
+		rxe_drain_recv_queue(qp, notify);
 		goto exit;
-
-	default:
-		state = RESPST_GET_REQ;
-		break;
 	}
 
+	state = RESPST_GET_REQ;
+
 	while (1) {
 		pr_debug("qp#%d state = %s\n", qp_num(qp),
 			 resp_state_name[state]);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 05/13] RDMA/rxe: Cleanup comp tasks in rxe_qp.c
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (3 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 04/13] RDMA/rxe: Handle qp error " Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 06/13] RDMA/rxe: Remove __rxe_do_task() Bob Pearson
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Take advantage of inline task behavior to cleanup code in rxe_qp.c
for completer tasks.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index b5e108794aa1..3691eb97c576 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -480,8 +480,7 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 
 	/* stop request/comp */
 	if (qp->sq.queue) {
-		if (qp_type(qp) == IB_QPT_RC)
-			rxe_disable_task(&qp->comp.task);
+		rxe_disable_task(&qp->comp.task);
 		rxe_disable_task(&qp->req.task);
 	}
 
@@ -524,9 +523,7 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 	rxe_enable_task(&qp->resp.task);
 
 	if (qp->sq.queue) {
-		if (qp_type(qp) == IB_QPT_RC)
-			rxe_enable_task(&qp->comp.task);
-
+		rxe_enable_task(&qp->comp.task);
 		rxe_enable_task(&qp->req.task);
 	}
 }
@@ -537,10 +534,7 @@ static void rxe_qp_drain(struct rxe_qp *qp)
 	if (qp->sq.queue) {
 		if (qp->req.state != QP_STATE_DRAINED) {
 			qp->req.state = QP_STATE_DRAIN;
-			if (qp_type(qp) == IB_QPT_RC)
-				rxe_sched_task(&qp->comp.task);
-			else
-				__rxe_do_task(&qp->comp.task);
+			rxe_sched_task(&qp->comp.task);
 			rxe_sched_task(&qp->req.task);
 		}
 	}
@@ -556,11 +550,7 @@ void rxe_qp_error(struct rxe_qp *qp)
 
 	/* drain work and packet queues */
 	rxe_sched_task(&qp->resp.task);
-
-	if (qp_type(qp) == IB_QPT_RC)
-		rxe_sched_task(&qp->comp.task);
-	else
-		__rxe_do_task(&qp->comp.task);
+	rxe_sched_task(&qp->comp.task);
 	rxe_sched_task(&qp->req.task);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 06/13] RDMA/rxe: Remove __rxe_do_task()
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (4 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH v3 05/13] RDMA/rxe: Cleanup comp tasks in rxe_qp.c Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 07/13] RDMA/rxe: Make tasks schedule each other Bob Pearson
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

The subroutine __rxe_do_task is not thread safe. It is only
used in the rxe_qp_reset() and rxe_qp_do_cleanup() routines.
After changes in error handling in the tasklet functions the
queues can be drained by calling them once outside of the
tasklet code. This allows __rxe_do_task() to be removed.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c   | 60 ++++++++++++++--------------
 drivers/infiniband/sw/rxe/rxe_task.c | 13 ------
 drivers/infiniband/sw/rxe/rxe_task.h |  6 ---
 3 files changed, 29 insertions(+), 50 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 3691eb97c576..50f6b8b8ad9d 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -477,28 +477,23 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 {
 	/* stop tasks from running */
 	rxe_disable_task(&qp->resp.task);
-
-	/* stop request/comp */
-	if (qp->sq.queue) {
-		rxe_disable_task(&qp->comp.task);
-		rxe_disable_task(&qp->req.task);
-	}
+	rxe_disable_task(&qp->comp.task);
+	rxe_disable_task(&qp->req.task);
 
 	/* move qp to the reset state */
 	qp->req.state = QP_STATE_RESET;
 	qp->comp.state = QP_STATE_RESET;
 	qp->resp.state = QP_STATE_RESET;
 
-	/* let state machines reset themselves drain work and packet queues
-	 * etc.
-	 */
-	__rxe_do_task(&qp->resp.task);
+	/* drain work and packet queues */
+	rxe_responder(qp);
+	rxe_completer(qp);
+	rxe_requester(qp);
 
-	if (qp->sq.queue) {
-		__rxe_do_task(&qp->comp.task);
-		__rxe_do_task(&qp->req.task);
+	if (qp->rq.queue)
+		rxe_queue_reset(qp->rq.queue);
+	if (qp->sq.queue)
 		rxe_queue_reset(qp->sq.queue);
-	}
 
 	/* cleanup attributes */
 	atomic_set(&qp->ssn, 0);
@@ -521,11 +516,8 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 
 	/* reenable tasks */
 	rxe_enable_task(&qp->resp.task);
-
-	if (qp->sq.queue) {
-		rxe_enable_task(&qp->comp.task);
-		rxe_enable_task(&qp->req.task);
-	}
+	rxe_enable_task(&qp->comp.task);
+	rxe_enable_task(&qp->req.task);
 }
 
 /* drain the send queue */
@@ -543,15 +535,25 @@ static void rxe_qp_drain(struct rxe_qp *qp)
 /* move the qp to the error state */
 void rxe_qp_error(struct rxe_qp *qp)
 {
+	/* stop tasks from running */
+	rxe_disable_task(&qp->resp.task);
+	rxe_disable_task(&qp->comp.task);
+	rxe_disable_task(&qp->req.task);
+
 	qp->req.state = QP_STATE_ERROR;
 	qp->resp.state = QP_STATE_ERROR;
 	qp->comp.state = QP_STATE_ERROR;
 	qp->attr.qp_state = IB_QPS_ERR;
 
 	/* drain work and packet queues */
-	rxe_sched_task(&qp->resp.task);
-	rxe_sched_task(&qp->comp.task);
-	rxe_sched_task(&qp->req.task);
+	rxe_responder(qp);
+	rxe_completer(qp);
+	rxe_requester(qp);
+
+	/* reenable tasks */
+	rxe_enable_task(&qp->resp.task);
+	rxe_enable_task(&qp->comp.task);
+	rxe_enable_task(&qp->req.task);
 }
 
 /* called by the modify qp verb */
@@ -770,24 +772,20 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
 
 	qp->valid = 0;
 	qp->qp_timeout_jiffies = 0;
-	rxe_cleanup_task(&qp->resp.task);
 
 	if (qp_type(qp) == IB_QPT_RC) {
 		del_timer_sync(&qp->retrans_timer);
 		del_timer_sync(&qp->rnr_nak_timer);
 	}
 
+	rxe_cleanup_task(&qp->resp.task);
 	rxe_cleanup_task(&qp->req.task);
 	rxe_cleanup_task(&qp->comp.task);
 
-	/* flush out any receive wr's or pending requests */
-	if (qp->req.task.func)
-		__rxe_do_task(&qp->req.task);
-
-	if (qp->sq.queue) {
-		__rxe_do_task(&qp->comp.task);
-		__rxe_do_task(&qp->req.task);
-	}
+	/* drain any receive wr's or pending requests */
+	rxe_responder(qp);
+	rxe_completer(qp);
+	rxe_requester(qp);
 
 	if (qp->sq.queue)
 		rxe_queue_cleanup(qp->sq.queue);
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 8dfbfa164eff..120693c9a795 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -6,19 +6,6 @@
 
 #include "rxe.h"
 
-int __rxe_do_task(struct rxe_task *task)
-
-{
-	int ret;
-
-	while ((ret = task->func(task->arg)) == 0)
-		;
-
-	task->ret = ret;
-
-	return ret;
-}
-
 /*
  * this locking is due to a potential race where
  * a second caller finds the task already running
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index 31963129ff7a..d594468fcf56 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -48,12 +48,6 @@ struct rxe_task {
 int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 		  enum rxe_task_type type);
 
-/*
- * raw call to func in loop without any checking
- * can call when tasklets are disabled
- */
-int __rxe_do_task(struct rxe_task *task);
-
 void rxe_run_task(struct rxe_task *task);
 
 void rxe_sched_task(struct rxe_task *task);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 07/13] RDMA/rxe: Make tasks schedule each other
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (5 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 06/13] RDMA/rxe: Remove __rxe_do_task() Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 08/13] RDMA/rxe: Implement disable/enable_task() Bob Pearson
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Replace rxe_run_task() by rxe_sched_task() when tasks call each other.
These are not performance critical and mainly involve error paths but
they run the risk of causing deadlocks.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c | 8 ++++----
 drivers/infiniband/sw/rxe/rxe_req.c  | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index 76dc0a4702fd..f2256f67edbf 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -316,7 +316,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
 					qp->comp.psn = pkt->psn;
 					if (qp->req.wait_psn) {
 						qp->req.wait_psn = 0;
-						rxe_run_task(&qp->req.task);
+						rxe_sched_task(&qp->req.task);
 					}
 				}
 				return COMPST_ERROR_RETRY;
@@ -463,7 +463,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	 */
 	if (qp->req.wait_fence) {
 		qp->req.wait_fence = 0;
-		rxe_run_task(&qp->req.task);
+		rxe_sched_task(&qp->req.task);
 	}
 }
 
@@ -477,7 +477,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
 		if (qp->req.need_rd_atomic) {
 			qp->comp.timeout_retry = 0;
 			qp->req.need_rd_atomic = 0;
-			rxe_run_task(&qp->req.task);
+			rxe_sched_task(&qp->req.task);
 		}
 	}
 
@@ -731,7 +731,7 @@ int rxe_completer(void *arg)
 							RXE_CNT_COMP_RETRY);
 					qp->req.need_retry = 1;
 					qp->comp.started_retry = 1;
-					rxe_run_task(&qp->req.task);
+					rxe_sched_task(&qp->req.task);
 				}
 				goto done;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 41f1d84f0acb..fba7572e1d0c 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -733,7 +733,7 @@ int rxe_requester(void *arg)
 						       qp->req.wqe_index);
 			wqe->state = wqe_state_done;
 			wqe->status = IB_WC_SUCCESS;
-			rxe_run_task(&qp->comp.task);
+			rxe_sched_task(&qp->comp.task);
 			goto done;
 		}
 		payload = mtu;
@@ -817,7 +817,7 @@ int rxe_requester(void *arg)
 	qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index);
 	wqe->state = wqe_state_error;
 	qp->req.state = QP_STATE_ERROR;
-	rxe_run_task(&qp->comp.task);
+	rxe_sched_task(&qp->comp.task);
 exit:
 	ret = -EAGAIN;
 out:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 08/13] RDMA/rxe: Implement disable/enable_task()
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (6 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 07/13] RDMA/rxe: Make tasks schedule each other Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 09/13] RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE Bob Pearson
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Implement common disable_task() and enable_task() routines by
adding a new PAUSED state to the do_task() state machine.
These replace tasklet_disable and tasklet_enable with code that
can be shared with all the task types. Move rxe_sched_task to
re-schedule the task outside of the locks to avoid a deadlock.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_task.c | 107 ++++++++++++++++-----------
 drivers/infiniband/sw/rxe/rxe_task.h |   1 +
 2 files changed, 66 insertions(+), 42 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 120693c9a795..d824de82f2ae 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -6,36 +6,46 @@
 
 #include "rxe.h"
 
-/*
- * this locking is due to a potential race where
- * a second caller finds the task already running
- * but looks just after the last call to func
- */
-static void do_task(struct rxe_task *task)
+static bool task_is_idle(struct rxe_task *task)
 {
-	unsigned int iterations = RXE_MAX_ITERATIONS;
-	int cont;
-	int ret;
+	if (task->destroyed)
+		return false;
 
 	spin_lock_bh(&task->lock);
 	switch (task->state) {
 	case TASK_STATE_START:
 		task->state = TASK_STATE_BUSY;
 		spin_unlock_bh(&task->lock);
-		break;
-
+		return true;
 	case TASK_STATE_BUSY:
 		task->state = TASK_STATE_ARMED;
 		fallthrough;
 	case TASK_STATE_ARMED:
-		spin_unlock_bh(&task->lock);
-		return;
-
+	case TASK_STATE_PAUSED:
+		break;
 	default:
+		WARN_ON(1);
+		break;
+	}
+	spin_unlock_bh(&task->lock);
+
+	return false;
+}
+
+static void do_task(struct rxe_task *task)
+{
+	unsigned int iterations = RXE_MAX_ITERATIONS;
+	bool resched = false;
+	int cont;
+	int ret;
+
+	/* flush out pending tasks */
+	spin_lock_bh(&task->lock);
+	if (task->state == TASK_STATE_PAUSED) {
 		spin_unlock_bh(&task->lock);
-		pr_warn("%s failed with bad state %d\n", __func__, task->state);
 		return;
 	}
+	spin_unlock_bh(&task->lock);
 
 	do {
 		cont = 0;
@@ -43,47 +53,52 @@ static void do_task(struct rxe_task *task)
 
 		spin_lock_bh(&task->lock);
 		switch (task->state) {
+		case TASK_STATE_START:
 		case TASK_STATE_BUSY:
 			if (ret) {
 				task->state = TASK_STATE_START;
-			} else if (iterations--) {
+			} else if (task->type == RXE_TASK_TYPE_INLINE ||
+					iterations--) {
 				cont = 1;
 			} else {
-				/* reschedule the tasklet and exit
-				 * the loop to give up the cpu
-				 */
-				tasklet_schedule(&task->tasklet);
 				task->state = TASK_STATE_START;
+				resched = true;
 			}
 			break;
-
-		/* someone tried to run the task since the last time we called
-		 * func, so we will call one more time regardless of the
-		 * return value
-		 */
 		case TASK_STATE_ARMED:
 			task->state = TASK_STATE_BUSY;
 			cont = 1;
 			break;
-
+		case TASK_STATE_PAUSED:
+			break;
 		default:
-			pr_warn("%s failed with bad state %d\n", __func__,
-				task->state);
+			WARN_ON(1);
+			break;
 		}
 		spin_unlock_bh(&task->lock);
 	} while (cont);
 
+	if (resched)
+		rxe_sched_task(task);
+
 	task->ret = ret;
 }
 
 static void disable_task(struct rxe_task *task)
 {
-	/* todo */
+	spin_lock_bh(&task->lock);
+	task->state = TASK_STATE_PAUSED;
+	spin_unlock_bh(&task->lock);
 }
 
 static void enable_task(struct rxe_task *task)
 {
-	/* todo */
+	spin_lock_bh(&task->lock);
+	task->state = TASK_STATE_START;
+	spin_unlock_bh(&task->lock);
+
+	/* restart task in case */
+	rxe_run_task(task);
 }
 
 /* busy wait until any previous tasks are done */
@@ -95,7 +110,8 @@ static void cleanup_task(struct rxe_task *task)
 
 	do {
 		spin_lock_bh(&task->lock);
-		idle = (task->state == TASK_STATE_START);
+		idle = (task->state == TASK_STATE_START ||
+			task->state == TASK_STATE_PAUSED);
 		spin_unlock_bh(&task->lock);
 	} while (!idle);
 }
@@ -103,22 +119,26 @@ static void cleanup_task(struct rxe_task *task)
 /* silently treat schedule as inline for inline tasks */
 static void inline_sched(struct rxe_task *task)
 {
-	do_task(task);
+	if (task_is_idle(task))
+		do_task(task);
 }
 
 static void inline_run(struct rxe_task *task)
 {
-	do_task(task);
+	if (task_is_idle(task))
+		do_task(task);
 }
 
 static void inline_disable(struct rxe_task *task)
 {
-	disable_task(task);
+	if (!task->destroyed)
+		disable_task(task);
 }
 
 static void inline_enable(struct rxe_task *task)
 {
-	enable_task(task);
+	if (!task->destroyed)
+		enable_task(task);
 }
 
 static void inline_cleanup(struct rxe_task *task)
@@ -142,31 +162,34 @@ static void inline_init(struct rxe_task *task)
 /* use tsklet_xxx to avoid name collisions with tasklet_xxx */
 static void tsklet_sched(struct rxe_task *task)
 {
-	tasklet_schedule(&task->tasklet);
+	if (task_is_idle(task))
+		tasklet_schedule(&task->tasklet);
 }
 
 static void tsklet_do_task(struct tasklet_struct *tasklet)
 {
 	struct rxe_task *task = container_of(tasklet, typeof(*task), tasklet);
 
-	do_task(task);
+	if (!task->destroyed)
+		do_task(task);
 }
 
 static void tsklet_run(struct rxe_task *task)
 {
-	do_task(task);
+	if (task_is_idle(task))
+		do_task(task);
 }
 
 static void tsklet_disable(struct rxe_task *task)
 {
-	disable_task(task);
-	tasklet_disable(&task->tasklet);
+	if (!task->destroyed)
+		disable_task(task);
 }
 
 static void tsklet_enable(struct rxe_task *task)
 {
-	tasklet_enable(&task->tasklet);
-	enable_task(task);
+	if (!task->destroyed)
+		enable_task(task);
 }
 
 static void tsklet_cleanup(struct rxe_task *task)
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index d594468fcf56..792832786456 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -26,6 +26,7 @@ enum {
 	TASK_STATE_START	= 0,
 	TASK_STATE_BUSY		= 1,
 	TASK_STATE_ARMED	= 2,
+	TASK_STATE_PAUSED	= 3,
 };
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 09/13] RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (7 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 08/13] RDMA/rxe: Implement disable/enable_task() Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 10/13] RDMA/rxe: Replace task->destroyed by task state INVALID Bob Pearson
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Replace the enum TASK_STATE_START by TASK_STATE_IDLE.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_task.c | 14 +++++++-------
 drivers/infiniband/sw/rxe/rxe_task.h |  2 +-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index d824de82f2ae..0fd0d97e8272 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -13,7 +13,7 @@ static bool task_is_idle(struct rxe_task *task)
 
 	spin_lock_bh(&task->lock);
 	switch (task->state) {
-	case TASK_STATE_START:
+	case TASK_STATE_IDLE:
 		task->state = TASK_STATE_BUSY;
 		spin_unlock_bh(&task->lock);
 		return true;
@@ -53,15 +53,15 @@ static void do_task(struct rxe_task *task)
 
 		spin_lock_bh(&task->lock);
 		switch (task->state) {
-		case TASK_STATE_START:
+		case TASK_STATE_IDLE:
 		case TASK_STATE_BUSY:
 			if (ret) {
-				task->state = TASK_STATE_START;
+				task->state = TASK_STATE_IDLE;
 			} else if (task->type == RXE_TASK_TYPE_INLINE ||
 					iterations--) {
 				cont = 1;
 			} else {
-				task->state = TASK_STATE_START;
+				task->state = TASK_STATE_IDLE;
 				resched = true;
 			}
 			break;
@@ -94,7 +94,7 @@ static void disable_task(struct rxe_task *task)
 static void enable_task(struct rxe_task *task)
 {
 	spin_lock_bh(&task->lock);
-	task->state = TASK_STATE_START;
+	task->state = TASK_STATE_IDLE;
 	spin_unlock_bh(&task->lock);
 
 	/* restart task in case */
@@ -110,7 +110,7 @@ static void cleanup_task(struct rxe_task *task)
 
 	do {
 		spin_lock_bh(&task->lock);
-		idle = (task->state == TASK_STATE_START ||
+		idle = (task->state == TASK_STATE_IDLE ||
 			task->state == TASK_STATE_PAUSED);
 		spin_unlock_bh(&task->lock);
 	} while (!idle);
@@ -219,7 +219,7 @@ int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 	task->func	= func;
 	task->destroyed	= false;
 	task->type	= type;
-	task->state	= TASK_STATE_START;
+	task->state	= TASK_STATE_IDLE;
 
 	spin_lock_init(&task->lock);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index 792832786456..0146307fc517 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -23,7 +23,7 @@ enum rxe_task_type {
 };
 
 enum {
-	TASK_STATE_START	= 0,
+	TASK_STATE_IDLE		= 0,
 	TASK_STATE_BUSY		= 1,
 	TASK_STATE_ARMED	= 2,
 	TASK_STATE_PAUSED	= 3,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 10/13] RDMA/rxe: Replace task->destroyed by task state INVALID.
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (8 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 09/13] RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 11/13] RDMA/rxe: Add workqueue support for tasks Bob Pearson
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Add a new state TASK_STATE_INVALID to replace the flag task->destroyed.
Make changes to task->state proteted by task->lock now including
TASK_STATE_INVALID.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_task.c | 49 ++++++++++------------------
 drivers/infiniband/sw/rxe/rxe_task.h |  3 +-
 2 files changed, 19 insertions(+), 33 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index 0fd0d97e8272..da175f2a0dbf 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -8,9 +8,6 @@
 
 static bool task_is_idle(struct rxe_task *task)
 {
-	if (task->destroyed)
-		return false;
-
 	spin_lock_bh(&task->lock);
 	switch (task->state) {
 	case TASK_STATE_IDLE:
@@ -19,12 +16,8 @@ static bool task_is_idle(struct rxe_task *task)
 		return true;
 	case TASK_STATE_BUSY:
 		task->state = TASK_STATE_ARMED;
-		fallthrough;
-	case TASK_STATE_ARMED:
-	case TASK_STATE_PAUSED:
 		break;
 	default:
-		WARN_ON(1);
 		break;
 	}
 	spin_unlock_bh(&task->lock);
@@ -41,7 +34,8 @@ static void do_task(struct rxe_task *task)
 
 	/* flush out pending tasks */
 	spin_lock_bh(&task->lock);
-	if (task->state == TASK_STATE_PAUSED) {
+	if (task->state == TASK_STATE_PAUSED ||
+	    task->state == TASK_STATE_INVALID) {
 		spin_unlock_bh(&task->lock);
 		return;
 	}
@@ -69,10 +63,7 @@ static void do_task(struct rxe_task *task)
 			task->state = TASK_STATE_BUSY;
 			cont = 1;
 			break;
-		case TASK_STATE_PAUSED:
-			break;
 		default:
-			WARN_ON(1);
 			break;
 		}
 		spin_unlock_bh(&task->lock);
@@ -87,14 +78,16 @@ static void do_task(struct rxe_task *task)
 static void disable_task(struct rxe_task *task)
 {
 	spin_lock_bh(&task->lock);
-	task->state = TASK_STATE_PAUSED;
+	if (task->state != TASK_STATE_INVALID)
+		task->state = TASK_STATE_PAUSED;
 	spin_unlock_bh(&task->lock);
 }
 
 static void enable_task(struct rxe_task *task)
 {
 	spin_lock_bh(&task->lock);
-	task->state = TASK_STATE_IDLE;
+	if (task->state != TASK_STATE_INVALID)
+		task->state = TASK_STATE_IDLE;
 	spin_unlock_bh(&task->lock);
 
 	/* restart task in case */
@@ -104,16 +97,16 @@ static void enable_task(struct rxe_task *task)
 /* busy wait until any previous tasks are done */
 static void cleanup_task(struct rxe_task *task)
 {
-	bool idle;
-
-	task->destroyed = true;
+	bool busy;
 
 	do {
 		spin_lock_bh(&task->lock);
-		idle = (task->state == TASK_STATE_IDLE ||
-			task->state == TASK_STATE_PAUSED);
+		busy = (task->state == TASK_STATE_BUSY ||
+			task->state == TASK_STATE_ARMED);
+		if (!busy)
+			task->state = TASK_STATE_INVALID;
 		spin_unlock_bh(&task->lock);
-	} while (!idle);
+	} while (busy);
 }
 
 /* silently treat schedule as inline for inline tasks */
@@ -131,14 +124,12 @@ static void inline_run(struct rxe_task *task)
 
 static void inline_disable(struct rxe_task *task)
 {
-	if (!task->destroyed)
-		disable_task(task);
+	disable_task(task);
 }
 
 static void inline_enable(struct rxe_task *task)
 {
-	if (!task->destroyed)
-		enable_task(task);
+	enable_task(task);
 }
 
 static void inline_cleanup(struct rxe_task *task)
@@ -168,10 +159,7 @@ static void tsklet_sched(struct rxe_task *task)
 
 static void tsklet_do_task(struct tasklet_struct *tasklet)
 {
-	struct rxe_task *task = container_of(tasklet, typeof(*task), tasklet);
-
-	if (!task->destroyed)
-		do_task(task);
+	do_task(container_of(tasklet, struct rxe_task, tasklet));
 }
 
 static void tsklet_run(struct rxe_task *task)
@@ -182,14 +170,12 @@ static void tsklet_run(struct rxe_task *task)
 
 static void tsklet_disable(struct rxe_task *task)
 {
-	if (!task->destroyed)
-		disable_task(task);
+	disable_task(task);
 }
 
 static void tsklet_enable(struct rxe_task *task)
 {
-	if (!task->destroyed)
-		enable_task(task);
+	enable_task(task);
 }
 
 static void tsklet_cleanup(struct rxe_task *task)
@@ -217,7 +203,6 @@ int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 {
 	task->arg	= arg;
 	task->func	= func;
-	task->destroyed	= false;
 	task->type	= type;
 	task->state	= TASK_STATE_IDLE;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index 0146307fc517..2c4ef4d339f1 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -27,6 +27,7 @@ enum {
 	TASK_STATE_BUSY		= 1,
 	TASK_STATE_ARMED	= 2,
 	TASK_STATE_PAUSED	= 3,
+	TASK_STATE_INVALID	= 4,
 };
 
 /*
@@ -41,7 +42,7 @@ struct rxe_task {
 	void				*arg;
 	int				(*func)(void *arg);
 	int				ret;
-	bool				destroyed;
+	bool				invalid;
 	const struct rxe_task_ops	*ops;
 	enum rxe_task_type		type;
 };
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 11/13] RDMA/rxe: Add workqueue support for tasks
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (9 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 10/13] RDMA/rxe: Replace task->destroyed by task state INVALID Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 12/13] RDMA/rxe: Make WORKQUEUE default for RC tasks Bob Pearson
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson, Ian Ziemba

Add a third task type RXE_TASK_TYPE_WORKQUEUE to rxe_task.c.

Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe.c      |  9 +++-
 drivers/infiniband/sw/rxe/rxe_task.c | 66 ++++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_task.h | 10 ++++-
 3 files changed, 83 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 51daac5c4feb..6d80218334ca 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -210,10 +210,16 @@ static int __init rxe_module_init(void)
 {
 	int err;
 
-	err = rxe_net_init();
+	err = rxe_alloc_wq();
 	if (err)
 		return err;
 
+	err = rxe_net_init();
+	if (err) {
+		rxe_destroy_wq();
+		return err;
+	}
+
 	rdma_link_register(&rxe_link_ops);
 	pr_info("loaded\n");
 	return 0;
@@ -224,6 +230,7 @@ static void __exit rxe_module_exit(void)
 	rdma_link_unregister(&rxe_link_ops);
 	ib_unregister_driver(RDMA_DRIVER_RXE);
 	rxe_net_exit();
+	rxe_destroy_wq();
 
 	pr_info("unloaded\n");
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index da175f2a0dbf..c1177752088d 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -6,6 +6,22 @@
 
 #include "rxe.h"
 
+static struct workqueue_struct *rxe_wq;
+
+int rxe_alloc_wq(void)
+{
+	rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
+	if (!rxe_wq)
+		return -ENOMEM;
+
+	return 0;
+}
+
+void rxe_destroy_wq(void)
+{
+	destroy_workqueue(rxe_wq);
+}
+
 static bool task_is_idle(struct rxe_task *task)
 {
 	spin_lock_bh(&task->lock);
@@ -198,6 +214,53 @@ static void tsklet_init(struct rxe_task *task)
 	task->ops = &tsklet_ops;
 }
 
+static void work_sched(struct rxe_task *task)
+{
+	if (task_is_idle(task))
+		queue_work(rxe_wq, &task->work);
+}
+
+static void work_do_task(struct work_struct *work)
+{
+	do_task(container_of(work, struct rxe_task, work));
+}
+
+static void work_run(struct rxe_task *task)
+{
+	if (task_is_idle(task))
+		do_task(task);
+}
+
+static void work_enable(struct rxe_task *task)
+{
+	enable_task(task);
+}
+
+static void work_disable(struct rxe_task *task)
+{
+	disable_task(task);
+	flush_workqueue(rxe_wq);
+}
+
+static void work_cleanup(struct rxe_task *task)
+{
+	cleanup_task(task);
+}
+
+static const struct rxe_task_ops work_ops = {
+	.sched = work_sched,
+	.run = work_run,
+	.enable = work_enable,
+	.disable = work_disable,
+	.cleanup = work_cleanup,
+};
+
+static void work_init(struct rxe_task *task)
+{
+	INIT_WORK(&task->work, work_do_task);
+	task->ops = &work_ops;
+}
+
 int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 		  enum rxe_task_type type)
 {
@@ -215,6 +278,9 @@ int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 	case RXE_TASK_TYPE_TASKLET:
 		tsklet_init(task);
 		break;
+	case RXE_TASK_TYPE_WORKQUEUE:
+		work_init(task);
+		break;
 	default:
 		pr_debug("%s: invalid task type = %d\n", __func__, type);
 		return -EINVAL;
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index 2c4ef4d339f1..d1156b935635 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -20,6 +20,7 @@ struct rxe_task_ops {
 enum rxe_task_type {
 	RXE_TASK_TYPE_INLINE	= 0,
 	RXE_TASK_TYPE_TASKLET	= 1,
+	RXE_TASK_TYPE_WORKQUEUE	= 2,
 };
 
 enum {
@@ -36,7 +37,10 @@ enum {
  * called again.
  */
 struct rxe_task {
-	struct tasklet_struct		tasklet;
+	union {
+		struct tasklet_struct		tasklet;
+		struct work_struct		work;
+	};
 	int				state;
 	spinlock_t			lock;
 	void				*arg;
@@ -47,6 +51,10 @@ struct rxe_task {
 	enum rxe_task_type		type;
 };
 
+int rxe_alloc_wq(void);
+
+void rxe_destroy_wq(void);
+
 int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 		  enum rxe_task_type type);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 12/13] RDMA/rxe: Make WORKQUEUE default for RC tasks
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (10 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 11/13] RDMA/rxe: Add workqueue support for tasks Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-10-29  3:10 ` [PATCH for-next v3 13/13] RDMA/rxe: Remove tasklets from rxe_task.c Bob Pearson
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

Change RXE_TASK_TYPE_TASKLET to RXE_TASK_TYPE_WORKQUEUE in rxe_qp.c.
This makes work queues the default for tasks except for UD completion
tasks.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_qp.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 50f6b8b8ad9d..ca467d8991a9 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -238,9 +238,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	skb_queue_head_init(&qp->req_pkts);
 
-	rxe_init_task(&qp->req.task, qp, rxe_requester, RXE_TASK_TYPE_TASKLET);
+	rxe_init_task(&qp->req.task, qp, rxe_requester,
+			RXE_TASK_TYPE_WORKQUEUE);
 	rxe_init_task(&qp->comp.task, qp, rxe_completer,
-			(qp_type(qp) == IB_QPT_RC) ? RXE_TASK_TYPE_TASKLET :
+			(qp_type(qp) == IB_QPT_RC) ? RXE_TASK_TYPE_WORKQUEUE :
 						     RXE_TASK_TYPE_INLINE);
 
 	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
@@ -288,7 +289,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	skb_queue_head_init(&qp->resp_pkts);
 
-	rxe_init_task(&qp->resp.task, qp, rxe_responder, RXE_TASK_TYPE_TASKLET);
+	rxe_init_task(&qp->resp.task, qp, rxe_responder,
+			RXE_TASK_TYPE_WORKQUEUE);
 
 	qp->resp.opcode		= OPCODE_NONE;
 	qp->resp.msn		= 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH for-next v3 13/13] RDMA/rxe: Remove tasklets from rxe_task.c
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (11 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 12/13] RDMA/rxe: Make WORKQUEUE default for RC tasks Bob Pearson
@ 2022-10-29  3:10 ` Bob Pearson
  2022-11-02 10:17 ` [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Daisuke Matsuda (Fujitsu)
  2022-11-18  5:02 ` Daisuke Matsuda (Fujitsu)
  14 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-10-29  3:10 UTC (permalink / raw)
  To: jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Bob Pearson

Remove the option to select tasklets(). Maintain the option to have
a pluggable interface for future expansion.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_task.c | 51 ----------------------------
 drivers/infiniband/sw/rxe/rxe_task.h |  8 ++---
 2 files changed, 2 insertions(+), 57 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
index c1177752088d..d9c4ab2e58c8 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ b/drivers/infiniband/sw/rxe/rxe_task.c
@@ -166,54 +166,6 @@ static void inline_init(struct rxe_task *task)
 	task->ops = &inline_ops;
 }
 
-/* use tsklet_xxx to avoid name collisions with tasklet_xxx */
-static void tsklet_sched(struct rxe_task *task)
-{
-	if (task_is_idle(task))
-		tasklet_schedule(&task->tasklet);
-}
-
-static void tsklet_do_task(struct tasklet_struct *tasklet)
-{
-	do_task(container_of(tasklet, struct rxe_task, tasklet));
-}
-
-static void tsklet_run(struct rxe_task *task)
-{
-	if (task_is_idle(task))
-		do_task(task);
-}
-
-static void tsklet_disable(struct rxe_task *task)
-{
-	disable_task(task);
-}
-
-static void tsklet_enable(struct rxe_task *task)
-{
-	enable_task(task);
-}
-
-static void tsklet_cleanup(struct rxe_task *task)
-{
-	cleanup_task(task);
-	tasklet_kill(&task->tasklet);
-}
-
-static const struct rxe_task_ops tsklet_ops = {
-	.sched = tsklet_sched,
-	.run = tsklet_run,
-	.enable = tsklet_enable,
-	.disable = tsklet_disable,
-	.cleanup = tsklet_cleanup,
-};
-
-static void tsklet_init(struct rxe_task *task)
-{
-	tasklet_setup(&task->tasklet, tsklet_do_task);
-	task->ops = &tsklet_ops;
-}
-
 static void work_sched(struct rxe_task *task)
 {
 	if (task_is_idle(task))
@@ -275,9 +227,6 @@ int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
 	case RXE_TASK_TYPE_INLINE:
 		inline_init(task);
 		break;
-	case RXE_TASK_TYPE_TASKLET:
-		tsklet_init(task);
-		break;
 	case RXE_TASK_TYPE_WORKQUEUE:
 		work_init(task);
 		break;
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
index d1156b935635..fbc7e2bf4e5a 100644
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ b/drivers/infiniband/sw/rxe/rxe_task.h
@@ -19,8 +19,7 @@ struct rxe_task_ops {
 
 enum rxe_task_type {
 	RXE_TASK_TYPE_INLINE	= 0,
-	RXE_TASK_TYPE_TASKLET	= 1,
-	RXE_TASK_TYPE_WORKQUEUE	= 2,
+	RXE_TASK_TYPE_WORKQUEUE	= 1,
 };
 
 enum {
@@ -37,10 +36,7 @@ enum {
  * called again.
  */
 struct rxe_task {
-	union {
-		struct tasklet_struct		tasklet;
-		struct work_struct		work;
-	};
+	struct work_struct		work;
 	int				state;
 	spinlock_t			lock;
 	void				*arg;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* RE: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (12 preceding siblings ...)
  2022-10-29  3:10 ` [PATCH for-next v3 13/13] RDMA/rxe: Remove tasklets from rxe_task.c Bob Pearson
@ 2022-11-02 10:17 ` Daisuke Matsuda (Fujitsu)
  2022-11-02 11:20   ` Bob Pearson
  2022-11-18  5:02 ` Daisuke Matsuda (Fujitsu)
  14 siblings, 1 reply; 23+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2022-11-02 10:17 UTC (permalink / raw)
  To: 'Bob Pearson', jgg, leon, zyjzyj2000, jhack, linux-rdma

On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> This patch series implements work queues as an alternative for
> the main tasklets in the rdma_rxe driver. The patch series starts
> with a patch that makes the internal API for task execution pluggable
> and implements an inline and a tasklet based set of functions.
> The remaining patches cleanup the qp reset and error code in the
> three tasklets and modify the locking logic to prevent making
> multiple calls to the tasklet scheduling routine. After
> this preparation the work queue equivalent set of functions is
> added and the tasklet version is dropped.

Thank you for posting the 3rd series.
It looks fine at a glance, but now I am concerned about problems
that can be potentially caused by concurrency.

> 
> The advantages of the work queue version of deferred task execution
> is mainly that the work queue variant has much better scalability
> and overall performance than the tasklet variant.  The perftest
> microbenchmarks in local loopback mode (not a very realistic test
> case) can reach approximately 100Gb/sec with work queues compared to
> about 16Gb/sec for tasklets.

As you wrote, the advantage of work queue version is that the number works
that can run parallelly scales with the number of logical CPUs. However, the
dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
designed for serial execution on tasklet, so we must not rely on them functioning
properly on parallel execution.

There could be 3 problems, which stem from the fact that works are not necessarily
executed in the same order the packets are received. Works are enqueued to worker
pools on each CPU, and each CPU respectively schedules the works, so the ordering
of works among CPUs is not guaranteed.

[1]
On UC/UD connections, responder does not check the psn of inbound packets,
so the payloads can be copied to MRs without checking the order. If there are
works that write to overlapping memory locations, they can potentially cause
data corruption depending the order.

[2]
On RC connections, responder checks the psn, and drops the packet if it is not
the expected one. Requester can retransmit the request in this case, so the order
seems to be guaranteed for RC.

However, responder updates the next expected psn (qp->resp.psn) BEFORE
replying an ACK packet. If the work is preempted soon after storing the next psn,
another work on another CPU can potentially reply another ACK packet earlier.
This behaviour is against the spec.
Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"

[3]
Again on RC connections, the next expected psn (qp->resp.psn) can be
loaded and stored at the same time from different threads. It seems we
have to use a synchronization method, perhaps like READ_ONCE() and
WRITE_ONCE() macros, to prevent loading an old value. This one is just an
example; there can be other variables that need similar consideration.


All the problems above can be solved by making the work queue single-
threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
for alloc_workqueue(), but this should be the last resort since this spoils
the performance benefit of work queue.

I am not sure what we can do with [1] right now.
For [2] and [3], we could just move the update of psn later than the ack reply,
and use *_ONCE() macros for shared variables.

Thanks,
Daisuke

> 
> This version of the patch series drops the tasklet version as an option
> but keeps the option of switching between the workqueue and inline
> versions.
> 
> This patch series is derived from an earlier patch set developed by
> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> to Lustre servers with hard RoCE v2 NICs.
> 
> It is based on the current version of wip/jgg-for-next.
> 
> v3:
> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
> The v3 version drops the first few patches which have already been accepted
> in for-next. It also drops the last patch of the v2 version which
> introduced module parameters to select between the task interfaces. It also
> drops the tasklet version entirely. It fixes a minor error caught by
> the kernel test robot <lkp@intel.com> with a missing static declaration.
> 
> v2:
> The v2 version of the patch set has some minor changes that address
> comments from Leon Romanovsky regarding locking of the valid parameter
> and the setup parameters for alloc_workqueue. It also has one
> additional cleanup patch.
> 
> Bob Pearson (13):
>   RDMA/rxe: Make task interface pluggable
>   RDMA/rxe: Split rxe_drain_resp_pkts()
>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>   RDMA/rxe: Handle qp error in rxe_resp.c
>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>   RDMA/rxe: Remove __rxe_do_task()
>   RDMA/rxe: Make tasks schedule each other
>   RDMA/rxe: Implement disable/enable_task()
>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>   RDMA/rxe: Add workqueue support for tasks
>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>   RDMA/rxe: Remove tasklets from rxe_task.c
> 
>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>  7 files changed, 329 insertions(+), 172 deletions(-)
> 
> 
> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-11-02 10:17 ` [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Daisuke Matsuda (Fujitsu)
@ 2022-11-02 11:20   ` Bob Pearson
  2022-11-04  4:59     ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Pearson @ 2022-11-02 11:20 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), jgg, leon, zyjzyj2000, jhack, linux-rdma

On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>> This patch series implements work queues as an alternative for
>> the main tasklets in the rdma_rxe driver. The patch series starts
>> with a patch that makes the internal API for task execution pluggable
>> and implements an inline and a tasklet based set of functions.
>> The remaining patches cleanup the qp reset and error code in the
>> three tasklets and modify the locking logic to prevent making
>> multiple calls to the tasklet scheduling routine. After
>> this preparation the work queue equivalent set of functions is
>> added and the tasklet version is dropped.
> 
> Thank you for posting the 3rd series.
> It looks fine at a glance, but now I am concerned about problems
> that can be potentially caused by concurrency.
> 
>>
>> The advantages of the work queue version of deferred task execution
>> is mainly that the work queue variant has much better scalability
>> and overall performance than the tasklet variant.  The perftest
>> microbenchmarks in local loopback mode (not a very realistic test
>> case) can reach approximately 100Gb/sec with work queues compared to
>> about 16Gb/sec for tasklets.
> 
> As you wrote, the advantage of work queue version is that the number works
> that can run parallelly scales with the number of logical CPUs. However, the
> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> designed for serial execution on tasklet, so we must not rely on them functioning
> properly on parallel execution.

Work queues are serial for each separate work task just like tasklets. There isn't
a problem here. The tasklets for different tasks can run in parallel but tend to
do so less than work queue tasks. The reason is that tasklets are scheduled by
default on the same cpu as the thread that scheduled it while work queues are scheduled
by the kernel scheduler and get spread around.
> 
> There could be 3 problems, which stem from the fact that works are not necessarily
> executed in the same order the packets are received. Works are enqueued to worker
> pools on each CPU, and each CPU respectively schedules the works, so the ordering
> of works among CPUs is not guaranteed.
> 
> [1]
> On UC/UD connections, responder does not check the psn of inbound packets,
> so the payloads can be copied to MRs without checking the order. If there are
> works that write to overlapping memory locations, they can potentially cause
> data corruption depending the order.
> 
> [2]
> On RC connections, responder checks the psn, and drops the packet if it is not
> the expected one. Requester can retransmit the request in this case, so the order
> seems to be guaranteed for RC.
> 
> However, responder updates the next expected psn (qp->resp.psn) BEFORE
> replying an ACK packet. If the work is preempted soon after storing the next psn,
> another work on another CPU can potentially reply another ACK packet earlier.
> This behaviour is against the spec.
> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> 
> [3]
> Again on RC connections, the next expected psn (qp->resp.psn) can be
> loaded and stored at the same time from different threads. It seems we
> have to use a synchronization method, perhaps like READ_ONCE() and
> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> example; there can be other variables that need similar consideration.
> 
> 
> All the problems above can be solved by making the work queue single-
> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> for alloc_workqueue(), but this should be the last resort since this spoils
> the performance benefit of work queue.
> 
> I am not sure what we can do with [1] right now.
> For [2] and [3], we could just move the update of psn later than the ack reply,
> and use *_ONCE() macros for shared variables.
> 
> Thanks,
> Daisuke
> 
>>
>> This version of the patch series drops the tasklet version as an option
>> but keeps the option of switching between the workqueue and inline
>> versions.
>>
>> This patch series is derived from an earlier patch set developed by
>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>> to Lustre servers with hard RoCE v2 NICs.
>>
>> It is based on the current version of wip/jgg-for-next.
>>
>> v3:
>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>> The v3 version drops the first few patches which have already been accepted
>> in for-next. It also drops the last patch of the v2 version which
>> introduced module parameters to select between the task interfaces. It also
>> drops the tasklet version entirely. It fixes a minor error caught by
>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>
>> v2:
>> The v2 version of the patch set has some minor changes that address
>> comments from Leon Romanovsky regarding locking of the valid parameter
>> and the setup parameters for alloc_workqueue. It also has one
>> additional cleanup patch.
>>
>> Bob Pearson (13):
>>   RDMA/rxe: Make task interface pluggable
>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>   RDMA/rxe: Remove __rxe_do_task()
>>   RDMA/rxe: Make tasks schedule each other
>>   RDMA/rxe: Implement disable/enable_task()
>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>   RDMA/rxe: Add workqueue support for tasks
>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>
>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>
>>
>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>> --
>> 2.34.1
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-11-02 11:20   ` Bob Pearson
@ 2022-11-04  4:59     ` Daisuke Matsuda (Fujitsu)
  2022-11-05 21:15       ` Bob Pearson
  0 siblings, 1 reply; 23+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2022-11-04  4:59 UTC (permalink / raw)
  To: 'Bob Pearson', jgg, leon, zyjzyj2000, jhack, linux-rdma

On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> > On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> >> This patch series implements work queues as an alternative for
> >> the main tasklets in the rdma_rxe driver. The patch series starts
> >> with a patch that makes the internal API for task execution pluggable
> >> and implements an inline and a tasklet based set of functions.
> >> The remaining patches cleanup the qp reset and error code in the
> >> three tasklets and modify the locking logic to prevent making
> >> multiple calls to the tasklet scheduling routine. After
> >> this preparation the work queue equivalent set of functions is
> >> added and the tasklet version is dropped.
> >
> > Thank you for posting the 3rd series.
> > It looks fine at a glance, but now I am concerned about problems
> > that can be potentially caused by concurrency.
> >
> >>
> >> The advantages of the work queue version of deferred task execution
> >> is mainly that the work queue variant has much better scalability
> >> and overall performance than the tasklet variant.  The perftest
> >> microbenchmarks in local loopback mode (not a very realistic test
> >> case) can reach approximately 100Gb/sec with work queues compared to
> >> about 16Gb/sec for tasklets.
> >
> > As you wrote, the advantage of work queue version is that the number works
> > that can run parallelly scales with the number of logical CPUs. However, the
> > dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> > designed for serial execution on tasklet, so we must not rely on them functioning
> > properly on parallel execution.
> 
> Work queues are serial for each separate work task just like tasklets. There isn't
> a problem here. The tasklets for different tasks can run in parallel but tend to
> do so less than work queue tasks. The reason is that tasklets are scheduled by
> default on the same cpu as the thread that scheduled it while work queues are scheduled
> by the kernel scheduler and get spread around.

=====
rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
=====
You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
the system scheduler, but each work is still enqueued to worker pools of each CPU
and thus bound to the CPU the issuer is running on. It seems the behaviour you
expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
on any CPU at the cost of cache utilization.

Two of the same tasklets never run concurrently on two different processors by nature,
but that is not the case with work queues. If two softirqs running on different CPUs
enqueue responder works at almost the same time, it is possible that they are dispatched
and run on the different CPUs at the same time. I mean the problems may arise in such
a situation.

Please let me know if I missed anything. I referred to the following document.
The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
run serially.
cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html

Thanks,
Daisuke

> >
> > There could be 3 problems, which stem from the fact that works are not necessarily
> > executed in the same order the packets are received. Works are enqueued to worker
> > pools on each CPU, and each CPU respectively schedules the works, so the ordering
> > of works among CPUs is not guaranteed.
> >
> > [1]
> > On UC/UD connections, responder does not check the psn of inbound packets,
> > so the payloads can be copied to MRs without checking the order. If there are
> > works that write to overlapping memory locations, they can potentially cause
> > data corruption depending the order.
> >
> > [2]
> > On RC connections, responder checks the psn, and drops the packet if it is not
> > the expected one. Requester can retransmit the request in this case, so the order
> > seems to be guaranteed for RC.
> >
> > However, responder updates the next expected psn (qp->resp.psn) BEFORE
> > replying an ACK packet. If the work is preempted soon after storing the next psn,
> > another work on another CPU can potentially reply another ACK packet earlier.
> > This behaviour is against the spec.
> > Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> >
> > [3]
> > Again on RC connections, the next expected psn (qp->resp.psn) can be
> > loaded and stored at the same time from different threads. It seems we
> > have to use a synchronization method, perhaps like READ_ONCE() and
> > WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> > example; there can be other variables that need similar consideration.
> >
> >
> > All the problems above can be solved by making the work queue single-
> > threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> > for alloc_workqueue(), but this should be the last resort since this spoils
> > the performance benefit of work queue.
> >
> > I am not sure what we can do with [1] right now.
> > For [2] and [3], we could just move the update of psn later than the ack reply,
> > and use *_ONCE() macros for shared variables.
> >
> > Thanks,
> > Daisuke
> >
> >>
> >> This version of the patch series drops the tasklet version as an option
> >> but keeps the option of switching between the workqueue and inline
> >> versions.
> >>
> >> This patch series is derived from an earlier patch set developed by
> >> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> >> to Lustre servers with hard RoCE v2 NICs.
> >>
> >> It is based on the current version of wip/jgg-for-next.
> >>
> >> v3:
> >> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
> >> The v3 version drops the first few patches which have already been accepted
> >> in for-next. It also drops the last patch of the v2 version which
> >> introduced module parameters to select between the task interfaces. It also
> >> drops the tasklet version entirely. It fixes a minor error caught by
> >> the kernel test robot <lkp@intel.com> with a missing static declaration.
> >>
> >> v2:
> >> The v2 version of the patch set has some minor changes that address
> >> comments from Leon Romanovsky regarding locking of the valid parameter
> >> and the setup parameters for alloc_workqueue. It also has one
> >> additional cleanup patch.
> >>
> >> Bob Pearson (13):
> >>   RDMA/rxe: Make task interface pluggable
> >>   RDMA/rxe: Split rxe_drain_resp_pkts()
> >>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
> >>   RDMA/rxe: Handle qp error in rxe_resp.c
> >>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
> >>   RDMA/rxe: Remove __rxe_do_task()
> >>   RDMA/rxe: Make tasks schedule each other
> >>   RDMA/rxe: Implement disable/enable_task()
> >>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
> >>   RDMA/rxe: Replace task->destroyed by task state INVALID.
> >>   RDMA/rxe: Add workqueue support for tasks
> >>   RDMA/rxe: Make WORKQUEUE default for RC tasks
> >>   RDMA/rxe: Remove tasklets from rxe_task.c
> >>
> >>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
> >>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
> >>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
> >>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
> >>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
> >>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
> >>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
> >>  7 files changed, 329 insertions(+), 172 deletions(-)
> >>
> >>
> >> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> >> --
> >> 2.34.1
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-11-04  4:59     ` Daisuke Matsuda (Fujitsu)
@ 2022-11-05 21:15       ` Bob Pearson
  2022-11-07  8:21         ` Daisuke Matsuda (Fujitsu)
  0 siblings, 1 reply; 23+ messages in thread
From: Bob Pearson @ 2022-11-05 21:15 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), jgg, leon, zyjzyj2000, jhack, linux-rdma

On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>>>> This patch series implements work queues as an alternative for
>>>> the main tasklets in the rdma_rxe driver. The patch series starts
>>>> with a patch that makes the internal API for task execution pluggable
>>>> and implements an inline and a tasklet based set of functions.
>>>> The remaining patches cleanup the qp reset and error code in the
>>>> three tasklets and modify the locking logic to prevent making
>>>> multiple calls to the tasklet scheduling routine. After
>>>> this preparation the work queue equivalent set of functions is
>>>> added and the tasklet version is dropped.
>>>
>>> Thank you for posting the 3rd series.
>>> It looks fine at a glance, but now I am concerned about problems
>>> that can be potentially caused by concurrency.
>>>
>>>>
>>>> The advantages of the work queue version of deferred task execution
>>>> is mainly that the work queue variant has much better scalability
>>>> and overall performance than the tasklet variant.  The perftest
>>>> microbenchmarks in local loopback mode (not a very realistic test
>>>> case) can reach approximately 100Gb/sec with work queues compared to
>>>> about 16Gb/sec for tasklets.
>>>
>>> As you wrote, the advantage of work queue version is that the number works
>>> that can run parallelly scales with the number of logical CPUs. However, the
>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
>>> designed for serial execution on tasklet, so we must not rely on them functioning
>>> properly on parallel execution.
>>
>> Work queues are serial for each separate work task just like tasklets. There isn't
>> a problem here. The tasklets for different tasks can run in parallel but tend to
>> do so less than work queue tasks. The reason is that tasklets are scheduled by
>> default on the same cpu as the thread that scheduled it while work queues are scheduled
>> by the kernel scheduler and get spread around.
> 
> =====
> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
> =====
> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
> the system scheduler, but each work is still enqueued to worker pools of each CPU
> and thus bound to the CPU the issuer is running on. It seems the behaviour you
> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
> on any CPU at the cost of cache utilization.
> 
> Two of the same tasklets never run concurrently on two different processors by nature,
> but that is not the case with work queues. If two softirqs running on different CPUs
> enqueue responder works at almost the same time, it is possible that they are dispatched
> and run on the different CPUs at the same time. I mean the problems may arise in such
> a situation.
> 
> Please let me know if I missed anything. I referred to the following document.
> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
> run serially.
> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
> 
> Thanks,
> Daisuke
> 
According to this:

    Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
    after a work item gets queued:

        The work function hasn’t been changed.

        No one queues the work item to another workqueue.

        The work item hasn’t been reinitiated.

    In other words, if the above conditions hold, the work item is guaranteed to be executed by at
    most one worker system-wide at any given time.

    Note that requeuing the work item (to the same queue) in the self function doesn’t break these
    conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
    inside a work function.

I should be OK. Each work item checks the state under lock before scheduling the item and
if it is free moves it to busy and then schedules it. Only one instance of a work item
at a time should be running.

I only know what I see from running top. It seems that the work items do get spread out over
time on the cpus.

The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
100% for extended periods of time. We are benchmarking storage with IOR.

Bob

>>>
>>> There could be 3 problems, which stem from the fact that works are not necessarily
>>> executed in the same order the packets are received. Works are enqueued to worker
>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
>>> of works among CPUs is not guaranteed.
>>>
>>> [1]
>>> On UC/UD connections, responder does not check the psn of inbound packets,
>>> so the payloads can be copied to MRs without checking the order. If there are
>>> works that write to overlapping memory locations, they can potentially cause
>>> data corruption depending the order.
>>>
>>> [2]
>>> On RC connections, responder checks the psn, and drops the packet if it is not
>>> the expected one. Requester can retransmit the request in this case, so the order
>>> seems to be guaranteed for RC.
>>>
>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
>>> replying an ACK packet. If the work is preempted soon after storing the next psn,
>>> another work on another CPU can potentially reply another ACK packet earlier.
>>> This behaviour is against the spec.
>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
>>>
>>> [3]
>>> Again on RC connections, the next expected psn (qp->resp.psn) can be
>>> loaded and stored at the same time from different threads. It seems we
>>> have to use a synchronization method, perhaps like READ_ONCE() and
>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
>>> example; there can be other variables that need similar consideration.
>>>
>>>
>>> All the problems above can be solved by making the work queue single-
>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
>>> for alloc_workqueue(), but this should be the last resort since this spoils
>>> the performance benefit of work queue.
>>>
>>> I am not sure what we can do with [1] right now.
>>> For [2] and [3], we could just move the update of psn later than the ack reply,
>>> and use *_ONCE() macros for shared variables.
>>>
>>> Thanks,
>>> Daisuke
>>>
>>>>
>>>> This version of the patch series drops the tasklet version as an option
>>>> but keeps the option of switching between the workqueue and inline
>>>> versions.
>>>>
>>>> This patch series is derived from an earlier patch set developed by
>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>>>> to Lustre servers with hard RoCE v2 NICs.
>>>>
>>>> It is based on the current version of wip/jgg-for-next.
>>>>
>>>> v3:
>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>>>> The v3 version drops the first few patches which have already been accepted
>>>> in for-next. It also drops the last patch of the v2 version which
>>>> introduced module parameters to select between the task interfaces. It also
>>>> drops the tasklet version entirely. It fixes a minor error caught by
>>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>>>
>>>> v2:
>>>> The v2 version of the patch set has some minor changes that address
>>>> comments from Leon Romanovsky regarding locking of the valid parameter
>>>> and the setup parameters for alloc_workqueue. It also has one
>>>> additional cleanup patch.
>>>>
>>>> Bob Pearson (13):
>>>>   RDMA/rxe: Make task interface pluggable
>>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>>>   RDMA/rxe: Remove __rxe_do_task()
>>>>   RDMA/rxe: Make tasks schedule each other
>>>>   RDMA/rxe: Implement disable/enable_task()
>>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>>>   RDMA/rxe: Add workqueue support for tasks
>>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>>>
>>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>>>
>>>>
>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>>>> --
>>>> 2.34.1
>>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-11-05 21:15       ` Bob Pearson
@ 2022-11-07  8:21         ` Daisuke Matsuda (Fujitsu)
  2022-11-07 16:06           ` Bob Pearson
  0 siblings, 1 reply; 23+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2022-11-07  8:21 UTC (permalink / raw)
  To: 'Bob Pearson', jgg, leon, zyjzyj2000, jhack, linux-rdma

On Sun, Nov 6, 2022 6:15 AM Bob Pearson
> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
> > On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
> >> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
> >>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
> >>>> This patch series implements work queues as an alternative for
> >>>> the main tasklets in the rdma_rxe driver. The patch series starts
> >>>> with a patch that makes the internal API for task execution pluggable
> >>>> and implements an inline and a tasklet based set of functions.
> >>>> The remaining patches cleanup the qp reset and error code in the
> >>>> three tasklets and modify the locking logic to prevent making
> >>>> multiple calls to the tasklet scheduling routine. After
> >>>> this preparation the work queue equivalent set of functions is
> >>>> added and the tasklet version is dropped.
> >>>
> >>> Thank you for posting the 3rd series.
> >>> It looks fine at a glance, but now I am concerned about problems
> >>> that can be potentially caused by concurrency.
> >>>
> >>>>
> >>>> The advantages of the work queue version of deferred task execution
> >>>> is mainly that the work queue variant has much better scalability
> >>>> and overall performance than the tasklet variant.  The perftest
> >>>> microbenchmarks in local loopback mode (not a very realistic test
> >>>> case) can reach approximately 100Gb/sec with work queues compared to
> >>>> about 16Gb/sec for tasklets.
> >>>
> >>> As you wrote, the advantage of work queue version is that the number works
> >>> that can run parallelly scales with the number of logical CPUs. However, the
> >>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
> >>> designed for serial execution on tasklet, so we must not rely on them functioning
> >>> properly on parallel execution.
> >>
> >> Work queues are serial for each separate work task just like tasklets. There isn't
> >> a problem here. The tasklets for different tasks can run in parallel but tend to
> >> do so less than work queue tasks. The reason is that tasklets are scheduled by
> >> default on the same cpu as the thread that scheduled it while work queues are scheduled
> >> by the kernel scheduler and get spread around.
> >
> > =====
> > rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
> > =====
> > You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
> > the system scheduler, but each work is still enqueued to worker pools of each CPU
> > and thus bound to the CPU the issuer is running on. It seems the behaviour you
> > expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
> > on any CPU at the cost of cache utilization.
> >
> > Two of the same tasklets never run concurrently on two different processors by nature,
> > but that is not the case with work queues. If two softirqs running on different CPUs
> > enqueue responder works at almost the same time, it is possible that they are dispatched
> > and run on the different CPUs at the same time. I mean the problems may arise in such
> > a situation.
> >
> > Please let me know if I missed anything. I referred to the following document.
> > The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
> > run serially.
> > cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
> >
> > Thanks,
> > Daisuke
> >
> According to this:
> 
>     Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>     after a work item gets queued:
> 
>         The work function hasn’t been changed.
> 
>         No one queues the work item to another workqueue.
> 
>         The work item hasn’t been reinitiated.
> 
>     In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>     most one worker system-wide at any given time.
> 
>     Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>     conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>     inside a work function.
> 
> I should be OK. Each work item checks the state under lock before scheduling the item and
> if it is free moves it to busy and then schedules it. Only one instance of a work item
> at a time should be running.

Thank you for the explanation.
Per-qp work items should meet the three conditions. That is what I have missing.
Now I see. You are correct.

> 
> I only know what I see from running top. It seems that the work items do get spread out over
> time on the cpus.

It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
workers in the same way. This is not stated explicitly in the document.

> 
> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
> 100% for extended periods of time. We are benchmarking storage with IOR.

It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
flag bothers other rxe users.

Thanks,
Daisuke

> 
> Bob
> 
> >>>
> >>> There could be 3 problems, which stem from the fact that works are not necessarily
> >>> executed in the same order the packets are received. Works are enqueued to worker
> >>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
> >>> of works among CPUs is not guaranteed.
> >>>
> >>> [1]
> >>> On UC/UD connections, responder does not check the psn of inbound packets,
> >>> so the payloads can be copied to MRs without checking the order. If there are
> >>> works that write to overlapping memory locations, they can potentially cause
> >>> data corruption depending the order.
> >>>
> >>> [2]
> >>> On RC connections, responder checks the psn, and drops the packet if it is not
> >>> the expected one. Requester can retransmit the request in this case, so the order
> >>> seems to be guaranteed for RC.
> >>>
> >>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
> >>> replying an ACK packet. If the work is preempted soon after storing the next psn,
> >>> another work on another CPU can potentially reply another ACK packet earlier.
> >>> This behaviour is against the spec.
> >>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
> >>>
> >>> [3]
> >>> Again on RC connections, the next expected psn (qp->resp.psn) can be
> >>> loaded and stored at the same time from different threads. It seems we
> >>> have to use a synchronization method, perhaps like READ_ONCE() and
> >>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
> >>> example; there can be other variables that need similar consideration.
> >>>
> >>>
> >>> All the problems above can be solved by making the work queue single-
> >>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
> >>> for alloc_workqueue(), but this should be the last resort since this spoils
> >>> the performance benefit of work queue.
> >>>
> >>> I am not sure what we can do with [1] right now.
> >>> For [2] and [3], we could just move the update of psn later than the ack reply,
> >>> and use *_ONCE() macros for shared variables.
> >>>
> >>> Thanks,
> >>> Daisuke
> >>>
> >>>>
> >>>> This version of the patch series drops the tasklet version as an option
> >>>> but keeps the option of switching between the workqueue and inline
> >>>> versions.
> >>>>
> >>>> This patch series is derived from an earlier patch set developed by
> >>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
> >>>> to Lustre servers with hard RoCE v2 NICs.
> >>>>
> >>>> It is based on the current version of wip/jgg-for-next.
> >>>>
> >>>> v3:
> >>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
> >>>> The v3 version drops the first few patches which have already been accepted
> >>>> in for-next. It also drops the last patch of the v2 version which
> >>>> introduced module parameters to select between the task interfaces. It also
> >>>> drops the tasklet version entirely. It fixes a minor error caught by
> >>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
> >>>>
> >>>> v2:
> >>>> The v2 version of the patch set has some minor changes that address
> >>>> comments from Leon Romanovsky regarding locking of the valid parameter
> >>>> and the setup parameters for alloc_workqueue. It also has one
> >>>> additional cleanup patch.
> >>>>
> >>>> Bob Pearson (13):
> >>>>   RDMA/rxe: Make task interface pluggable
> >>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
> >>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
> >>>>   RDMA/rxe: Handle qp error in rxe_resp.c
> >>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
> >>>>   RDMA/rxe: Remove __rxe_do_task()
> >>>>   RDMA/rxe: Make tasks schedule each other
> >>>>   RDMA/rxe: Implement disable/enable_task()
> >>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
> >>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
> >>>>   RDMA/rxe: Add workqueue support for tasks
> >>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
> >>>>   RDMA/rxe: Remove tasklets from rxe_task.c
> >>>>
> >>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
> >>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
> >>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
> >>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
> >>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
> >>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
> >>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
> >>>>  7 files changed, 329 insertions(+), 172 deletions(-)
> >>>>
> >>>>
> >>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> >>>> --
> >>>> 2.34.1
> >>>
> >


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-11-07  8:21         ` Daisuke Matsuda (Fujitsu)
@ 2022-11-07 16:06           ` Bob Pearson
  0 siblings, 0 replies; 23+ messages in thread
From: Bob Pearson @ 2022-11-07 16:06 UTC (permalink / raw)
  To: Daisuke Matsuda (Fujitsu), jgg, leon, zyjzyj2000, jhack, linux-rdma

On 11/7/22 02:21, Daisuke Matsuda (Fujitsu) wrote:
> On Sun, Nov 6, 2022 6:15 AM Bob Pearson
>> On 11/3/22 23:59, Daisuke Matsuda (Fujitsu) wrote:
>>> On Wed, Nov 2, 2022 8:21 PM Bob Pearson wrote:
>>>> On 11/2/22 05:17, Daisuke Matsuda (Fujitsu) wrote:
>>>>> On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:
>>>>>> This patch series implements work queues as an alternative for
>>>>>> the main tasklets in the rdma_rxe driver. The patch series starts
>>>>>> with a patch that makes the internal API for task execution pluggable
>>>>>> and implements an inline and a tasklet based set of functions.
>>>>>> The remaining patches cleanup the qp reset and error code in the
>>>>>> three tasklets and modify the locking logic to prevent making
>>>>>> multiple calls to the tasklet scheduling routine. After
>>>>>> this preparation the work queue equivalent set of functions is
>>>>>> added and the tasklet version is dropped.
>>>>>
>>>>> Thank you for posting the 3rd series.
>>>>> It looks fine at a glance, but now I am concerned about problems
>>>>> that can be potentially caused by concurrency.
>>>>>
>>>>>>
>>>>>> The advantages of the work queue version of deferred task execution
>>>>>> is mainly that the work queue variant has much better scalability
>>>>>> and overall performance than the tasklet variant.  The perftest
>>>>>> microbenchmarks in local loopback mode (not a very realistic test
>>>>>> case) can reach approximately 100Gb/sec with work queues compared to
>>>>>> about 16Gb/sec for tasklets.
>>>>>
>>>>> As you wrote, the advantage of work queue version is that the number works
>>>>> that can run parallelly scales with the number of logical CPUs. However, the
>>>>> dispatched works (rxe_requester, rxe_responder, and rxe_completer) are
>>>>> designed for serial execution on tasklet, so we must not rely on them functioning
>>>>> properly on parallel execution.
>>>>
>>>> Work queues are serial for each separate work task just like tasklets. There isn't
>>>> a problem here. The tasklets for different tasks can run in parallel but tend to
>>>> do so less than work queue tasks. The reason is that tasklets are scheduled by
>>>> default on the same cpu as the thread that scheduled it while work queues are scheduled
>>>> by the kernel scheduler and get spread around.
>>>
>>> =====
>>> rxe_wq = alloc_workqueue("rxe_wq", WQ_CPU_INTENSIVE, WQ_MAX_ACTIVE);
>>> =====
>>> You are using the WQ_CPU_INTENSIVE flag. This allows works to be scheduled by
>>> the system scheduler, but each work is still enqueued to worker pools of each CPU
>>> and thus bound to the CPU the issuer is running on. It seems the behaviour you
>>> expect can be achieved by the WQ_UNBOUND flag. Unbound work items will run
>>> on any CPU at the cost of cache utilization.
>>>
>>> Two of the same tasklets never run concurrently on two different processors by nature,
>>> but that is not the case with work queues. If two softirqs running on different CPUs
>>> enqueue responder works at almost the same time, it is possible that they are dispatched
>>> and run on the different CPUs at the same time. I mean the problems may arise in such
>>> a situation.
>>>
>>> Please let me know if I missed anything. I referred to the following document.
>>> The easiest solution is to use @flags= WQ_UNBOUND and @max_active=1 to let works
>>> run serially.
>>> cf. https://www.kernel.org/doc/html/latest/core-api/workqueue.html
>>>
>>> Thanks,
>>> Daisuke
>>>
>> According to this:
>>
>>     Workqueue guarantees that a work item cannot be re-entrant if the following conditions hold
>>     after a work item gets queued:
>>
>>         The work function hasn’t been changed.
>>
>>         No one queues the work item to another workqueue.
>>
>>         The work item hasn’t been reinitiated.
>>
>>     In other words, if the above conditions hold, the work item is guaranteed to be executed by at
>>     most one worker system-wide at any given time.
>>
>>     Note that requeuing the work item (to the same queue) in the self function doesn’t break these
>>     conditions, so it’s safe to do. Otherwise, caution is required when breaking the conditions
>>     inside a work function.
>>
>> I should be OK. Each work item checks the state under lock before scheduling the item and
>> if it is free moves it to busy and then schedules it. Only one instance of a work item
>> at a time should be running.
> 
> Thank you for the explanation.
> Per-qp work items should meet the three conditions. That is what I have missing.
> Now I see. You are correct.
> 
>>
>> I only know what I see from running top. It seems that the work items do get spread out over
>> time on the cpus.
> 
> It seems process_one_work() schedules items for both UNBOUND and CPU_INTENSIVE
> workers in the same way. This is not stated explicitly in the document.
> 
>>
>> The CPU_INTENSIVE is certainly correct for our application which will run all the cpus at
>> 100% for extended periods of time. We are benchmarking storage with IOR.
> 
> It is OK with me. I have not come up with any situations where the CPU_INTENSIVE
> flag bothers other rxe users.
> 
> Thanks,
> Daisuke
> 
>>
>> Bob
>>
>>>>>
>>>>> There could be 3 problems, which stem from the fact that works are not necessarily
>>>>> executed in the same order the packets are received. Works are enqueued to worker
>>>>> pools on each CPU, and each CPU respectively schedules the works, so the ordering
>>>>> of works among CPUs is not guaranteed.
>>>>>
>>>>> [1]
>>>>> On UC/UD connections, responder does not check the psn of inbound packets,
>>>>> so the payloads can be copied to MRs without checking the order. If there are
>>>>> works that write to overlapping memory locations, they can potentially cause
>>>>> data corruption depending the order.
>>>>>
>>>>> [2]
>>>>> On RC connections, responder checks the psn, and drops the packet if it is not
>>>>> the expected one. Requester can retransmit the request in this case, so the order
>>>>> seems to be guaranteed for RC.
>>>>>
>>>>> However, responder updates the next expected psn (qp->resp.psn) BEFORE
>>>>> replying an ACK packet. If the work is preempted soon after storing the next psn,
>>>>> another work on another CPU can potentially reply another ACK packet earlier.
>>>>> This behaviour is against the spec.
>>>>> Cf. IB Specification Vol 1-Release-1.5 " 9.5 TRANSACTION ORDERING"
>>>>>
>>>>> [3]
>>>>> Again on RC connections, the next expected psn (qp->resp.psn) can be
>>>>> loaded and stored at the same time from different threads. It seems we
>>>>> have to use a synchronization method, perhaps like READ_ONCE() and
>>>>> WRITE_ONCE() macros, to prevent loading an old value. This one is just an
>>>>> example; there can be other variables that need similar consideration.
>>>>>
>>>>>
>>>>> All the problems above can be solved by making the work queue single-
>>>>> threaded. We can do it by using flags=WQ_UNBOUND and max_active=1
>>>>> for alloc_workqueue(), but this should be the last resort since this spoils
>>>>> the performance benefit of work queue.
>>>>>
>>>>> I am not sure what we can do with [1] right now.
>>>>> For [2] and [3], we could just move the update of psn later than the ack reply,
>>>>> and use *_ONCE() macros for shared variables.
>>>>>
>>>>> Thanks,
>>>>> Daisuke

Thank you for taking the time to review this.

Bob
>>>>>
>>>>>>
>>>>>> This version of the patch series drops the tasklet version as an option
>>>>>> but keeps the option of switching between the workqueue and inline
>>>>>> versions.
>>>>>>
>>>>>> This patch series is derived from an earlier patch set developed by
>>>>>> Ian Ziemba at HPE which is used in some Lustre storage clients attached
>>>>>> to Lustre servers with hard RoCE v2 NICs.
>>>>>>
>>>>>> It is based on the current version of wip/jgg-for-next.
>>>>>>
>>>>>> v3:
>>>>>> Link: https://lore.kernel.org/linux-rdma/202210220559.f7taTL8S-lkp@intel.com/
>>>>>> The v3 version drops the first few patches which have already been accepted
>>>>>> in for-next. It also drops the last patch of the v2 version which
>>>>>> introduced module parameters to select between the task interfaces. It also
>>>>>> drops the tasklet version entirely. It fixes a minor error caught by
>>>>>> the kernel test robot <lkp@intel.com> with a missing static declaration.
>>>>>>
>>>>>> v2:
>>>>>> The v2 version of the patch set has some minor changes that address
>>>>>> comments from Leon Romanovsky regarding locking of the valid parameter
>>>>>> and the setup parameters for alloc_workqueue. It also has one
>>>>>> additional cleanup patch.
>>>>>>
>>>>>> Bob Pearson (13):
>>>>>>   RDMA/rxe: Make task interface pluggable
>>>>>>   RDMA/rxe: Split rxe_drain_resp_pkts()
>>>>>>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>>>>>>   RDMA/rxe: Handle qp error in rxe_resp.c
>>>>>>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>>>>>>   RDMA/rxe: Remove __rxe_do_task()
>>>>>>   RDMA/rxe: Make tasks schedule each other
>>>>>>   RDMA/rxe: Implement disable/enable_task()
>>>>>>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>>>>>>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>>>>>>   RDMA/rxe: Add workqueue support for tasks
>>>>>>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>>>>>>   RDMA/rxe: Remove tasklets from rxe_task.c
>>>>>>
>>>>>>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>>>>>>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>>>>>>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>>>>>>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>>>>>>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>>>>>>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>>>>>>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>>>>>>  7 files changed, 329 insertions(+), 172 deletions(-)
>>>>>>
>>>>>>
>>>>>> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
>>>>>> --
>>>>>> 2.34.1
>>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable
  2022-10-29  3:09 ` [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable Bob Pearson
@ 2022-11-11  2:28   ` Yanjun Zhu
  0 siblings, 0 replies; 23+ messages in thread
From: Yanjun Zhu @ 2022-11-11  2:28 UTC (permalink / raw)
  To: Bob Pearson, jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Ian Ziemba

在 2022/10/29 11:09, Bob Pearson 写道:
> Make the internal interface to the task operations pluggable and
> add a new 'inline' type.
> 
> Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_qp.c   |   8 +-
>   drivers/infiniband/sw/rxe/rxe_task.c | 160 ++++++++++++++++++++++-----
>   drivers/infiniband/sw/rxe/rxe_task.h |  44 +++++---
>   3 files changed, 165 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index 3f6d62a80bea..b5e108794aa1 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -238,8 +238,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
>   
>   	skb_queue_head_init(&qp->req_pkts);
>   
> -	rxe_init_task(&qp->req.task, qp, rxe_requester);
> -	rxe_init_task(&qp->comp.task, qp, rxe_completer);
> +	rxe_init_task(&qp->req.task, qp, rxe_requester, RXE_TASK_TYPE_TASKLET);
> +	rxe_init_task(&qp->comp.task, qp, rxe_completer,
> +			(qp_type(qp) == IB_QPT_RC) ? RXE_TASK_TYPE_TASKLET :
> +						     RXE_TASK_TYPE_INLINE);
>   
>   	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
>   	if (init->qp_type == IB_QPT_RC) {
> @@ -286,7 +288,7 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
>   
>   	skb_queue_head_init(&qp->resp_pkts);
>   
> -	rxe_init_task(&qp->resp.task, qp, rxe_responder);
> +	rxe_init_task(&qp->resp.task, qp, rxe_responder, RXE_TASK_TYPE_TASKLET);
>   
>   	qp->resp.opcode		= OPCODE_NONE;
>   	qp->resp.msn		= 0;
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> index 0208d833a41b..8dfbfa164eff 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ b/drivers/infiniband/sw/rxe/rxe_task.c
> @@ -24,12 +24,11 @@ int __rxe_do_task(struct rxe_task *task)
>    * a second caller finds the task already running
>    * but looks just after the last call to func
>    */
> -static void do_task(struct tasklet_struct *t)
> +static void do_task(struct rxe_task *task)
>   {
> +	unsigned int iterations = RXE_MAX_ITERATIONS;
>   	int cont;
>   	int ret;
> -	struct rxe_task *task = from_tasklet(task, t, tasklet);
> -	unsigned int iterations = RXE_MAX_ITERATIONS;
>   
>   	spin_lock_bh(&task->lock);
>   	switch (task->state) {
> @@ -90,28 +89,21 @@ static void do_task(struct tasklet_struct *t)
>   	task->ret = ret;
>   }
>   
> -int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *))
> +static void disable_task(struct rxe_task *task)
>   {
> -	task->arg	= arg;
> -	task->func	= func;
> -	task->destroyed	= false;
> -
> -	tasklet_setup(&task->tasklet, do_task);
> -
> -	task->state = TASK_STATE_START;
> -	spin_lock_init(&task->lock);
> +	/* todo */
> +}
>   
> -	return 0;
> +static void enable_task(struct rxe_task *task)
> +{
> +	/* todo */
>   }
>   
> -void rxe_cleanup_task(struct rxe_task *task)
> +/* busy wait until any previous tasks are done */
> +static void cleanup_task(struct rxe_task *task)
>   {
>   	bool idle;
>   
> -	/*
> -	 * Mark the task, then wait for it to finish. It might be
> -	 * running in a non-tasklet (direct call) context.
> -	 */
>   	task->destroyed = true;
>   
>   	do {
> @@ -119,32 +111,144 @@ void rxe_cleanup_task(struct rxe_task *task)
>   		idle = (task->state == TASK_STATE_START);
>   		spin_unlock_bh(&task->lock);
>   	} while (!idle);
> +}
>   
> -	tasklet_kill(&task->tasklet);
> +/* silently treat schedule as inline for inline tasks */
> +static void inline_sched(struct rxe_task *task)
> +{
> +	do_task(task);
>   }
>   
> -void rxe_run_task(struct rxe_task *task)
> +static void inline_run(struct rxe_task *task)
>   {
> -	if (task->destroyed)
> -		return;
> +	do_task(task);
> +}
>   
> -	do_task(&task->tasklet);
> +static void inline_disable(struct rxe_task *task)
> +{
> +	disable_task(task);
>   }
>   
> -void rxe_sched_task(struct rxe_task *task)
> +static void inline_enable(struct rxe_task *task)
>   {
> -	if (task->destroyed)
> -		return;
> +	enable_task(task);
> +}
> +
> +static void inline_cleanup(struct rxe_task *task)
> +{
> +	cleanup_task(task);
> +}
> +
> +static const struct rxe_task_ops inline_ops = {
> +	.sched = inline_sched,
> +	.run = inline_run,
> +	.enable = inline_enable,
> +	.disable = inline_disable,
> +	.cleanup = inline_cleanup,
> +};
>   
> +static void inline_init(struct rxe_task *task)
> +{
> +	task->ops = &inline_ops;
> +}
> +
> +/* use tsklet_xxx to avoid name collisions with tasklet_xxx */
> +static void tsklet_sched(struct rxe_task *task)
> +{
>   	tasklet_schedule(&task->tasklet);
>   }
>   
> -void rxe_disable_task(struct rxe_task *task)
> +static void tsklet_do_task(struct tasklet_struct *tasklet)
>   {
> +	struct rxe_task *task = container_of(tasklet, typeof(*task), tasklet);
> +
> +	do_task(task);
> +}
> +
> +static void tsklet_run(struct rxe_task *task)
> +{
> +	do_task(task);
> +}
> +
> +static void tsklet_disable(struct rxe_task *task)
> +{
> +	disable_task(task);
>   	tasklet_disable(&task->tasklet);
>   }
>   
> -void rxe_enable_task(struct rxe_task *task)
> +static void tsklet_enable(struct rxe_task *task)
>   {
>   	tasklet_enable(&task->tasklet);
> +	enable_task(task);
> +}
> +
> +static void tsklet_cleanup(struct rxe_task *task)
> +{
> +	cleanup_task(task);
> +	tasklet_kill(&task->tasklet);
> +}
> +
> +static const struct rxe_task_ops tsklet_ops = {
> +	.sched = tsklet_sched,
> +	.run = tsklet_run,
> +	.enable = tsklet_enable,
> +	.disable = tsklet_disable,
> +	.cleanup = tsklet_cleanup,
> +};
> +
> +static void tsklet_init(struct rxe_task *task)
> +{
> +	tasklet_setup(&task->tasklet, tsklet_do_task);
> +	task->ops = &tsklet_ops;
> +}
> +
> +int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
> +		  enum rxe_task_type type)
> +{
> +	task->arg	= arg;
> +	task->func	= func;
> +	task->destroyed	= false;
> +	task->type	= type;
> +	task->state	= TASK_STATE_START;
> +
> +	spin_lock_init(&task->lock);

About this spin_lock_init, I rembered this commit 
news://nntp.lore.kernel.org:119/20220710043709.707649-1-yanjun.zhu@linux.dev

Can this spin_lock_init be moved to the function rxe_qp_init_misc?
So this can avoid the error "initialize spin locks before use".

Zhu Yanjun

> +
> +	switch (type) {
> +	case RXE_TASK_TYPE_INLINE:
> +		inline_init(task);
> +		break;
> +	case RXE_TASK_TYPE_TASKLET:
> +		tsklet_init(task);
> +		break;
> +	default:
> +		pr_debug("%s: invalid task type = %d\n", __func__, type);
> +		return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +void rxe_sched_task(struct rxe_task *task)
> +{
> +	task->ops->sched(task);
> +}
> +
> +void rxe_run_task(struct rxe_task *task)
> +{
> +	task->ops->run(task);
> +}
> +
> +void rxe_enable_task(struct rxe_task *task)
> +{
> +	task->ops->enable(task);
> +}
> +
> +void rxe_disable_task(struct rxe_task *task)
> +{
> +	task->ops->disable(task);
> +}
> +
> +void rxe_cleanup_task(struct rxe_task *task)
> +{
> +	task->ops->cleanup(task);
>   }
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
> index 7b88129702ac..31963129ff7a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_task.h
> +++ b/drivers/infiniband/sw/rxe/rxe_task.h
> @@ -7,6 +7,21 @@
>   #ifndef RXE_TASK_H
>   #define RXE_TASK_H
>   
> +struct rxe_task;
> +
> +struct rxe_task_ops {
> +	void (*sched)(struct rxe_task *task);
> +	void (*run)(struct rxe_task *task);
> +	void (*disable)(struct rxe_task *task);
> +	void (*enable)(struct rxe_task *task);
> +	void (*cleanup)(struct rxe_task *task);
> +};
> +
> +enum rxe_task_type {
> +	RXE_TASK_TYPE_INLINE	= 0,
> +	RXE_TASK_TYPE_TASKLET	= 1,
> +};
> +
>   enum {
>   	TASK_STATE_START	= 0,
>   	TASK_STATE_BUSY		= 1,
> @@ -19,24 +34,19 @@ enum {
>    * called again.
>    */
>   struct rxe_task {
> -	struct tasklet_struct	tasklet;
> -	int			state;
> -	spinlock_t		lock;
> -	void			*arg;
> -	int			(*func)(void *arg);
> -	int			ret;
> -	bool			destroyed;
> +	struct tasklet_struct		tasklet;
> +	int				state;
> +	spinlock_t			lock;
> +	void				*arg;
> +	int				(*func)(void *arg);
> +	int				ret;
> +	bool				destroyed;
> +	const struct rxe_task_ops	*ops;
> +	enum rxe_task_type		type;
>   };
>   
> -/*
> - * init rxe_task structure
> - *	arg  => parameter to pass to fcn
> - *	func => function to call until it returns != 0
> - */
> -int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *));
> -
> -/* cleanup task */
> -void rxe_cleanup_task(struct rxe_task *task);
> +int rxe_init_task(struct rxe_task *task, void *arg, int (*func)(void *),
> +		  enum rxe_task_type type);
>   
>   /*
>    * raw call to func in loop without any checking
> @@ -54,4 +64,6 @@ void rxe_disable_task(struct rxe_task *task);
>   /* allow task to run */
>   void rxe_enable_task(struct rxe_task *task);
>   
> +void rxe_cleanup_task(struct rxe_task *task);
> +
>   #endif /* RXE_TASK_H */


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c
  2022-10-29  3:10 ` [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c Bob Pearson
@ 2022-11-11  3:04   ` Yanjun Zhu
  0 siblings, 0 replies; 23+ messages in thread
From: Yanjun Zhu @ 2022-11-11  3:04 UTC (permalink / raw)
  To: Bob Pearson, jgg, leon, zyjzyj2000, jhack, linux-rdma; +Cc: Ian Ziemba

在 2022/10/29 11:10, Bob Pearson 写道:
> Make rxe_responder() more like rxe_completer() and take qp reset
> handling out of the state machine.

 From RDMA spec, qp reset is part of qp states. If qp reset is moved out 
of the state machine. And other devices still take qp reset in the state 
machine. Will this make difference on the connection between rxe and 
other ib devices, such as irdma, mlx devices.

You know, rxe should make basic connections with other ib devices.

Zhu Yanjun

> 
> Signed-off-by: Ian Ziemba <ian.ziemba@hpe.com>
> Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
> ---
>   drivers/infiniband/sw/rxe/rxe_resp.c | 12 +++---------
>   1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index c32bc12cc82f..c4f365449aa5 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -40,7 +40,6 @@ enum resp_states {
>   	RESPST_ERR_LENGTH,
>   	RESPST_ERR_CQ_OVERFLOW,
>   	RESPST_ERROR,
> -	RESPST_RESET,
>   	RESPST_DONE,
>   	RESPST_EXIT,
>   };
> @@ -75,7 +74,6 @@ static char *resp_state_name[] = {
>   	[RESPST_ERR_LENGTH]			= "ERR_LENGTH",
>   	[RESPST_ERR_CQ_OVERFLOW]		= "ERR_CQ_OVERFLOW",
>   	[RESPST_ERROR]				= "ERROR",
> -	[RESPST_RESET]				= "RESET",
>   	[RESPST_DONE]				= "DONE",
>   	[RESPST_EXIT]				= "EXIT",
>   };
> @@ -1281,8 +1279,9 @@ int rxe_responder(void *arg)
>   
>   	switch (qp->resp.state) {
>   	case QP_STATE_RESET:
> -		state = RESPST_RESET;
> -		break;
> +		rxe_drain_req_pkts(qp, false);
> +		qp->resp.wqe = NULL;
> +		goto exit;
>   
>   	default:
>   		state = RESPST_GET_REQ;
> @@ -1441,11 +1440,6 @@ int rxe_responder(void *arg)
>   
>   			goto exit;
>   
> -		case RESPST_RESET:
> -			rxe_drain_req_pkts(qp, false);
> -			qp->resp.wqe = NULL;
> -			goto exit;
> -
>   		case RESPST_ERROR:
>   			qp->resp.goto_error = 0;
>   			pr_debug("qp#%d moved to error state\n", qp_num(qp));


^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: [PATCH for-next v3 00/13] Implement work queues for rdma_rxe
  2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
                   ` (13 preceding siblings ...)
  2022-11-02 10:17 ` [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Daisuke Matsuda (Fujitsu)
@ 2022-11-18  5:02 ` Daisuke Matsuda (Fujitsu)
  14 siblings, 0 replies; 23+ messages in thread
From: Daisuke Matsuda (Fujitsu) @ 2022-11-18  5:02 UTC (permalink / raw)
  To: 'Bob Pearson', jgg, leon, zyjzyj2000, jhack, linux-rdma

On Sat, Oct 29, 2022 12:10 PM Bob Pearson wrote:

<...>

> Bob Pearson (13):
>   RDMA/rxe: Make task interface pluggable
>   RDMA/rxe: Split rxe_drain_resp_pkts()
>   RDMA/rxe: Simplify reset state handling in rxe_resp.c
>   RDMA/rxe: Handle qp error in rxe_resp.c
>   RDMA/rxe: Cleanup comp tasks in rxe_qp.c
>   RDMA/rxe: Remove __rxe_do_task()
>   RDMA/rxe: Make tasks schedule each other
>   RDMA/rxe: Implement disable/enable_task()
>   RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE
>   RDMA/rxe: Replace task->destroyed by task state INVALID.
>   RDMA/rxe: Add workqueue support for tasks
>   RDMA/rxe: Make WORKQUEUE default for RC tasks
>   RDMA/rxe: Remove tasklets from rxe_task.c

Hello Bob,

I have found a soft lockup issue reproducible with rdma-core testcases.
It does not happen on the latest for-next tree but with this patch series,
so we need to fix the issue before getting it merged.

I did the test on a VM with 8 CPUs. I fetched the latest rdma-core and built it,
and executed the following command inside 'rdma-core' directory:
# while true; do ./build/bin/run_tests.py -v --dev rxe_ens6 --gid 1; sleep 2; done
(Please specify your 'dev' and 'gid' appropriately.)

Before 10 minutes passed, my console freezed and became unresponsive,
showing the test progress below:
=====
test_create_ah (tests.test_addr.AHTest)
Test ibv_create_ah. ... ok
test_create_ah_roce (tests.test_addr.AHTest)
Verify that AH can't be created without GRH in RoCE ... ok
test_destroy_ah (tests.test_addr.AHTest)
Test ibv_destroy_ah. ... ok
test_atomic_cmp_and_swap (tests.test_atomic.AtomicTest) ... ok
test_atomic_fetch_and_add (tests.test_atomic.AtomicTest) ... ok
test_atomic_invalid_lkey (tests.test_atomic.AtomicTest) ...
=====
Note this does not always happen. You may have to wait for some minutes.

Here is the backtrace:
=====
[ 1212.135650] watchdog: BUG: soft lockup - CPU#3 stuck for 1017s! [python3:3428]
[ 1212.138144] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs rpcrdma rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm rdma_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core rfkill sunrpc intel_rapl_msr intel_rapl_common kvm_intel kvm irqbypass joydev nd_pmem virtio_balloon dax_pmem nd_btt i2c_piix4 pcspkr drm xfs libcrc32c sd_mod t10_pi sr_mod crc64_rocksoft_generic cdrom crc64_rocksoft crc64 sg nd_e820 ata_generic libnvdimm crct10dif_pclmul crc32_pclmul crc32c_intel ata_piix virtio_net libata ghash_clmulni_intel e1000 net_failover sha512_ssse3 failover virtio_console serio_raw dm_mirror dm_region_hash dm_log dm_mod fuse
[ 1212.147152] CPU: 3 PID: 3428 Comm: python3 Kdump: loaded Tainted: G             L     6.1.0-rc1+ #29
[ 1212.148425] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 1212.149754] RIP: 0010:__local_bh_enable_ip+0x26/0x70
[ 1212.150464] Code: 00 00 66 90 0f 1f 44 00 00 65 8b 05 c4 7e d2 4e a9 00 00 0f 00 75 31 83 ee 01 f7 de 65 01 35 b1 7e d2 4e 65 8b 05 aa 7e d2 4e <a9> 00 ff ff 00 74 1b 65 ff 0d 9c 7e d2 4e 65 8b 05 95 7e d2 4e 85
[ 1212.153081] RSP: 0018:ffffaf0b8054baf8 EFLAGS: 00000203
[ 1212.153822] RAX: 0000000000000001 RBX: ffff8d2189171450 RCX: 00000000ffffffff
[ 1212.154823] RDX: 0000000000000001 RSI: 00000000fffffe00 RDI: ffffffffc0b51baa
[ 1212.155826] RBP: ffff8d2189171474 R08: 0000011a38ad4004 R09: 0000000000000101
[ 1212.156862] R10: ffffffffb2c06100 R11: 0000000000000000 R12: 0000000000000000
[ 1212.157860] R13: ffff8d218df5eda0 R14: ffff8d2181058328 R15: 0000000000000000
[ 1212.158858] FS:  00007f8cec55f740(0000) GS:ffff8d23b5cc0000(0000) knlGS:0000000000000000
[ 1212.159989] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1212.160837] CR2: 00007f8ceaa15024 CR3: 0000000108c2c002 CR4: 0000000000060ee0
[ 1212.161841] Call Trace:
[ 1212.162205]  <TASK>
[ 1212.162522]  work_cleanup+0x3a/0x40 [rdma_rxe]
[ 1212.163172]  rxe_qp_do_cleanup+0x54/0x1e0 [rdma_rxe]
[ 1212.163888]  execute_in_process_context+0x23/0x70
[ 1212.164575]  __rxe_cleanup+0xc6/0x170 [rdma_rxe]
[ 1212.165245]  rxe_destroy_qp+0x28/0x40 [rdma_rxe]
[ 1212.165909]  ib_destroy_qp_user+0x90/0x1b0 [ib_core]
[ 1212.166646]  uverbs_free_qp+0x35/0x90 [ib_uverbs]
[ 1212.167333]  destroy_hw_idr_uobject+0x1e/0x50 [ib_uverbs]
[ 1212.168103]  uverbs_destroy_uobject+0x37/0x1c0 [ib_uverbs]
[ 1212.168899]  uobj_destroy+0x3c/0x80 [ib_uverbs]
[ 1212.169532]  ib_uverbs_run_method+0x203/0x320 [ib_uverbs]
[ 1212.170412]  ? uverbs_free_qp+0x90/0x90 [ib_uverbs]
[ 1212.171151]  ib_uverbs_cmd_verbs+0x172/0x220 [ib_uverbs]
[ 1212.171912]  ? free_unref_page_commit+0x7e/0x170
[ 1212.172583]  ? xa_destroy+0x82/0x110
[ 1212.173104]  ? kvfree_call_rcu+0x27d/0x310
[ 1212.173692]  ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
[ 1212.174481]  ib_uverbs_ioctl+0xa4/0x110 [ib_uverbs]
[ 1212.175182]  __x64_sys_ioctl+0x8a/0xc0
[ 1212.175725]  do_syscall_64+0x3b/0x90
[ 1212.176243]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 1212.176973] RIP: 0033:0x7f8cebe3ec6b
[ 1212.177489] Code: 73 01 c3 48 8b 0d b5 b1 1b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 85 b1 1b 00 f7 d8 64 89 01 48
[ 1212.180072] RSP: 002b:00007ffee64927a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 1212.181150] RAX: ffffffffffffffda RBX: 00007ffee64928d8 RCX: 00007f8cebe3ec6b
[ 1212.182131] RDX: 00007ffee64928c0 RSI: 00000000c0181b01 RDI: 0000000000000003
[ 1212.183099] RBP: 00007ffee64928a0 R08: 000056424f8c0c70 R09: 000000000002ae30
[ 1212.184064] R10: 00007f8cead75e10 R11: 0000000000000246 R12: 00007ffee649287c
[ 1212.185061] R13: 0000000000000022 R14: 000056424f8c0560 R15: 00007f8cebb903d0
[ 1212.186031]  </TASK>

Message from syslogd@c9ibremote at Nov 17 17:03:12 ...
 kernel:watchdog: BUG: soft lockup - CPU#3 stuck for 1017s! [python3:3428]
=====
It seems that ibv_destroy_qp(3) was issued from userspace, right?
I am not very familiar with uverbs ;(

The process got stuck at the do-while loop below:
=====
rxe_task.c
---
/* busy wait until any previous tasks are done */
static void cleanup_task(struct rxe_task *task)
{
        bool busy;

        do {
                spin_lock_bh(&task->lock);
                busy = (task->state == TASK_STATE_BUSY ||
                        task->state == TASK_STATE_ARMED);
                if (!busy)
                        task->state = TASK_STATE_INVALID;
                spin_unlock_bh(&task->lock);
        } while (busy);
}
=====
Typically task->state is 0 (i.e. "TASK_STATE_IDLE") when we reach here,
but in the infinite loop, task->state was constantly 1 (i.e. "TASK_STATE_BUSY").

IMO, the bottom halves completed their works leaving task->state "TASK_STATE_BUSY",
and ibv_destroy_qp(3) issued from userspace thereafter got stuck,
but I am not sure how this counter-intuitive state transition occurs.

Do you have any idea why this happens and how to fix this?
I cannot take enough time to inspect this issue further right now,
but I would update you if I find anything helpful to fix this.

Thanks,
Daisuke

> 
>  drivers/infiniband/sw/rxe/rxe.c      |   9 +-
>  drivers/infiniband/sw/rxe/rxe_comp.c |  24 ++-
>  drivers/infiniband/sw/rxe/rxe_qp.c   |  80 ++++-----
>  drivers/infiniband/sw/rxe/rxe_req.c  |   4 +-
>  drivers/infiniband/sw/rxe/rxe_resp.c |  70 +++++---
>  drivers/infiniband/sw/rxe/rxe_task.c | 258 +++++++++++++++++++--------
>  drivers/infiniband/sw/rxe/rxe_task.h |  56 +++---
>  7 files changed, 329 insertions(+), 172 deletions(-)
> 
> 
> base-commit: 692373d186205dfb1b56f35f22702412d94d9420
> --
> 2.34.1


^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-11-18  5:02 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-29  3:09 [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Bob Pearson
2022-10-29  3:09 ` [PATCH for-next v3 01/13] RDMA/rxe: Make task interface pluggable Bob Pearson
2022-11-11  2:28   ` Yanjun Zhu
2022-10-29  3:09 ` [PATCH for-next v3 02/13] RDMA/rxe: Split rxe_drain_resp_pkts() Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 03/13] RDMA/rxe: Simplify reset state handling in rxe_resp.c Bob Pearson
2022-11-11  3:04   ` Yanjun Zhu
2022-10-29  3:10 ` [PATCH for-next v3 04/13] RDMA/rxe: Handle qp error " Bob Pearson
2022-10-29  3:10 ` [PATCH v3 05/13] RDMA/rxe: Cleanup comp tasks in rxe_qp.c Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 06/13] RDMA/rxe: Remove __rxe_do_task() Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 07/13] RDMA/rxe: Make tasks schedule each other Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 08/13] RDMA/rxe: Implement disable/enable_task() Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 09/13] RDMA/rxe: Replace TASK_STATE_START by TASK_STATE_IDLE Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 10/13] RDMA/rxe: Replace task->destroyed by task state INVALID Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 11/13] RDMA/rxe: Add workqueue support for tasks Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 12/13] RDMA/rxe: Make WORKQUEUE default for RC tasks Bob Pearson
2022-10-29  3:10 ` [PATCH for-next v3 13/13] RDMA/rxe: Remove tasklets from rxe_task.c Bob Pearson
2022-11-02 10:17 ` [PATCH for-next v3 00/13] Implement work queues for rdma_rxe Daisuke Matsuda (Fujitsu)
2022-11-02 11:20   ` Bob Pearson
2022-11-04  4:59     ` Daisuke Matsuda (Fujitsu)
2022-11-05 21:15       ` Bob Pearson
2022-11-07  8:21         ` Daisuke Matsuda (Fujitsu)
2022-11-07 16:06           ` Bob Pearson
2022-11-18  5:02 ` Daisuke Matsuda (Fujitsu)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).