nvdimm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE
@ 2022-09-07  2:42 Daisuke Matsuda
  2022-09-07  2:42 ` [RFC PATCH 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex Daisuke Matsuda
                   ` (8 more replies)
  0 siblings, 9 replies; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:42 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

Hi everyone,

This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
driver, which has been available only in mlx5 driver[1] so far.

[Overview]
When applications register a memory region(MR), RDMA drivers normally pin
pages in the MR so that physical addresses are never changed during RDMA
communication. This requires the MR to fit in physical memory and
inevitably leads to memory pressure. On the other hand, On-Demand Paging
(ODP) allows applications to register MRs without pinning pages. They are
paged-in when the driver requires and paged-out when the OS reclaims. As a
result, it is possible to register a large MR that does not fit in physical
memory without taking up so much physical memory.

[Why to add this feature?]
We, Fujitsu, have contributed to RDMA with a view to using it with
persistent memory. Persistent memory can host a filesystem that allows
applications to read/write files directly without involving page cache.
This is called FS-DAX(filesystem direct access) mode. There is a problem
that data on DAX-enabled filesystem cannot be duplicated with software RAID
or other hardware methods. Data replication with RDMA, which features
high-speed connections, is the best solution for the problem.

However, there is a known issue that hinders using RDMA with FS-DAX. When
RDMA operations to a file and update of the file metadata are processed
concurrently on the same node, illegal memory accesses can be executed,
disregarding the updated metadata. This is because RDMA operations do not
go through page cache but access data directly. There was an effort[2] to
solve this problem, but it was rejected in the end. Though there is no
general solution available, it is possible to work around the problem using
the ODP feature that has been available only in mlx5. ODP enables drivers
to update metadata before processing RDMA operations.

We have enhanced the rxe to expedite the usage of persistent memory. Our
contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
them being merged to rxe along with ODP, an environment will be ready for
developers to create and test software for RDMA with FS-DAX. There is a
library(librpma)[5] being developed for this purpose. This environment
can be used by anybody without any special hardware but an ordinary
computer with a normal NIC though it is inferior to hardware
implementations in terms of performance.

[Design considerations]
ODP has been available only in mlx5, but functions and data structures
that can be used commonly are provided in ib_uverbs(infiniband/core). The
interface is heavily dependent on HMM infrastructure[6], and this patchset
use them as much as possible. While mlx5 has both Explicit and Implicit ODP
features along with prefetch feature, this patchset implements the Explicit
ODP feature only.

As an important change, it is necessary to convert triple tasklets
(requester, responder and completer) to workqueues because they must be
able to sleep in order to trigger page fault before accessing MRs. I did a
test shown in the 2nd patch and found that the change makes the latency
higher while improving the bandwidth. Though it may be possible to create a
new independent workqueue for page fault execution, it is a not very
sensible solution since the tasklets have to busy-wait its completion in
that case.

If responder and completer sleep, it becomes more likely that packet drop
occurs because of overflow in receiver queue. There are multiple queues
involved, but, as SoftRoCE uses UDP, the most important one would be the
UDP buffers. The size can be configured in net.core.rmem_default and
net.core.rmem_max sysconfig parameters. Users should change these values in
case of packet drop, but page fault would be typically not so long as to
cause the problem.

[How does ODP work?]
"struct ib_umem_odp" is used to manage pages. It is created for each
ODP-enabled MR on its registration. This struct holds a pair of arrays
(dma_list/pfn_list) that serve as a driver page table. DMA addresses and
PFNs are stored in the driver page table. They are updated on page-in and
page-out, both of which use the common interface in ib_uverbs.

Page-in can occur when requester, responder or completer access an MR in
order to process RDMA operations. If they find that the pages being
accessed are not present on physical memory or requisite permissions are
not set on the pages, they provoke page fault to make pages present with
proper permissions and at the same time update the driver page table. After
confirming the presence of the pages, they execute memory access such as
read, write or atomic operations.

Page-out is triggered by page reclaim or filesystem events (e.g. metadata
update of a file that is being used as an MR). When creating an ODP-enabled
MR, the driver registers an MMU notifier callback. When the kernel issues a
page invalidation notification, the callback is provoked to unmap DMA
addresses and update the driver page table. After that, the kernel releases
the pages.

[Supported operations]
All operations are supported on RC connection. Atomic write[3] and Flush[4]
operations, which are still under discussion, are also going to be
supported after their patches are merged. On UD connection, Send, Recv,
SRQ-Recv are supported. Because other operations are not supported on mlx5,
I take after the decision right now.

[How to test ODP?]
There are only a few resources available for testing. pyverbs testcases in
rdma-core and perftest[7] are recommendable ones. Note that you may have to
build perftest from upstream since older versions do not handle ODP
capabilities correctly.

[Future work]
My next work will be the prefetch feature. It allows applications to
trigger page fault using ibv_advise_mr(3) to optimize performance. Some
existing software like librpma use this feature. Additionally, I think we
can also add the implicit ODP feature in the future.

[1] [RFC 00/20] On demand paging
https://www.spinics.net/lists/linux-rdma/msg18906.html

[2] [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
https://lore.kernel.org/nvdimm/20190809225833.6657-1-ira.weiny@intel.com/

[3] [RESEND PATCH v5 0/2] RDMA/rxe: Add RDMA Atomic Write operation
https://www.spinics.net/lists/linux-rdma/msg111428.html

[4] [PATCH v4 0/6] RDMA/rxe: Add RDMA FLUSH operation
https://www.spinics.net/lists/kernel/msg4462045.html

[5] librpma: Remote Persistent Memory Access Library
https://github.com/pmem/rpma

[6] Heterogeneous Memory Management (HMM)
https://www.kernel.org/doc/html/latest/mm/hmm.html

[7] linux-rdma/perftest: Infiniband Verbs Performance Tests
https://github.com/linux-rdma/perftest

Daisuke Matsuda (7):
  IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
  RDMA/rxe: Convert the triple tasklets to workqueues
  RDMA/rxe: Cleanup code for responder Atomic operations
  RDMA/rxe: Add page invalidation support
  RDMA/rxe: Allow registering MRs for On-Demand Paging
  RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
  RDMA/rxe: Add support for the traditional Atomic operations with ODP

 drivers/infiniband/core/umem_odp.c    |   6 +-
 drivers/infiniband/hw/mlx5/odp.c      |   4 +-
 drivers/infiniband/sw/rxe/Makefile    |   5 +-
 drivers/infiniband/sw/rxe/rxe.c       |  18 ++
 drivers/infiniband/sw/rxe/rxe_comp.c  |  42 +++-
 drivers/infiniband/sw/rxe/rxe_loc.h   |  11 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    |   7 +-
 drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
 drivers/infiniband/sw/rxe/rxe_odp.c   | 329 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++---
 drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
 drivers/infiniband/sw/rxe/rxe_req.c   |  14 +-
 drivers/infiniband/sw/rxe/rxe_resp.c  | 175 +++++++-------
 drivers/infiniband/sw/rxe/rxe_resp.h  |  44 ++++
 drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------
 drivers/infiniband/sw/rxe/rxe_task.h  |  69 ------
 drivers/infiniband/sw/rxe/rxe_verbs.c |  16 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |  10 +-
 drivers/infiniband/sw/rxe/rxe_wq.c    | 161 +++++++++++++
 drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++
 21 files changed, 824 insertions(+), 386 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h
 delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
 delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
 create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h

-- 
2.31.1


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [RFC PATCH 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
@ 2022-09-07  2:42 ` Daisuke Matsuda
  2022-09-07  2:43 ` [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Daisuke Matsuda
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:42 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

ib_umem_odp_map_dma_single_page(), which has been used only by the mlx5
driver, holds umem_mutex on success and releases on failure. This
behavior is not convenient for other drivers to use it, so change it to
always retain mutex on return.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/core/umem_odp.c | 6 ++----
 drivers/infiniband/hw/mlx5/odp.c   | 4 +++-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index c459c4d011cf..92617a021439 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -328,8 +328,8 @@ static int ib_umem_odp_map_dma_single_page(
  *
  * Maps the range passed in the argument to DMA addresses.
  * The DMA addresses of the mapped pages is updated in umem_odp->dma_list.
- * Upon success the ODP MR will be locked to let caller complete its device
- * page table update.
+ * The umem mutex is locked and not released in this function. The caller should
+ * complete its device page table update before releasing the lock.
  *
  * Returns the number of pages mapped in success, negative error code
  * for failure.
@@ -456,8 +456,6 @@ int ib_umem_odp_map_dma_and_lock(struct ib_umem_odp *umem_odp, u64 user_virt,
 	/* upon success lock should stay on hold for the callee */
 	if (!ret)
 		ret = dma_index - start_idx;
-	else
-		mutex_unlock(&umem_odp->umem_mutex);
 
 out_put_mm:
 	mmput(owning_mm);
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index bc97958818bb..a0de27651586 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -572,8 +572,10 @@ static int pagefault_real_mr(struct mlx5_ib_mr *mr, struct ib_umem_odp *odp,
 		access_mask |= ODP_WRITE_ALLOWED_BIT;
 
 	np = ib_umem_odp_map_dma_and_lock(odp, user_va, bcnt, access_mask, fault);
-	if (np < 0)
+	if (np < 0) {
+		mutex_unlock(&odp->umem_mutex);
 		return np;
+	}
 
 	/*
 	 * No need to check whether the MTTs really belong to this MR, since
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
  2022-09-07  2:42 ` [RFC PATCH 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex Daisuke Matsuda
@ 2022-09-07  2:43 ` Daisuke Matsuda
  2022-09-09 19:39   ` Bob Pearson
  2022-09-11  7:10   ` Yanjun Zhu
  2022-09-07  2:43 ` [RFC PATCH 3/7] RDMA/rxe: Cleanup code for responder Atomic operations Daisuke Matsuda
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:43 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

In order to implement On-Demand Paging on the rxe driver, triple tasklets
(requester, responder, and completer) must be allowed to sleep so that they
can trigger page fault when pages being accessed are not present.

This patch replaces the tasklets with workqueues, but still allows direct-
call of works from softirq context if it is obvious that MRs are not going
to be accessed and there is no work being processed in the workqueue.

As counterparts to tasklet_disable() and tasklet_enable() are missing for
workqueues, an atomic value is introduced to get works suspended while qp
reset is in progress.

As a reference, performance change was measured using ib_send_bw and
ib_send_lat commands over RC connection. Both the client and the server
were placed on the same KVM host. An option "-n 100000" was given to the
respective commands to iterate over 100000 times.

Before applying this patch:
[ib_send_bw]
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      100000           0.00               203.13             0.003250
---------------------------------------------------------------------------------------
[ib_send_lat]
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       100000          30.91          1519.05      34.23             36.06            5.15            48.49                   82.37
---------------------------------------------------------------------------------------

After applying this patch:
[ib_send_bw]
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      100000           0.00               240.88             0.003854
---------------------------------------------------------------------------------------
[ib_send_lat]
---------------------------------------------------------------------------------------
 #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
 2       100000          40.88          2994.82      47.80             50.25            13.70           76.42                   185.04
---------------------------------------------------------------------------------------

The test was conducted 3 times for each kernel, and the results with median
"BW average" and "t_typical" are shown above. It shows the conversion
improves the bandwidth while causing higher latency.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/Makefile    |   2 +-
 drivers/infiniband/sw/rxe/rxe_comp.c  |  42 ++++---
 drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +-
 drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
 drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
 drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++++------
 drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
 drivers/infiniband/sw/rxe/rxe_req.c   |  14 +--
 drivers/infiniband/sw/rxe/rxe_resp.c  |  16 +--
 drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------------------
 drivers/infiniband/sw/rxe/rxe_task.h  |  69 -----------
 drivers/infiniband/sw/rxe/rxe_verbs.c |   8 +-
 drivers/infiniband/sw/rxe/rxe_verbs.h |   8 +-
 drivers/infiniband/sw/rxe/rxe_wq.c    | 161 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++++++++
 15 files changed, 322 insertions(+), 299 deletions(-)
 delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
 delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
 create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
 create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h

diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 5395a581f4bb..358f6b06aa64 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -20,6 +20,6 @@ rdma_rxe-y := \
 	rxe_mmap.o \
 	rxe_icrc.o \
 	rxe_mcast.o \
-	rxe_task.o \
+	rxe_wq.o \
 	rxe_net.o \
 	rxe_hw_counters.o
diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index fb0c008af78c..0348da06c4dd 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -9,7 +9,7 @@
 #include "rxe.h"
 #include "rxe_loc.h"
 #include "rxe_queue.h"
-#include "rxe_task.h"
+#include "rxe_wq.h"
 
 enum comp_state {
 	COMPST_GET_ACK,
@@ -118,21 +118,37 @@ void retransmit_timer(struct timer_list *t)
 
 	if (qp->valid) {
 		qp->comp.timeout = 1;
-		rxe_run_task(&qp->comp.task, 1);
+		rxe_run_work(&qp->comp.work, 1);
 	}
 }
 
-void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
+void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 {
+	struct rxe_qp *qp = pkt->qp;
 	int must_sched;
 
 	skb_queue_tail(&qp->resp_pkts, skb);
 
-	must_sched = skb_queue_len(&qp->resp_pkts) > 1;
+	/* Schedule a workqueue when processing READ and ATOMIC acks.
+	 * In these cases, completer may sleep to access ODP-enabled MRs.
+	 */
+	switch (pkt->opcode) {
+	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
+	case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST:
+	case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY:
+	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
+	case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE:
+		must_sched = 1;
+		break;
+
+	default:
+		must_sched = skb_queue_len(&qp->resp_pkts) > 1;
+	}
+
 	if (must_sched != 0)
 		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
 
-	rxe_run_task(&qp->comp.task, must_sched);
+	rxe_run_work(&qp->comp.work, must_sched);
 }
 
 static inline enum comp_state get_wqe(struct rxe_qp *qp,
@@ -305,7 +321,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
 					qp->comp.psn = pkt->psn;
 					if (qp->req.wait_psn) {
 						qp->req.wait_psn = 0;
-						rxe_run_task(&qp->req.task, 0);
+						rxe_run_work(&qp->req.work, 0);
 					}
 				}
 				return COMPST_ERROR_RETRY;
@@ -452,7 +468,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	 */
 	if (qp->req.wait_fence) {
 		qp->req.wait_fence = 0;
-		rxe_run_task(&qp->req.task, 0);
+		rxe_run_work(&qp->req.work, 0);
 	}
 }
 
@@ -466,7 +482,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
 		if (qp->req.need_rd_atomic) {
 			qp->comp.timeout_retry = 0;
 			qp->req.need_rd_atomic = 0;
-			rxe_run_task(&qp->req.task, 0);
+			rxe_run_work(&qp->req.work, 0);
 		}
 	}
 
@@ -512,7 +528,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp,
 
 		if (qp->req.wait_psn) {
 			qp->req.wait_psn = 0;
-			rxe_run_task(&qp->req.task, 1);
+			rxe_run_work(&qp->req.work, 1);
 		}
 	}
 
@@ -646,7 +662,7 @@ int rxe_completer(void *arg)
 
 			if (qp->req.wait_psn) {
 				qp->req.wait_psn = 0;
-				rxe_run_task(&qp->req.task, 1);
+				rxe_run_work(&qp->req.work, 1);
 			}
 
 			state = COMPST_DONE;
@@ -714,7 +730,7 @@ int rxe_completer(void *arg)
 							RXE_CNT_COMP_RETRY);
 					qp->req.need_retry = 1;
 					qp->comp.started_retry = 1;
-					rxe_run_task(&qp->req.task, 0);
+					rxe_run_work(&qp->req.work, 0);
 				}
 				goto done;
 
@@ -757,8 +773,8 @@ int rxe_completer(void *arg)
 		}
 	}
 
-	/* A non-zero return value will cause rxe_do_task to
-	 * exit its loop and end the tasklet. A zero return
+	/* A non-zero return value will cause rxe_do_work to
+	 * exit its loop and end the work. A zero return
 	 * will continue looping and return to rxe_completer
 	 */
 done:
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 22f6cc31d1d6..0f8cb9e38cc9 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -181,7 +181,7 @@ void rxe_icrc_generate(struct sk_buff *skb, struct rxe_pkt_info *pkt);
 
 void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb);
 
-void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb);
+void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb);
 
 static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
 {
diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
index c53f4529f098..5526970882c8 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -346,7 +346,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
 
 	if (unlikely(qp->need_req_skb &&
 		     skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
-		rxe_run_task(&qp->req.task, 1);
+		rxe_run_work(&qp->req.work, 1);
 
 	rxe_put(qp);
 }
@@ -430,7 +430,7 @@ int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
 	if ((qp_type(qp) != IB_QPT_RC) &&
 	    (pkt->mask & RXE_END_MASK)) {
 		pkt->wqe->state = wqe_state_done;
-		rxe_run_task(&qp->comp.task, 1);
+		rxe_run_work(&qp->comp.work, 1);
 	}
 
 	rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS);
diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 86c7a8bf3cbb..1c0251812fc8 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -105,7 +105,7 @@ enum rxe_device_param {
 	RXE_INFLIGHT_SKBS_PER_QP_HIGH	= 64,
 	RXE_INFLIGHT_SKBS_PER_QP_LOW	= 16,
 
-	/* Max number of interations of each tasklet
+	/* Max number of interations of each work
 	 * before yielding the cpu to let other
 	 * work make progress
 	 */
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 516bf9b95e48..c7c7acd4ecfa 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -13,7 +13,7 @@
 #include "rxe.h"
 #include "rxe_loc.h"
 #include "rxe_queue.h"
-#include "rxe_task.h"
+#include "rxe_wq.h"
 
 static int rxe_qp_chk_cap(struct rxe_dev *rxe, struct ib_qp_cap *cap,
 			  int has_srq)
@@ -172,9 +172,9 @@ static void rxe_qp_init_misc(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	spin_lock_init(&qp->state_lock);
 
-	spin_lock_init(&qp->req.task.state_lock);
-	spin_lock_init(&qp->resp.task.state_lock);
-	spin_lock_init(&qp->comp.task.state_lock);
+	spin_lock_init(&qp->req.work.state_lock);
+	spin_lock_init(&qp->resp.work.state_lock);
+	spin_lock_init(&qp->comp.work.state_lock);
 
 	spin_lock_init(&qp->sq.sq_lock);
 	spin_lock_init(&qp->rq.producer_lock);
@@ -242,10 +242,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	skb_queue_head_init(&qp->req_pkts);
 
-	rxe_init_task(rxe, &qp->req.task, qp,
-		      rxe_requester, "req");
-	rxe_init_task(rxe, &qp->comp.task, qp,
-		      rxe_completer, "comp");
+	rxe_init_work(rxe, &qp->req.work, qp,
+		      rxe_requester, "rxe_req");
+	rxe_init_work(rxe, &qp->comp.work, qp,
+		      rxe_completer, "rxe_comp");
 
 	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
 	if (init->qp_type == IB_QPT_RC) {
@@ -292,8 +292,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
 
 	skb_queue_head_init(&qp->resp_pkts);
 
-	rxe_init_task(rxe, &qp->resp.task, qp,
-		      rxe_responder, "resp");
+	rxe_init_work(rxe, &qp->resp.work, qp,
+		      rxe_responder, "rxe_resp");
 
 	qp->resp.opcode		= OPCODE_NONE;
 	qp->resp.msn		= 0;
@@ -481,14 +481,14 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp,
 /* move the qp to the reset state */
 static void rxe_qp_reset(struct rxe_qp *qp)
 {
-	/* stop tasks from running */
-	rxe_disable_task(&qp->resp.task);
+	/* flush workqueue and stop works from running */
+	rxe_disable_work(&qp->resp.work);
 
 	/* stop request/comp */
 	if (qp->sq.queue) {
 		if (qp_type(qp) == IB_QPT_RC)
-			rxe_disable_task(&qp->comp.task);
-		rxe_disable_task(&qp->req.task);
+			rxe_disable_work(&qp->comp.work);
+		rxe_disable_work(&qp->req.work);
 	}
 
 	/* move qp to the reset state */
@@ -499,11 +499,11 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 	/* let state machines reset themselves drain work and packet queues
 	 * etc.
 	 */
-	__rxe_do_task(&qp->resp.task);
+	__rxe_do_work(&qp->resp.work);
 
 	if (qp->sq.queue) {
-		__rxe_do_task(&qp->comp.task);
-		__rxe_do_task(&qp->req.task);
+		__rxe_do_work(&qp->comp.work);
+		__rxe_do_work(&qp->req.work);
 		rxe_queue_reset(qp->sq.queue);
 	}
 
@@ -526,14 +526,14 @@ static void rxe_qp_reset(struct rxe_qp *qp)
 
 	cleanup_rd_atomic_resources(qp);
 
-	/* reenable tasks */
-	rxe_enable_task(&qp->resp.task);
+	/* reenable workqueue */
+	rxe_enable_work(&qp->resp.work);
 
 	if (qp->sq.queue) {
 		if (qp_type(qp) == IB_QPT_RC)
-			rxe_enable_task(&qp->comp.task);
+			rxe_enable_work(&qp->comp.work);
 
-		rxe_enable_task(&qp->req.task);
+		rxe_enable_work(&qp->req.work);
 	}
 }
 
@@ -544,10 +544,10 @@ static void rxe_qp_drain(struct rxe_qp *qp)
 		if (qp->req.state != QP_STATE_DRAINED) {
 			qp->req.state = QP_STATE_DRAIN;
 			if (qp_type(qp) == IB_QPT_RC)
-				rxe_run_task(&qp->comp.task, 1);
+				rxe_run_work(&qp->comp.work, 1);
 			else
-				__rxe_do_task(&qp->comp.task);
-			rxe_run_task(&qp->req.task, 1);
+				__rxe_do_work(&qp->comp.work);
+			rxe_run_work(&qp->req.work, 1);
 		}
 	}
 }
@@ -561,13 +561,13 @@ void rxe_qp_error(struct rxe_qp *qp)
 	qp->attr.qp_state = IB_QPS_ERR;
 
 	/* drain work and packet queues */
-	rxe_run_task(&qp->resp.task, 1);
+	rxe_run_work(&qp->resp.work, 1);
 
 	if (qp_type(qp) == IB_QPT_RC)
-		rxe_run_task(&qp->comp.task, 1);
+		rxe_run_work(&qp->comp.work, 1);
 	else
-		__rxe_do_task(&qp->comp.task);
-	rxe_run_task(&qp->req.task, 1);
+		__rxe_do_work(&qp->comp.work);
+	rxe_run_work(&qp->req.work, 1);
 }
 
 /* called by the modify qp verb */
@@ -786,21 +786,21 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
 
 	qp->valid = 0;
 	qp->qp_timeout_jiffies = 0;
-	rxe_cleanup_task(&qp->resp.task);
+	rxe_cleanup_work(&qp->resp.work);
 
 	if (qp_type(qp) == IB_QPT_RC) {
 		del_timer_sync(&qp->retrans_timer);
 		del_timer_sync(&qp->rnr_nak_timer);
 	}
 
-	rxe_cleanup_task(&qp->req.task);
-	rxe_cleanup_task(&qp->comp.task);
+	rxe_cleanup_work(&qp->req.work);
+	rxe_cleanup_work(&qp->comp.work);
 
 	/* flush out any receive wr's or pending requests */
-	__rxe_do_task(&qp->req.task);
+	__rxe_do_work(&qp->req.work);
 	if (qp->sq.queue) {
-		__rxe_do_task(&qp->comp.task);
-		__rxe_do_task(&qp->req.task);
+		__rxe_do_work(&qp->comp.work);
+		__rxe_do_work(&qp->req.work);
 	}
 
 	if (qp->sq.queue)
diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
index f3ad7b6dbd97..3b525e9d903e 100644
--- a/drivers/infiniband/sw/rxe/rxe_recv.c
+++ b/drivers/infiniband/sw/rxe/rxe_recv.c
@@ -226,7 +226,7 @@ static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
 	if (pkt->mask & RXE_REQ_MASK)
 		rxe_resp_queue_pkt(pkt->qp, skb);
 	else
-		rxe_comp_queue_pkt(pkt->qp, skb);
+		rxe_comp_queue_pkt(pkt, skb);
 }
 
 static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index f63771207970..bd05dd3de499 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -105,7 +105,7 @@ void rnr_nak_timer(struct timer_list *t)
 	/* request a send queue retry */
 	qp->req.need_retry = 1;
 	qp->req.wait_for_rnr_timer = 0;
-	rxe_run_task(&qp->req.task, 1);
+	rxe_run_work(&qp->req.work, 1);
 }
 
 static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp)
@@ -608,7 +608,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	 * which can lead to a deadlock. So go ahead and complete
 	 * it now.
 	 */
-	rxe_run_task(&qp->comp.task, 1);
+	rxe_run_work(&qp->comp.work, 1);
 
 	return 0;
 }
@@ -733,7 +733,7 @@ int rxe_requester(void *arg)
 						       qp->req.wqe_index);
 			wqe->state = wqe_state_done;
 			wqe->status = IB_WC_SUCCESS;
-			rxe_run_task(&qp->comp.task, 0);
+			rxe_run_work(&qp->comp.work, 0);
 			goto done;
 		}
 		payload = mtu;
@@ -795,7 +795,7 @@ int rxe_requester(void *arg)
 		rollback_state(wqe, qp, &rollback_wqe, rollback_psn);
 
 		if (err == -EAGAIN) {
-			rxe_run_task(&qp->req.task, 1);
+			rxe_run_work(&qp->req.work, 1);
 			goto exit;
 		}
 
@@ -805,8 +805,8 @@ int rxe_requester(void *arg)
 
 	update_state(qp, &pkt);
 
-	/* A non-zero return value will cause rxe_do_task to
-	 * exit its loop and end the tasklet. A zero return
+	/* A non-zero return value will cause rxe_do_work to
+	 * exit its loop and end the work. A zero return
 	 * will continue looping and return to rxe_requester
 	 */
 done:
@@ -817,7 +817,7 @@ int rxe_requester(void *arg)
 	qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index);
 	wqe->state = wqe_state_error;
 	qp->req.state = QP_STATE_ERROR;
-	rxe_run_task(&qp->comp.task, 0);
+	rxe_run_work(&qp->comp.work, 0);
 exit:
 	ret = -EAGAIN;
 out:
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index b36ec5c4d5e0..e97c55b292f0 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -83,15 +83,11 @@ static char *resp_state_name[] = {
 /* rxe_recv calls here to add a request packet to the input queue */
 void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
 {
-	int must_sched;
-	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
-
+	/* responder can sleep to access ODP-enabled MRs, so schedule the
+	 * workqueue for all incoming requests.
+	 */
 	skb_queue_tail(&qp->req_pkts, skb);
-
-	must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) ||
-			(skb_queue_len(&qp->req_pkts) > 1);
-
-	rxe_run_task(&qp->resp.task, must_sched);
+	rxe_run_work(&qp->resp.work, 1);
 }
 
 static inline enum resp_states get_req(struct rxe_qp *qp,
@@ -1464,8 +1460,8 @@ int rxe_responder(void *arg)
 		}
 	}
 
-	/* A non-zero return value will cause rxe_do_task to
-	 * exit its loop and end the tasklet. A zero return
+	/* A non-zero return value will cause rxe_do_work to
+	 * exit its loop and end the work. A zero return
 	 * will continue looping and return to rxe_responder
 	 */
 done:
diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
deleted file mode 100644
index 2248cf33d776..000000000000
--- a/drivers/infiniband/sw/rxe/rxe_task.c
+++ /dev/null
@@ -1,152 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
-/*
- * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
- * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
- */
-
-#include <linux/kernel.h>
-#include <linux/interrupt.h>
-#include <linux/hardirq.h>
-
-#include "rxe.h"
-
-int __rxe_do_task(struct rxe_task *task)
-
-{
-	int ret;
-
-	while ((ret = task->func(task->arg)) == 0)
-		;
-
-	task->ret = ret;
-
-	return ret;
-}
-
-/*
- * this locking is due to a potential race where
- * a second caller finds the task already running
- * but looks just after the last call to func
- */
-void rxe_do_task(struct tasklet_struct *t)
-{
-	int cont;
-	int ret;
-	struct rxe_task *task = from_tasklet(task, t, tasklet);
-	unsigned int iterations = RXE_MAX_ITERATIONS;
-
-	spin_lock_bh(&task->state_lock);
-	switch (task->state) {
-	case TASK_STATE_START:
-		task->state = TASK_STATE_BUSY;
-		spin_unlock_bh(&task->state_lock);
-		break;
-
-	case TASK_STATE_BUSY:
-		task->state = TASK_STATE_ARMED;
-		fallthrough;
-	case TASK_STATE_ARMED:
-		spin_unlock_bh(&task->state_lock);
-		return;
-
-	default:
-		spin_unlock_bh(&task->state_lock);
-		pr_warn("%s failed with bad state %d\n", __func__, task->state);
-		return;
-	}
-
-	do {
-		cont = 0;
-		ret = task->func(task->arg);
-
-		spin_lock_bh(&task->state_lock);
-		switch (task->state) {
-		case TASK_STATE_BUSY:
-			if (ret) {
-				task->state = TASK_STATE_START;
-			} else if (iterations--) {
-				cont = 1;
-			} else {
-				/* reschedule the tasklet and exit
-				 * the loop to give up the cpu
-				 */
-				tasklet_schedule(&task->tasklet);
-				task->state = TASK_STATE_START;
-			}
-			break;
-
-		/* someone tried to run the task since the last time we called
-		 * func, so we will call one more time regardless of the
-		 * return value
-		 */
-		case TASK_STATE_ARMED:
-			task->state = TASK_STATE_BUSY;
-			cont = 1;
-			break;
-
-		default:
-			pr_warn("%s failed with bad state %d\n", __func__,
-				task->state);
-		}
-		spin_unlock_bh(&task->state_lock);
-	} while (cont);
-
-	task->ret = ret;
-}
-
-int rxe_init_task(void *obj, struct rxe_task *task,
-		  void *arg, int (*func)(void *), char *name)
-{
-	task->obj	= obj;
-	task->arg	= arg;
-	task->func	= func;
-	snprintf(task->name, sizeof(task->name), "%s", name);
-	task->destroyed	= false;
-
-	tasklet_setup(&task->tasklet, rxe_do_task);
-
-	task->state = TASK_STATE_START;
-	spin_lock_init(&task->state_lock);
-
-	return 0;
-}
-
-void rxe_cleanup_task(struct rxe_task *task)
-{
-	bool idle;
-
-	/*
-	 * Mark the task, then wait for it to finish. It might be
-	 * running in a non-tasklet (direct call) context.
-	 */
-	task->destroyed = true;
-
-	do {
-		spin_lock_bh(&task->state_lock);
-		idle = (task->state == TASK_STATE_START);
-		spin_unlock_bh(&task->state_lock);
-	} while (!idle);
-
-	tasklet_kill(&task->tasklet);
-}
-
-void rxe_run_task(struct rxe_task *task, int sched)
-{
-	if (task->destroyed)
-		return;
-
-	if (sched)
-		tasklet_schedule(&task->tasklet);
-	else
-		rxe_do_task(&task->tasklet);
-}
-
-void rxe_disable_task(struct rxe_task *task)
-{
-	tasklet_disable(&task->tasklet);
-}
-
-void rxe_enable_task(struct rxe_task *task)
-{
-	tasklet_enable(&task->tasklet);
-}
diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
deleted file mode 100644
index 11d183fd3338..000000000000
--- a/drivers/infiniband/sw/rxe/rxe_task.h
+++ /dev/null
@@ -1,69 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
-/*
- * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
- * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
- */
-
-#ifndef RXE_TASK_H
-#define RXE_TASK_H
-
-enum {
-	TASK_STATE_START	= 0,
-	TASK_STATE_BUSY		= 1,
-	TASK_STATE_ARMED	= 2,
-};
-
-/*
- * data structure to describe a 'task' which is a short
- * function that returns 0 as long as it needs to be
- * called again.
- */
-struct rxe_task {
-	void			*obj;
-	struct tasklet_struct	tasklet;
-	int			state;
-	spinlock_t		state_lock; /* spinlock for task state */
-	void			*arg;
-	int			(*func)(void *arg);
-	int			ret;
-	char			name[16];
-	bool			destroyed;
-};
-
-/*
- * init rxe_task structure
- *	arg  => parameter to pass to fcn
- *	func => function to call until it returns != 0
- */
-int rxe_init_task(void *obj, struct rxe_task *task,
-		  void *arg, int (*func)(void *), char *name);
-
-/* cleanup task */
-void rxe_cleanup_task(struct rxe_task *task);
-
-/*
- * raw call to func in loop without any checking
- * can call when tasklets are disabled
- */
-int __rxe_do_task(struct rxe_task *task);
-
-/*
- * common function called by any of the main tasklets
- * If there is any chance that there is additional
- * work to do someone must reschedule the task before
- * leaving
- */
-void rxe_do_task(struct tasklet_struct *t);
-
-/* run a task, else schedule it to run as a tasklet, The decision
- * to run or schedule tasklet is based on the parameter sched.
- */
-void rxe_run_task(struct rxe_task *task, int sched);
-
-/* keep a task from scheduling */
-void rxe_disable_task(struct rxe_task *task);
-
-/* allow task to run */
-void rxe_enable_task(struct rxe_task *task);
-
-#endif /* RXE_TASK_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 9ebe9decad34..7510f25c5ea3 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -697,9 +697,9 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, const struct ib_send_wr *wr,
 		wr = next;
 	}
 
-	rxe_run_task(&qp->req.task, 1);
+	rxe_run_work(&qp->req.work, 1);
 	if (unlikely(qp->req.state == QP_STATE_ERROR))
-		rxe_run_task(&qp->comp.task, 1);
+		rxe_run_work(&qp->comp.work, 1);
 
 	return err;
 }
@@ -721,7 +721,7 @@ static int rxe_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
 
 	if (qp->is_user) {
 		/* Utilize process context to do protocol processing */
-		rxe_run_task(&qp->req.task, 0);
+		rxe_run_work(&qp->req.work, 0);
 		return 0;
 	} else
 		return rxe_post_send_kernel(qp, wr, bad_wr);
@@ -761,7 +761,7 @@ static int rxe_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
 	spin_unlock_irqrestore(&rq->producer_lock, flags);
 
 	if (qp->resp.state == QP_STATE_ERROR)
-		rxe_run_task(&qp->resp.task, 1);
+		rxe_run_work(&qp->resp.work, 1);
 
 err1:
 	return err;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index a51819d0c345..b09b4cb9897a 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -10,7 +10,7 @@
 #include <linux/interrupt.h>
 #include <linux/workqueue.h>
 #include "rxe_pool.h"
-#include "rxe_task.h"
+#include "rxe_wq.h"
 #include "rxe_hw_counters.h"
 
 static inline int pkey_match(u16 key1, u16 key2)
@@ -125,7 +125,7 @@ struct rxe_req_info {
 	int			need_retry;
 	int			wait_for_rnr_timer;
 	int			noack_pkts;
-	struct rxe_task		task;
+	struct rxe_work		work;
 };
 
 struct rxe_comp_info {
@@ -137,7 +137,7 @@ struct rxe_comp_info {
 	int			started_retry;
 	u32			retry_cnt;
 	u32			rnr_retry;
-	struct rxe_task		task;
+	struct rxe_work		work;
 };
 
 enum rdatm_res_state {
@@ -204,7 +204,7 @@ struct rxe_resp_info {
 	unsigned int		res_head;
 	unsigned int		res_tail;
 	struct resp_res		*res;
-	struct rxe_task		task;
+	struct rxe_work		work;
 };
 
 struct rxe_qp {
diff --git a/drivers/infiniband/sw/rxe/rxe_wq.c b/drivers/infiniband/sw/rxe/rxe_wq.c
new file mode 100644
index 000000000000..8b35186eac85
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_wq.c
@@ -0,0 +1,161 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#include <linux/kernel.h>
+#include <linux/interrupt.h>
+#include <linux/hardirq.h>
+
+#include "rxe.h"
+
+int __rxe_do_work(struct rxe_work *work)
+
+{
+	int ret;
+
+	while ((ret = work->func(work->arg)) == 0)
+		;
+
+	work->ret = ret;
+
+	return ret;
+}
+
+/*
+ * this locking is due to a potential race where
+ * a second caller finds the work already running
+ * but looks just after the last call to func
+ */
+void rxe_do_work(struct work_struct *w)
+{
+	int cont;
+	int ret;
+
+	struct rxe_work *work = container_of(w, typeof(*work), work);
+	unsigned int iterations = RXE_MAX_ITERATIONS;
+
+	spin_lock_bh(&work->state_lock);
+	switch (work->state) {
+	case WQ_STATE_START:
+		work->state = WQ_STATE_BUSY;
+		spin_unlock_bh(&work->state_lock);
+		break;
+
+	case WQ_STATE_BUSY:
+		work->state = WQ_STATE_ARMED;
+		fallthrough;
+	case WQ_STATE_ARMED:
+		spin_unlock_bh(&work->state_lock);
+		return;
+
+	default:
+		spin_unlock_bh(&work->state_lock);
+		pr_warn("%s failed with bad state %d\n", __func__, work->state);
+		return;
+	}
+
+	do {
+		cont = 0;
+		ret = work->func(work->arg);
+
+		spin_lock_bh(&work->state_lock);
+		switch (work->state) {
+		case WQ_STATE_BUSY:
+			if (ret) {
+				work->state = WQ_STATE_START;
+			} else if (iterations--) {
+				cont = 1;
+			} else {
+				/* reschedule the work and exit
+				 * the loop to give up the cpu
+				 */
+				queue_work(work->worker, &work->work);
+				work->state = WQ_STATE_START;
+			}
+			break;
+
+		/* someone tried to run the work since the last time we called
+		 * func, so we will call one more time regardless of the
+		 * return value
+		 */
+		case WQ_STATE_ARMED:
+			work->state = WQ_STATE_BUSY;
+			cont = 1;
+			break;
+
+		default:
+			pr_warn("%s failed with bad state %d\n", __func__,
+				work->state);
+		}
+		spin_unlock_bh(&work->state_lock);
+	} while (cont);
+
+	work->ret = ret;
+}
+
+int rxe_init_work(void *obj, struct rxe_work *work,
+		  void *arg, int (*func)(void *), char *name)
+{
+	work->obj	= obj;
+	work->arg	= arg;
+	work->func	= func;
+	snprintf(work->name, sizeof(work->name), "%s", name);
+	work->destroyed	= false;
+	atomic_set(&work->suspended, 0);
+
+	work->worker = create_singlethread_workqueue(name);
+	INIT_WORK(&work->work, rxe_do_work);
+
+	work->state = WQ_STATE_START;
+	spin_lock_init(&work->state_lock);
+
+	return 0;
+}
+
+void rxe_cleanup_work(struct rxe_work *work)
+{
+	bool idle;
+
+	/*
+	 * Mark the work, then wait for it to finish. It might be
+	 * running in a non-workqueue (direct call) context.
+	 */
+	work->destroyed = true;
+	flush_workqueue(work->worker);
+
+	do {
+		spin_lock_bh(&work->state_lock);
+		idle = (work->state == WQ_STATE_START);
+		spin_unlock_bh(&work->state_lock);
+	} while (!idle);
+
+	destroy_workqueue(work->worker);
+}
+
+void rxe_run_work(struct rxe_work *work, int sched)
+{
+	if (work->destroyed)
+		return;
+
+	/* busy-loop while qp reset is in progress */
+	while (atomic_read(&work->suspended))
+		continue;
+
+	if (sched)
+		queue_work(work->worker, &work->work);
+	else
+		rxe_do_work(&work->work);
+}
+
+void rxe_disable_work(struct rxe_work *work)
+{
+	atomic_inc(&work->suspended);
+	flush_workqueue(work->worker);
+}
+
+void rxe_enable_work(struct rxe_work *work)
+{
+	atomic_dec(&work->suspended);
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_wq.h b/drivers/infiniband/sw/rxe/rxe_wq.h
new file mode 100644
index 000000000000..b40af598dcc0
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_wq.h
@@ -0,0 +1,71 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+/*
+ * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
+ * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
+ */
+
+#ifndef RXE_WQ_H
+#define RXE_WQ_H
+
+enum {
+	WQ_STATE_START	= 0,
+	WQ_STATE_BUSY		= 1,
+	WQ_STATE_ARMED	= 2,
+};
+
+/*
+ * data structure to describe a 'work' which is a short
+ * function that returns 0 as long as it needs to be
+ * called again.
+ */
+struct rxe_work {
+	void			*obj;
+	struct workqueue_struct	*worker;
+	struct work_struct	work;
+	int			state;
+	spinlock_t		state_lock; /* spinlock for work state */
+	void			*arg;
+	int			(*func)(void *arg);
+	int			ret;
+	char			name[16];
+	bool			destroyed;
+	atomic_t		suspended; /* used to {dis,en}able workqueue */
+};
+
+/*
+ * init rxe_work structure
+ *	arg  => parameter to pass to fcn
+ *	func => function to call until it returns != 0
+ */
+int rxe_init_work(void *obj, struct rxe_work *work,
+		  void *arg, int (*func)(void *), char *name);
+
+/* cleanup work */
+void rxe_cleanup_work(struct rxe_work *work);
+
+/*
+ * raw call to func in loop without any checking
+ * can call when workqueues are suspended.
+ */
+int __rxe_do_work(struct rxe_work *work);
+
+/*
+ * common function called by any of the main workqueues
+ * If there is any chance that there is additional
+ * work to do someone must reschedule the work before
+ * leaving
+ */
+void rxe_do_work(struct work_struct *w);
+
+/* run a work, else schedule it to run as a workqueue, The decision
+ * to run or schedule workqueue is based on the parameter sched.
+ */
+void rxe_run_work(struct rxe_work *work, int sched);
+
+/* keep a work from scheduling */
+void rxe_disable_work(struct rxe_work *work);
+
+/* allow work to run */
+void rxe_enable_work(struct rxe_work *work);
+
+#endif /* RXE_WQ_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 3/7] RDMA/rxe: Cleanup code for responder Atomic operations
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
  2022-09-07  2:42 ` [RFC PATCH 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex Daisuke Matsuda
  2022-09-07  2:43 ` [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Daisuke Matsuda
@ 2022-09-07  2:43 ` Daisuke Matsuda
  2022-09-07  2:43 ` [RFC PATCH 4/7] RDMA/rxe: Add page invalidation support Daisuke Matsuda
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:43 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

Currently, rxe_responder() directly calls the function to execute Atomic
operations. This need to be modified to insert some conditional branches
for the new RDMA Write operation and the ODP feature.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe_resp.c | 102 +++++++++++++++++----------
 1 file changed, 64 insertions(+), 38 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index e97c55b292f0..cadc8fa64dd0 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -591,60 +591,86 @@ static struct resp_res *rxe_prepare_res(struct rxe_qp *qp,
 /* Guarantee atomicity of atomic operations at the machine level. */
 static DEFINE_SPINLOCK(atomic_ops_lock);
 
-static enum resp_states atomic_reply(struct rxe_qp *qp,
-					 struct rxe_pkt_info *pkt)
+enum resp_states rxe_process_atomic(struct rxe_qp *qp,
+				    struct rxe_pkt_info *pkt, u64 *vaddr)
 {
-	u64 *vaddr;
 	enum resp_states ret;
-	struct rxe_mr *mr = qp->resp.mr;
 	struct resp_res *res = qp->resp.res;
 	u64 value;
 
-	if (!res) {
-		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK);
-		qp->resp.res = res;
+	/* check vaddr is 8 bytes aligned. */
+	if (!vaddr || (uintptr_t)vaddr & 7) {
+		ret = RESPST_ERR_MISALIGNED_ATOMIC;
+		goto out;
 	}
 
-	if (!res->replay) {
-		if (mr->state != RXE_MR_STATE_VALID) {
-			ret = RESPST_ERR_RKEY_VIOLATION;
-			goto out;
-		}
+	spin_lock(&atomic_ops_lock);
+	res->atomic.orig_val = value = *vaddr;
 
-		vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset,
-					sizeof(u64));
+	if (pkt->opcode == IB_OPCODE_RC_COMPARE_SWAP) {
+		if (value == atmeth_comp(pkt))
+			value = atmeth_swap_add(pkt);
+	} else {
+		value += atmeth_swap_add(pkt);
+	}
 
-		/* check vaddr is 8 bytes aligned. */
-		if (!vaddr || (uintptr_t)vaddr & 7) {
-			ret = RESPST_ERR_MISALIGNED_ATOMIC;
-			goto out;
-		}
+	*vaddr = value;
+	spin_unlock(&atomic_ops_lock);
 
-		spin_lock_bh(&atomic_ops_lock);
-		res->atomic.orig_val = value = *vaddr;
+	qp->resp.msn++;
 
-		if (pkt->opcode == IB_OPCODE_RC_COMPARE_SWAP) {
-			if (value == atmeth_comp(pkt))
-				value = atmeth_swap_add(pkt);
-		} else {
-			value += atmeth_swap_add(pkt);
-		}
+	/* next expected psn, read handles this separately */
+	qp->resp.psn = (pkt->psn + 1) & BTH_PSN_MASK;
+	qp->resp.ack_psn = qp->resp.psn;
 
-		*vaddr = value;
-		spin_unlock_bh(&atomic_ops_lock);
+	qp->resp.opcode = pkt->opcode;
+	qp->resp.status = IB_WC_SUCCESS;
 
-		qp->resp.msn++;
+	ret = RESPST_ACKNOWLEDGE;
+out:
+	return ret;
+}
 
-		/* next expected psn, read handles this separately */
-		qp->resp.psn = (pkt->psn + 1) & BTH_PSN_MASK;
-		qp->resp.ack_psn = qp->resp.psn;
+static  enum resp_states rxe_atomic_ops(struct rxe_qp *qp,
+					struct rxe_pkt_info *pkt,
+					struct rxe_mr *mr)
+{
+	u64 *vaddr;
+	int ret;
 
-		qp->resp.opcode = pkt->opcode;
-		qp->resp.status = IB_WC_SUCCESS;
+	vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset,
+			      sizeof(u64));
+
+	if (pkt->mask & RXE_ATOMIC_MASK) {
+		ret = rxe_process_atomic(qp, pkt, vaddr);
+	} else {
+		/*ATOMIC WRITE operation will come here. */
+		ret = RESPST_ERR_UNSUPPORTED_OPCODE;
 	}
 
-	ret = RESPST_ACKNOWLEDGE;
-out:
+	return ret;
+}
+
+static enum resp_states rxe_atomic_reply(struct rxe_qp *qp,
+					 struct rxe_pkt_info *pkt)
+{
+	struct rxe_mr *mr = qp->resp.mr;
+	struct resp_res *res = qp->resp.res;
+	int ret;
+
+	if (!res) {
+		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK);
+		qp->resp.res = res;
+	}
+
+	if (!res->replay) {
+		if (mr->state != RXE_MR_STATE_VALID)
+			return RESPST_ERR_RKEY_VIOLATION;
+
+		ret = rxe_atomic_ops(qp, pkt, mr);
+	} else
+		ret = RESPST_ACKNOWLEDGE;
+
 	return ret;
 }
 
@@ -1327,7 +1353,7 @@ int rxe_responder(void *arg)
 			state = read_reply(qp, pkt);
 			break;
 		case RESPST_ATOMIC_REPLY:
-			state = atomic_reply(qp, pkt);
+			state = rxe_atomic_reply(qp, pkt);
 			break;
 		case RESPST_ACKNOWLEDGE:
 			state = acknowledge(qp, pkt);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 4/7] RDMA/rxe: Add page invalidation support
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (2 preceding siblings ...)
  2022-09-07  2:43 ` [RFC PATCH 3/7] RDMA/rxe: Cleanup code for responder Atomic operations Daisuke Matsuda
@ 2022-09-07  2:43 ` Daisuke Matsuda
  2022-09-07  2:43 ` [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:43 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

On page invalidation, an MMU notifier callback is invoked to unmap DMA
addresses and update umem_odp->dma_list. The callback is registered when an
ODP-enabled MR is created.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/Makefile  |  3 ++-
 drivers/infiniband/sw/rxe/rxe_odp.c | 34 +++++++++++++++++++++++++++++
 2 files changed, 36 insertions(+), 1 deletion(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c

diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
index 358f6b06aa64..924f4acb2816 100644
--- a/drivers/infiniband/sw/rxe/Makefile
+++ b/drivers/infiniband/sw/rxe/Makefile
@@ -22,4 +22,5 @@ rdma_rxe-y := \
 	rxe_mcast.o \
 	rxe_wq.o \
 	rxe_net.o \
-	rxe_hw_counters.o
+	rxe_hw_counters.o \
+	rxe_odp.o
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
new file mode 100644
index 000000000000..0f702787a66e
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -0,0 +1,34 @@
+// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
+/*
+ * Copyright (c) 2022 Fujitsu Ltd. All rights reserved.
+ */
+
+#include <rdma/ib_umem_odp.h>
+
+bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
+			     const struct mmu_notifier_range *range,
+			     unsigned long cur_seq)
+{
+	struct ib_umem_odp *umem_odp =
+		container_of(mni, struct ib_umem_odp, notifier);
+	unsigned long start;
+	unsigned long end;
+
+	if (!mmu_notifier_range_blockable(range))
+		return false;
+
+	mutex_lock(&umem_odp->umem_mutex);
+	mmu_interval_set_seq(mni, cur_seq);
+
+	start = max_t(u64, ib_umem_start(umem_odp), range->start);
+	end = min_t(u64, ib_umem_end(umem_odp), range->end);
+
+	ib_umem_odp_unmap_dma_pages(umem_odp, start, end);
+
+	mutex_unlock(&umem_odp->umem_mutex);
+	return true;
+}
+
+const struct mmu_interval_notifier_ops rxe_mn_ops = {
+	.invalidate = rxe_ib_invalidate_range,
+};
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (3 preceding siblings ...)
  2022-09-07  2:43 ` [RFC PATCH 4/7] RDMA/rxe: Add page invalidation support Daisuke Matsuda
@ 2022-09-07  2:43 ` Daisuke Matsuda
  2022-09-08 16:57   ` Haris Iqbal
  2022-09-07  2:43 ` [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP Daisuke Matsuda
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:43 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

Allow applications to register an ODP-enabled MR, in which case the flag
IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
RDMA operation supported right now. They will be enabled later in the
subsequent two patches.

rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR here.
It syncs process address space from the CPU page table to the driver page
table(dma_list/pfn_list in umem_odp) when called with a
RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page
fault when pages being accessed are not present or do not have proper
read/write permissions and possibly to prefetch pages in the future.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c       |  7 +++
 drivers/infiniband/sw/rxe/rxe_loc.h   |  5 ++
 drivers/infiniband/sw/rxe/rxe_mr.c    |  7 ++-
 drivers/infiniband/sw/rxe/rxe_odp.c   | 80 +++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c  | 21 +++++--
 drivers/infiniband/sw/rxe/rxe_verbs.c |  8 ++-
 drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
 7 files changed, 121 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 51daac5c4feb..0719f451253c 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 			rxe->ndev->dev_addr);
 
 	rxe->max_ucontext			= RXE_MAX_UCONTEXT;
+
+	if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
+		rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
+
+		/* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
+		rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+	}
 }
 
 /* initialize port attributes */
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 0f8cb9e38cc9..03b4078b90a3 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -64,6 +64,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
 
 /* rxe_mr.c */
 u8 rxe_get_next_key(u32 last_key);
+void rxe_mr_init(int access, struct rxe_mr *mr);
 void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr);
 int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr);
@@ -188,4 +189,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
 	return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type];
 }
 
+/* rxe_odp.c */
+int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
+			   int access_flags, struct rxe_mr *mr);
+
 #endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 814116ec4778..0ae72a4516be 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -48,7 +48,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 				| IB_ACCESS_REMOTE_WRITE	\
 				| IB_ACCESS_REMOTE_ATOMIC)
 
-static void rxe_mr_init(int access, struct rxe_mr *mr)
+void rxe_mr_init(int access, struct rxe_mr *mr)
 {
 	u32 lkey = mr->elem.index << 8 | rxe_get_next_key(-1);
 	u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0;
@@ -438,7 +438,10 @@ int copy_data(
 		if (bytes > 0) {
 			iova = sge->addr + offset;
 
-			err = rxe_mr_copy(mr, iova, addr, bytes, dir);
+			if (mr->odp_enabled)
+				err = -EOPNOTSUPP;
+			else
+				err = rxe_mr_copy(mr, iova, addr, bytes, dir);
 			if (err)
 				goto err2;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 0f702787a66e..1f6930ba714c 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -5,6 +5,8 @@
 
 #include <rdma/ib_umem_odp.h>
 
+#include "rxe.h"
+
 bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
 			     const struct mmu_notifier_range *range,
 			     unsigned long cur_seq)
@@ -32,3 +34,81 @@ bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
 const struct mmu_interval_notifier_ops rxe_mn_ops = {
 	.invalidate = rxe_ib_invalidate_range,
 };
+
+#define RXE_PAGEFAULT_RDONLY BIT(1)
+#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
+static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
+{
+	int np;
+	u64 access_mask;
+	bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+	access_mask = ODP_READ_ALLOWED_BIT;
+	if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
+		access_mask |= ODP_WRITE_ALLOWED_BIT;
+
+	/*
+	 * umem mutex is held after return from ib_umem_odp_map_dma_and_lock().
+	 * Release it when access to user MR is done or not required.
+	 */
+	np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
+					  access_mask, fault);
+
+	return np;
+}
+
+static int rxe_init_odp_mr(struct rxe_mr *mr)
+{
+	int ret;
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+	ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length,
+				   RXE_PAGEFAULT_SNAPSHOT);
+	mutex_unlock(&umem_odp->umem_mutex);
+
+	return ret >= 0 ? 0 : ret;
+}
+
+int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
+			   int access_flags, struct rxe_mr *mr)
+{
+	int err;
+	struct ib_umem_odp *umem_odp;
+	struct rxe_dev *dev = container_of(pd->device, struct rxe_dev, ib_dev);
+
+	if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
+		return -EOPNOTSUPP;
+
+	rxe_mr_init(access_flags, mr);
+
+	if (!start && length == U64_MAX) {
+		if (iova != 0)
+			return -EINVAL;
+		if (!(dev->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
+			return -EINVAL;
+
+		/* Never reach here, for implicit ODP is not implemented. */
+	}
+
+	umem_odp = ib_umem_odp_get(pd->device, start, length, access_flags,
+				   &rxe_mn_ops);
+	if (IS_ERR(umem_odp))
+		return PTR_ERR(umem_odp);
+
+	umem_odp->private = mr;
+
+	mr->odp_enabled = true;
+	mr->ibmr.pd = pd;
+	mr->umem = &umem_odp->umem;
+	mr->access = access_flags;
+	mr->length = length;
+	mr->iova = iova;
+	mr->offset = ib_umem_offset(&umem_odp->umem);
+	mr->state = RXE_MR_STATE_VALID;
+	mr->type = IB_MR_TYPE_USER;
+
+	err = rxe_init_odp_mr(mr);
+
+	return err;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index cadc8fa64dd0..dd8632e783f6 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -535,8 +535,12 @@ static enum resp_states write_data_in(struct rxe_qp *qp,
 	int	err;
 	int data_len = payload_size(pkt);
 
-	err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
-			  payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
+	if (qp->resp.mr->odp_enabled)
+		err = -EOPNOTSUPP;
+	else
+		err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
+				  payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
+
 	if (err) {
 		rc = RESPST_ERR_RKEY_VIOLATION;
 		goto out;
@@ -667,7 +671,10 @@ static enum resp_states rxe_atomic_reply(struct rxe_qp *qp,
 		if (mr->state != RXE_MR_STATE_VALID)
 			return RESPST_ERR_RKEY_VIOLATION;
 
-		ret = rxe_atomic_ops(qp, pkt, mr);
+		if (mr->odp_enabled)
+			ret = RESPST_ERR_UNSUPPORTED_OPCODE;
+		else
+			ret = rxe_atomic_ops(qp, pkt, mr);
 	} else
 		ret = RESPST_ACKNOWLEDGE;
 
@@ -831,8 +838,12 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 	if (!skb)
 		return RESPST_ERR_RNR;
 
-	err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
-			  payload, RXE_FROM_MR_OBJ);
+	if (mr->odp_enabled)
+		err = -EOPNOTSUPP;
+	else
+		err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
+				  payload, RXE_FROM_MR_OBJ);
+
 	if (err)
 		pr_err("Failed copying memory\n");
 	if (mr)
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 7510f25c5ea3..b00e9b847382 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -926,10 +926,14 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
 		goto err2;
 	}
 
-
 	rxe_get(pd);
 
-	err = rxe_mr_init_user(pd, start, length, iova, access, mr);
+	if (access & IB_ACCESS_ON_DEMAND)
+		err = rxe_create_user_odp_mr(&pd->ibpd, start, length, iova,
+					     access, mr);
+	else
+		err = rxe_mr_init_user(pd, start, length, iova, access, mr);
+
 	if (err)
 		goto err3;
 
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index b09b4cb9897a..98d2bb737ebc 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -324,6 +324,8 @@ struct rxe_mr {
 	atomic_t		num_mw;
 
 	struct rxe_map		**map;
+
+	bool		        odp_enabled;
 };
 
 enum rxe_mw_state {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (4 preceding siblings ...)
  2022-09-07  2:43 ` [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
@ 2022-09-07  2:43 ` Daisuke Matsuda
  2022-09-08  8:29   ` Leon Romanovsky
  2022-09-07  2:43 ` [RFC PATCH 7/7] RDMA/rxe: Add support for the traditional Atomic " Daisuke Matsuda
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:43 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
it to load payloads of requesting packets; responder uses it to process
Send, Write, and Read operaetions; completer uses it to copy data from
response packets of Read and Atomic operations to a user MR.

Allow these operations to be used with ODP by adding a counterpart function
rxe_odp_mr_copy(). It is comprised of the following steps:
 1. Check the driver page table(umem_odp->dma_list) to see if pages being
    accessed are present with appropriate permission.
 2. If necessary, trigger page fault to map the pages.
 3. Convert their user space addresses to kernel logical addresses using
    PFNs in the driver page table(umem_odp->pfn_list).
 4. Execute data copy fo/from the pages.

umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
is not changed while it is checked and that mapped pages are not
invalidated before data copy completes.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c      |  10 ++
 drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c   |   2 +-
 drivers/infiniband/sw/rxe/rxe_odp.c  | 173 +++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c |   6 +-
 5 files changed, 190 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index 0719f451253c..dd287fc60e9d 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -79,6 +79,16 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 
 		/* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
 		rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
+
+		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SEND;
+		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_RECV;
+		rxe->attr.odp_caps.per_transport_caps.ud_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
+
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SEND;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
 	}
 }
 
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 03b4078b90a3..91982b5a690c 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -192,5 +192,7 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
 /* rxe_odp.c */
 int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
 			   int access_flags, struct rxe_mr *mr);
+int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
+		    enum rxe_mr_copy_dir dir);
 
 #endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 0ae72a4516be..2091e865dd8f 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -439,7 +439,7 @@ int copy_data(
 			iova = sge->addr + offset;
 
 			if (mr->odp_enabled)
-				err = -EOPNOTSUPP;
+				err = rxe_odp_mr_copy(mr, iova, addr, bytes, dir);
 			else
 				err = rxe_mr_copy(mr, iova, addr, bytes, dir);
 			if (err)
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 1f6930ba714c..85c34995c704 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -3,6 +3,8 @@
  * Copyright (c) 2022 Fujitsu Ltd. All rights reserved.
  */
 
+#include <linux/hmm.h>
+
 #include <rdma/ib_umem_odp.h>
 
 #include "rxe.h"
@@ -112,3 +114,174 @@ int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
 
 	return err;
 }
+
+static inline bool rxe_is_pagefault_neccesary(struct ib_umem_odp *umem_odp,
+					      u64 iova, int length, u32 perm)
+{
+	int idx;
+	u64 addr;
+	bool need_fault = false;
+
+	addr = iova & (~(BIT(umem_odp->page_shift) - 1));
+
+	/* Skim through all pages that are to be accessed. */
+	while (addr < iova + length) {
+		idx = (addr - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
+
+		if (!(umem_odp->dma_list[idx] & perm)) {
+			need_fault = true;
+			break;
+		}
+
+		addr += BIT(umem_odp->page_shift);
+	}
+	return need_fault;
+}
+
+/* umem mutex is always locked when returning from this function. */
+static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	const int max_tries = 3;
+	int cnt = 0;
+
+	int err;
+	u64 perm;
+	bool need_fault;
+
+	if (unlikely(length < 1))
+		return -EINVAL;
+
+	perm = ODP_READ_ALLOWED_BIT;
+	if (!(flags & RXE_PAGEFAULT_RDONLY))
+		perm |= ODP_WRITE_ALLOWED_BIT;
+
+	mutex_lock(&umem_odp->umem_mutex);
+
+	/*
+	 * A successful return from rxe_odp_do_pagefault() does not guarantee
+	 * that all pages in the range became present. Recheck the DMA address
+	 * array, allowing max 3 tries for pagefault.
+	 */
+	while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
+							iova, length, perm))) {
+
+		if (cnt >= max_tries)
+			break;
+
+		mutex_unlock(&umem_odp->umem_mutex);
+
+		/* rxe_odp_do_pagefault() locks the umem mutex. */
+		err = rxe_odp_do_pagefault(mr, iova, length, flags);
+		if (err < 0)
+			return err;
+
+		cnt++;
+	}
+
+	if (need_fault)
+		return -EFAULT;
+
+	return 0;
+}
+
+static inline void *rxe_odp_get_virt(struct ib_umem_odp *umem_odp, int umem_idx,
+				     size_t offset)
+{
+	struct page *page;
+	void *virt;
+
+	/*
+	 * Step 1. Get page struct from the pfn array.
+	 * Step 2. Convert page struct to kernel logical address.
+	 * Step 3. Add offset in the page to the address.
+	 */
+	page = hmm_pfn_to_page(umem_odp->pfn_list[umem_idx]);
+	virt = page_address(page);
+
+	if (!virt)
+		return NULL;
+
+	virt += offset;
+
+	return virt;
+}
+
+static int __rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+			     int length, enum rxe_mr_copy_dir dir)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+
+	int idx, bytes;
+	u8 *user_va;
+	size_t offset;
+
+	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
+	offset = iova & (BIT(umem_odp->page_shift) - 1);
+
+	while (length > 0) {
+		u8 *src, *dest;
+
+		user_va = (u8 *)rxe_odp_get_virt(umem_odp, idx, offset);
+		if (!user_va)
+			return -EFAULT;
+
+		src = (dir == RXE_TO_MR_OBJ) ? addr : user_va;
+		dest = (dir == RXE_TO_MR_OBJ) ? user_va : addr;
+
+		bytes = BIT(umem_odp->page_shift) - offset;
+
+		if (bytes > length)
+			bytes = length;
+
+		memcpy(dest, src, bytes);
+
+		length  -= bytes;
+		idx++;
+		offset = 0;
+	}
+
+	/* The mutex was locked in rxe_odp_map_range().
+	 * Now it is safe to invalidate the MR, so unlock it
+	 */
+	mutex_unlock(&umem_odp->umem_mutex);
+
+	return 0;
+}
+
+int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
+		    enum rxe_mr_copy_dir dir)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	u32 flags = 0;
+
+	int err;
+
+	if (length == 0)
+		return 0;
+
+	WARN_ON_ONCE(!mr->odp_enabled);
+
+	switch (dir) {
+	case RXE_TO_MR_OBJ:
+		break;
+
+	case RXE_FROM_MR_OBJ:
+		flags = RXE_PAGEFAULT_RDONLY;
+		break;
+
+	default:
+		return -EINVAL;
+	}
+
+	/* umem mutex is locked here to prevent MR invalidation before data copy
+	 * completes; On success, it is unlocked in __rxe_odp_mr_copy()
+	 */
+	err = rxe_odp_map_range(mr, iova, length, flags);
+	if (err) {
+		mutex_unlock(&umem_odp->umem_mutex);
+		return err;
+	}
+
+	return __rxe_odp_mr_copy(mr, iova, addr, length, dir);
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index dd8632e783f6..bf439004c378 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -536,7 +536,8 @@ static enum resp_states write_data_in(struct rxe_qp *qp,
 	int data_len = payload_size(pkt);
 
 	if (qp->resp.mr->odp_enabled)
-		err = -EOPNOTSUPP;
+		err = rxe_odp_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
+				      payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
 	else
 		err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
 				  payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
@@ -839,7 +840,8 @@ static enum resp_states read_reply(struct rxe_qp *qp,
 		return RESPST_ERR_RNR;
 
 	if (mr->odp_enabled)
-		err = -EOPNOTSUPP;
+		err = rxe_odp_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
+				      payload, RXE_FROM_MR_OBJ);
 	else
 		err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
 				  payload, RXE_FROM_MR_OBJ);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [RFC PATCH 7/7] RDMA/rxe: Add support for the traditional Atomic operations with ODP
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (5 preceding siblings ...)
  2022-09-07  2:43 ` [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP Daisuke Matsuda
@ 2022-09-07  2:43 ` Daisuke Matsuda
  2022-09-08  8:40 ` [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Zhu Yanjun
  2022-09-09  3:07 ` Li Zhijian
  8 siblings, 0 replies; 25+ messages in thread
From: Daisuke Matsuda @ 2022-09-07  2:43 UTC (permalink / raw)
  To: linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto,
	Daisuke Matsuda

Enable 'fetch and add' and 'compare and swap' operations to manipulate
data in an ODP-enabled MR. This is comprised of the following steps:
 1. Check the driver page table(umem_odp->dma_list) to see if the target
    page is both readable and writable.
 2. If not, then trigger page fault to map the page.
 3. Convert its user space address to a kernel logical address using PFNs
    in the driver page table(umem_odp->pfn_list).
 4. Execute the operation.

umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
is not changed while it is checked and that the target page is not
invalidated before data access completes.

Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
---
 drivers/infiniband/sw/rxe/rxe.c      |  1 +
 drivers/infiniband/sw/rxe/rxe_loc.h  |  2 ++
 drivers/infiniband/sw/rxe/rxe_odp.c  | 42 ++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c | 38 ++----------------------
 drivers/infiniband/sw/rxe/rxe_resp.h | 44 ++++++++++++++++++++++++++++
 5 files changed, 91 insertions(+), 36 deletions(-)
 create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h

diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
index dd287fc60e9d..8190af3e9afe 100644
--- a/drivers/infiniband/sw/rxe/rxe.c
+++ b/drivers/infiniband/sw/rxe/rxe.c
@@ -88,6 +88,7 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_RECV;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_WRITE;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_READ;
+		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_ATOMIC;
 		rxe->attr.odp_caps.per_transport_caps.rc_odp_caps |= IB_ODP_SUPPORT_SRQ_RECV;
 	}
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 91982b5a690c..1d10a58bbd5b 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -194,5 +194,7 @@ int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
 			   int access_flags, struct rxe_mr *mr);
 int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		    enum rxe_mr_copy_dir dir);
+enum resp_states rxe_odp_atomic_ops(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
+				    struct rxe_mr *mr);
 
 #endif /* RXE_LOC_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
index 85c34995c704..3297d124c90e 100644
--- a/drivers/infiniband/sw/rxe/rxe_odp.c
+++ b/drivers/infiniband/sw/rxe/rxe_odp.c
@@ -8,6 +8,7 @@
 #include <rdma/ib_umem_odp.h>
 
 #include "rxe.h"
+#include "rxe_resp.h"
 
 bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
 			     const struct mmu_notifier_range *range,
@@ -285,3 +286,44 @@ int rxe_odp_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 
 	return __rxe_odp_mr_copy(mr, iova, addr, length, dir);
 }
+
+static inline void *rxe_odp_get_virt_atomic(struct rxe_qp *qp, struct rxe_mr *mr)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	u64 iova = qp->resp.va + qp->resp.offset;
+	int idx;
+	size_t offset;
+
+	if (rxe_odp_map_range(mr, iova, sizeof(char), 0))
+		return NULL;
+
+	idx = (iova - ib_umem_start(umem_odp)) >> umem_odp->page_shift;
+	offset = iova & (BIT(umem_odp->page_shift) - 1);
+
+	return rxe_odp_get_virt(umem_odp, idx, offset);
+}
+
+enum resp_states rxe_odp_atomic_ops(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
+				    struct rxe_mr *mr)
+{
+	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
+	u64 *vaddr;
+	int ret;
+
+	if (WARN_ON(!mr->odp_enabled))
+		return RESPST_ERR_RKEY_VIOLATION;
+
+	/* umem mutex is locked here to prevent MR invalidation before memory
+	 * access completes.
+	 */
+	vaddr = (u64 *)rxe_odp_get_virt_atomic(qp, mr);
+
+	if (pkt->mask & RXE_ATOMIC_MASK)
+		ret = rxe_process_atomic(qp, pkt, vaddr);
+	else
+		/* ATOMIC WRITE operation will come here. */
+		ret = RESPST_ERR_RKEY_VIOLATION;
+
+	mutex_unlock(&umem_odp->umem_mutex);
+	return ret;
+}
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index bf439004c378..663e5b32c9cb 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -9,41 +9,7 @@
 #include "rxe.h"
 #include "rxe_loc.h"
 #include "rxe_queue.h"
-
-enum resp_states {
-	RESPST_NONE,
-	RESPST_GET_REQ,
-	RESPST_CHK_PSN,
-	RESPST_CHK_OP_SEQ,
-	RESPST_CHK_OP_VALID,
-	RESPST_CHK_RESOURCE,
-	RESPST_CHK_LENGTH,
-	RESPST_CHK_RKEY,
-	RESPST_EXECUTE,
-	RESPST_READ_REPLY,
-	RESPST_ATOMIC_REPLY,
-	RESPST_COMPLETE,
-	RESPST_ACKNOWLEDGE,
-	RESPST_CLEANUP,
-	RESPST_DUPLICATE_REQUEST,
-	RESPST_ERR_MALFORMED_WQE,
-	RESPST_ERR_UNSUPPORTED_OPCODE,
-	RESPST_ERR_MISALIGNED_ATOMIC,
-	RESPST_ERR_PSN_OUT_OF_SEQ,
-	RESPST_ERR_MISSING_OPCODE_FIRST,
-	RESPST_ERR_MISSING_OPCODE_LAST_C,
-	RESPST_ERR_MISSING_OPCODE_LAST_D1E,
-	RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
-	RESPST_ERR_RNR,
-	RESPST_ERR_RKEY_VIOLATION,
-	RESPST_ERR_INVALIDATE_RKEY,
-	RESPST_ERR_LENGTH,
-	RESPST_ERR_CQ_OVERFLOW,
-	RESPST_ERROR,
-	RESPST_RESET,
-	RESPST_DONE,
-	RESPST_EXIT,
-};
+#include "rxe_resp.h"
 
 static char *resp_state_name[] = {
 	[RESPST_NONE]				= "NONE",
@@ -673,7 +639,7 @@ static enum resp_states rxe_atomic_reply(struct rxe_qp *qp,
 			return RESPST_ERR_RKEY_VIOLATION;
 
 		if (mr->odp_enabled)
-			ret = RESPST_ERR_UNSUPPORTED_OPCODE;
+			ret = rxe_odp_atomic_ops(qp, pkt, mr);
 		else
 			ret = rxe_atomic_ops(qp, pkt, mr);
 	} else
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.h b/drivers/infiniband/sw/rxe/rxe_resp.h
new file mode 100644
index 000000000000..cb907b49175f
--- /dev/null
+++ b/drivers/infiniband/sw/rxe/rxe_resp.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
+
+#ifndef RXE_RESP_H
+#define RXE_RESP_H
+
+enum resp_states {
+	RESPST_NONE,
+	RESPST_GET_REQ,
+	RESPST_CHK_PSN,
+	RESPST_CHK_OP_SEQ,
+	RESPST_CHK_OP_VALID,
+	RESPST_CHK_RESOURCE,
+	RESPST_CHK_LENGTH,
+	RESPST_CHK_RKEY,
+	RESPST_EXECUTE,
+	RESPST_READ_REPLY,
+	RESPST_ATOMIC_REPLY,
+	RESPST_COMPLETE,
+	RESPST_ACKNOWLEDGE,
+	RESPST_CLEANUP,
+	RESPST_DUPLICATE_REQUEST,
+	RESPST_ERR_MALFORMED_WQE,
+	RESPST_ERR_UNSUPPORTED_OPCODE,
+	RESPST_ERR_MISALIGNED_ATOMIC,
+	RESPST_ERR_PSN_OUT_OF_SEQ,
+	RESPST_ERR_MISSING_OPCODE_FIRST,
+	RESPST_ERR_MISSING_OPCODE_LAST_C,
+	RESPST_ERR_MISSING_OPCODE_LAST_D1E,
+	RESPST_ERR_TOO_MANY_RDMA_ATM_REQ,
+	RESPST_ERR_RNR,
+	RESPST_ERR_RKEY_VIOLATION,
+	RESPST_ERR_INVALIDATE_RKEY,
+	RESPST_ERR_LENGTH,
+	RESPST_ERR_CQ_OVERFLOW,
+	RESPST_ERROR,
+	RESPST_RESET,
+	RESPST_DONE,
+	RESPST_EXIT,
+};
+
+enum resp_states rxe_process_atomic(struct rxe_qp *qp,
+				    struct rxe_pkt_info *pkt, u64 *vaddr);
+
+#endif /* RXE_RESP_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
  2022-09-07  2:43 ` [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP Daisuke Matsuda
@ 2022-09-08  8:29   ` Leon Romanovsky
  2022-09-09  2:45     ` matsuda-daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Leon Romanovsky @ 2022-09-08  8:29 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, jgg, zyjzyj2000, nvdimm, linux-kernel, rpearsonhpe,
	yangx.jy, lizhijian, y-goto

On Wed, Sep 07, 2022 at 11:43:04AM +0900, Daisuke Matsuda wrote:
> rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> it to load payloads of requesting packets; responder uses it to process
> Send, Write, and Read operaetions; completer uses it to copy data from
> response packets of Read and Atomic operations to a user MR.
> 
> Allow these operations to be used with ODP by adding a counterpart function
> rxe_odp_mr_copy(). It is comprised of the following steps:
>  1. Check the driver page table(umem_odp->dma_list) to see if pages being
>     accessed are present with appropriate permission.
>  2. If necessary, trigger page fault to map the pages.
>  3. Convert their user space addresses to kernel logical addresses using
>     PFNs in the driver page table(umem_odp->pfn_list).
>  4. Execute data copy fo/from the pages.
> 
> umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> is not changed while it is checked and that mapped pages are not
> invalidated before data copy completes.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>  drivers/infiniband/sw/rxe/rxe.c      |  10 ++
>  drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_odp.c  | 173 +++++++++++++++++++++++++++
>  drivers/infiniband/sw/rxe/rxe_resp.c |   6 +-
>  5 files changed, 190 insertions(+), 3 deletions(-)

<...>

> +/* umem mutex is always locked when returning from this function. */
> +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
> +{
> +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +	const int max_tries = 3;
> +	int cnt = 0;
> +
> +	int err;
> +	u64 perm;
> +	bool need_fault;
> +
> +	if (unlikely(length < 1))
> +		return -EINVAL;
> +
> +	perm = ODP_READ_ALLOWED_BIT;
> +	if (!(flags & RXE_PAGEFAULT_RDONLY))
> +		perm |= ODP_WRITE_ALLOWED_BIT;
> +
> +	mutex_lock(&umem_odp->umem_mutex);
> +
> +	/*
> +	 * A successful return from rxe_odp_do_pagefault() does not guarantee
> +	 * that all pages in the range became present. Recheck the DMA address
> +	 * array, allowing max 3 tries for pagefault.
> +	 */
> +	while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> +							iova, length, perm))) {
> +
> +		if (cnt >= max_tries)
> +			break;
> +
> +		mutex_unlock(&umem_odp->umem_mutex);
> +
> +		/* rxe_odp_do_pagefault() locks the umem mutex. */

Maybe it is correct and safe to release lock in the middle, but it is
not clear. The whole pattern of taking lock in one function and later
releasing it in another doesn't look right to me.

Thanks

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (6 preceding siblings ...)
  2022-09-07  2:43 ` [RFC PATCH 7/7] RDMA/rxe: Add support for the traditional Atomic " Daisuke Matsuda
@ 2022-09-08  8:40 ` Zhu Yanjun
  2022-09-08 10:25   ` matsuda-daisuke
  2022-09-09  3:07 ` Li Zhijian
  8 siblings, 1 reply; 25+ messages in thread
From: Zhu Yanjun @ 2022-09-08  8:40 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: RDMA mailing list, Leon Romanovsky, Jason Gunthorpe, nvdimm,
	LKML, Bob Pearson, Xiao Yang, Li Zhijian, y-goto

On Wed, Sep 7, 2022 at 10:44 AM Daisuke Matsuda
<matsuda-daisuke@fujitsu.com> wrote:
>
> Hi everyone,
>
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.
>
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin
> pages in the MR so that physical addresses are never changed during RDMA
> communication. This requires the MR to fit in physical memory and
> inevitably leads to memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are
> paged-in when the driver requires and paged-out when the OS reclaims. As a
> result, it is possible to register a large MR that does not fit in physical
> memory without taking up so much physical memory.
>
> [Why to add this feature?]
> We, Fujitsu, have contributed to RDMA with a view to using it with
> persistent memory. Persistent memory can host a filesystem that allows
> applications to read/write files directly without involving page cache.
> This is called FS-DAX(filesystem direct access) mode. There is a problem
> that data on DAX-enabled filesystem cannot be duplicated with software RAID
> or other hardware methods. Data replication with RDMA, which features
> high-speed connections, is the best solution for the problem.
>
> However, there is a known issue that hinders using RDMA with FS-DAX. When
> RDMA operations to a file and update of the file metadata are processed
> concurrently on the same node, illegal memory accesses can be executed,
> disregarding the updated metadata. This is because RDMA operations do not
> go through page cache but access data directly. There was an effort[2] to
> solve this problem, but it was rejected in the end. Though there is no
> general solution available, it is possible to work around the problem using
> the ODP feature that has been available only in mlx5. ODP enables drivers
> to update metadata before processing RDMA operations.
>
> We have enhanced the rxe to expedite the usage of persistent memory. Our
> contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
> them being merged to rxe along with ODP, an environment will be ready for
> developers to create and test software for RDMA with FS-DAX. There is a
> library(librpma)[5] being developed for this purpose. This environment
> can be used by anybody without any special hardware but an ordinary
> computer with a normal NIC though it is inferior to hardware
> implementations in terms of performance.
>
> [Design considerations]
> ODP has been available only in mlx5, but functions and data structures
> that can be used commonly are provided in ib_uverbs(infiniband/core). The
> interface is heavily dependent on HMM infrastructure[6], and this patchset
> use them as much as possible. While mlx5 has both Explicit and Implicit ODP
> features along with prefetch feature, this patchset implements the Explicit
> ODP feature only.
>
> As an important change, it is necessary to convert triple tasklets
> (requester, responder and completer) to workqueues because they must be
> able to sleep in order to trigger page fault before accessing MRs. I did a
> test shown in the 2nd patch and found that the change makes the latency
> higher while improving the bandwidth. Though it may be possible to create a
> new independent workqueue for page fault execution, it is a not very
> sensible solution since the tasklets have to busy-wait its completion in
> that case.
>
> If responder and completer sleep, it becomes more likely that packet drop
> occurs because of overflow in receiver queue. There are multiple queues
> involved, but, as SoftRoCE uses UDP, the most important one would be the
> UDP buffers. The size can be configured in net.core.rmem_default and
> net.core.rmem_max sysconfig parameters. Users should change these values in
> case of packet drop, but page fault would be typically not so long as to
> cause the problem.
>
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each
> ODP-enabled MR on its registration. This struct holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> PFNs are stored in the driver page table. They are updated on page-in and
> page-out, both of which use the common interface in ib_uverbs.
>
> Page-in can occur when requester, responder or completer access an MR in
> order to process RDMA operations. If they find that the pages being
> accessed are not present on physical memory or requisite permissions are
> not set on the pages, they provoke page fault to make pages present with
> proper permissions and at the same time update the driver page table. After
> confirming the presence of the pages, they execute memory access such as
> read, write or atomic operations.
>
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> update of a file that is being used as an MR). When creating an ODP-enabled
> MR, the driver registers an MMU notifier callback. When the kernel issues a
> page invalidation notification, the callback is provoked to unmap DMA
> addresses and update the driver page table. After that, the kernel releases
> the pages.
>
> [Supported operations]
> All operations are supported on RC connection. Atomic write[3] and Flush[4]
> operations, which are still under discussion, are also going to be
> supported after their patches are merged. On UD connection, Send, Recv,
> SRQ-Recv are supported. Because other operations are not supported on mlx5,
> I take after the decision right now.
>
> [How to test ODP?]
> There are only a few resources available for testing. pyverbs testcases in
> rdma-core and perftest[7] are recommendable ones. Note that you may have to
> build perftest from upstream since older versions do not handle ODP
> capabilities correctly.

ibv_rc_pingpong can also test the odp feature.

Please add rxe odp test cases in rdma-core.

Thanks a lot.
Zhu Yanjun

>
> [Future work]
> My next work will be the prefetch feature. It allows applications to
> trigger page fault using ibv_advise_mr(3) to optimize performance. Some
> existing software like librpma use this feature. Additionally, I think we
> can also add the implicit ODP feature in the future.
>
> [1] [RFC 00/20] On demand paging
> https://www.spinics.net/lists/linux-rdma/msg18906.html
>
> [2] [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
> https://lore.kernel.org/nvdimm/20190809225833.6657-1-ira.weiny@intel.com/
>
> [3] [RESEND PATCH v5 0/2] RDMA/rxe: Add RDMA Atomic Write operation
> https://www.spinics.net/lists/linux-rdma/msg111428.html
>
> [4] [PATCH v4 0/6] RDMA/rxe: Add RDMA FLUSH operation
> https://www.spinics.net/lists/kernel/msg4462045.html
>
> [5] librpma: Remote Persistent Memory Access Library
> https://github.com/pmem/rpma
>
> [6] Heterogeneous Memory Management (HMM)
> https://www.kernel.org/doc/html/latest/mm/hmm.html
>
> [7] linux-rdma/perftest: Infiniband Verbs Performance Tests
> https://github.com/linux-rdma/perftest
>
> Daisuke Matsuda (7):
>   IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
>   RDMA/rxe: Convert the triple tasklets to workqueues
>   RDMA/rxe: Cleanup code for responder Atomic operations
>   RDMA/rxe: Add page invalidation support
>   RDMA/rxe: Allow registering MRs for On-Demand Paging
>   RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
>   RDMA/rxe: Add support for the traditional Atomic operations with ODP
>
>  drivers/infiniband/core/umem_odp.c    |   6 +-
>  drivers/infiniband/hw/mlx5/odp.c      |   4 +-
>  drivers/infiniband/sw/rxe/Makefile    |   5 +-
>  drivers/infiniband/sw/rxe/rxe.c       |  18 ++
>  drivers/infiniband/sw/rxe/rxe_comp.c  |  42 +++-
>  drivers/infiniband/sw/rxe/rxe_loc.h   |  11 +-
>  drivers/infiniband/sw/rxe/rxe_mr.c    |   7 +-
>  drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
>  drivers/infiniband/sw/rxe/rxe_odp.c   | 329 ++++++++++++++++++++++++++
>  drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
>  drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++---
>  drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
>  drivers/infiniband/sw/rxe/rxe_req.c   |  14 +-
>  drivers/infiniband/sw/rxe/rxe_resp.c  | 175 +++++++-------
>  drivers/infiniband/sw/rxe/rxe_resp.h  |  44 ++++
>  drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------
>  drivers/infiniband/sw/rxe/rxe_task.h  |  69 ------
>  drivers/infiniband/sw/rxe/rxe_verbs.c |  16 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.h |  10 +-
>  drivers/infiniband/sw/rxe/rxe_wq.c    | 161 +++++++++++++
>  drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++
>  21 files changed, 824 insertions(+), 386 deletions(-)
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h
>  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
>  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE
  2022-09-08  8:40 ` [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Zhu Yanjun
@ 2022-09-08 10:25   ` matsuda-daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-08 10:25 UTC (permalink / raw)
  To: 'Zhu Yanjun'
  Cc: RDMA mailing list, Leon Romanovsky, Jason Gunthorpe, nvdimm,
	LKML, Bob Pearson, yangx.jy, lizhijian, y-goto

On Thu, Sep 8, 2022 5:41 PM Zhu Yanjun wrote:
> > [How to test ODP?]
> > There are only a few resources available for testing. pyverbs testcases in
> > rdma-core and perftest[7] are recommendable ones. Note that you may have to
> > build perftest from upstream since older versions do not handle ODP
> > capabilities correctly.
> 
> ibv_rc_pingpong can also test the odp feature.

This may be the easiest way to try the feature.

> 
> Please add rxe odp test cases in rdma-core.

This RFC implementation is functionally a subset of mlx5.
So, basically we can use the existing one(tests/test_odp.py).
I'll add new cases when I make them for additional testing.

Thanks,
Daisuke Matsuda

> 
> Thanks a lot.
> Zhu Yanjun





^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging
  2022-09-07  2:43 ` [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
@ 2022-09-08 16:57   ` Haris Iqbal
  2022-09-09  0:55     ` matsuda-daisuke
  0 siblings, 1 reply; 25+ messages in thread
From: Haris Iqbal @ 2022-09-08 16:57 UTC (permalink / raw)
  To: Daisuke Matsuda
  Cc: linux-rdma, leonro, jgg, zyjzyj2000, nvdimm, linux-kernel,
	rpearsonhpe, yangx.jy, lizhijian, y-goto, haris iqbal

On Wed, Sep 7, 2022 at 4:45 AM Daisuke Matsuda
<matsuda-daisuke@fujitsu.com> wrote:
>
> Allow applications to register an ODP-enabled MR, in which case the flag
> IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
> RDMA operation supported right now. They will be enabled later in the
> subsequent two patches.
>
> rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR here.
> It syncs process address space from the CPU page table to the driver page
> table(dma_list/pfn_list in umem_odp) when called with a
> RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page
> fault when pages being accessed are not present or do not have proper
> read/write permissions and possibly to prefetch pages in the future.
>
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>  drivers/infiniband/sw/rxe/rxe.c       |  7 +++
>  drivers/infiniband/sw/rxe/rxe_loc.h   |  5 ++
>  drivers/infiniband/sw/rxe/rxe_mr.c    |  7 ++-
>  drivers/infiniband/sw/rxe/rxe_odp.c   | 80 +++++++++++++++++++++++++++
>  drivers/infiniband/sw/rxe/rxe_resp.c  | 21 +++++--
>  drivers/infiniband/sw/rxe/rxe_verbs.c |  8 ++-
>  drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
>  7 files changed, 121 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/infiniband/sw/rxe/rxe.c b/drivers/infiniband/sw/rxe/rxe.c
> index 51daac5c4feb..0719f451253c 100644
> --- a/drivers/infiniband/sw/rxe/rxe.c
> +++ b/drivers/infiniband/sw/rxe/rxe.c
> @@ -73,6 +73,13 @@ static void rxe_init_device_param(struct rxe_dev *rxe)
>                         rxe->ndev->dev_addr);
>
>         rxe->max_ucontext                       = RXE_MAX_UCONTEXT;
> +
> +       if (IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING)) {
> +               rxe->attr.kernel_cap_flags |= IBK_ON_DEMAND_PAGING;
> +
> +               /* IB_ODP_SUPPORT_IMPLICIT is not supported right now. */
> +               rxe->attr.odp_caps.general_caps |= IB_ODP_SUPPORT;
> +       }
>  }
>
>  /* initialize port attributes */
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 0f8cb9e38cc9..03b4078b90a3 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -64,6 +64,7 @@ int rxe_mmap(struct ib_ucontext *context, struct vm_area_struct *vma);
>
>  /* rxe_mr.c */
>  u8 rxe_get_next_key(u32 last_key);
> +void rxe_mr_init(int access, struct rxe_mr *mr);
>  void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr);
>  int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
>                      int access, struct rxe_mr *mr);
> @@ -188,4 +189,8 @@ static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
>         return rxe_wr_opcode_info[opcode].mask[qp->ibqp.qp_type];
>  }
>
> +/* rxe_odp.c */
> +int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
> +                          int access_flags, struct rxe_mr *mr);
> +
>  #endif /* RXE_LOC_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
> index 814116ec4778..0ae72a4516be 100644
> --- a/drivers/infiniband/sw/rxe/rxe_mr.c
> +++ b/drivers/infiniband/sw/rxe/rxe_mr.c
> @@ -48,7 +48,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
>                                 | IB_ACCESS_REMOTE_WRITE        \
>                                 | IB_ACCESS_REMOTE_ATOMIC)
>
> -static void rxe_mr_init(int access, struct rxe_mr *mr)
> +void rxe_mr_init(int access, struct rxe_mr *mr)
>  {
>         u32 lkey = mr->elem.index << 8 | rxe_get_next_key(-1);
>         u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0;
> @@ -438,7 +438,10 @@ int copy_data(
>                 if (bytes > 0) {
>                         iova = sge->addr + offset;
>
> -                       err = rxe_mr_copy(mr, iova, addr, bytes, dir);
> +                       if (mr->odp_enabled)
> +                               err = -EOPNOTSUPP;
> +                       else
> +                               err = rxe_mr_copy(mr, iova, addr, bytes, dir);
>                         if (err)
>                                 goto err2;
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_odp.c b/drivers/infiniband/sw/rxe/rxe_odp.c
> index 0f702787a66e..1f6930ba714c 100644
> --- a/drivers/infiniband/sw/rxe/rxe_odp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_odp.c
> @@ -5,6 +5,8 @@
>
>  #include <rdma/ib_umem_odp.h>
>
> +#include "rxe.h"
> +
>  bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
>                              const struct mmu_notifier_range *range,
>                              unsigned long cur_seq)
> @@ -32,3 +34,81 @@ bool rxe_ib_invalidate_range(struct mmu_interval_notifier *mni,
>  const struct mmu_interval_notifier_ops rxe_mn_ops = {
>         .invalidate = rxe_ib_invalidate_range,
>  };
> +
> +#define RXE_PAGEFAULT_RDONLY BIT(1)
> +#define RXE_PAGEFAULT_SNAPSHOT BIT(2)
> +static int rxe_odp_do_pagefault(struct rxe_mr *mr, u64 user_va, int bcnt, u32 flags)
> +{
> +       int np;
> +       u64 access_mask;
> +       bool fault = !(flags & RXE_PAGEFAULT_SNAPSHOT);
> +       struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> +       access_mask = ODP_READ_ALLOWED_BIT;
> +       if (umem_odp->umem.writable && !(flags & RXE_PAGEFAULT_RDONLY))
> +               access_mask |= ODP_WRITE_ALLOWED_BIT;
> +
> +       /*
> +        * umem mutex is held after return from ib_umem_odp_map_dma_and_lock().
> +        * Release it when access to user MR is done or not required.
> +        */
> +       np = ib_umem_odp_map_dma_and_lock(umem_odp, user_va, bcnt,
> +                                         access_mask, fault);
> +
> +       return np;
> +}
> +
> +static int rxe_init_odp_mr(struct rxe_mr *mr)
> +{
> +       int ret;
> +       struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> +
> +       ret = rxe_odp_do_pagefault(mr, mr->umem->address, mr->umem->length,
> +                                  RXE_PAGEFAULT_SNAPSHOT);
> +       mutex_unlock(&umem_odp->umem_mutex);
> +
> +       return ret >= 0 ? 0 : ret;
> +}
> +
> +int rxe_create_user_odp_mr(struct ib_pd *pd, u64 start, u64 length, u64 iova,
> +                          int access_flags, struct rxe_mr *mr)
> +{
> +       int err;
> +       struct ib_umem_odp *umem_odp;
> +       struct rxe_dev *dev = container_of(pd->device, struct rxe_dev, ib_dev);
> +
> +       if (!IS_ENABLED(CONFIG_INFINIBAND_ON_DEMAND_PAGING))
> +               return -EOPNOTSUPP;
> +
> +       rxe_mr_init(access_flags, mr);
> +
> +       if (!start && length == U64_MAX) {
> +               if (iova != 0)
> +                       return -EINVAL;
> +               if (!(dev->attr.odp_caps.general_caps & IB_ODP_SUPPORT_IMPLICIT))
> +                       return -EINVAL;
> +
> +               /* Never reach here, for implicit ODP is not implemented. */
> +       }
> +
> +       umem_odp = ib_umem_odp_get(pd->device, start, length, access_flags,
> +                                  &rxe_mn_ops);
> +       if (IS_ERR(umem_odp))
> +               return PTR_ERR(umem_odp);
> +
> +       umem_odp->private = mr;
> +
> +       mr->odp_enabled = true;
> +       mr->ibmr.pd = pd;
> +       mr->umem = &umem_odp->umem;
> +       mr->access = access_flags;
> +       mr->length = length;
> +       mr->iova = iova;
> +       mr->offset = ib_umem_offset(&umem_odp->umem);
> +       mr->state = RXE_MR_STATE_VALID;
> +       mr->type = IB_MR_TYPE_USER;
> +
> +       err = rxe_init_odp_mr(mr);
> +
> +       return err;
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index cadc8fa64dd0..dd8632e783f6 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -535,8 +535,12 @@ static enum resp_states write_data_in(struct rxe_qp *qp,
>         int     err;
>         int data_len = payload_size(pkt);
>
> -       err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
> -                         payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
> +       if (qp->resp.mr->odp_enabled)

You cannot use qp->resp.mr here, because for zero byte operations,
resp.mr is not set in the function check_rkey().

The code fails for RTRS with the following stack trace,

[Thu Sep  8 20:12:22 2022] BUG: kernel NULL pointer dereference,
address: 0000000000000158
[Thu Sep  8 20:12:22 2022] #PF: supervisor read access in kernel mode
[Thu Sep  8 20:12:22 2022] #PF: error_code(0x0000) - not-present page
[Thu Sep  8 20:12:22 2022] PGD 0 P4D 0
[Thu Sep  8 20:12:22 2022] Oops: 0000 [#1] PREEMPT SMP
[Thu Sep  8 20:12:22 2022] CPU: 3 PID: 38 Comm: kworker/u8:1 Not
tainted 6.0.0-rc2-pserver+ #17
[Thu Sep  8 20:12:22 2022] Hardware name: QEMU Standard PC (i440FX +
PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[Thu Sep  8 20:12:22 2022] Workqueue: rxe_resp rxe_do_work [rdma_rxe]
[Thu Sep  8 20:12:22 2022] RIP: 0010:rxe_responder+0x1910/0x1d90 [rdma_rxe]
[Thu Sep  8 20:12:22 2022] Code: 06 48 63 88 fc 15 63 c0 0f b6 46 01
83 ea 04 c0 e8 04 29 ca 83 e0 03 29 c2 49 8b 87 08 05 00 00 49 03 87
00 05 00 00 4c 63 ea <80> bf 58 01 00 00 00 48 8d 14 0e 48 89 c6 4d 89
ee 44 89 e9 0f 84
[Thu Sep  8 20:12:22 2022] RSP: 0018:ffffb0358015fd80 EFLAGS: 00010246
[Thu Sep  8 20:12:22 2022] RAX: 0000000000000000 RBX: ffff9af4839b5e28
RCX: 0000000000000020
[Thu Sep  8 20:12:22 2022] RDX: 0000000000000000 RSI: ffff9af485094a6a
RDI: 0000000000000000
[Thu Sep  8 20:12:22 2022] RBP: ffff9af488bd7128 R08: 0000000000000000
R09: 0000000000000000
[Thu Sep  8 20:12:22 2022] R10: ffff9af4808eaf7c R11: 0000000000000001
R12: 0000000000000008
[Thu Sep  8 20:12:22 2022] R13: 0000000000000000 R14: ffff9af488bd7380
R15: ffff9af488bd7000
[Thu Sep  8 20:12:22 2022] FS:  0000000000000000(0000)
GS:ffff9af5b7d80000(0000) knlGS:0000000000000000
[Thu Sep  8 20:12:22 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Thu Sep  8 20:12:22 2022] CR2: 0000000000000158 CR3: 000000004a60a000
CR4: 00000000000006e0
[Thu Sep  8 20:12:22 2022] DR0: 0000000000000000 DR1: 0000000000000000
DR2: 0000000000000000
[Thu Sep  8 20:12:22 2022] DR3: 0000000000000000 DR6: 00000000fffe0ff0
DR7: 0000000000000400
[Thu Sep  8 20:12:22 2022] Call Trace:
[Thu Sep  8 20:12:22 2022]  <TASK>
[Thu Sep  8 20:12:22 2022]  ? newidle_balance+0x2e5/0x400
[Thu Sep  8 20:12:22 2022]  ? _raw_spin_unlock+0x12/0x30
[Thu Sep  8 20:12:22 2022]  ? finish_task_switch+0x91/0x2a0
[Thu Sep  8 20:12:22 2022]  rxe_do_work+0x86/0x110 [rdma_rxe]
[Thu Sep  8 20:12:22 2022]  process_one_work+0x1dc/0x3a0
[Thu Sep  8 20:12:22 2022]  worker_thread+0x4a/0x3b0
[Thu Sep  8 20:12:22 2022]  ? process_one_work+0x3a0/0x3a0
[Thu Sep  8 20:12:22 2022]  kthread+0xe7/0x110
[Thu Sep  8 20:12:22 2022]  ? kthread_complete_and_exit+0x20/0x20
[Thu Sep  8 20:12:22 2022]  ret_from_fork+0x22/0x30
[Thu Sep  8 20:12:22 2022]  </TASK>
[Thu Sep  8 20:12:22 2022] Modules linked in: rnbd_server rtrs_server
rtrs_core rdma_ucm rdma_cm iw_cm ib_cm crc32_generic rdma_rxe
ip6_udp_tunnel udp_tunnel ib_uverbs ib_core loop null_blk
[Thu Sep  8 20:12:22 2022] CR2: 0000000000000158
[Thu Sep  8 20:12:22 2022] ---[ end trace 0000000000000000 ]---
[Thu Sep  8 20:12:22 2022] BUG: kernel NULL pointer dereference,
address: 0000000000000158
[Thu Sep  8 20:12:22 2022] RIP: 0010:rxe_responder+0x1910/0x1d90 [rdma_rxe]
[Thu Sep  8 20:12:22 2022] #PF: supervisor read access in kernel mode
[Thu Sep  8 20:12:22 2022] Code: 06 48 63 88 fc 15 63 c0 0f b6 46 01
83 ea 04 c0 e8 04 29 ca 83 e0 03 29 c2 49 8b 87 08 05 00 00 49 03 87
00 05 00 00 4c 63 ea <80> bf 58 01 00 00 00 48 8d 14 0e 48 89 c6 4d 89
ee 44 89 e9 0f 84
[Thu Sep  8 20:12:22 2022] #PF: error_code(0x0000) - not-present page
[Thu Sep  8 20:12:22 2022] RSP: 0018:ffffb0358015fd80 EFLAGS: 00010246
[Thu Sep  8 20:12:22 2022] PGD 0 P4D 0

Technically, for operations with 0 length, the code can simply not do
any of the *_mr_copy, and carry on with success. So maybe you can
check data_len first and copy only if needed.


> +               err = -EOPNOTSUPP;
> +       else
> +               err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
> +                                 payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
> +
>         if (err) {
>                 rc = RESPST_ERR_RKEY_VIOLATION;
>                 goto out;
> @@ -667,7 +671,10 @@ static enum resp_states rxe_atomic_reply(struct rxe_qp *qp,
>                 if (mr->state != RXE_MR_STATE_VALID)
>                         return RESPST_ERR_RKEY_VIOLATION;
>
> -               ret = rxe_atomic_ops(qp, pkt, mr);
> +               if (mr->odp_enabled)
> +                       ret = RESPST_ERR_UNSUPPORTED_OPCODE;
> +               else
> +                       ret = rxe_atomic_ops(qp, pkt, mr);
>         } else
>                 ret = RESPST_ACKNOWLEDGE;
>
> @@ -831,8 +838,12 @@ static enum resp_states read_reply(struct rxe_qp *qp,
>         if (!skb)
>                 return RESPST_ERR_RNR;
>
> -       err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
> -                         payload, RXE_FROM_MR_OBJ);
> +       if (mr->odp_enabled)
> +               err = -EOPNOTSUPP;
> +       else
> +               err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
> +                                 payload, RXE_FROM_MR_OBJ);
> +
>         if (err)
>                 pr_err("Failed copying memory\n");
>         if (mr)
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 7510f25c5ea3..b00e9b847382 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -926,10 +926,14 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
>                 goto err2;
>         }
>
> -
>         rxe_get(pd);
>
> -       err = rxe_mr_init_user(pd, start, length, iova, access, mr);
> +       if (access & IB_ACCESS_ON_DEMAND)
> +               err = rxe_create_user_odp_mr(&pd->ibpd, start, length, iova,
> +                                            access, mr);
> +       else
> +               err = rxe_mr_init_user(pd, start, length, iova, access, mr);
> +
>         if (err)
>                 goto err3;
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index b09b4cb9897a..98d2bb737ebc 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -324,6 +324,8 @@ struct rxe_mr {
>         atomic_t                num_mw;
>
>         struct rxe_map          **map;
> +
> +       bool                    odp_enabled;
>  };
>
>  enum rxe_mw_state {
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging
  2022-09-08 16:57   ` Haris Iqbal
@ 2022-09-09  0:55     ` matsuda-daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-09  0:55 UTC (permalink / raw)
  To: 'Haris Iqbal'
  Cc: linux-rdma, leonro, jgg, zyjzyj2000, nvdimm, linux-kernel,
	rpearsonhpe, yangx.jy, lizhijian, y-goto, haris iqbal

On Fri, Sep 9, 2022 1:58 AM Haris Iqbal wrote:
> On Wed, Sep 7, 2022 at 4:45 AM Daisuke Matsuda
> <matsuda-daisuke@fujitsu.com> wrote:
> >
> > Allow applications to register an ODP-enabled MR, in which case the flag
> > IB_ACCESS_ON_DEMAND is passed to rxe_reg_user_mr(). However, there is no
> > RDMA operation supported right now. They will be enabled later in the
> > subsequent two patches.
> >
> > rxe_odp_do_pagefault() is called to initialize an ODP-enabled MR here.
> > It syncs process address space from the CPU page table to the driver page
> > table(dma_list/pfn_list in umem_odp) when called with a
> > RXE_PAGEFAULT_SNAPSHOT flag. Additionally, It can be used to trigger page
> > fault when pages being accessed are not present or do not have proper
> > read/write permissions and possibly to prefetch pages in the future.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >  drivers/infiniband/sw/rxe/rxe.c       |  7 +++
> >  drivers/infiniband/sw/rxe/rxe_loc.h   |  5 ++
> >  drivers/infiniband/sw/rxe/rxe_mr.c    |  7 ++-
> >  drivers/infiniband/sw/rxe/rxe_odp.c   | 80 +++++++++++++++++++++++++++
> >  drivers/infiniband/sw/rxe/rxe_resp.c  | 21 +++++--
> >  drivers/infiniband/sw/rxe/rxe_verbs.c |  8 ++-
> >  drivers/infiniband/sw/rxe/rxe_verbs.h |  2 +
> >  7 files changed, 121 insertions(+), 9 deletions(-)
> >

<...>

> > diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> > index cadc8fa64dd0..dd8632e783f6 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> > @@ -535,8 +535,12 @@ static enum resp_states write_data_in(struct rxe_qp *qp,
> >         int     err;
> >         int data_len = payload_size(pkt);
> >
> > -       err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
> > -                         payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
> > +       if (qp->resp.mr->odp_enabled)
> 
> You cannot use qp->resp.mr here, because for zero byte operations,
> resp.mr is not set in the function check_rkey().
> 
> The code fails for RTRS with the following stack trace,
> 
> [Thu Sep  8 20:12:22 2022] BUG: kernel NULL pointer dereference,
> address: 0000000000000158
> [Thu Sep  8 20:12:22 2022] #PF: supervisor read access in kernel mode
> [Thu Sep  8 20:12:22 2022] #PF: error_code(0x0000) - not-present page
> [Thu Sep  8 20:12:22 2022] PGD 0 P4D 0
> [Thu Sep  8 20:12:22 2022] Oops: 0000 [#1] PREEMPT SMP
> [Thu Sep  8 20:12:22 2022] CPU: 3 PID: 38 Comm: kworker/u8:1 Not
> tainted 6.0.0-rc2-pserver+ #17
> [Thu Sep  8 20:12:22 2022] Hardware name: QEMU Standard PC (i440FX +
> PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
> [Thu Sep  8 20:12:22 2022] Workqueue: rxe_resp rxe_do_work [rdma_rxe]
> [Thu Sep  8 20:12:22 2022] RIP: 0010:rxe_responder+0x1910/0x1d90 [rdma_rxe]
> [Thu Sep  8 20:12:22 2022] Code: 06 48 63 88 fc 15 63 c0 0f b6 46 01
> 83 ea 04 c0 e8 04 29 ca 83 e0 03 29 c2 49 8b 87 08 05 00 00 49 03 87
> 00 05 00 00 4c 63 ea <80> bf 58 01 00 00 00 48 8d 14 0e 48 89 c6 4d 89
> ee 44 89 e9 0f 84
> [Thu Sep  8 20:12:22 2022] RSP: 0018:ffffb0358015fd80 EFLAGS: 00010246
> [Thu Sep  8 20:12:22 2022] RAX: 0000000000000000 RBX: ffff9af4839b5e28
> RCX: 0000000000000020
> [Thu Sep  8 20:12:22 2022] RDX: 0000000000000000 RSI: ffff9af485094a6a
> RDI: 0000000000000000
> [Thu Sep  8 20:12:22 2022] RBP: ffff9af488bd7128 R08: 0000000000000000
> R09: 0000000000000000
> [Thu Sep  8 20:12:22 2022] R10: ffff9af4808eaf7c R11: 0000000000000001
> R12: 0000000000000008
> [Thu Sep  8 20:12:22 2022] R13: 0000000000000000 R14: ffff9af488bd7380
> R15: ffff9af488bd7000
> [Thu Sep  8 20:12:22 2022] FS:  0000000000000000(0000)
> GS:ffff9af5b7d80000(0000) knlGS:0000000000000000
> [Thu Sep  8 20:12:22 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [Thu Sep  8 20:12:22 2022] CR2: 0000000000000158 CR3: 000000004a60a000
> CR4: 00000000000006e0
> [Thu Sep  8 20:12:22 2022] DR0: 0000000000000000 DR1: 0000000000000000
> DR2: 0000000000000000
> [Thu Sep  8 20:12:22 2022] DR3: 0000000000000000 DR6: 00000000fffe0ff0
> DR7: 0000000000000400
> [Thu Sep  8 20:12:22 2022] Call Trace:
> [Thu Sep  8 20:12:22 2022]  <TASK>
> [Thu Sep  8 20:12:22 2022]  ? newidle_balance+0x2e5/0x400
> [Thu Sep  8 20:12:22 2022]  ? _raw_spin_unlock+0x12/0x30
> [Thu Sep  8 20:12:22 2022]  ? finish_task_switch+0x91/0x2a0
> [Thu Sep  8 20:12:22 2022]  rxe_do_work+0x86/0x110 [rdma_rxe]
> [Thu Sep  8 20:12:22 2022]  process_one_work+0x1dc/0x3a0
> [Thu Sep  8 20:12:22 2022]  worker_thread+0x4a/0x3b0
> [Thu Sep  8 20:12:22 2022]  ? process_one_work+0x3a0/0x3a0
> [Thu Sep  8 20:12:22 2022]  kthread+0xe7/0x110
> [Thu Sep  8 20:12:22 2022]  ? kthread_complete_and_exit+0x20/0x20
> [Thu Sep  8 20:12:22 2022]  ret_from_fork+0x22/0x30
> [Thu Sep  8 20:12:22 2022]  </TASK>
> [Thu Sep  8 20:12:22 2022] Modules linked in: rnbd_server rtrs_server
> rtrs_core rdma_ucm rdma_cm iw_cm ib_cm crc32_generic rdma_rxe
> ip6_udp_tunnel udp_tunnel ib_uverbs ib_core loop null_blk
> [Thu Sep  8 20:12:22 2022] CR2: 0000000000000158
> [Thu Sep  8 20:12:22 2022] ---[ end trace 0000000000000000 ]---
> [Thu Sep  8 20:12:22 2022] BUG: kernel NULL pointer dereference,
> address: 0000000000000158
> [Thu Sep  8 20:12:22 2022] RIP: 0010:rxe_responder+0x1910/0x1d90 [rdma_rxe]
> [Thu Sep  8 20:12:22 2022] #PF: supervisor read access in kernel mode
> [Thu Sep  8 20:12:22 2022] Code: 06 48 63 88 fc 15 63 c0 0f b6 46 01
> 83 ea 04 c0 e8 04 29 ca 83 e0 03 29 c2 49 8b 87 08 05 00 00 49 03 87
> 00 05 00 00 4c 63 ea <80> bf 58 01 00 00 00 48 8d 14 0e 48 89 c6 4d 89
> ee 44 89 e9 0f 84
> [Thu Sep  8 20:12:22 2022] #PF: error_code(0x0000) - not-present page
> [Thu Sep  8 20:12:22 2022] RSP: 0018:ffffb0358015fd80 EFLAGS: 00010246
> [Thu Sep  8 20:12:22 2022] PGD 0 P4D 0
> 
> Technically, for operations with 0 length, the code can simply not do
> any of the *_mr_copy, and carry on with success. So maybe you can
> check data_len first and copy only if needed.
> 

Good Catch!
I will fix this in the next post as you suggest.

Many Thanks

> 
> > +               err = -EOPNOTSUPP;
> > +       else
> > +               err = rxe_mr_copy(qp->resp.mr, qp->resp.va + qp->resp.offset,
> > +                                 payload_addr(pkt), data_len, RXE_TO_MR_OBJ);
> > +
> >         if (err) {
> >                 rc = RESPST_ERR_RKEY_VIOLATION;
> >                 goto out;
> > @@ -667,7 +671,10 @@ static enum resp_states rxe_atomic_reply(struct rxe_qp *qp,
> >                 if (mr->state != RXE_MR_STATE_VALID)
> >                         return RESPST_ERR_RKEY_VIOLATION;
> >
> > -               ret = rxe_atomic_ops(qp, pkt, mr);
> > +               if (mr->odp_enabled)
> > +                       ret = RESPST_ERR_UNSUPPORTED_OPCODE;
> > +               else
> > +                       ret = rxe_atomic_ops(qp, pkt, mr);
> >         } else
> >                 ret = RESPST_ACKNOWLEDGE;
> >
> > @@ -831,8 +838,12 @@ static enum resp_states read_reply(struct rxe_qp *qp,
> >         if (!skb)
> >                 return RESPST_ERR_RNR;
> >
> > -       err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
> > -                         payload, RXE_FROM_MR_OBJ);
> > +       if (mr->odp_enabled)
> > +               err = -EOPNOTSUPP;
> > +       else
> > +               err = rxe_mr_copy(mr, res->read.va, payload_addr(&ack_pkt),
> > +                                 payload, RXE_FROM_MR_OBJ);
> > +
> >         if (err)
> >                 pr_err("Failed copying memory\n");
> >         if (mr)
> > diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> > index 7510f25c5ea3..b00e9b847382 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> > +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> > @@ -926,10 +926,14 @@ static struct ib_mr *rxe_reg_user_mr(struct ib_pd *ibpd,
> >                 goto err2;
> >         }
> >
> > -
> >         rxe_get(pd);
> >
> > -       err = rxe_mr_init_user(pd, start, length, iova, access, mr);
> > +       if (access & IB_ACCESS_ON_DEMAND)
> > +               err = rxe_create_user_odp_mr(&pd->ibpd, start, length, iova,
> > +                                            access, mr);
> > +       else
> > +               err = rxe_mr_init_user(pd, start, length, iova, access, mr);
> > +
> >         if (err)
> >                 goto err3;
> >
> > diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> > index b09b4cb9897a..98d2bb737ebc 100644
> > --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> > +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> > @@ -324,6 +324,8 @@ struct rxe_mr {
> >         atomic_t                num_mw;
> >
> >         struct rxe_map          **map;
> > +
> > +       bool                    odp_enabled;
> >  };
> >
> >  enum rxe_mw_state {
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
  2022-09-08  8:29   ` Leon Romanovsky
@ 2022-09-09  2:45     ` matsuda-daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-09  2:45 UTC (permalink / raw)
  To: 'Leon Romanovsky'
  Cc: linux-rdma, jgg, zyjzyj2000, nvdimm, linux-kernel, rpearsonhpe,
	yangx.jy, lizhijian, y-goto

On Thu, Sep 8, 2022 5:30 PM Leon Romanovsky wrote:
> On Wed, Sep 07, 2022 at 11:43:04AM +0900, Daisuke Matsuda wrote:
> > rxe_mr_copy() is used widely to copy data to/from a user MR. requester uses
> > it to load payloads of requesting packets; responder uses it to process
> > Send, Write, and Read operaetions; completer uses it to copy data from
> > response packets of Read and Atomic operations to a user MR.
> >
> > Allow these operations to be used with ODP by adding a counterpart function
> > rxe_odp_mr_copy(). It is comprised of the following steps:
> >  1. Check the driver page table(umem_odp->dma_list) to see if pages being
> >     accessed are present with appropriate permission.
> >  2. If necessary, trigger page fault to map the pages.
> >  3. Convert their user space addresses to kernel logical addresses using
> >     PFNs in the driver page table(umem_odp->pfn_list).
> >  4. Execute data copy fo/from the pages.
> >
> > umem_mutex is used to ensure that dma_list (an array of addresses of an MR)
> > is not changed while it is checked and that mapped pages are not
> > invalidated before data copy completes.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >  drivers/infiniband/sw/rxe/rxe.c      |  10 ++
> >  drivers/infiniband/sw/rxe/rxe_loc.h  |   2 +
> >  drivers/infiniband/sw/rxe/rxe_mr.c   |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_odp.c  | 173 +++++++++++++++++++++++++++
> >  drivers/infiniband/sw/rxe/rxe_resp.c |   6 +-
> >  5 files changed, 190 insertions(+), 3 deletions(-)
> 
> <...>
> 
> > +/* umem mutex is always locked when returning from this function. */
> > +static int rxe_odp_map_range(struct rxe_mr *mr, u64 iova, int length, u32 flags)
> > +{
> > +	struct ib_umem_odp *umem_odp = to_ib_umem_odp(mr->umem);
> > +	const int max_tries = 3;
> > +	int cnt = 0;
> > +
> > +	int err;
> > +	u64 perm;
> > +	bool need_fault;
> > +
> > +	if (unlikely(length < 1))
> > +		return -EINVAL;
> > +
> > +	perm = ODP_READ_ALLOWED_BIT;
> > +	if (!(flags & RXE_PAGEFAULT_RDONLY))
> > +		perm |= ODP_WRITE_ALLOWED_BIT;
> > +
> > +	mutex_lock(&umem_odp->umem_mutex);
> > +
> > +	/*
> > +	 * A successful return from rxe_odp_do_pagefault() does not guarantee
> > +	 * that all pages in the range became present. Recheck the DMA address
> > +	 * array, allowing max 3 tries for pagefault.
> > +	 */
> > +	while ((need_fault = rxe_is_pagefault_neccesary(umem_odp,
> > +							iova, length, perm))) {
> > +
> > +		if (cnt >= max_tries)
> > +			break;
> > +
> > +		mutex_unlock(&umem_odp->umem_mutex);
> > +
> > +		/* rxe_odp_do_pagefault() locks the umem mutex. */
> 
> Maybe it is correct and safe to release lock in the middle, but it is
> not clear. The whole pattern of taking lock in one function and later
> releasing it in another doesn't look right to me.

When the driver finds the pages are not mapped in rxe_is_pagefault_neccesary(),
it releases the lock to let the kernel execute page invalidation meantime,
and takes the lock again to do page fault in ib_umem_odp_map_dma_and_lock().
Then, it proceed to rxe_is_pagefault_neccesary() again with the lock taken.

I admit the usage of the lock is quite confusing. 
It is locked before making it clear that the target pages are present.
It is released when the target pages are missing and page fault is required,
or when access to the target pages in a MR is done.

I will move some lock taking/releasing operations to rxe_odp_mr_copy()
and rxe_odp_atomic_ops() so that people can understand the situation easier.
Also, I will rethink the way I explain it in comments and the patch description.

Thank you,
Daisuke Matsuda

> 
> Thanks

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE
  2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
                   ` (7 preceding siblings ...)
  2022-09-08  8:40 ` [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Zhu Yanjun
@ 2022-09-09  3:07 ` Li Zhijian
  2022-09-12  9:21   ` matsuda-daisuke
  8 siblings, 1 reply; 25+ messages in thread
From: Li Zhijian @ 2022-09-09  3:07 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, y-goto

Daisuke

Great job.

I love this feature, before starting reviewing you patches, i tested it with QEMU(with fsdax memory-backend) migration
over RDMA where it worked for MLX5 before.

This time, with you ODP patches, it works on RXE though ibv_advise_mr may be not yet ready.


Thanks
Zhijian


On 07/09/2022 10:42, Daisuke Matsuda wrote:
> Hi everyone,
>
> This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> driver, which has been available only in mlx5 driver[1] so far.
>
> [Overview]
> When applications register a memory region(MR), RDMA drivers normally pin
> pages in the MR so that physical addresses are never changed during RDMA
> communication. This requires the MR to fit in physical memory and
> inevitably leads to memory pressure. On the other hand, On-Demand Paging
> (ODP) allows applications to register MRs without pinning pages. They are
> paged-in when the driver requires and paged-out when the OS reclaims. As a
> result, it is possible to register a large MR that does not fit in physical
> memory without taking up so much physical memory.
>
> [Why to add this feature?]
> We, Fujitsu, have contributed to RDMA with a view to using it with
> persistent memory. Persistent memory can host a filesystem that allows
> applications to read/write files directly without involving page cache.
> This is called FS-DAX(filesystem direct access) mode. There is a problem
> that data on DAX-enabled filesystem cannot be duplicated with software RAID
> or other hardware methods. Data replication with RDMA, which features
> high-speed connections, is the best solution for the problem.
>
> However, there is a known issue that hinders using RDMA with FS-DAX. When
> RDMA operations to a file and update of the file metadata are processed
> concurrently on the same node, illegal memory accesses can be executed,
> disregarding the updated metadata. This is because RDMA operations do not
> go through page cache but access data directly. There was an effort[2] to
> solve this problem, but it was rejected in the end. Though there is no
> general solution available, it is possible to work around the problem using
> the ODP feature that has been available only in mlx5. ODP enables drivers
> to update metadata before processing RDMA operations.
>
> We have enhanced the rxe to expedite the usage of persistent memory. Our
> contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
> them being merged to rxe along with ODP, an environment will be ready for
> developers to create and test software for RDMA with FS-DAX. There is a
> library(librpma)[5] being developed for this purpose. This environment
> can be used by anybody without any special hardware but an ordinary
> computer with a normal NIC though it is inferior to hardware
> implementations in terms of performance.
>
> [Design considerations]
> ODP has been available only in mlx5, but functions and data structures
> that can be used commonly are provided in ib_uverbs(infiniband/core). The
> interface is heavily dependent on HMM infrastructure[6], and this patchset
> use them as much as possible. While mlx5 has both Explicit and Implicit ODP
> features along with prefetch feature, this patchset implements the Explicit
> ODP feature only.
>
> As an important change, it is necessary to convert triple tasklets
> (requester, responder and completer) to workqueues because they must be
> able to sleep in order to trigger page fault before accessing MRs. I did a
> test shown in the 2nd patch and found that the change makes the latency
> higher while improving the bandwidth. Though it may be possible to create a
> new independent workqueue for page fault execution, it is a not very
> sensible solution since the tasklets have to busy-wait its completion in
> that case.
>
> If responder and completer sleep, it becomes more likely that packet drop
> occurs because of overflow in receiver queue. There are multiple queues
> involved, but, as SoftRoCE uses UDP, the most important one would be the
> UDP buffers. The size can be configured in net.core.rmem_default and
> net.core.rmem_max sysconfig parameters. Users should change these values in
> case of packet drop, but page fault would be typically not so long as to
> cause the problem.
>
> [How does ODP work?]
> "struct ib_umem_odp" is used to manage pages. It is created for each
> ODP-enabled MR on its registration. This struct holds a pair of arrays
> (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> PFNs are stored in the driver page table. They are updated on page-in and
> page-out, both of which use the common interface in ib_uverbs.
>
> Page-in can occur when requester, responder or completer access an MR in
> order to process RDMA operations. If they find that the pages being
> accessed are not present on physical memory or requisite permissions are
> not set on the pages, they provoke page fault to make pages present with
> proper permissions and at the same time update the driver page table. After
> confirming the presence of the pages, they execute memory access such as
> read, write or atomic operations.
>
> Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> update of a file that is being used as an MR). When creating an ODP-enabled
> MR, the driver registers an MMU notifier callback. When the kernel issues a
> page invalidation notification, the callback is provoked to unmap DMA
> addresses and update the driver page table. After that, the kernel releases
> the pages.
>
> [Supported operations]
> All operations are supported on RC connection. Atomic write[3] and Flush[4]
> operations, which are still under discussion, are also going to be
> supported after their patches are merged. On UD connection, Send, Recv,
> SRQ-Recv are supported. Because other operations are not supported on mlx5,
> I take after the decision right now.
>
> [How to test ODP?]
> There are only a few resources available for testing. pyverbs testcases in
> rdma-core and perftest[7] are recommendable ones. Note that you may have to
> build perftest from upstream since older versions do not handle ODP
> capabilities correctly.
>
> [Future work]
> My next work will be the prefetch feature. It allows applications to
> trigger page fault using ibv_advise_mr(3) to optimize performance. Some
> existing software like librpma use this feature. Additionally, I think we
> can also add the implicit ODP feature in the future.
>
> [1] [RFC 00/20] On demand paging
> https://www.spinics.net/lists/linux-rdma/msg18906.html
>
> [2] [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
> https://lore.kernel.org/nvdimm/20190809225833.6657-1-ira.weiny@intel.com/
>
> [3] [RESEND PATCH v5 0/2] RDMA/rxe: Add RDMA Atomic Write operation
> https://www.spinics.net/lists/linux-rdma/msg111428.html
>
> [4] [PATCH v4 0/6] RDMA/rxe: Add RDMA FLUSH operation
> https://www.spinics.net/lists/kernel/msg4462045.html
>
> [5] librpma: Remote Persistent Memory Access Library
> https://github.com/pmem/rpma
>
> [6] Heterogeneous Memory Management (HMM)
> https://www.kernel.org/doc/html/latest/mm/hmm.html
>
> [7] linux-rdma/perftest: Infiniband Verbs Performance Tests
> https://github.com/linux-rdma/perftest
>
> Daisuke Matsuda (7):
>    IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
>    RDMA/rxe: Convert the triple tasklets to workqueues
>    RDMA/rxe: Cleanup code for responder Atomic operations
>    RDMA/rxe: Add page invalidation support
>    RDMA/rxe: Allow registering MRs for On-Demand Paging
>    RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
>    RDMA/rxe: Add support for the traditional Atomic operations with ODP
>
>   drivers/infiniband/core/umem_odp.c    |   6 +-
>   drivers/infiniband/hw/mlx5/odp.c      |   4 +-
>   drivers/infiniband/sw/rxe/Makefile    |   5 +-
>   drivers/infiniband/sw/rxe/rxe.c       |  18 ++
>   drivers/infiniband/sw/rxe/rxe_comp.c  |  42 +++-
>   drivers/infiniband/sw/rxe/rxe_loc.h   |  11 +-
>   drivers/infiniband/sw/rxe/rxe_mr.c    |   7 +-
>   drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
>   drivers/infiniband/sw/rxe/rxe_odp.c   | 329 ++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
>   drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++---
>   drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
>   drivers/infiniband/sw/rxe/rxe_req.c   |  14 +-
>   drivers/infiniband/sw/rxe/rxe_resp.c  | 175 +++++++-------
>   drivers/infiniband/sw/rxe/rxe_resp.h  |  44 ++++
>   drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------
>   drivers/infiniband/sw/rxe/rxe_task.h  |  69 ------
>   drivers/infiniband/sw/rxe/rxe_verbs.c |  16 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  10 +-
>   drivers/infiniband/sw/rxe/rxe_wq.c    | 161 +++++++++++++
>   drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++
>   21 files changed, 824 insertions(+), 386 deletions(-)
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h
>   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
>   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-07  2:43 ` [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Daisuke Matsuda
@ 2022-09-09 19:39   ` Bob Pearson
  2022-09-12  8:27     ` matsuda-daisuke
  2022-09-11  7:10   ` Yanjun Zhu
  1 sibling, 1 reply; 25+ messages in thread
From: Bob Pearson @ 2022-09-09 19:39 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leonro, jgg, zyjzyj2000, Ziemba,
	Ian, Frank Zago
  Cc: nvdimm, linux-kernel, yangx.jy, lizhijian, y-goto

On 9/6/22 21:43, Daisuke Matsuda wrote:
> In order to implement On-Demand Paging on the rxe driver, triple tasklets
> (requester, responder, and completer) must be allowed to sleep so that they
> can trigger page fault when pages being accessed are not present.
> 
> This patch replaces the tasklets with workqueues, but still allows direct-
> call of works from softirq context if it is obvious that MRs are not going
> to be accessed and there is no work being processed in the workqueue.
> 
> As counterparts to tasklet_disable() and tasklet_enable() are missing for
> workqueues, an atomic value is introduced to get works suspended while qp
> reset is in progress.
> 
> As a reference, performance change was measured using ib_send_bw and
> ib_send_lat commands over RC connection. Both the client and the server
> were placed on the same KVM host. An option "-n 100000" was given to the
> respective commands to iterate over 100000 times.
> 
> Before applying this patch:
> [ib_send_bw]
> ---------------------------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  65536      100000           0.00               203.13             0.003250
> ---------------------------------------------------------------------------------------
> [ib_send_lat]
> ---------------------------------------------------------------------------------------
>  #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
>  2       100000          30.91          1519.05      34.23             36.06            5.15            48.49                   82.37
> ---------------------------------------------------------------------------------------
> 
> After applying this patch:
> [ib_send_bw]
> ---------------------------------------------------------------------------------------
>  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>  65536      100000           0.00               240.88             0.003854
> ---------------------------------------------------------------------------------------
> [ib_send_lat]
> ---------------------------------------------------------------------------------------
>  #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
>  2       100000          40.88          2994.82      47.80             50.25            13.70           76.42                   185.04
> ---------------------------------------------------------------------------------------
> 
> The test was conducted 3 times for each kernel, and the results with median
> "BW average" and "t_typical" are shown above. It shows the conversion
> improves the bandwidth while causing higher latency.
> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>  drivers/infiniband/sw/rxe/Makefile    |   2 +-
>  drivers/infiniband/sw/rxe/rxe_comp.c  |  42 ++++---
>  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
>  drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
>  drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++++------
>  drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
>  drivers/infiniband/sw/rxe/rxe_req.c   |  14 +--
>  drivers/infiniband/sw/rxe/rxe_resp.c  |  16 +--
>  drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------------------
>  drivers/infiniband/sw/rxe/rxe_task.h  |  69 -----------
>  drivers/infiniband/sw/rxe/rxe_verbs.c |   8 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.h |   8 +-
>  drivers/infiniband/sw/rxe/rxe_wq.c    | 161 ++++++++++++++++++++++++++
>  drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++++++++
>  15 files changed, 322 insertions(+), 299 deletions(-)
>  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
>  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
>  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h
> 
> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> index 5395a581f4bb..358f6b06aa64 100644
> --- a/drivers/infiniband/sw/rxe/Makefile
> +++ b/drivers/infiniband/sw/rxe/Makefile
> @@ -20,6 +20,6 @@ rdma_rxe-y := \
>  	rxe_mmap.o \
>  	rxe_icrc.o \
>  	rxe_mcast.o \
> -	rxe_task.o \
> +	rxe_wq.o \
>  	rxe_net.o \
>  	rxe_hw_counters.o
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index fb0c008af78c..0348da06c4dd 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -9,7 +9,7 @@
>  #include "rxe.h"
>  #include "rxe_loc.h"
>  #include "rxe_queue.h"
> -#include "rxe_task.h"
> +#include "rxe_wq.h"
>  
>  enum comp_state {
>  	COMPST_GET_ACK,
> @@ -118,21 +118,37 @@ void retransmit_timer(struct timer_list *t)
>  
>  	if (qp->valid) {
>  		qp->comp.timeout = 1;
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>  	}
>  }
>  
> -void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
> +void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>  {
> +	struct rxe_qp *qp = pkt->qp;
>  	int must_sched;
>  
>  	skb_queue_tail(&qp->resp_pkts, skb);
>  
> -	must_sched = skb_queue_len(&qp->resp_pkts) > 1;
> +	/* Schedule a workqueue when processing READ and ATOMIC acks.
> +	 * In these cases, completer may sleep to access ODP-enabled MRs.
> +	 */
> +	switch (pkt->opcode) {
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST:
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY:
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
> +	case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE:
> +		must_sched = 1;
> +		break;
> +
> +	default:
> +		must_sched = skb_queue_len(&qp->resp_pkts) > 1;
> +	}
> +
>  	if (must_sched != 0)
>  		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
>  
> -	rxe_run_task(&qp->comp.task, must_sched);
> +	rxe_run_work(&qp->comp.work, must_sched);
>  }
>  
>  static inline enum comp_state get_wqe(struct rxe_qp *qp,
> @@ -305,7 +321,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
>  					qp->comp.psn = pkt->psn;
>  					if (qp->req.wait_psn) {
>  						qp->req.wait_psn = 0;
> -						rxe_run_task(&qp->req.task, 0);
> +						rxe_run_work(&qp->req.work, 0);
>  					}
>  				}
>  				return COMPST_ERROR_RETRY;
> @@ -452,7 +468,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
>  	 */
>  	if (qp->req.wait_fence) {
>  		qp->req.wait_fence = 0;
> -		rxe_run_task(&qp->req.task, 0);
> +		rxe_run_work(&qp->req.work, 0);
>  	}
>  }
>  
> @@ -466,7 +482,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
>  		if (qp->req.need_rd_atomic) {
>  			qp->comp.timeout_retry = 0;
>  			qp->req.need_rd_atomic = 0;
> -			rxe_run_task(&qp->req.task, 0);
> +			rxe_run_work(&qp->req.work, 0);
>  		}
>  	}
>  
> @@ -512,7 +528,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp,
>  
>  		if (qp->req.wait_psn) {
>  			qp->req.wait_psn = 0;
> -			rxe_run_task(&qp->req.task, 1);
> +			rxe_run_work(&qp->req.work, 1);
>  		}
>  	}
>  
> @@ -646,7 +662,7 @@ int rxe_completer(void *arg)
>  
>  			if (qp->req.wait_psn) {
>  				qp->req.wait_psn = 0;
> -				rxe_run_task(&qp->req.task, 1);
> +				rxe_run_work(&qp->req.work, 1);
>  			}
>  
>  			state = COMPST_DONE;
> @@ -714,7 +730,7 @@ int rxe_completer(void *arg)
>  							RXE_CNT_COMP_RETRY);
>  					qp->req.need_retry = 1;
>  					qp->comp.started_retry = 1;
> -					rxe_run_task(&qp->req.task, 0);
> +					rxe_run_work(&qp->req.work, 0);
>  				}
>  				goto done;
>  
> @@ -757,8 +773,8 @@ int rxe_completer(void *arg)
>  		}
>  	}
>  
> -	/* A non-zero return value will cause rxe_do_task to
> -	 * exit its loop and end the tasklet. A zero return
> +	/* A non-zero return value will cause rxe_do_work to
> +	 * exit its loop and end the work. A zero return
>  	 * will continue looping and return to rxe_completer
>  	 */
>  done:
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 22f6cc31d1d6..0f8cb9e38cc9 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -181,7 +181,7 @@ void rxe_icrc_generate(struct sk_buff *skb, struct rxe_pkt_info *pkt);
>  
>  void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb);
>  
> -void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb);
> +void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb);
>  
>  static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
>  {
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index c53f4529f098..5526970882c8 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -346,7 +346,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
>  
>  	if (unlikely(qp->need_req_skb &&
>  		     skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
> -		rxe_run_task(&qp->req.task, 1);
> +		rxe_run_work(&qp->req.work, 1);
>  
>  	rxe_put(qp);
>  }
> @@ -430,7 +430,7 @@ int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
>  	if ((qp_type(qp) != IB_QPT_RC) &&
>  	    (pkt->mask & RXE_END_MASK)) {
>  		pkt->wqe->state = wqe_state_done;
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>  	}
>  
>  	rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS);
> diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
> index 86c7a8bf3cbb..1c0251812fc8 100644
> --- a/drivers/infiniband/sw/rxe/rxe_param.h
> +++ b/drivers/infiniband/sw/rxe/rxe_param.h
> @@ -105,7 +105,7 @@ enum rxe_device_param {
>  	RXE_INFLIGHT_SKBS_PER_QP_HIGH	= 64,
>  	RXE_INFLIGHT_SKBS_PER_QP_LOW	= 16,
>  
> -	/* Max number of interations of each tasklet
> +	/* Max number of interations of each work
>  	 * before yielding the cpu to let other
>  	 * work make progress
>  	 */
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index 516bf9b95e48..c7c7acd4ecfa 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -13,7 +13,7 @@
>  #include "rxe.h"
>  #include "rxe_loc.h"
>  #include "rxe_queue.h"
> -#include "rxe_task.h"
> +#include "rxe_wq.h"
>  
>  static int rxe_qp_chk_cap(struct rxe_dev *rxe, struct ib_qp_cap *cap,
>  			  int has_srq)
> @@ -172,9 +172,9 @@ static void rxe_qp_init_misc(struct rxe_dev *rxe, struct rxe_qp *qp,
>  
>  	spin_lock_init(&qp->state_lock);
>  
> -	spin_lock_init(&qp->req.task.state_lock);
> -	spin_lock_init(&qp->resp.task.state_lock);
> -	spin_lock_init(&qp->comp.task.state_lock);
> +	spin_lock_init(&qp->req.work.state_lock);
> +	spin_lock_init(&qp->resp.work.state_lock);
> +	spin_lock_init(&qp->comp.work.state_lock);
>  
>  	spin_lock_init(&qp->sq.sq_lock);
>  	spin_lock_init(&qp->rq.producer_lock);
> @@ -242,10 +242,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
>  
>  	skb_queue_head_init(&qp->req_pkts);
>  
> -	rxe_init_task(rxe, &qp->req.task, qp,
> -		      rxe_requester, "req");
> -	rxe_init_task(rxe, &qp->comp.task, qp,
> -		      rxe_completer, "comp");
> +	rxe_init_work(rxe, &qp->req.work, qp,
> +		      rxe_requester, "rxe_req");
> +	rxe_init_work(rxe, &qp->comp.work, qp,
> +		      rxe_completer, "rxe_comp");
>  
>  	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
>  	if (init->qp_type == IB_QPT_RC) {
> @@ -292,8 +292,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
>  
>  	skb_queue_head_init(&qp->resp_pkts);
>  
> -	rxe_init_task(rxe, &qp->resp.task, qp,
> -		      rxe_responder, "resp");
> +	rxe_init_work(rxe, &qp->resp.work, qp,
> +		      rxe_responder, "rxe_resp");
>  
>  	qp->resp.opcode		= OPCODE_NONE;
>  	qp->resp.msn		= 0;
> @@ -481,14 +481,14 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp,
>  /* move the qp to the reset state */
>  static void rxe_qp_reset(struct rxe_qp *qp)
>  {
> -	/* stop tasks from running */
> -	rxe_disable_task(&qp->resp.task);
> +	/* flush workqueue and stop works from running */
> +	rxe_disable_work(&qp->resp.work);
>  
>  	/* stop request/comp */
>  	if (qp->sq.queue) {
>  		if (qp_type(qp) == IB_QPT_RC)
> -			rxe_disable_task(&qp->comp.task);
> -		rxe_disable_task(&qp->req.task);
> +			rxe_disable_work(&qp->comp.work);
> +		rxe_disable_work(&qp->req.work);
>  	}
>  
>  	/* move qp to the reset state */
> @@ -499,11 +499,11 @@ static void rxe_qp_reset(struct rxe_qp *qp)
>  	/* let state machines reset themselves drain work and packet queues
>  	 * etc.
>  	 */
> -	__rxe_do_task(&qp->resp.task);
> +	__rxe_do_work(&qp->resp.work);
>  
>  	if (qp->sq.queue) {
> -		__rxe_do_task(&qp->comp.task);
> -		__rxe_do_task(&qp->req.task);
> +		__rxe_do_work(&qp->comp.work);
> +		__rxe_do_work(&qp->req.work);
>  		rxe_queue_reset(qp->sq.queue);
>  	}
>  
> @@ -526,14 +526,14 @@ static void rxe_qp_reset(struct rxe_qp *qp)
>  
>  	cleanup_rd_atomic_resources(qp);
>  
> -	/* reenable tasks */
> -	rxe_enable_task(&qp->resp.task);
> +	/* reenable workqueue */
> +	rxe_enable_work(&qp->resp.work);
>  
>  	if (qp->sq.queue) {
>  		if (qp_type(qp) == IB_QPT_RC)
> -			rxe_enable_task(&qp->comp.task);
> +			rxe_enable_work(&qp->comp.work);
>  
> -		rxe_enable_task(&qp->req.task);
> +		rxe_enable_work(&qp->req.work);
>  	}
>  }
>  
> @@ -544,10 +544,10 @@ static void rxe_qp_drain(struct rxe_qp *qp)
>  		if (qp->req.state != QP_STATE_DRAINED) {
>  			qp->req.state = QP_STATE_DRAIN;
>  			if (qp_type(qp) == IB_QPT_RC)
> -				rxe_run_task(&qp->comp.task, 1);
> +				rxe_run_work(&qp->comp.work, 1);
>  			else
> -				__rxe_do_task(&qp->comp.task);
> -			rxe_run_task(&qp->req.task, 1);
> +				__rxe_do_work(&qp->comp.work);
> +			rxe_run_work(&qp->req.work, 1);
>  		}
>  	}
>  }
> @@ -561,13 +561,13 @@ void rxe_qp_error(struct rxe_qp *qp)
>  	qp->attr.qp_state = IB_QPS_ERR;
>  
>  	/* drain work and packet queues */
> -	rxe_run_task(&qp->resp.task, 1);
> +	rxe_run_work(&qp->resp.work, 1);
>  
>  	if (qp_type(qp) == IB_QPT_RC)
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>  	else
> -		__rxe_do_task(&qp->comp.task);
> -	rxe_run_task(&qp->req.task, 1);
> +		__rxe_do_work(&qp->comp.work);
> +	rxe_run_work(&qp->req.work, 1);
>  }
>  
>  /* called by the modify qp verb */
> @@ -786,21 +786,21 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
>  
>  	qp->valid = 0;
>  	qp->qp_timeout_jiffies = 0;
> -	rxe_cleanup_task(&qp->resp.task);
> +	rxe_cleanup_work(&qp->resp.work);
>  
>  	if (qp_type(qp) == IB_QPT_RC) {
>  		del_timer_sync(&qp->retrans_timer);
>  		del_timer_sync(&qp->rnr_nak_timer);
>  	}
>  
> -	rxe_cleanup_task(&qp->req.task);
> -	rxe_cleanup_task(&qp->comp.task);
> +	rxe_cleanup_work(&qp->req.work);
> +	rxe_cleanup_work(&qp->comp.work);
>  
>  	/* flush out any receive wr's or pending requests */
> -	__rxe_do_task(&qp->req.task);
> +	__rxe_do_work(&qp->req.work);
>  	if (qp->sq.queue) {
> -		__rxe_do_task(&qp->comp.task);
> -		__rxe_do_task(&qp->req.task);
> +		__rxe_do_work(&qp->comp.work);
> +		__rxe_do_work(&qp->req.work);
>  	}
>  
>  	if (qp->sq.queue)
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index f3ad7b6dbd97..3b525e9d903e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -226,7 +226,7 @@ static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>  	if (pkt->mask & RXE_REQ_MASK)
>  		rxe_resp_queue_pkt(pkt->qp, skb);
>  	else
> -		rxe_comp_queue_pkt(pkt->qp, skb);
> +		rxe_comp_queue_pkt(pkt, skb);
>  }
>  
>  static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
> diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
> index f63771207970..bd05dd3de499 100644
> --- a/drivers/infiniband/sw/rxe/rxe_req.c
> +++ b/drivers/infiniband/sw/rxe/rxe_req.c
> @@ -105,7 +105,7 @@ void rnr_nak_timer(struct timer_list *t)
>  	/* request a send queue retry */
>  	qp->req.need_retry = 1;
>  	qp->req.wait_for_rnr_timer = 0;
> -	rxe_run_task(&qp->req.task, 1);
> +	rxe_run_work(&qp->req.work, 1);
>  }
>  
>  static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp)
> @@ -608,7 +608,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
>  	 * which can lead to a deadlock. So go ahead and complete
>  	 * it now.
>  	 */
> -	rxe_run_task(&qp->comp.task, 1);
> +	rxe_run_work(&qp->comp.work, 1);
>  
>  	return 0;
>  }
> @@ -733,7 +733,7 @@ int rxe_requester(void *arg)
>  						       qp->req.wqe_index);
>  			wqe->state = wqe_state_done;
>  			wqe->status = IB_WC_SUCCESS;
> -			rxe_run_task(&qp->comp.task, 0);
> +			rxe_run_work(&qp->comp.work, 0);
>  			goto done;
>  		}
>  		payload = mtu;
> @@ -795,7 +795,7 @@ int rxe_requester(void *arg)
>  		rollback_state(wqe, qp, &rollback_wqe, rollback_psn);
>  
>  		if (err == -EAGAIN) {
> -			rxe_run_task(&qp->req.task, 1);
> +			rxe_run_work(&qp->req.work, 1);
>  			goto exit;
>  		}
>  
> @@ -805,8 +805,8 @@ int rxe_requester(void *arg)
>  
>  	update_state(qp, &pkt);
>  
> -	/* A non-zero return value will cause rxe_do_task to
> -	 * exit its loop and end the tasklet. A zero return
> +	/* A non-zero return value will cause rxe_do_work to
> +	 * exit its loop and end the work. A zero return
>  	 * will continue looping and return to rxe_requester
>  	 */
>  done:
> @@ -817,7 +817,7 @@ int rxe_requester(void *arg)
>  	qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index);
>  	wqe->state = wqe_state_error;
>  	qp->req.state = QP_STATE_ERROR;
> -	rxe_run_task(&qp->comp.task, 0);
> +	rxe_run_work(&qp->comp.work, 0);
>  exit:
>  	ret = -EAGAIN;
>  out:
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index b36ec5c4d5e0..e97c55b292f0 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -83,15 +83,11 @@ static char *resp_state_name[] = {
>  /* rxe_recv calls here to add a request packet to the input queue */
>  void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
>  {
> -	int must_sched;
> -	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> -
> +	/* responder can sleep to access ODP-enabled MRs, so schedule the
> +	 * workqueue for all incoming requests.
> +	 */
>  	skb_queue_tail(&qp->req_pkts, skb);
> -
> -	must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) ||
> -			(skb_queue_len(&qp->req_pkts) > 1);
> -
> -	rxe_run_task(&qp->resp.task, must_sched);
> +	rxe_run_work(&qp->resp.work, 1);
>  }
>  
>  static inline enum resp_states get_req(struct rxe_qp *qp,
> @@ -1464,8 +1460,8 @@ int rxe_responder(void *arg)
>  		}
>  	}
>  
> -	/* A non-zero return value will cause rxe_do_task to
> -	 * exit its loop and end the tasklet. A zero return
> +	/* A non-zero return value will cause rxe_do_work to
> +	 * exit its loop and end the work. A zero return
>  	 * will continue looping and return to rxe_responder
>  	 */
>  done:
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> deleted file mode 100644
> index 2248cf33d776..000000000000
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ /dev/null
> @@ -1,152 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> -/*
> - * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> - * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> - */
> -
> -#include <linux/kernel.h>
> -#include <linux/interrupt.h>
> -#include <linux/hardirq.h>
> -
> -#include "rxe.h"
> -
> -int __rxe_do_task(struct rxe_task *task)
> -
> -{
> -	int ret;
> -
> -	while ((ret = task->func(task->arg)) == 0)
> -		;
> -
> -	task->ret = ret;
> -
> -	return ret;
> -}
> -
> -/*
> - * this locking is due to a potential race where
> - * a second caller finds the task already running
> - * but looks just after the last call to func
> - */
> -void rxe_do_task(struct tasklet_struct *t)
> -{
> -	int cont;
> -	int ret;
> -	struct rxe_task *task = from_tasklet(task, t, tasklet);
> -	unsigned int iterations = RXE_MAX_ITERATIONS;
> -
> -	spin_lock_bh(&task->state_lock);
> -	switch (task->state) {
> -	case TASK_STATE_START:
> -		task->state = TASK_STATE_BUSY;
> -		spin_unlock_bh(&task->state_lock);
> -		break;
> -
> -	case TASK_STATE_BUSY:
> -		task->state = TASK_STATE_ARMED;
> -		fallthrough;
> -	case TASK_STATE_ARMED:
> -		spin_unlock_bh(&task->state_lock);
> -		return;
> -
> -	default:
> -		spin_unlock_bh(&task->state_lock);
> -		pr_warn("%s failed with bad state %d\n", __func__, task->state);
> -		return;
> -	}
> -
> -	do {
> -		cont = 0;
> -		ret = task->func(task->arg);
> -
> -		spin_lock_bh(&task->state_lock);
> -		switch (task->state) {
> -		case TASK_STATE_BUSY:
> -			if (ret) {
> -				task->state = TASK_STATE_START;
> -			} else if (iterations--) {
> -				cont = 1;
> -			} else {
> -				/* reschedule the tasklet and exit
> -				 * the loop to give up the cpu
> -				 */
> -				tasklet_schedule(&task->tasklet);
> -				task->state = TASK_STATE_START;
> -			}
> -			break;
> -
> -		/* someone tried to run the task since the last time we called
> -		 * func, so we will call one more time regardless of the
> -		 * return value
> -		 */
> -		case TASK_STATE_ARMED:
> -			task->state = TASK_STATE_BUSY;
> -			cont = 1;
> -			break;
> -
> -		default:
> -			pr_warn("%s failed with bad state %d\n", __func__,
> -				task->state);
> -		}
> -		spin_unlock_bh(&task->state_lock);
> -	} while (cont);
> -
> -	task->ret = ret;
> -}
> -
> -int rxe_init_task(void *obj, struct rxe_task *task,
> -		  void *arg, int (*func)(void *), char *name)
> -{
> -	task->obj	= obj;
> -	task->arg	= arg;
> -	task->func	= func;
> -	snprintf(task->name, sizeof(task->name), "%s", name);
> -	task->destroyed	= false;
> -
> -	tasklet_setup(&task->tasklet, rxe_do_task);
> -
> -	task->state = TASK_STATE_START;
> -	spin_lock_init(&task->state_lock);
> -
> -	return 0;
> -}
> -
> -void rxe_cleanup_task(struct rxe_task *task)
> -{
> -	bool idle;
> -
> -	/*
> -	 * Mark the task, then wait for it to finish. It might be
> -	 * running in a non-tasklet (direct call) context.
> -	 */
> -	task->destroyed = true;
> -
> -	do {
> -		spin_lock_bh(&task->state_lock);
> -		idle = (task->state == TASK_STATE_START);
> -		spin_unlock_bh(&task->state_lock);
> -	} while (!idle);
> -
> -	tasklet_kill(&task->tasklet);
> -}
> -
> -void rxe_run_task(struct rxe_task *task, int sched)
> -{
> -	if (task->destroyed)
> -		return;
> -
> -	if (sched)
> -		tasklet_schedule(&task->tasklet);
> -	else
> -		rxe_do_task(&task->tasklet);
> -}
> -
> -void rxe_disable_task(struct rxe_task *task)
> -{
> -	tasklet_disable(&task->tasklet);
> -}
> -
> -void rxe_enable_task(struct rxe_task *task)
> -{
> -	tasklet_enable(&task->tasklet);
> -}
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
> deleted file mode 100644
> index 11d183fd3338..000000000000
> --- a/drivers/infiniband/sw/rxe/rxe_task.h
> +++ /dev/null
> @@ -1,69 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> -/*
> - * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> - * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> - */
> -
> -#ifndef RXE_TASK_H
> -#define RXE_TASK_H
> -
> -enum {
> -	TASK_STATE_START	= 0,
> -	TASK_STATE_BUSY		= 1,
> -	TASK_STATE_ARMED	= 2,
> -};
> -
> -/*
> - * data structure to describe a 'task' which is a short
> - * function that returns 0 as long as it needs to be
> - * called again.
> - */
> -struct rxe_task {
> -	void			*obj;
> -	struct tasklet_struct	tasklet;
> -	int			state;
> -	spinlock_t		state_lock; /* spinlock for task state */
> -	void			*arg;
> -	int			(*func)(void *arg);
> -	int			ret;
> -	char			name[16];
> -	bool			destroyed;
> -};
> -
> -/*
> - * init rxe_task structure
> - *	arg  => parameter to pass to fcn
> - *	func => function to call until it returns != 0
> - */
> -int rxe_init_task(void *obj, struct rxe_task *task,
> -		  void *arg, int (*func)(void *), char *name);
> -
> -/* cleanup task */
> -void rxe_cleanup_task(struct rxe_task *task);
> -
> -/*
> - * raw call to func in loop without any checking
> - * can call when tasklets are disabled
> - */
> -int __rxe_do_task(struct rxe_task *task);
> -
> -/*
> - * common function called by any of the main tasklets
> - * If there is any chance that there is additional
> - * work to do someone must reschedule the task before
> - * leaving
> - */
> -void rxe_do_task(struct tasklet_struct *t);
> -
> -/* run a task, else schedule it to run as a tasklet, The decision
> - * to run or schedule tasklet is based on the parameter sched.
> - */
> -void rxe_run_task(struct rxe_task *task, int sched);
> -
> -/* keep a task from scheduling */
> -void rxe_disable_task(struct rxe_task *task);
> -
> -/* allow task to run */
> -void rxe_enable_task(struct rxe_task *task);
> -
> -#endif /* RXE_TASK_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 9ebe9decad34..7510f25c5ea3 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -697,9 +697,9 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, const struct ib_send_wr *wr,
>  		wr = next;
>  	}
>  
> -	rxe_run_task(&qp->req.task, 1);
> +	rxe_run_work(&qp->req.work, 1);
>  	if (unlikely(qp->req.state == QP_STATE_ERROR))
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>  
>  	return err;
>  }
> @@ -721,7 +721,7 @@ static int rxe_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
>  
>  	if (qp->is_user) {
>  		/* Utilize process context to do protocol processing */
> -		rxe_run_task(&qp->req.task, 0);
> +		rxe_run_work(&qp->req.work, 0);
>  		return 0;
>  	} else
>  		return rxe_post_send_kernel(qp, wr, bad_wr);
> @@ -761,7 +761,7 @@ static int rxe_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
>  	spin_unlock_irqrestore(&rq->producer_lock, flags);
>  
>  	if (qp->resp.state == QP_STATE_ERROR)
> -		rxe_run_task(&qp->resp.task, 1);
> +		rxe_run_work(&qp->resp.work, 1);
>  
>  err1:
>  	return err;
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index a51819d0c345..b09b4cb9897a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -10,7 +10,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/workqueue.h>
>  #include "rxe_pool.h"
> -#include "rxe_task.h"
> +#include "rxe_wq.h"
>  #include "rxe_hw_counters.h"
>  
>  static inline int pkey_match(u16 key1, u16 key2)
> @@ -125,7 +125,7 @@ struct rxe_req_info {
>  	int			need_retry;
>  	int			wait_for_rnr_timer;
>  	int			noack_pkts;
> -	struct rxe_task		task;
> +	struct rxe_work		work;
>  };
>  
>  struct rxe_comp_info {
> @@ -137,7 +137,7 @@ struct rxe_comp_info {
>  	int			started_retry;
>  	u32			retry_cnt;
>  	u32			rnr_retry;
> -	struct rxe_task		task;
> +	struct rxe_work		work;
>  };
>  
>  enum rdatm_res_state {
> @@ -204,7 +204,7 @@ struct rxe_resp_info {
>  	unsigned int		res_head;
>  	unsigned int		res_tail;
>  	struct resp_res		*res;
> -	struct rxe_task		task;
> +	struct rxe_work		work;
>  };
>  
>  struct rxe_qp {
> diff --git a/drivers/infiniband/sw/rxe/rxe_wq.c b/drivers/infiniband/sw/rxe/rxe_wq.c
> new file mode 100644
> index 000000000000..8b35186eac85
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_wq.c
> @@ -0,0 +1,161 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/interrupt.h>
> +#include <linux/hardirq.h>
> +
> +#include "rxe.h"
> +
> +int __rxe_do_work(struct rxe_work *work)
> +
> +{
> +	int ret;
> +
> +	while ((ret = work->func(work->arg)) == 0)
> +		;
> +
> +	work->ret = ret;
> +
> +	return ret;
> +}
> +
> +/*
> + * this locking is due to a potential race where
> + * a second caller finds the work already running
> + * but looks just after the last call to func
> + */
> +void rxe_do_work(struct work_struct *w)
> +{
> +	int cont;
> +	int ret;
> +
> +	struct rxe_work *work = container_of(w, typeof(*work), work);
> +	unsigned int iterations = RXE_MAX_ITERATIONS;
> +
> +	spin_lock_bh(&work->state_lock);
> +	switch (work->state) {
> +	case WQ_STATE_START:
> +		work->state = WQ_STATE_BUSY;
> +		spin_unlock_bh(&work->state_lock);
> +		break;
> +
> +	case WQ_STATE_BUSY:
> +		work->state = WQ_STATE_ARMED;
> +		fallthrough;
> +	case WQ_STATE_ARMED:
> +		spin_unlock_bh(&work->state_lock);
> +		return;
> +
> +	default:
> +		spin_unlock_bh(&work->state_lock);
> +		pr_warn("%s failed with bad state %d\n", __func__, work->state);
> +		return;
> +	}
> +
> +	do {
> +		cont = 0;
> +		ret = work->func(work->arg);
> +
> +		spin_lock_bh(&work->state_lock);
> +		switch (work->state) {
> +		case WQ_STATE_BUSY:
> +			if (ret) {
> +				work->state = WQ_STATE_START;
> +			} else if (iterations--) {
> +				cont = 1;
> +			} else {
> +				/* reschedule the work and exit
> +				 * the loop to give up the cpu
> +				 */
> +				queue_work(work->worker, &work->work);
> +				work->state = WQ_STATE_START;
> +			}
> +			break;
> +
> +		/* someone tried to run the work since the last time we called
> +		 * func, so we will call one more time regardless of the
> +		 * return value
> +		 */
> +		case WQ_STATE_ARMED:
> +			work->state = WQ_STATE_BUSY;
> +			cont = 1;
> +			break;
> +
> +		default:
> +			pr_warn("%s failed with bad state %d\n", __func__,
> +				work->state);
> +		}
> +		spin_unlock_bh(&work->state_lock);
> +	} while (cont);
> +
> +	work->ret = ret;
> +}
> +
> +int rxe_init_work(void *obj, struct rxe_work *work,
> +		  void *arg, int (*func)(void *), char *name)
> +{
> +	work->obj	= obj;
> +	work->arg	= arg;
> +	work->func	= func;
> +	snprintf(work->name, sizeof(work->name), "%s", name);
> +	work->destroyed	= false;
> +	atomic_set(&work->suspended, 0);
> +
> +	work->worker = create_singlethread_workqueue(name);
> +	INIT_WORK(&work->work, rxe_do_work);
> +
> +	work->state = WQ_STATE_START;
> +	spin_lock_init(&work->state_lock);
> +
> +	return 0;
> +}
> +
> +void rxe_cleanup_work(struct rxe_work *work)
> +{
> +	bool idle;
> +
> +	/*
> +	 * Mark the work, then wait for it to finish. It might be
> +	 * running in a non-workqueue (direct call) context.
> +	 */
> +	work->destroyed = true;
> +	flush_workqueue(work->worker);
> +
> +	do {
> +		spin_lock_bh(&work->state_lock);
> +		idle = (work->state == WQ_STATE_START);
> +		spin_unlock_bh(&work->state_lock);
> +	} while (!idle);
> +
> +	destroy_workqueue(work->worker);
> +}
> +
> +void rxe_run_work(struct rxe_work *work, int sched)
> +{
> +	if (work->destroyed)
> +		return;
> +
> +	/* busy-loop while qp reset is in progress */
> +	while (atomic_read(&work->suspended))
> +		continue;
> +
> +	if (sched)
> +		queue_work(work->worker, &work->work);
> +	else
> +		rxe_do_work(&work->work);
> +}
> +
> +void rxe_disable_work(struct rxe_work *work)
> +{
> +	atomic_inc(&work->suspended);
> +	flush_workqueue(work->worker);
> +}
> +
> +void rxe_enable_work(struct rxe_work *work)
> +{
> +	atomic_dec(&work->suspended);
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_wq.h b/drivers/infiniband/sw/rxe/rxe_wq.h
> new file mode 100644
> index 000000000000..b40af598dcc0
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_wq.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#ifndef RXE_WQ_H
> +#define RXE_WQ_H
> +
> +enum {
> +	WQ_STATE_START	= 0,
> +	WQ_STATE_BUSY		= 1,
> +	WQ_STATE_ARMED	= 2,
> +};
> +
> +/*
> + * data structure to describe a 'work' which is a short
> + * function that returns 0 as long as it needs to be
> + * called again.
> + */
> +struct rxe_work {
> +	void			*obj;
> +	struct workqueue_struct	*worker;
> +	struct work_struct	work;
> +	int			state;
> +	spinlock_t		state_lock; /* spinlock for work state */
> +	void			*arg;
> +	int			(*func)(void *arg);
> +	int			ret;
> +	char			name[16];
> +	bool			destroyed;
> +	atomic_t		suspended; /* used to {dis,en}able workqueue */
> +};
> +
> +/*
> + * init rxe_work structure
> + *	arg  => parameter to pass to fcn
> + *	func => function to call until it returns != 0
> + */
> +int rxe_init_work(void *obj, struct rxe_work *work,
> +		  void *arg, int (*func)(void *), char *name);
> +
> +/* cleanup work */
> +void rxe_cleanup_work(struct rxe_work *work);
> +
> +/*
> + * raw call to func in loop without any checking
> + * can call when workqueues are suspended.
> + */
> +int __rxe_do_work(struct rxe_work *work);
> +
> +/*
> + * common function called by any of the main workqueues
> + * If there is any chance that there is additional
> + * work to do someone must reschedule the work before
> + * leaving
> + */
> +void rxe_do_work(struct work_struct *w);
> +
> +/* run a work, else schedule it to run as a workqueue, The decision
> + * to run or schedule workqueue is based on the parameter sched.
> + */
> +void rxe_run_work(struct rxe_work *work, int sched);
> +
> +/* keep a work from scheduling */
> +void rxe_disable_work(struct rxe_work *work);
> +
> +/* allow work to run */
> +void rxe_enable_work(struct rxe_work *work);
> +
> +#endif /* RXE_WQ_H */

Daiskuke,

We (hpe) have an internal patch set that also allows using work queues instead of tasklets.
There is a certain amount of work to allow control over cpu assignment from ulp or application
level. This is critical for high performance in multithreaded IO applications. It would be in
both of our interests if we could find a common implementation that works for everyone.

Perhaps we could arrange for a call to discuss?

Bob Pearson

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-07  2:43 ` [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Daisuke Matsuda
  2022-09-09 19:39   ` Bob Pearson
@ 2022-09-11  7:10   ` Yanjun Zhu
  2022-09-11 15:08     ` Bart Van Assche
  1 sibling, 1 reply; 25+ messages in thread
From: Yanjun Zhu @ 2022-09-11  7:10 UTC (permalink / raw)
  To: Daisuke Matsuda, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto

在 2022/9/7 10:43, Daisuke Matsuda 写道:
> In order to implement On-Demand Paging on the rxe driver, triple tasklets
> (requester, responder, and completer) must be allowed to sleep so that they
> can trigger page fault when pages being accessed are not present.
> 
> This patch replaces the tasklets with workqueues, but still allows direct-
> call of works from softirq context if it is obvious that MRs are not going
> to be accessed and there is no work being processed in the workqueue.
> 
> As counterparts to tasklet_disable() and tasklet_enable() are missing for
> workqueues, an atomic value is introduced to get works suspended while qp
> reset is in progress.
> 
> As a reference, performance change was measured using ib_send_bw and
> ib_send_lat commands over RC connection. Both the client and the server
> were placed on the same KVM host. An option "-n 100000" was given to the
> respective commands to iterate over 100000 times.
> 
> Before applying this patch:
> [ib_send_bw]
> ---------------------------------------------------------------------------------------
>   #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>   65536      100000           0.00               203.13             0.003250
> ---------------------------------------------------------------------------------------
> [ib_send_lat]
> ---------------------------------------------------------------------------------------
>   #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
>   2       100000          30.91          1519.05      34.23             36.06            5.15            48.49                   82.37
> ---------------------------------------------------------------------------------------
> 
> After applying this patch:
> [ib_send_bw]
> ---------------------------------------------------------------------------------------
>   #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
>   65536      100000           0.00               240.88             0.003854
> ---------------------------------------------------------------------------------------
> [ib_send_lat]
> ---------------------------------------------------------------------------------------
>   #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99% percentile[usec]   99.9% percentile[usec]
>   2       100000          40.88          2994.82      47.80             50.25            13.70           76.42                   185.04
> ---------------------------------------------------------------------------------------
> 
> The test was conducted 3 times for each kernel, and the results with median
> "BW average" and "t_typical" are shown above. It shows the conversion
> improves the bandwidth while causing higher latency.

I also implemented a workqueue for rxe. IMO, can we add a variable to 
decide to use tasklet or workqueue?

If user prefer using tasklet, he can set the variable to use tasklet.
And the default is tasklet. Set the variable to another value to use 
workqueue.

Zhu Yanjun

> 
> Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> ---
>   drivers/infiniband/sw/rxe/Makefile    |   2 +-
>   drivers/infiniband/sw/rxe/rxe_comp.c  |  42 ++++---
>   drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +-
>   drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
>   drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
>   drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++++------
>   drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
>   drivers/infiniband/sw/rxe/rxe_req.c   |  14 +--
>   drivers/infiniband/sw/rxe/rxe_resp.c  |  16 +--
>   drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------------------
>   drivers/infiniband/sw/rxe/rxe_task.h  |  69 -----------
>   drivers/infiniband/sw/rxe/rxe_verbs.c |   8 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.h |   8 +-
>   drivers/infiniband/sw/rxe/rxe_wq.c    | 161 ++++++++++++++++++++++++++
>   drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++++++++
>   15 files changed, 322 insertions(+), 299 deletions(-)
>   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
>   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
>   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h
> 
> diff --git a/drivers/infiniband/sw/rxe/Makefile b/drivers/infiniband/sw/rxe/Makefile
> index 5395a581f4bb..358f6b06aa64 100644
> --- a/drivers/infiniband/sw/rxe/Makefile
> +++ b/drivers/infiniband/sw/rxe/Makefile
> @@ -20,6 +20,6 @@ rdma_rxe-y := \
>   	rxe_mmap.o \
>   	rxe_icrc.o \
>   	rxe_mcast.o \
> -	rxe_task.o \
> +	rxe_wq.o \
>   	rxe_net.o \
>   	rxe_hw_counters.o
> diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
> index fb0c008af78c..0348da06c4dd 100644
> --- a/drivers/infiniband/sw/rxe/rxe_comp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_comp.c
> @@ -9,7 +9,7 @@
>   #include "rxe.h"
>   #include "rxe_loc.h"
>   #include "rxe_queue.h"
> -#include "rxe_task.h"
> +#include "rxe_wq.h"
>   
>   enum comp_state {
>   	COMPST_GET_ACK,
> @@ -118,21 +118,37 @@ void retransmit_timer(struct timer_list *t)
>   
>   	if (qp->valid) {
>   		qp->comp.timeout = 1;
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>   	}
>   }
>   
> -void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
> +void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>   {
> +	struct rxe_qp *qp = pkt->qp;
>   	int must_sched;
>   
>   	skb_queue_tail(&qp->resp_pkts, skb);
>   
> -	must_sched = skb_queue_len(&qp->resp_pkts) > 1;
> +	/* Schedule a workqueue when processing READ and ATOMIC acks.
> +	 * In these cases, completer may sleep to access ODP-enabled MRs.
> +	 */
> +	switch (pkt->opcode) {
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_FIRST:
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_LAST:
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_ONLY:
> +	case IB_OPCODE_RC_RDMA_READ_RESPONSE_MIDDLE:
> +	case IB_OPCODE_RC_ATOMIC_ACKNOWLEDGE:
> +		must_sched = 1;
> +		break;
> +
> +	default:
> +		must_sched = skb_queue_len(&qp->resp_pkts) > 1;
> +	}
> +
>   	if (must_sched != 0)
>   		rxe_counter_inc(SKB_TO_PKT(skb)->rxe, RXE_CNT_COMPLETER_SCHED);
>   
> -	rxe_run_task(&qp->comp.task, must_sched);
> +	rxe_run_work(&qp->comp.work, must_sched);
>   }
>   
>   static inline enum comp_state get_wqe(struct rxe_qp *qp,
> @@ -305,7 +321,7 @@ static inline enum comp_state check_ack(struct rxe_qp *qp,
>   					qp->comp.psn = pkt->psn;
>   					if (qp->req.wait_psn) {
>   						qp->req.wait_psn = 0;
> -						rxe_run_task(&qp->req.task, 0);
> +						rxe_run_work(&qp->req.work, 0);
>   					}
>   				}
>   				return COMPST_ERROR_RETRY;
> @@ -452,7 +468,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
>   	 */
>   	if (qp->req.wait_fence) {
>   		qp->req.wait_fence = 0;
> -		rxe_run_task(&qp->req.task, 0);
> +		rxe_run_work(&qp->req.work, 0);
>   	}
>   }
>   
> @@ -466,7 +482,7 @@ static inline enum comp_state complete_ack(struct rxe_qp *qp,
>   		if (qp->req.need_rd_atomic) {
>   			qp->comp.timeout_retry = 0;
>   			qp->req.need_rd_atomic = 0;
> -			rxe_run_task(&qp->req.task, 0);
> +			rxe_run_work(&qp->req.work, 0);
>   		}
>   	}
>   
> @@ -512,7 +528,7 @@ static inline enum comp_state complete_wqe(struct rxe_qp *qp,
>   
>   		if (qp->req.wait_psn) {
>   			qp->req.wait_psn = 0;
> -			rxe_run_task(&qp->req.task, 1);
> +			rxe_run_work(&qp->req.work, 1);
>   		}
>   	}
>   
> @@ -646,7 +662,7 @@ int rxe_completer(void *arg)
>   
>   			if (qp->req.wait_psn) {
>   				qp->req.wait_psn = 0;
> -				rxe_run_task(&qp->req.task, 1);
> +				rxe_run_work(&qp->req.work, 1);
>   			}
>   
>   			state = COMPST_DONE;
> @@ -714,7 +730,7 @@ int rxe_completer(void *arg)
>   							RXE_CNT_COMP_RETRY);
>   					qp->req.need_retry = 1;
>   					qp->comp.started_retry = 1;
> -					rxe_run_task(&qp->req.task, 0);
> +					rxe_run_work(&qp->req.work, 0);
>   				}
>   				goto done;
>   
> @@ -757,8 +773,8 @@ int rxe_completer(void *arg)
>   		}
>   	}
>   
> -	/* A non-zero return value will cause rxe_do_task to
> -	 * exit its loop and end the tasklet. A zero return
> +	/* A non-zero return value will cause rxe_do_work to
> +	 * exit its loop and end the work. A zero return
>   	 * will continue looping and return to rxe_completer
>   	 */
>   done:
> diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
> index 22f6cc31d1d6..0f8cb9e38cc9 100644
> --- a/drivers/infiniband/sw/rxe/rxe_loc.h
> +++ b/drivers/infiniband/sw/rxe/rxe_loc.h
> @@ -181,7 +181,7 @@ void rxe_icrc_generate(struct sk_buff *skb, struct rxe_pkt_info *pkt);
>   
>   void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb);
>   
> -void rxe_comp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb);
> +void rxe_comp_queue_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb);
>   
>   static inline unsigned int wr_opcode_mask(int opcode, struct rxe_qp *qp)
>   {
> diff --git a/drivers/infiniband/sw/rxe/rxe_net.c b/drivers/infiniband/sw/rxe/rxe_net.c
> index c53f4529f098..5526970882c8 100644
> --- a/drivers/infiniband/sw/rxe/rxe_net.c
> +++ b/drivers/infiniband/sw/rxe/rxe_net.c
> @@ -346,7 +346,7 @@ static void rxe_skb_tx_dtor(struct sk_buff *skb)
>   
>   	if (unlikely(qp->need_req_skb &&
>   		     skb_out < RXE_INFLIGHT_SKBS_PER_QP_LOW))
> -		rxe_run_task(&qp->req.task, 1);
> +		rxe_run_work(&qp->req.work, 1);
>   
>   	rxe_put(qp);
>   }
> @@ -430,7 +430,7 @@ int rxe_xmit_packet(struct rxe_qp *qp, struct rxe_pkt_info *pkt,
>   	if ((qp_type(qp) != IB_QPT_RC) &&
>   	    (pkt->mask & RXE_END_MASK)) {
>   		pkt->wqe->state = wqe_state_done;
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>   	}
>   
>   	rxe_counter_inc(rxe, RXE_CNT_SENT_PKTS);
> diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
> index 86c7a8bf3cbb..1c0251812fc8 100644
> --- a/drivers/infiniband/sw/rxe/rxe_param.h
> +++ b/drivers/infiniband/sw/rxe/rxe_param.h
> @@ -105,7 +105,7 @@ enum rxe_device_param {
>   	RXE_INFLIGHT_SKBS_PER_QP_HIGH	= 64,
>   	RXE_INFLIGHT_SKBS_PER_QP_LOW	= 16,
>   
> -	/* Max number of interations of each tasklet
> +	/* Max number of interations of each work
>   	 * before yielding the cpu to let other
>   	 * work make progress
>   	 */
> diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
> index 516bf9b95e48..c7c7acd4ecfa 100644
> --- a/drivers/infiniband/sw/rxe/rxe_qp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_qp.c
> @@ -13,7 +13,7 @@
>   #include "rxe.h"
>   #include "rxe_loc.h"
>   #include "rxe_queue.h"
> -#include "rxe_task.h"
> +#include "rxe_wq.h"
>   
>   static int rxe_qp_chk_cap(struct rxe_dev *rxe, struct ib_qp_cap *cap,
>   			  int has_srq)
> @@ -172,9 +172,9 @@ static void rxe_qp_init_misc(struct rxe_dev *rxe, struct rxe_qp *qp,
>   
>   	spin_lock_init(&qp->state_lock);
>   
> -	spin_lock_init(&qp->req.task.state_lock);
> -	spin_lock_init(&qp->resp.task.state_lock);
> -	spin_lock_init(&qp->comp.task.state_lock);
> +	spin_lock_init(&qp->req.work.state_lock);
> +	spin_lock_init(&qp->resp.work.state_lock);
> +	spin_lock_init(&qp->comp.work.state_lock);
>   
>   	spin_lock_init(&qp->sq.sq_lock);
>   	spin_lock_init(&qp->rq.producer_lock);
> @@ -242,10 +242,10 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
>   
>   	skb_queue_head_init(&qp->req_pkts);
>   
> -	rxe_init_task(rxe, &qp->req.task, qp,
> -		      rxe_requester, "req");
> -	rxe_init_task(rxe, &qp->comp.task, qp,
> -		      rxe_completer, "comp");
> +	rxe_init_work(rxe, &qp->req.work, qp,
> +		      rxe_requester, "rxe_req");
> +	rxe_init_work(rxe, &qp->comp.work, qp,
> +		      rxe_completer, "rxe_comp");
>   
>   	qp->qp_timeout_jiffies = 0; /* Can't be set for UD/UC in modify_qp */
>   	if (init->qp_type == IB_QPT_RC) {
> @@ -292,8 +292,8 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
>   
>   	skb_queue_head_init(&qp->resp_pkts);
>   
> -	rxe_init_task(rxe, &qp->resp.task, qp,
> -		      rxe_responder, "resp");
> +	rxe_init_work(rxe, &qp->resp.work, qp,
> +		      rxe_responder, "rxe_resp");
>   
>   	qp->resp.opcode		= OPCODE_NONE;
>   	qp->resp.msn		= 0;
> @@ -481,14 +481,14 @@ int rxe_qp_chk_attr(struct rxe_dev *rxe, struct rxe_qp *qp,
>   /* move the qp to the reset state */
>   static void rxe_qp_reset(struct rxe_qp *qp)
>   {
> -	/* stop tasks from running */
> -	rxe_disable_task(&qp->resp.task);
> +	/* flush workqueue and stop works from running */
> +	rxe_disable_work(&qp->resp.work);
>   
>   	/* stop request/comp */
>   	if (qp->sq.queue) {
>   		if (qp_type(qp) == IB_QPT_RC)
> -			rxe_disable_task(&qp->comp.task);
> -		rxe_disable_task(&qp->req.task);
> +			rxe_disable_work(&qp->comp.work);
> +		rxe_disable_work(&qp->req.work);
>   	}
>   
>   	/* move qp to the reset state */
> @@ -499,11 +499,11 @@ static void rxe_qp_reset(struct rxe_qp *qp)
>   	/* let state machines reset themselves drain work and packet queues
>   	 * etc.
>   	 */
> -	__rxe_do_task(&qp->resp.task);
> +	__rxe_do_work(&qp->resp.work);
>   
>   	if (qp->sq.queue) {
> -		__rxe_do_task(&qp->comp.task);
> -		__rxe_do_task(&qp->req.task);
> +		__rxe_do_work(&qp->comp.work);
> +		__rxe_do_work(&qp->req.work);
>   		rxe_queue_reset(qp->sq.queue);
>   	}
>   
> @@ -526,14 +526,14 @@ static void rxe_qp_reset(struct rxe_qp *qp)
>   
>   	cleanup_rd_atomic_resources(qp);
>   
> -	/* reenable tasks */
> -	rxe_enable_task(&qp->resp.task);
> +	/* reenable workqueue */
> +	rxe_enable_work(&qp->resp.work);
>   
>   	if (qp->sq.queue) {
>   		if (qp_type(qp) == IB_QPT_RC)
> -			rxe_enable_task(&qp->comp.task);
> +			rxe_enable_work(&qp->comp.work);
>   
> -		rxe_enable_task(&qp->req.task);
> +		rxe_enable_work(&qp->req.work);
>   	}
>   }
>   
> @@ -544,10 +544,10 @@ static void rxe_qp_drain(struct rxe_qp *qp)
>   		if (qp->req.state != QP_STATE_DRAINED) {
>   			qp->req.state = QP_STATE_DRAIN;
>   			if (qp_type(qp) == IB_QPT_RC)
> -				rxe_run_task(&qp->comp.task, 1);
> +				rxe_run_work(&qp->comp.work, 1);
>   			else
> -				__rxe_do_task(&qp->comp.task);
> -			rxe_run_task(&qp->req.task, 1);
> +				__rxe_do_work(&qp->comp.work);
> +			rxe_run_work(&qp->req.work, 1);
>   		}
>   	}
>   }
> @@ -561,13 +561,13 @@ void rxe_qp_error(struct rxe_qp *qp)
>   	qp->attr.qp_state = IB_QPS_ERR;
>   
>   	/* drain work and packet queues */
> -	rxe_run_task(&qp->resp.task, 1);
> +	rxe_run_work(&qp->resp.work, 1);
>   
>   	if (qp_type(qp) == IB_QPT_RC)
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>   	else
> -		__rxe_do_task(&qp->comp.task);
> -	rxe_run_task(&qp->req.task, 1);
> +		__rxe_do_work(&qp->comp.work);
> +	rxe_run_work(&qp->req.work, 1);
>   }
>   
>   /* called by the modify qp verb */
> @@ -786,21 +786,21 @@ static void rxe_qp_do_cleanup(struct work_struct *work)
>   
>   	qp->valid = 0;
>   	qp->qp_timeout_jiffies = 0;
> -	rxe_cleanup_task(&qp->resp.task);
> +	rxe_cleanup_work(&qp->resp.work);
>   
>   	if (qp_type(qp) == IB_QPT_RC) {
>   		del_timer_sync(&qp->retrans_timer);
>   		del_timer_sync(&qp->rnr_nak_timer);
>   	}
>   
> -	rxe_cleanup_task(&qp->req.task);
> -	rxe_cleanup_task(&qp->comp.task);
> +	rxe_cleanup_work(&qp->req.work);
> +	rxe_cleanup_work(&qp->comp.work);
>   
>   	/* flush out any receive wr's or pending requests */
> -	__rxe_do_task(&qp->req.task);
> +	__rxe_do_work(&qp->req.work);
>   	if (qp->sq.queue) {
> -		__rxe_do_task(&qp->comp.task);
> -		__rxe_do_task(&qp->req.task);
> +		__rxe_do_work(&qp->comp.work);
> +		__rxe_do_work(&qp->req.work);
>   	}
>   
>   	if (qp->sq.queue)
> diff --git a/drivers/infiniband/sw/rxe/rxe_recv.c b/drivers/infiniband/sw/rxe/rxe_recv.c
> index f3ad7b6dbd97..3b525e9d903e 100644
> --- a/drivers/infiniband/sw/rxe/rxe_recv.c
> +++ b/drivers/infiniband/sw/rxe/rxe_recv.c
> @@ -226,7 +226,7 @@ static inline void rxe_rcv_pkt(struct rxe_pkt_info *pkt, struct sk_buff *skb)
>   	if (pkt->mask & RXE_REQ_MASK)
>   		rxe_resp_queue_pkt(pkt->qp, skb);
>   	else
> -		rxe_comp_queue_pkt(pkt->qp, skb);
> +		rxe_comp_queue_pkt(pkt, skb);
>   }
>   
>   static void rxe_rcv_mcast_pkt(struct rxe_dev *rxe, struct sk_buff *skb)
> diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
> index f63771207970..bd05dd3de499 100644
> --- a/drivers/infiniband/sw/rxe/rxe_req.c
> +++ b/drivers/infiniband/sw/rxe/rxe_req.c
> @@ -105,7 +105,7 @@ void rnr_nak_timer(struct timer_list *t)
>   	/* request a send queue retry */
>   	qp->req.need_retry = 1;
>   	qp->req.wait_for_rnr_timer = 0;
> -	rxe_run_task(&qp->req.task, 1);
> +	rxe_run_work(&qp->req.work, 1);
>   }
>   
>   static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp)
> @@ -608,7 +608,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
>   	 * which can lead to a deadlock. So go ahead and complete
>   	 * it now.
>   	 */
> -	rxe_run_task(&qp->comp.task, 1);
> +	rxe_run_work(&qp->comp.work, 1);
>   
>   	return 0;
>   }
> @@ -733,7 +733,7 @@ int rxe_requester(void *arg)
>   						       qp->req.wqe_index);
>   			wqe->state = wqe_state_done;
>   			wqe->status = IB_WC_SUCCESS;
> -			rxe_run_task(&qp->comp.task, 0);
> +			rxe_run_work(&qp->comp.work, 0);
>   			goto done;
>   		}
>   		payload = mtu;
> @@ -795,7 +795,7 @@ int rxe_requester(void *arg)
>   		rollback_state(wqe, qp, &rollback_wqe, rollback_psn);
>   
>   		if (err == -EAGAIN) {
> -			rxe_run_task(&qp->req.task, 1);
> +			rxe_run_work(&qp->req.work, 1);
>   			goto exit;
>   		}
>   
> @@ -805,8 +805,8 @@ int rxe_requester(void *arg)
>   
>   	update_state(qp, &pkt);
>   
> -	/* A non-zero return value will cause rxe_do_task to
> -	 * exit its loop and end the tasklet. A zero return
> +	/* A non-zero return value will cause rxe_do_work to
> +	 * exit its loop and end the work. A zero return
>   	 * will continue looping and return to rxe_requester
>   	 */
>   done:
> @@ -817,7 +817,7 @@ int rxe_requester(void *arg)
>   	qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index);
>   	wqe->state = wqe_state_error;
>   	qp->req.state = QP_STATE_ERROR;
> -	rxe_run_task(&qp->comp.task, 0);
> +	rxe_run_work(&qp->comp.work, 0);
>   exit:
>   	ret = -EAGAIN;
>   out:
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index b36ec5c4d5e0..e97c55b292f0 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -83,15 +83,11 @@ static char *resp_state_name[] = {
>   /* rxe_recv calls here to add a request packet to the input queue */
>   void rxe_resp_queue_pkt(struct rxe_qp *qp, struct sk_buff *skb)
>   {
> -	int must_sched;
> -	struct rxe_pkt_info *pkt = SKB_TO_PKT(skb);
> -
> +	/* responder can sleep to access ODP-enabled MRs, so schedule the
> +	 * workqueue for all incoming requests.
> +	 */
>   	skb_queue_tail(&qp->req_pkts, skb);
> -
> -	must_sched = (pkt->opcode == IB_OPCODE_RC_RDMA_READ_REQUEST) ||
> -			(skb_queue_len(&qp->req_pkts) > 1);
> -
> -	rxe_run_task(&qp->resp.task, must_sched);
> +	rxe_run_work(&qp->resp.work, 1);
>   }
>   
>   static inline enum resp_states get_req(struct rxe_qp *qp,
> @@ -1464,8 +1460,8 @@ int rxe_responder(void *arg)
>   		}
>   	}
>   
> -	/* A non-zero return value will cause rxe_do_task to
> -	 * exit its loop and end the tasklet. A zero return
> +	/* A non-zero return value will cause rxe_do_work to
> +	 * exit its loop and end the work. A zero return
>   	 * will continue looping and return to rxe_responder
>   	 */
>   done:
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.c b/drivers/infiniband/sw/rxe/rxe_task.c
> deleted file mode 100644
> index 2248cf33d776..000000000000
> --- a/drivers/infiniband/sw/rxe/rxe_task.c
> +++ /dev/null
> @@ -1,152 +0,0 @@
> -// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> -/*
> - * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> - * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> - */
> -
> -#include <linux/kernel.h>
> -#include <linux/interrupt.h>
> -#include <linux/hardirq.h>
> -
> -#include "rxe.h"
> -
> -int __rxe_do_task(struct rxe_task *task)
> -
> -{
> -	int ret;
> -
> -	while ((ret = task->func(task->arg)) == 0)
> -		;
> -
> -	task->ret = ret;
> -
> -	return ret;
> -}
> -
> -/*
> - * this locking is due to a potential race where
> - * a second caller finds the task already running
> - * but looks just after the last call to func
> - */
> -void rxe_do_task(struct tasklet_struct *t)
> -{
> -	int cont;
> -	int ret;
> -	struct rxe_task *task = from_tasklet(task, t, tasklet);
> -	unsigned int iterations = RXE_MAX_ITERATIONS;
> -
> -	spin_lock_bh(&task->state_lock);
> -	switch (task->state) {
> -	case TASK_STATE_START:
> -		task->state = TASK_STATE_BUSY;
> -		spin_unlock_bh(&task->state_lock);
> -		break;
> -
> -	case TASK_STATE_BUSY:
> -		task->state = TASK_STATE_ARMED;
> -		fallthrough;
> -	case TASK_STATE_ARMED:
> -		spin_unlock_bh(&task->state_lock);
> -		return;
> -
> -	default:
> -		spin_unlock_bh(&task->state_lock);
> -		pr_warn("%s failed with bad state %d\n", __func__, task->state);
> -		return;
> -	}
> -
> -	do {
> -		cont = 0;
> -		ret = task->func(task->arg);
> -
> -		spin_lock_bh(&task->state_lock);
> -		switch (task->state) {
> -		case TASK_STATE_BUSY:
> -			if (ret) {
> -				task->state = TASK_STATE_START;
> -			} else if (iterations--) {
> -				cont = 1;
> -			} else {
> -				/* reschedule the tasklet and exit
> -				 * the loop to give up the cpu
> -				 */
> -				tasklet_schedule(&task->tasklet);
> -				task->state = TASK_STATE_START;
> -			}
> -			break;
> -
> -		/* someone tried to run the task since the last time we called
> -		 * func, so we will call one more time regardless of the
> -		 * return value
> -		 */
> -		case TASK_STATE_ARMED:
> -			task->state = TASK_STATE_BUSY;
> -			cont = 1;
> -			break;
> -
> -		default:
> -			pr_warn("%s failed with bad state %d\n", __func__,
> -				task->state);
> -		}
> -		spin_unlock_bh(&task->state_lock);
> -	} while (cont);
> -
> -	task->ret = ret;
> -}
> -
> -int rxe_init_task(void *obj, struct rxe_task *task,
> -		  void *arg, int (*func)(void *), char *name)
> -{
> -	task->obj	= obj;
> -	task->arg	= arg;
> -	task->func	= func;
> -	snprintf(task->name, sizeof(task->name), "%s", name);
> -	task->destroyed	= false;
> -
> -	tasklet_setup(&task->tasklet, rxe_do_task);
> -
> -	task->state = TASK_STATE_START;
> -	spin_lock_init(&task->state_lock);
> -
> -	return 0;
> -}
> -
> -void rxe_cleanup_task(struct rxe_task *task)
> -{
> -	bool idle;
> -
> -	/*
> -	 * Mark the task, then wait for it to finish. It might be
> -	 * running in a non-tasklet (direct call) context.
> -	 */
> -	task->destroyed = true;
> -
> -	do {
> -		spin_lock_bh(&task->state_lock);
> -		idle = (task->state == TASK_STATE_START);
> -		spin_unlock_bh(&task->state_lock);
> -	} while (!idle);
> -
> -	tasklet_kill(&task->tasklet);
> -}
> -
> -void rxe_run_task(struct rxe_task *task, int sched)
> -{
> -	if (task->destroyed)
> -		return;
> -
> -	if (sched)
> -		tasklet_schedule(&task->tasklet);
> -	else
> -		rxe_do_task(&task->tasklet);
> -}
> -
> -void rxe_disable_task(struct rxe_task *task)
> -{
> -	tasklet_disable(&task->tasklet);
> -}
> -
> -void rxe_enable_task(struct rxe_task *task)
> -{
> -	tasklet_enable(&task->tasklet);
> -}
> diff --git a/drivers/infiniband/sw/rxe/rxe_task.h b/drivers/infiniband/sw/rxe/rxe_task.h
> deleted file mode 100644
> index 11d183fd3338..000000000000
> --- a/drivers/infiniband/sw/rxe/rxe_task.h
> +++ /dev/null
> @@ -1,69 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> -/*
> - * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> - * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> - */
> -
> -#ifndef RXE_TASK_H
> -#define RXE_TASK_H
> -
> -enum {
> -	TASK_STATE_START	= 0,
> -	TASK_STATE_BUSY		= 1,
> -	TASK_STATE_ARMED	= 2,
> -};
> -
> -/*
> - * data structure to describe a 'task' which is a short
> - * function that returns 0 as long as it needs to be
> - * called again.
> - */
> -struct rxe_task {
> -	void			*obj;
> -	struct tasklet_struct	tasklet;
> -	int			state;
> -	spinlock_t		state_lock; /* spinlock for task state */
> -	void			*arg;
> -	int			(*func)(void *arg);
> -	int			ret;
> -	char			name[16];
> -	bool			destroyed;
> -};
> -
> -/*
> - * init rxe_task structure
> - *	arg  => parameter to pass to fcn
> - *	func => function to call until it returns != 0
> - */
> -int rxe_init_task(void *obj, struct rxe_task *task,
> -		  void *arg, int (*func)(void *), char *name);
> -
> -/* cleanup task */
> -void rxe_cleanup_task(struct rxe_task *task);
> -
> -/*
> - * raw call to func in loop without any checking
> - * can call when tasklets are disabled
> - */
> -int __rxe_do_task(struct rxe_task *task);
> -
> -/*
> - * common function called by any of the main tasklets
> - * If there is any chance that there is additional
> - * work to do someone must reschedule the task before
> - * leaving
> - */
> -void rxe_do_task(struct tasklet_struct *t);
> -
> -/* run a task, else schedule it to run as a tasklet, The decision
> - * to run or schedule tasklet is based on the parameter sched.
> - */
> -void rxe_run_task(struct rxe_task *task, int sched);
> -
> -/* keep a task from scheduling */
> -void rxe_disable_task(struct rxe_task *task);
> -
> -/* allow task to run */
> -void rxe_enable_task(struct rxe_task *task);
> -
> -#endif /* RXE_TASK_H */
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
> index 9ebe9decad34..7510f25c5ea3 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.c
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
> @@ -697,9 +697,9 @@ static int rxe_post_send_kernel(struct rxe_qp *qp, const struct ib_send_wr *wr,
>   		wr = next;
>   	}
>   
> -	rxe_run_task(&qp->req.task, 1);
> +	rxe_run_work(&qp->req.work, 1);
>   	if (unlikely(qp->req.state == QP_STATE_ERROR))
> -		rxe_run_task(&qp->comp.task, 1);
> +		rxe_run_work(&qp->comp.work, 1);
>   
>   	return err;
>   }
> @@ -721,7 +721,7 @@ static int rxe_post_send(struct ib_qp *ibqp, const struct ib_send_wr *wr,
>   
>   	if (qp->is_user) {
>   		/* Utilize process context to do protocol processing */
> -		rxe_run_task(&qp->req.task, 0);
> +		rxe_run_work(&qp->req.work, 0);
>   		return 0;
>   	} else
>   		return rxe_post_send_kernel(qp, wr, bad_wr);
> @@ -761,7 +761,7 @@ static int rxe_post_recv(struct ib_qp *ibqp, const struct ib_recv_wr *wr,
>   	spin_unlock_irqrestore(&rq->producer_lock, flags);
>   
>   	if (qp->resp.state == QP_STATE_ERROR)
> -		rxe_run_task(&qp->resp.task, 1);
> +		rxe_run_work(&qp->resp.work, 1);
>   
>   err1:
>   	return err;
> diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
> index a51819d0c345..b09b4cb9897a 100644
> --- a/drivers/infiniband/sw/rxe/rxe_verbs.h
> +++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
> @@ -10,7 +10,7 @@
>   #include <linux/interrupt.h>
>   #include <linux/workqueue.h>
>   #include "rxe_pool.h"
> -#include "rxe_task.h"
> +#include "rxe_wq.h"
>   #include "rxe_hw_counters.h"
>   
>   static inline int pkey_match(u16 key1, u16 key2)
> @@ -125,7 +125,7 @@ struct rxe_req_info {
>   	int			need_retry;
>   	int			wait_for_rnr_timer;
>   	int			noack_pkts;
> -	struct rxe_task		task;
> +	struct rxe_work		work;
>   };
>   
>   struct rxe_comp_info {
> @@ -137,7 +137,7 @@ struct rxe_comp_info {
>   	int			started_retry;
>   	u32			retry_cnt;
>   	u32			rnr_retry;
> -	struct rxe_task		task;
> +	struct rxe_work		work;
>   };
>   
>   enum rdatm_res_state {
> @@ -204,7 +204,7 @@ struct rxe_resp_info {
>   	unsigned int		res_head;
>   	unsigned int		res_tail;
>   	struct resp_res		*res;
> -	struct rxe_task		task;
> +	struct rxe_work		work;
>   };
>   
>   struct rxe_qp {
> diff --git a/drivers/infiniband/sw/rxe/rxe_wq.c b/drivers/infiniband/sw/rxe/rxe_wq.c
> new file mode 100644
> index 000000000000..8b35186eac85
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_wq.c
> @@ -0,0 +1,161 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/interrupt.h>
> +#include <linux/hardirq.h>
> +
> +#include "rxe.h"
> +
> +int __rxe_do_work(struct rxe_work *work)
> +
> +{
> +	int ret;
> +
> +	while ((ret = work->func(work->arg)) == 0)
> +		;
> +
> +	work->ret = ret;
> +
> +	return ret;
> +}
> +
> +/*
> + * this locking is due to a potential race where
> + * a second caller finds the work already running
> + * but looks just after the last call to func
> + */
> +void rxe_do_work(struct work_struct *w)
> +{
> +	int cont;
> +	int ret;
> +
> +	struct rxe_work *work = container_of(w, typeof(*work), work);
> +	unsigned int iterations = RXE_MAX_ITERATIONS;
> +
> +	spin_lock_bh(&work->state_lock);
> +	switch (work->state) {
> +	case WQ_STATE_START:
> +		work->state = WQ_STATE_BUSY;
> +		spin_unlock_bh(&work->state_lock);
> +		break;
> +
> +	case WQ_STATE_BUSY:
> +		work->state = WQ_STATE_ARMED;
> +		fallthrough;
> +	case WQ_STATE_ARMED:
> +		spin_unlock_bh(&work->state_lock);
> +		return;
> +
> +	default:
> +		spin_unlock_bh(&work->state_lock);
> +		pr_warn("%s failed with bad state %d\n", __func__, work->state);
> +		return;
> +	}
> +
> +	do {
> +		cont = 0;
> +		ret = work->func(work->arg);
> +
> +		spin_lock_bh(&work->state_lock);
> +		switch (work->state) {
> +		case WQ_STATE_BUSY:
> +			if (ret) {
> +				work->state = WQ_STATE_START;
> +			} else if (iterations--) {
> +				cont = 1;
> +			} else {
> +				/* reschedule the work and exit
> +				 * the loop to give up the cpu
> +				 */
> +				queue_work(work->worker, &work->work);
> +				work->state = WQ_STATE_START;
> +			}
> +			break;
> +
> +		/* someone tried to run the work since the last time we called
> +		 * func, so we will call one more time regardless of the
> +		 * return value
> +		 */
> +		case WQ_STATE_ARMED:
> +			work->state = WQ_STATE_BUSY;
> +			cont = 1;
> +			break;
> +
> +		default:
> +			pr_warn("%s failed with bad state %d\n", __func__,
> +				work->state);
> +		}
> +		spin_unlock_bh(&work->state_lock);
> +	} while (cont);
> +
> +	work->ret = ret;
> +}
> +
> +int rxe_init_work(void *obj, struct rxe_work *work,
> +		  void *arg, int (*func)(void *), char *name)
> +{
> +	work->obj	= obj;
> +	work->arg	= arg;
> +	work->func	= func;
> +	snprintf(work->name, sizeof(work->name), "%s", name);
> +	work->destroyed	= false;
> +	atomic_set(&work->suspended, 0);
> +
> +	work->worker = create_singlethread_workqueue(name);
> +	INIT_WORK(&work->work, rxe_do_work);
> +
> +	work->state = WQ_STATE_START;
> +	spin_lock_init(&work->state_lock);
> +
> +	return 0;
> +}
> +
> +void rxe_cleanup_work(struct rxe_work *work)
> +{
> +	bool idle;
> +
> +	/*
> +	 * Mark the work, then wait for it to finish. It might be
> +	 * running in a non-workqueue (direct call) context.
> +	 */
> +	work->destroyed = true;
> +	flush_workqueue(work->worker);
> +
> +	do {
> +		spin_lock_bh(&work->state_lock);
> +		idle = (work->state == WQ_STATE_START);
> +		spin_unlock_bh(&work->state_lock);
> +	} while (!idle);
> +
> +	destroy_workqueue(work->worker);
> +}
> +
> +void rxe_run_work(struct rxe_work *work, int sched)
> +{
> +	if (work->destroyed)
> +		return;
> +
> +	/* busy-loop while qp reset is in progress */
> +	while (atomic_read(&work->suspended))
> +		continue;
> +
> +	if (sched)
> +		queue_work(work->worker, &work->work);
> +	else
> +		rxe_do_work(&work->work);
> +}
> +
> +void rxe_disable_work(struct rxe_work *work)
> +{
> +	atomic_inc(&work->suspended);
> +	flush_workqueue(work->worker);
> +}
> +
> +void rxe_enable_work(struct rxe_work *work)
> +{
> +	atomic_dec(&work->suspended);
> +}
> diff --git a/drivers/infiniband/sw/rxe/rxe_wq.h b/drivers/infiniband/sw/rxe/rxe_wq.h
> new file mode 100644
> index 000000000000..b40af598dcc0
> --- /dev/null
> +++ b/drivers/infiniband/sw/rxe/rxe_wq.h
> @@ -0,0 +1,71 @@
> +/* SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB */
> +/*
> + * Copyright (c) 2016 Mellanox Technologies Ltd. All rights reserved.
> + * Copyright (c) 2015 System Fabric Works, Inc. All rights reserved.
> + */
> +
> +#ifndef RXE_WQ_H
> +#define RXE_WQ_H
> +
> +enum {
> +	WQ_STATE_START	= 0,
> +	WQ_STATE_BUSY		= 1,
> +	WQ_STATE_ARMED	= 2,
> +};
> +
> +/*
> + * data structure to describe a 'work' which is a short
> + * function that returns 0 as long as it needs to be
> + * called again.
> + */
> +struct rxe_work {
> +	void			*obj;
> +	struct workqueue_struct	*worker;
> +	struct work_struct	work;
> +	int			state;
> +	spinlock_t		state_lock; /* spinlock for work state */
> +	void			*arg;
> +	int			(*func)(void *arg);
> +	int			ret;
> +	char			name[16];
> +	bool			destroyed;
> +	atomic_t		suspended; /* used to {dis,en}able workqueue */
> +};
> +
> +/*
> + * init rxe_work structure
> + *	arg  => parameter to pass to fcn
> + *	func => function to call until it returns != 0
> + */
> +int rxe_init_work(void *obj, struct rxe_work *work,
> +		  void *arg, int (*func)(void *), char *name);
> +
> +/* cleanup work */
> +void rxe_cleanup_work(struct rxe_work *work);
> +
> +/*
> + * raw call to func in loop without any checking
> + * can call when workqueues are suspended.
> + */
> +int __rxe_do_work(struct rxe_work *work);
> +
> +/*
> + * common function called by any of the main workqueues
> + * If there is any chance that there is additional
> + * work to do someone must reschedule the work before
> + * leaving
> + */
> +void rxe_do_work(struct work_struct *w);
> +
> +/* run a work, else schedule it to run as a workqueue, The decision
> + * to run or schedule workqueue is based on the parameter sched.
> + */
> +void rxe_run_work(struct rxe_work *work, int sched);
> +
> +/* keep a work from scheduling */
> +void rxe_disable_work(struct rxe_work *work);
> +
> +/* allow work to run */
> +void rxe_enable_work(struct rxe_work *work);
> +
> +#endif /* RXE_WQ_H */


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-11  7:10   ` Yanjun Zhu
@ 2022-09-11 15:08     ` Bart Van Assche
  2022-09-12  7:58       ` matsuda-daisuke
  2022-09-12  8:25       ` Yanjun Zhu
  0 siblings, 2 replies; 25+ messages in thread
From: Bart Van Assche @ 2022-09-11 15:08 UTC (permalink / raw)
  To: Yanjun Zhu, Daisuke Matsuda, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto

On 9/11/22 00:10, Yanjun Zhu wrote:
> I also implemented a workqueue for rxe. IMO, can we add a variable to
> decide to use tasklet or workqueue?
> 
> If user prefer using tasklet, he can set the variable to use
> tasklet. And the default is tasklet. Set the variable to another
> value to use workqueue.

I'm in favor of removing all uses of the tasklet mechanism because of 
the disadvantages of that mechanism. See also:
* "Eliminating tasklets" (https://lwn.net/Articles/239633/).
* "Modernizing the tasklet API" (https://lwn.net/Articles/830964/).
* Sebastian Andrzej Siewior's opinion about tasklets 
(https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/).

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-11 15:08     ` Bart Van Assche
@ 2022-09-12  7:58       ` matsuda-daisuke
  2022-09-12  8:29         ` Yanjun Zhu
  2022-09-12 19:52         ` Bob Pearson
  2022-09-12  8:25       ` Yanjun Zhu
  1 sibling, 2 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-12  7:58 UTC (permalink / raw)
  To: 'Bart Van Assche',
	Yanjun Zhu, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto

On Mon, Sep 12, 2022 12:09 AM Bart Van Assche wrote:
> On 9/11/22 00:10, Yanjun Zhu wrote:
> > I also implemented a workqueue for rxe. IMO, can we add a variable to
> > decide to use tasklet or workqueue?
> >
> > If user prefer using tasklet, he can set the variable to use
> > tasklet. And the default is tasklet. Set the variable to another
> > value to use workqueue.

That's an interesting idea, but I am not sure how users specify it.
IIRC, tasklets are generated when rdma link is added, typically by 
executing ' rdma link add' command. I don't think we can add
an device specific option to the utility(iproute2/rdma).

> 
> I'm in favor of removing all uses of the tasklet mechanism because of
> the disadvantages of that mechanism. See also:
> * "Eliminating tasklets" (https://lwn.net/Articles/239633/).
> * "Modernizing the tasklet API" (https://lwn.net/Articles/830964/).
> * Sebastian Andrzej Siewior's opinion about tasklets
> (https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/).

I am also in favor of using workqueues alone not only because of the
disadvantages above but also to avoid complexity. I would like to know
if there is anybody who will bothered by the change especially in terms
of performance.

Thanks,
Daisuke

> 
> Thanks,
> 
> Bart.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-11 15:08     ` Bart Van Assche
  2022-09-12  7:58       ` matsuda-daisuke
@ 2022-09-12  8:25       ` Yanjun Zhu
  1 sibling, 0 replies; 25+ messages in thread
From: Yanjun Zhu @ 2022-09-12  8:25 UTC (permalink / raw)
  To: Bart Van Assche, Daisuke Matsuda, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto


在 2022/9/11 23:08, Bart Van Assche 写道:
> On 9/11/22 00:10, Yanjun Zhu wrote:
>> I also implemented a workqueue for rxe. IMO, can we add a variable to
>> decide to use tasklet or workqueue?
>>
>> If user prefer using tasklet, he can set the variable to use
>> tasklet. And the default is tasklet. Set the variable to another
>> value to use workqueue.
>
> I'm in favor of removing all uses of the tasklet mechanism because of 
> the disadvantages of that mechanism. See also:
> * "Eliminating tasklets" (https://lwn.net/Articles/239633/).
> * "Modernizing the tasklet API" (https://lwn.net/Articles/830964/).
> * Sebastian Andrzej Siewior's opinion about tasklets 
> (https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/).

Thanks, Bart

https://lwn.net/Articles/239633/ is to remove tasklet. But 
https://lwn.net/Articles/240323/ describes the difference between 
workqueue and tasklet.

I am not sure whether the difference between tasklet and workqueue in 
the link https://lwn.net/Articles/240323/ is resolved. If you know, 
please also let me know.

And in the link https://lwn.net/Articles/830964/ and 
https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/, tasklet can 
be replaced by workqueue, timers or thread interrupts.

"

In current kernels, tasklets can be replaced by workqueues, timers, or 
threaded interrupts. If threaded interrupts are used, the work may just 
be executed in the interrupt handler itself. Those newer mechanisms do 
not have the disadvantages of tasklets and should satisfy the same 
needs, so developers do not see a reason to keep tasklets. It seems that 
any migration away from tasklets will be done one driver (or subsystem) 
at a time. For example, Takashi Iwai already reported having the 
conversion ready for sound drivers.

"

And in the link 
https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/, Sebastian 
thought that threaded interrupts are a good substitute to tasklet.

To me, after I have implemented workqueue in rxe, I did not find any 
benefits with workqueue. And sometime the latency is worse with workqueue.

This is why I do not send the workqueue commits to upstream maillist.

I am not sure whether it is a good idea to replace tasklet with 
workqueue or not.

Let me do more readings in linux upstream maillist.

Zhu Yanjun

>
> Thanks,
>
> Bart.
>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-09 19:39   ` Bob Pearson
@ 2022-09-12  8:27     ` matsuda-daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-12  8:27 UTC (permalink / raw)
  To: 'Bob Pearson',
	linux-rdma, leonro, jgg, zyjzyj2000, Ziemba, Ian, Frank Zago
  Cc: nvdimm, linux-kernel, yangx.jy, lizhijian, y-goto

On Sat, Sep 10, 2022 4:39 AM Bob Pearson wrote:
> On 9/6/22 21:43, Daisuke Matsuda wrote:
> > In order to implement On-Demand Paging on the rxe driver, triple tasklets
> > (requester, responder, and completer) must be allowed to sleep so that they
> > can trigger page fault when pages being accessed are not present.
> >
> > This patch replaces the tasklets with workqueues, but still allows direct-
> > call of works from softirq context if it is obvious that MRs are not going
> > to be accessed and there is no work being processed in the workqueue.
> >
> > As counterparts to tasklet_disable() and tasklet_enable() are missing for
> > workqueues, an atomic value is introduced to get works suspended while qp
> > reset is in progress.
> >
> > As a reference, performance change was measured using ib_send_bw and
> > ib_send_lat commands over RC connection. Both the client and the server
> > were placed on the same KVM host. An option "-n 100000" was given to the
> > respective commands to iterate over 100000 times.
> >
> > Before applying this patch:
> > [ib_send_bw]
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  65536      100000           0.00               203.13             0.003250
> > ---------------------------------------------------------------------------------------
> > [ib_send_lat]
> > ---------------------------------------------------------------------------------------
> >  #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99%
> percentile[usec]   99.9% percentile[usec]
> >  2       100000          30.91          1519.05      34.23             36.06            5.15            48.49
> 82.37
> > ---------------------------------------------------------------------------------------
> >
> > After applying this patch:
> > [ib_send_bw]
> > ---------------------------------------------------------------------------------------
> >  #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
> >  65536      100000           0.00               240.88             0.003854
> > ---------------------------------------------------------------------------------------
> > [ib_send_lat]
> > ---------------------------------------------------------------------------------------
> >  #bytes #iterations    t_min[usec]    t_max[usec]  t_typical[usec]    t_avg[usec]    t_stdev[usec]   99%
> percentile[usec]   99.9% percentile[usec]
> >  2       100000          40.88          2994.82      47.80             50.25            13.70           76.42
> 185.04
> > ---------------------------------------------------------------------------------------
> >
> > The test was conducted 3 times for each kernel, and the results with median
> > "BW average" and "t_typical" are shown above. It shows the conversion
> > improves the bandwidth while causing higher latency.
> >
> > Signed-off-by: Daisuke Matsuda <matsuda-daisuke@fujitsu.com>
> > ---
> >  drivers/infiniband/sw/rxe/Makefile    |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_comp.c  |  42 ++++---
> >  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
> >  drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++++------
> >  drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_req.c   |  14 +--
> >  drivers/infiniband/sw/rxe/rxe_resp.c  |  16 +--
> >  drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------------------
> >  drivers/infiniband/sw/rxe/rxe_task.h  |  69 -----------
> >  drivers/infiniband/sw/rxe/rxe_verbs.c |   8 +-
> >  drivers/infiniband/sw/rxe/rxe_verbs.h |   8 +-
> >  drivers/infiniband/sw/rxe/rxe_wq.c    | 161 ++++++++++++++++++++++++++
> >  drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++++++++
> >  15 files changed, 322 insertions(+), 299 deletions(-)
> >  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
> >  delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
> >  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
> >  create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h

<...>

> 
> Daiskuke,
> 
> We (hpe) have an internal patch set that also allows using work queues instead of tasklets.
> There is a certain amount of work to allow control over cpu assignment from ulp or application
> level. This is critical for high performance in multithreaded IO applications. It would be in
> both of our interests if we could find a common implementation that works for everyone.

I agree with you, and I am interested in your work.

Would you post the patch set? It may be possible to rebase my work on your implementation.
I just simply replaced tasklets mechanically and modified the conditions when dispatching works
to responder and completer workqueues. The changes are not so complicated. I'll be away between
9/15-9/20. After that, I can take the time to look into it.

By the way, I would also like you to join in the discussion about whether to preserve tasklets or not.
Cf. https://lore.kernel.org/all/TYCPR01MB8455D739FC9FB034E3485C87E5449@TYCPR01MB8455.jpnprd01.prod.outlook.com/

> 
> Perhaps we could arrange for a call to discuss?

This is an important change that may affect all of rxe users,
so I think the discussion should be open to anybody at least at this stage.

Thanks,
Daisuke


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-12  7:58       ` matsuda-daisuke
@ 2022-09-12  8:29         ` Yanjun Zhu
  2022-09-12 19:52         ` Bob Pearson
  1 sibling, 0 replies; 25+ messages in thread
From: Yanjun Zhu @ 2022-09-12  8:29 UTC (permalink / raw)
  To: matsuda-daisuke, 'Bart Van Assche',
	linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, lizhijian, y-goto


在 2022/9/12 15:58, matsuda-daisuke@fujitsu.com 写道:
> On Mon, Sep 12, 2022 12:09 AM Bart Van Assche wrote:
>> On 9/11/22 00:10, Yanjun Zhu wrote:
>>> I also implemented a workqueue for rxe. IMO, can we add a variable to
>>> decide to use tasklet or workqueue?
>>>
>>> If user prefer using tasklet, he can set the variable to use
>>> tasklet. And the default is tasklet. Set the variable to another
>>> value to use workqueue.
> That's an interesting idea, but I am not sure how users specify it.
> IIRC, tasklets are generated when rdma link is added, typically by
> executing ' rdma link add' command. I don't think we can add
> an device specific option to the utility(iproute2/rdma).

In my workqueue implementation, I used a sysctl variable to set a value 
to choose tasklet or workqueue.

You can use sysctl or driver parameter to decide to use tasklet or 
workqueue.

And finally the Macros like #if ... #else ... #endif also acts as a method.

Zhu Yanjun

>
>> I'm in favor of removing all uses of the tasklet mechanism because of
>> the disadvantages of that mechanism. See also:
>> * "Eliminating tasklets" (https://lwn.net/Articles/239633/).
>> * "Modernizing the tasklet API" (https://lwn.net/Articles/830964/).
>> * Sebastian Andrzej Siewior's opinion about tasklets
>> (https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/).
> I am also in favor of using workqueues alone not only because of the
> disadvantages above but also to avoid complexity. I would like to know
> if there is anybody who will bothered by the change especially in terms
> of performance.
>
> Thanks,
> Daisuke
>
>> Thanks,
>>
>> Bart.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE
  2022-09-09  3:07 ` Li Zhijian
@ 2022-09-12  9:21   ` matsuda-daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-12  9:21 UTC (permalink / raw)
  To: lizhijian, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, rpearsonhpe, yangx.jy, y-goto

On Fri, Sep 9, 2022 12:08 PM Li, Zhijian wrote:
> 
> Daisuke
> 
> Great job.
> 
> I love this feature, before starting reviewing you patches, i tested it with QEMU(with fsdax memory-backend) migration
> over RDMA where it worked for MLX5 before.

Thank you for the try!
Anybody interested in the feature can get the RFC tree from my github.
https://github.com/daimatsuda/linux/tree/odp_rfc

> 
> This time, with you ODP patches, it works on RXE though ibv_advise_mr may be not yet ready.

ibv_advise_mr(3) is what I referred to as prefetch feature. Currently, libibverbs always returns
EOPNOTSUPP error. We need to implement the feature in rxe driver and add a handler to use
it from librxe in the future.

Thanks,
Daisuke

> 
> 
> Thanks
> Zhijian
> 
> 
> On 07/09/2022 10:42, Daisuke Matsuda wrote:
> > Hi everyone,
> >
> > This patch series implements the On-Demand Paging feature on SoftRoCE(rxe)
> > driver, which has been available only in mlx5 driver[1] so far.
> >
> > [Overview]
> > When applications register a memory region(MR), RDMA drivers normally pin
> > pages in the MR so that physical addresses are never changed during RDMA
> > communication. This requires the MR to fit in physical memory and
> > inevitably leads to memory pressure. On the other hand, On-Demand Paging
> > (ODP) allows applications to register MRs without pinning pages. They are
> > paged-in when the driver requires and paged-out when the OS reclaims. As a
> > result, it is possible to register a large MR that does not fit in physical
> > memory without taking up so much physical memory.
> >
> > [Why to add this feature?]
> > We, Fujitsu, have contributed to RDMA with a view to using it with
> > persistent memory. Persistent memory can host a filesystem that allows
> > applications to read/write files directly without involving page cache.
> > This is called FS-DAX(filesystem direct access) mode. There is a problem
> > that data on DAX-enabled filesystem cannot be duplicated with software RAID
> > or other hardware methods. Data replication with RDMA, which features
> > high-speed connections, is the best solution for the problem.
> >
> > However, there is a known issue that hinders using RDMA with FS-DAX. When
> > RDMA operations to a file and update of the file metadata are processed
> > concurrently on the same node, illegal memory accesses can be executed,
> > disregarding the updated metadata. This is because RDMA operations do not
> > go through page cache but access data directly. There was an effort[2] to
> > solve this problem, but it was rejected in the end. Though there is no
> > general solution available, it is possible to work around the problem using
> > the ODP feature that has been available only in mlx5. ODP enables drivers
> > to update metadata before processing RDMA operations.
> >
> > We have enhanced the rxe to expedite the usage of persistent memory. Our
> > contribution to rxe includes RDMA Atomic write[3] and RDMA Flush[4]. With
> > them being merged to rxe along with ODP, an environment will be ready for
> > developers to create and test software for RDMA with FS-DAX. There is a
> > library(librpma)[5] being developed for this purpose. This environment
> > can be used by anybody without any special hardware but an ordinary
> > computer with a normal NIC though it is inferior to hardware
> > implementations in terms of performance.
> >
> > [Design considerations]
> > ODP has been available only in mlx5, but functions and data structures
> > that can be used commonly are provided in ib_uverbs(infiniband/core). The
> > interface is heavily dependent on HMM infrastructure[6], and this patchset
> > use them as much as possible. While mlx5 has both Explicit and Implicit ODP
> > features along with prefetch feature, this patchset implements the Explicit
> > ODP feature only.
> >
> > As an important change, it is necessary to convert triple tasklets
> > (requester, responder and completer) to workqueues because they must be
> > able to sleep in order to trigger page fault before accessing MRs. I did a
> > test shown in the 2nd patch and found that the change makes the latency
> > higher while improving the bandwidth. Though it may be possible to create a
> > new independent workqueue for page fault execution, it is a not very
> > sensible solution since the tasklets have to busy-wait its completion in
> > that case.
> >
> > If responder and completer sleep, it becomes more likely that packet drop
> > occurs because of overflow in receiver queue. There are multiple queues
> > involved, but, as SoftRoCE uses UDP, the most important one would be the
> > UDP buffers. The size can be configured in net.core.rmem_default and
> > net.core.rmem_max sysconfig parameters. Users should change these values in
> > case of packet drop, but page fault would be typically not so long as to
> > cause the problem.
> >
> > [How does ODP work?]
> > "struct ib_umem_odp" is used to manage pages. It is created for each
> > ODP-enabled MR on its registration. This struct holds a pair of arrays
> > (dma_list/pfn_list) that serve as a driver page table. DMA addresses and
> > PFNs are stored in the driver page table. They are updated on page-in and
> > page-out, both of which use the common interface in ib_uverbs.
> >
> > Page-in can occur when requester, responder or completer access an MR in
> > order to process RDMA operations. If they find that the pages being
> > accessed are not present on physical memory or requisite permissions are
> > not set on the pages, they provoke page fault to make pages present with
> > proper permissions and at the same time update the driver page table. After
> > confirming the presence of the pages, they execute memory access such as
> > read, write or atomic operations.
> >
> > Page-out is triggered by page reclaim or filesystem events (e.g. metadata
> > update of a file that is being used as an MR). When creating an ODP-enabled
> > MR, the driver registers an MMU notifier callback. When the kernel issues a
> > page invalidation notification, the callback is provoked to unmap DMA
> > addresses and update the driver page table. After that, the kernel releases
> > the pages.
> >
> > [Supported operations]
> > All operations are supported on RC connection. Atomic write[3] and Flush[4]
> > operations, which are still under discussion, are also going to be
> > supported after their patches are merged. On UD connection, Send, Recv,
> > SRQ-Recv are supported. Because other operations are not supported on mlx5,
> > I take after the decision right now.
> >
> > [How to test ODP?]
> > There are only a few resources available for testing. pyverbs testcases in
> > rdma-core and perftest[7] are recommendable ones. Note that you may have to
> > build perftest from upstream since older versions do not handle ODP
> > capabilities correctly.
> >
> > [Future work]
> > My next work will be the prefetch feature. It allows applications to
> > trigger page fault using ibv_advise_mr(3) to optimize performance. Some
> > existing software like librpma use this feature. Additionally, I think we
> > can also add the implicit ODP feature in the future.
> >
> > [1] [RFC 00/20] On demand paging
> > https://www.spinics.net/lists/linux-rdma/msg18906.html
> >
> > [2] [RFC PATCH v2 00/19] RDMA/FS DAX truncate proposal V1,000,002 ;-)
> > https://lore.kernel.org/nvdimm/20190809225833.6657-1-ira.weiny@intel.com/
> >
> > [3] [RESEND PATCH v5 0/2] RDMA/rxe: Add RDMA Atomic Write operation
> > https://www.spinics.net/lists/linux-rdma/msg111428.html
> >
> > [4] [PATCH v4 0/6] RDMA/rxe: Add RDMA FLUSH operation
> > https://www.spinics.net/lists/kernel/msg4462045.html
> >
> > [5] librpma: Remote Persistent Memory Access Library
> > https://github.com/pmem/rpma
> >
> > [6] Heterogeneous Memory Management (HMM)
> > https://www.kernel.org/doc/html/latest/mm/hmm.html
> >
> > [7] linux-rdma/perftest: Infiniband Verbs Performance Tests
> > https://github.com/linux-rdma/perftest
> >
> > Daisuke Matsuda (7):
> >    IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex
> >    RDMA/rxe: Convert the triple tasklets to workqueues
> >    RDMA/rxe: Cleanup code for responder Atomic operations
> >    RDMA/rxe: Add page invalidation support
> >    RDMA/rxe: Allow registering MRs for On-Demand Paging
> >    RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP
> >    RDMA/rxe: Add support for the traditional Atomic operations with ODP
> >
> >   drivers/infiniband/core/umem_odp.c    |   6 +-
> >   drivers/infiniband/hw/mlx5/odp.c      |   4 +-
> >   drivers/infiniband/sw/rxe/Makefile    |   5 +-
> >   drivers/infiniband/sw/rxe/rxe.c       |  18 ++
> >   drivers/infiniband/sw/rxe/rxe_comp.c  |  42 +++-
> >   drivers/infiniband/sw/rxe/rxe_loc.h   |  11 +-
> >   drivers/infiniband/sw/rxe/rxe_mr.c    |   7 +-
> >   drivers/infiniband/sw/rxe/rxe_net.c   |   4 +-
> >   drivers/infiniband/sw/rxe/rxe_odp.c   | 329 ++++++++++++++++++++++++++
> >   drivers/infiniband/sw/rxe/rxe_param.h |   2 +-
> >   drivers/infiniband/sw/rxe/rxe_qp.c    |  68 +++---
> >   drivers/infiniband/sw/rxe/rxe_recv.c  |   2 +-
> >   drivers/infiniband/sw/rxe/rxe_req.c   |  14 +-
> >   drivers/infiniband/sw/rxe/rxe_resp.c  | 175 +++++++-------
> >   drivers/infiniband/sw/rxe/rxe_resp.h  |  44 ++++
> >   drivers/infiniband/sw/rxe/rxe_task.c  | 152 ------------
> >   drivers/infiniband/sw/rxe/rxe_task.h  |  69 ------
> >   drivers/infiniband/sw/rxe/rxe_verbs.c |  16 +-
> >   drivers/infiniband/sw/rxe/rxe_verbs.h |  10 +-
> >   drivers/infiniband/sw/rxe/rxe_wq.c    | 161 +++++++++++++
> >   drivers/infiniband/sw/rxe/rxe_wq.h    |  71 ++++++
> >   21 files changed, 824 insertions(+), 386 deletions(-)
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_odp.c
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_resp.h
> >   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.c
> >   delete mode 100644 drivers/infiniband/sw/rxe/rxe_task.h
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.c
> >   create mode 100644 drivers/infiniband/sw/rxe/rxe_wq.h
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-12  7:58       ` matsuda-daisuke
  2022-09-12  8:29         ` Yanjun Zhu
@ 2022-09-12 19:52         ` Bob Pearson
  2022-09-28  6:40           ` matsuda-daisuke
  1 sibling, 1 reply; 25+ messages in thread
From: Bob Pearson @ 2022-09-12 19:52 UTC (permalink / raw)
  To: matsuda-daisuke, 'Bart Van Assche',
	Yanjun Zhu, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, yangx.jy, lizhijian, y-goto

On 9/12/22 02:58, matsuda-daisuke@fujitsu.com wrote:
> On Mon, Sep 12, 2022 12:09 AM Bart Van Assche wrote:
>> On 9/11/22 00:10, Yanjun Zhu wrote:
>>> I also implemented a workqueue for rxe. IMO, can we add a variable to
>>> decide to use tasklet or workqueue?
>>>
>>> If user prefer using tasklet, he can set the variable to use
>>> tasklet. And the default is tasklet. Set the variable to another
>>> value to use workqueue.
> 
> That's an interesting idea, but I am not sure how users specify it.
> IIRC, tasklets are generated when rdma link is added, typically by 
> executing ' rdma link add' command. I don't think we can add
> an device specific option to the utility(iproute2/rdma).
> 
>>
>> I'm in favor of removing all uses of the tasklet mechanism because of
>> the disadvantages of that mechanism. See also:
>> * "Eliminating tasklets" (https://lwn.net/Articles/239633/).
>> * "Modernizing the tasklet API" (https://lwn.net/Articles/830964/).
>> * Sebastian Andrzej Siewior's opinion about tasklets
>> (https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/).
> 
> I am also in favor of using workqueues alone not only because of the
> disadvantages above but also to avoid complexity. I would like to know
> if there is anybody who will bothered by the change especially in terms
> of performance.
> 
> Thanks,
> Daisuke
> 
>>
>> Thanks,
>>
>> Bart.
> 

The HPE patch set for work queues (I should send it in) kept the ability to run tasklets or work queues.
The reason was that the change was motivated by a benchmarking exercise and we found that the performance
of tasklets was noticeably better for one IO stream but for 2 or more IO streams work queues were better because we
could place the work on separate cpus. Tasklets have a tendency to bunch up on the same cpu. I am interested in
how Matsuda got better/same performance for work queues.

Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues
  2022-09-12 19:52         ` Bob Pearson
@ 2022-09-28  6:40           ` matsuda-daisuke
  0 siblings, 0 replies; 25+ messages in thread
From: matsuda-daisuke @ 2022-09-28  6:40 UTC (permalink / raw)
  To: 'Bob Pearson', 'Bart Van Assche',
	Yanjun Zhu, linux-rdma, leonro, jgg, zyjzyj2000
  Cc: nvdimm, linux-kernel, yangx.jy, lizhijian, y-goto

Dear Bob, Yanjun, and Bart,

Sorry for taking long time to reply.

> On 9/12/22 02:58, matsuda-daisuke@fujitsu.com wrote:
> > On Mon, Sep 12, 2022 12:09 AM Bart Van Assche wrote:
> >> On 9/11/22 00:10, Yanjun Zhu wrote:
> >>> I also implemented a workqueue for rxe. IMO, can we add a variable to
> >>> decide to use tasklet or workqueue?
> >>>
> >>> If user prefer using tasklet, he can set the variable to use
> >>> tasklet. And the default is tasklet. Set the variable to another
> >>> value to use workqueue.
> >
> > That's an interesting idea, but I am not sure how users specify it.
> > IIRC, tasklets are generated when rdma link is added, typically by
> > executing ' rdma link add' command. I don't think we can add
> > an device specific option to the utility(iproute2/rdma).
> >
> >>
> >> I'm in favor of removing all uses of the tasklet mechanism because of
> >> the disadvantages of that mechanism. See also:
> >> * "Eliminating tasklets" (https://lwn.net/Articles/239633/).
> >> * "Modernizing the tasklet API" (https://lwn.net/Articles/830964/).
> >> * Sebastian Andrzej Siewior's opinion about tasklets
> >> (https://lore.kernel.org/all/YvovfXMJQAUBsvBZ@linutronix.de/).
> >
> > I am also in favor of using workqueues alone not only because of the
> > disadvantages above but also to avoid complexity. I would like to know
> > if there is anybody who will bothered by the change especially in terms
> > of performance.
> >
> > Thanks,
> > Daisuke
> >
> >>
> >> Thanks,
> >>
> >> Bart.
> >
> 
> The HPE patch set for work queues (I should send it in) kept the ability to run tasklets or work queues.
> The reason was that the change was motivated by a benchmarking exercise and we found that the performance
> of tasklets was noticeably better for one IO stream but for 2 or more IO streams work queues were better because we
> could place the work on separate cpus. Tasklets have a tendency to bunch up on the same cpu. I am interested in
> how Matsuda got better/same performance for work queues.

As far as I measured the bandwidth using ib_send_bw command, the performance
was better with workqueues. There seem to be multiple factors that affect
the result. For example, with the current implementation, rxe_responder() can
sometimes be called from softirq(NET_RX_SOFTIRQ) context directly. I changed
rxe_resp_queue_pkt() to always schedule the responder for all incoming 
requests. This may have led to better utilization of multiple processors because
softirq code and responder code are more likely to run concurrently on different
cores in this case. Tasklets are likely to run on the same core as softirq code
because TASKLET_SOFTIRQ is processed later than NET_RX_SOFTIRQ in __do_softirq().

That being said, I think it is also true that the performance of tasklets can be
superior to that of workqueues in some situations. When I measured the bandwidth
of RDMA Read using ib_read_bw command, it was better with tasklets. Additionally,
the latency is generally around 40% higher with workqueues, so it is possible some
kinds of workloads do not benefit from using workqueues.

I therefore think we may preserve tasklets though there are disadvantages suggested
by Bart. While I have no objection to removing tasklets totally, I am in favour of
Yanjun's suggestion of switching between tasklets and workqueues using sysctl
parameter. I would like to hear what you guys think about this.

I would also like to know when Bob is going to post the patchset. Both of us need to use
workqueues, but allowing sleep for ODP and improving performance for multiplex IO
streams are different matters. I suppose it will be easier to make the changes one by one.
If you need some more time to post it, I suggest we proceed with Yanjun's idea for now.
That will preserve current implementation of tasklets, so it would be not hard to add your
change onto it. Could you let me know your thoughts?


Regards,
Daisuke Matsuda


> 
> Bob

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-09-28  6:41 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07  2:42 [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Daisuke Matsuda
2022-09-07  2:42 ` [RFC PATCH 1/7] IB/mlx5: Change ib_umem_odp_map_dma_single_page() to retain umem_mutex Daisuke Matsuda
2022-09-07  2:43 ` [RFC PATCH 2/7] RDMA/rxe: Convert the triple tasklets to workqueues Daisuke Matsuda
2022-09-09 19:39   ` Bob Pearson
2022-09-12  8:27     ` matsuda-daisuke
2022-09-11  7:10   ` Yanjun Zhu
2022-09-11 15:08     ` Bart Van Assche
2022-09-12  7:58       ` matsuda-daisuke
2022-09-12  8:29         ` Yanjun Zhu
2022-09-12 19:52         ` Bob Pearson
2022-09-28  6:40           ` matsuda-daisuke
2022-09-12  8:25       ` Yanjun Zhu
2022-09-07  2:43 ` [RFC PATCH 3/7] RDMA/rxe: Cleanup code for responder Atomic operations Daisuke Matsuda
2022-09-07  2:43 ` [RFC PATCH 4/7] RDMA/rxe: Add page invalidation support Daisuke Matsuda
2022-09-07  2:43 ` [RFC PATCH 5/7] RDMA/rxe: Allow registering MRs for On-Demand Paging Daisuke Matsuda
2022-09-08 16:57   ` Haris Iqbal
2022-09-09  0:55     ` matsuda-daisuke
2022-09-07  2:43 ` [RFC PATCH 6/7] RDMA/rxe: Add support for Send/Recv/Write/Read operations with ODP Daisuke Matsuda
2022-09-08  8:29   ` Leon Romanovsky
2022-09-09  2:45     ` matsuda-daisuke
2022-09-07  2:43 ` [RFC PATCH 7/7] RDMA/rxe: Add support for the traditional Atomic " Daisuke Matsuda
2022-09-08  8:40 ` [RFC PATCH 0/7] RDMA/rxe: On-Demand Paging on SoftRoCE Zhu Yanjun
2022-09-08 10:25   ` matsuda-daisuke
2022-09-09  3:07 ` Li Zhijian
2022-09-12  9:21   ` matsuda-daisuke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).