linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray
@ 2023-01-16 23:55 Bob Pearson
  2023-01-16 23:55 ` [PATCH for-next v5 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
                   ` (5 more replies)
  0 siblings, 6 replies; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:55 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

This patch series replaces the page map carried in each memory region
with a struct xarray. It is based on a sketch developed by Jason
Gunthorpe. The first five patches are preparation that tries to
cleanly isolate all the mr specific code into rxe_mr.c. The sixth
patch is the actual change.

v5:
  Responded to a note from lizhijian@fujitsu.com and restored calls to
  is_pmem_page() which were accidentally dropped in earlier versions.
v4:
  Responded to a comment by Zhu and cleaned up error passing between
  rxe_mr.c and rxe_resp.c.
  Other various cleanups including more careful use of unsigned ints.
  Rebased to current for-next.
v3:
  Fixed an error reported by kernel test robot
v2:
  Rebased to 6.2.0-rc1+
  Minor cleanups
  Fixed error reported by Jason in 4/6 missing if after else.

Bob Pearson (6):
  RDMA/rxe: Cleanup mr_check_range
  RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c
  RDMA-rxe: Isolate mr code from atomic_reply()
  RDMA-rxe: Isolate mr code from atomic_write_reply()
  RDMA/rxe: Cleanup page variables in rxe_mr.c
  RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray

 drivers/infiniband/sw/rxe/rxe_loc.h   |  13 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    | 625 +++++++++++++++-----------
 drivers/infiniband/sw/rxe/rxe_resp.c  | 118 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.c |  36 --
 drivers/infiniband/sw/rxe/rxe_verbs.h |  37 +-
 5 files changed, 425 insertions(+), 404 deletions(-)


base-commit: 1ec82317a1daac78c04b0c15af89018ccf9fa2b7
-- 
2.37.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH for-next v5 1/6] RDMA/rxe: Cleanup mr_check_range
  2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
@ 2023-01-16 23:55 ` Bob Pearson
  2023-01-16 23:55 ` [PATCH for-next v5 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c Bob Pearson
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:55 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Remove blank lines and replace EFAULT by EINVAL when an invalid
mr type is used.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 072eac4b65d2..632ee1e516a1 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -26,8 +26,6 @@ u8 rxe_get_next_key(u32 last_key)
 
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 {
-
-
 	switch (mr->ibmr.type) {
 	case IB_MR_TYPE_DMA:
 		return 0;
@@ -41,7 +39,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 
 	default:
 		rxe_dbg_mr(mr, "type (%d) not supported\n", mr->ibmr.type);
-		return -EFAULT;
+		return -EINVAL;
 	}
 }
 
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v5 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c
  2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
  2023-01-16 23:55 ` [PATCH for-next v5 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
@ 2023-01-16 23:55 ` Bob Pearson
  2023-01-16 23:56 ` [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:55 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Move rxe_map_mr_sg() to rxe_mr.c where it makes a little more sense.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |  2 ++
 drivers/infiniband/sw/rxe/rxe_mr.c    | 36 +++++++++++++++++++++++++++
 drivers/infiniband/sw/rxe/rxe_verbs.c | 36 ---------------------------
 3 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 948ce4902b10..29b6c2143045 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -69,6 +69,8 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		enum rxe_mr_copy_dir dir);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
+int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+		  int sg_nents, unsigned int *sg_offset);
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 632ee1e516a1..229c7259644c 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -223,6 +223,42 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
+static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
+{
+	struct rxe_mr *mr = to_rmr(ibmr);
+	struct rxe_map *map;
+	struct rxe_phys_buf *buf;
+
+	if (unlikely(mr->nbuf == mr->num_buf))
+		return -ENOMEM;
+
+	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
+	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
+
+	buf->addr = addr;
+	buf->size = ibmr->page_size;
+	mr->nbuf++;
+
+	return 0;
+}
+
+int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+		  int sg_nents, unsigned int *sg_offset)
+{
+	struct rxe_mr *mr = to_rmr(ibmr);
+	int n;
+
+	mr->nbuf = 0;
+
+	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+
+	mr->page_shift = ilog2(ibmr->page_size);
+	mr->page_mask = ibmr->page_size - 1;
+	mr->offset = ibmr->iova & mr->page_mask;
+
+	return n;
+}
+
 static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
 			size_t *offset_out)
 {
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 025b35bf014e..7a902e0a0607 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -948,42 +948,6 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	return ERR_PTR(err);
 }
 
-static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
-{
-	struct rxe_mr *mr = to_rmr(ibmr);
-	struct rxe_map *map;
-	struct rxe_phys_buf *buf;
-
-	if (unlikely(mr->nbuf == mr->num_buf))
-		return -ENOMEM;
-
-	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
-	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
-
-	buf->addr = addr;
-	buf->size = ibmr->page_size;
-	mr->nbuf++;
-
-	return 0;
-}
-
-static int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
-			 int sg_nents, unsigned int *sg_offset)
-{
-	struct rxe_mr *mr = to_rmr(ibmr);
-	int n;
-
-	mr->nbuf = 0;
-
-	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
-
-	mr->page_shift = ilog2(ibmr->page_size);
-	mr->page_mask = ibmr->page_size - 1;
-	mr->offset = ibmr->iova & mr->page_mask;
-
-	return n;
-}
-
 static ssize_t parent_show(struct device *device,
 			   struct device_attribute *attr, char *buf)
 {
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply()
  2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
  2023-01-16 23:55 ` [PATCH for-next v5 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
  2023-01-16 23:55 ` [PATCH for-next v5 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c Bob Pearson
@ 2023-01-16 23:56 ` Bob Pearson
  2023-01-17 15:09   ` Jason Gunthorpe
  2023-01-16 23:56 ` [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:56 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Isolate mr specific code from atomic_reply() in rxe_resp.c into
a subroutine rxe_mr_do_atomic_op() in rxe_mr.c.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |  2 +
 drivers/infiniband/sw/rxe/rxe_mr.c    | 83 ++++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_resp.c  | 49 +++++-----------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  5 ++
 4 files changed, 76 insertions(+), 63 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 29b6c2143045..bcb1bbcf50df 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -72,6 +72,8 @@ int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset);
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
+int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+			u64 compare, u64 swap_add, u64 *orig_val);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 229c7259644c..15a8d44daa35 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -32,13 +32,15 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 
 	case IB_MR_TYPE_USER:
 	case IB_MR_TYPE_MEM_REG:
-		if (iova < mr->ibmr.iova || length > mr->ibmr.length ||
-		    iova > mr->ibmr.iova + mr->ibmr.length - length)
-			return -EFAULT;
+		if (iova < mr->ibmr.iova ||
+		    iova + length > mr->ibmr.iova + mr->ibmr.length) {
+			rxe_dbg_mr(mr, "iova/length out of range");
+			return -ERANGE;
+		}
 		return 0;
 
 	default:
-		rxe_dbg_mr(mr, "type (%d) not supported\n", mr->ibmr.type);
+		rxe_dbg_mr(mr, "mr type not supported\n");
 		return -EINVAL;
 	}
 }
@@ -299,37 +301,22 @@ void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
 {
 	size_t offset;
 	int m, n;
-	void *addr;
 
-	if (mr->state != RXE_MR_STATE_VALID) {
-		rxe_dbg_mr(mr, "Not in valid state\n");
-		addr = NULL;
-		goto out;
-	}
+	if (mr->state != RXE_MR_STATE_VALID)
+		return NULL;
 
-	if (!mr->map) {
-		addr = (void *)(uintptr_t)iova;
-		goto out;
-	}
+	if (mr->ibmr.type == IB_MR_TYPE_DMA)
+		return (void *)(uintptr_t)iova;
 
-	if (mr_check_range(mr, iova, length)) {
-		rxe_dbg_mr(mr, "Range violation\n");
-		addr = NULL;
-		goto out;
-	}
+	if (mr_check_range(mr, iova, length))
+		return NULL;
 
 	lookup_iova(mr, iova, &m, &n, &offset);
 
-	if (offset + length > mr->map[m]->buf[n].size) {
-		rxe_dbg_mr(mr, "Crosses page boundary\n");
-		addr = NULL;
-		goto out;
-	}
-
-	addr = (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
+	if (offset + length > mr->map[m]->buf[n].size)
+		return NULL;
 
-out:
-	return addr;
+	return (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
 }
 
 int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
@@ -538,6 +525,46 @@ int copy_data(
 	return err;
 }
 
+/* Guarantee atomicity of atomic operations at the machine level. */
+static DEFINE_SPINLOCK(atomic_ops_lock);
+
+int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
+			u64 compare, u64 swap_add, u64 *orig_val)
+{
+	u64 *va;
+	u64 value;
+
+	if (mr->state != RXE_MR_STATE_VALID) {
+		rxe_dbg_mr(mr, "mr not in valid state");
+		return -EINVAL;
+	}
+
+	va = iova_to_vaddr(mr, iova, sizeof(u64));
+	if (!va) {
+		rxe_dbg_mr(mr, "iova out of range");
+		return -ERANGE;
+	}
+
+	if ((uintptr_t)va & 0x7) {
+		rxe_dbg_mr(mr, "iova not aligned");
+		return RXE_ERR_NOT_ALIGNED;
+	}
+
+	spin_lock_bh(&atomic_ops_lock);
+	value = *orig_val = *va;
+
+	if (opcode == IB_OPCODE_RC_COMPARE_SWAP) {
+		if (value == compare)
+			*va = swap_add;
+	} else {
+		value += swap_add;
+		*va = value;
+	}
+	spin_unlock_bh(&atomic_ops_lock);
+
+	return 0;
+}
+
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 {
 	struct rxe_sge		*sge	= &dma->sge[dma->cur_sge];
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index c74972244f08..1e38e5da1f4c 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -725,17 +725,12 @@ static enum resp_states process_flush(struct rxe_qp *qp,
 	return RESPST_ACKNOWLEDGE;
 }
 
-/* Guarantee atomicity of atomic operations at the machine level. */
-static DEFINE_SPINLOCK(atomic_ops_lock);
-
 static enum resp_states atomic_reply(struct rxe_qp *qp,
-					 struct rxe_pkt_info *pkt)
+				     struct rxe_pkt_info *pkt)
 {
-	u64 *vaddr;
-	enum resp_states ret;
 	struct rxe_mr *mr = qp->resp.mr;
 	struct resp_res *res = qp->resp.res;
-	u64 value;
+	int err;
 
 	if (!res) {
 		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_MASK);
@@ -743,33 +738,19 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 	}
 
 	if (!res->replay) {
-		if (mr->state != RXE_MR_STATE_VALID) {
-			ret = RESPST_ERR_RKEY_VIOLATION;
-			goto out;
-		}
-
-		vaddr = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset,
-					sizeof(u64));
-
-		/* check vaddr is 8 bytes aligned. */
-		if (!vaddr || (uintptr_t)vaddr & 7) {
-			ret = RESPST_ERR_MISALIGNED_ATOMIC;
-			goto out;
-		}
-
-		spin_lock_bh(&atomic_ops_lock);
-		res->atomic.orig_val = value = *vaddr;
-
-		if (pkt->opcode == IB_OPCODE_RC_COMPARE_SWAP) {
-			if (value == atmeth_comp(pkt))
-				value = atmeth_swap_add(pkt);
-		} else {
-			value += atmeth_swap_add(pkt);
+		u64 iova = qp->resp.va + qp->resp.offset;
+
+		err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
+					  atmeth_comp(pkt),
+					  atmeth_swap_add(pkt),
+					  &res->atomic.orig_val);
+		if (unlikely(err)) {
+			if (err == -RXE_ERR_NOT_ALIGNED)
+				return RESPST_ERR_MISALIGNED_ATOMIC;
+			else
+				return RESPST_ERR_RKEY_VIOLATION;
 		}
 
-		*vaddr = value;
-		spin_unlock_bh(&atomic_ops_lock);
-
 		qp->resp.msn++;
 
 		/* next expected psn, read handles this separately */
@@ -780,9 +761,7 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 		qp->resp.status = IB_WC_SUCCESS;
 	}
 
-	ret = RESPST_ACKNOWLEDGE;
-out:
-	return ret;
+	return RESPST_ACKNOWLEDGE;
 }
 
 #ifdef CONFIG_64BIT
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 19ddfa890480..691200d99d6b 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -273,6 +273,11 @@ enum rxe_mr_state {
 	RXE_MR_STATE_VALID,
 };
 
+/* some rxe specific error conditions */
+enum rxe_mr_error {
+	RXE_ERR_NOT_ALIGNED = 0x1001,	/* misaligned address */
+};
+
 enum rxe_mr_copy_dir {
 	RXE_TO_MR_OBJ,
 	RXE_FROM_MR_OBJ,
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (2 preceding siblings ...)
  2023-01-16 23:56 ` [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
@ 2023-01-16 23:56 ` Bob Pearson
  2023-01-17 15:11   ` Jason Gunthorpe
  2023-01-16 23:56 ` [PATCH for-next v5 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c Bob Pearson
  2023-01-16 23:56 ` [PATCH for-next v5 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson
  5 siblings, 1 reply; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:56 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Isolate mr specific code from atomic_write_reply() in rxe_resp.c into
a subroutine rxe_mr_do_atomic_write() in rxe_mr.c.
Check length for atomic write operation.
Make iova_to_vaddr() static.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h  |  1 +
 drivers/infiniband/sw/rxe/rxe_mr.c   | 34 ++++++++++++++
 drivers/infiniband/sw/rxe/rxe_resp.c | 69 ++++++++++++----------------
 3 files changed, 64 insertions(+), 40 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index bcb1bbcf50df..b1dda0cf891b 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -74,6 +74,7 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val);
+int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type);
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 15a8d44daa35..10484f671977 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -565,6 +565,40 @@ int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 	return 0;
 }
 
+/* only implemented for 64 bit architectures */
+int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
+{
+#if defined CONFIG_64BIT
+	u64 *va;
+
+	/* See IBA oA19-28 */
+	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
+		rxe_dbg_mr(mr, "mr not in valid state");
+		return -EINVAL;
+	}
+
+	va = iova_to_vaddr(mr, iova, sizeof(value));
+	if (unlikely(!va)) {
+		rxe_dbg_mr(mr, "iova out of range");
+		return -ERANGE;
+	}
+
+	/* See IBA A19.4.2 */
+	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
+		rxe_dbg_mr(mr, "misaligned address");
+		return -RXE_ERR_NOT_ALIGNED;
+	}
+
+	/* Do atomic write after all prior operations have completed */
+	smp_store_release(va, value);
+
+	return 0;
+#else
+	WARN_ON(1);
+	return -EINVAL;
+#endif
+}
+
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 {
 	struct rxe_sge		*sge	= &dma->sge[dma->cur_sge];
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 1e38e5da1f4c..49298ff88d25 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -764,30 +764,40 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
 	return RESPST_ACKNOWLEDGE;
 }
 
-#ifdef CONFIG_64BIT
-static enum resp_states do_atomic_write(struct rxe_qp *qp,
-					struct rxe_pkt_info *pkt)
+static enum resp_states atomic_write_reply(struct rxe_qp *qp,
+					   struct rxe_pkt_info *pkt)
 {
-	struct rxe_mr *mr = qp->resp.mr;
-	int payload = payload_size(pkt);
-	u64 src, *dst;
-
-	if (mr->state != RXE_MR_STATE_VALID)
-		return RESPST_ERR_RKEY_VIOLATION;
+	struct resp_res *res = qp->resp.res;
+	struct rxe_mr *mr;
+	u64 value;
+	u64 iova;
+	int err;
 
-	memcpy(&src, payload_addr(pkt), payload);
+	if (!res) {
+		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_WRITE_MASK);
+		qp->resp.res = res;
+	}
 
-	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
-	/* check vaddr is 8 bytes aligned. */
-	if (!dst || (uintptr_t)dst & 7)
-		return RESPST_ERR_MISALIGNED_ATOMIC;
+	if (res->replay)
+		return RESPST_ACKNOWLEDGE;
 
-	/* Do atomic write after all prior operations have completed */
-	smp_store_release(dst, src);
+	mr = qp->resp.mr;
+	value = *(u64 *)payload_addr(pkt);
+	iova = qp->resp.va + qp->resp.offset;
 
-	/* decrease resp.resid to zero */
-	qp->resp.resid -= sizeof(payload);
+#if defined CONFIG_64BIT
+	err = rxe_mr_do_atomic_write(mr, iova, value);
+	if (unlikely(err)) {
+		if (err == -RXE_ERR_NOT_ALIGNED)
+			return RESPST_ERR_MISALIGNED_ATOMIC;
+		else
+			return RESPST_ERR_RKEY_VIOLATION;
+	}
+#else
+	return RESPST_ERR_UNSUPPORTED_OPCODE;
+#endif
 
+	qp->resp.resid = 0;
 	qp->resp.msn++;
 
 	/* next expected psn, read handles this separately */
@@ -796,29 +806,8 @@ static enum resp_states do_atomic_write(struct rxe_qp *qp,
 
 	qp->resp.opcode = pkt->opcode;
 	qp->resp.status = IB_WC_SUCCESS;
-	return RESPST_ACKNOWLEDGE;
-}
-#else
-static enum resp_states do_atomic_write(struct rxe_qp *qp,
-					struct rxe_pkt_info *pkt)
-{
-	return RESPST_ERR_UNSUPPORTED_OPCODE;
-}
-#endif /* CONFIG_64BIT */
 
-static enum resp_states atomic_write_reply(struct rxe_qp *qp,
-					   struct rxe_pkt_info *pkt)
-{
-	struct resp_res *res = qp->resp.res;
-
-	if (!res) {
-		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_WRITE_MASK);
-		qp->resp.res = res;
-	}
-
-	if (res->replay)
-		return RESPST_ACKNOWLEDGE;
-	return do_atomic_write(qp, pkt);
+	return RESPST_ACKNOWLEDGE;
 }
 
 static struct sk_buff *prepare_ack_packet(struct rxe_qp *qp,
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v5 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c
  2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (3 preceding siblings ...)
  2023-01-16 23:56 ` [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
@ 2023-01-16 23:56 ` Bob Pearson
  2023-01-16 23:56 ` [PATCH for-next v5 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson
  5 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:56 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Cleanup usage of mr->page_shift and mr->page_mask and introduce
an extractor for mr->ibmr.page_size. Normal usage in the kernel
has page_mask masking out offset in page rather than masking out
the page number. The rxe driver had reversed that which was confusing.
Implicitly there can be a per mr page_size which was not uniformly
supported.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    | 31 ++++++++++++---------------
 drivers/infiniband/sw/rxe/rxe_verbs.h | 11 +++++++---
 2 files changed, 22 insertions(+), 20 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 10484f671977..fdf76df4cf3e 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -62,6 +62,9 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->lkey = mr->ibmr.lkey = lkey;
 	mr->rkey = mr->ibmr.rkey = rkey;
 
+	mr->ibmr.page_size = PAGE_SIZE;
+	mr->page_mask = PAGE_MASK;
+	mr->page_shift = PAGE_SHIFT;
 	mr->state = RXE_MR_STATE_INVALID;
 }
 
@@ -151,9 +154,6 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		goto err_release_umem;
 	}
 
-	mr->page_shift = PAGE_SHIFT;
-	mr->page_mask = PAGE_SIZE - 1;
-
 	num_buf			= 0;
 	map = mr->map;
 	if (length > 0) {
@@ -182,7 +182,7 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 				goto err_release_umem;
 			}
 			buf->addr = (uintptr_t)vaddr;
-			buf->size = PAGE_SIZE;
+			buf->size = mr_page_size(mr);
 			num_buf++;
 			buf++;
 
@@ -191,10 +191,9 @@ int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 
 	mr->umem = umem;
 	mr->access = access;
-	mr->offset = ib_umem_offset(umem);
+	mr->page_offset = ib_umem_offset(umem);
 	mr->state = RXE_MR_STATE_VALID;
 	mr->ibmr.type = IB_MR_TYPE_USER;
-	mr->ibmr.page_size = PAGE_SIZE;
 
 	return 0;
 
@@ -248,29 +247,27 @@ int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
-	int n;
-
-	mr->nbuf = 0;
+	unsigned int page_size = mr_page_size(mr);
 
-	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+	mr->page_shift = ilog2(page_size);
+	mr->page_mask = ~((u64)page_size - 1);
+	mr->page_offset = ibmr->iova & (page_size - 1);
 
-	mr->page_shift = ilog2(ibmr->page_size);
-	mr->page_mask = ibmr->page_size - 1;
-	mr->offset = ibmr->iova & mr->page_mask;
+	mr->nbuf = 0;
 
-	return n;
+	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
 }
 
 static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
 			size_t *offset_out)
 {
-	size_t offset = iova - mr->ibmr.iova + mr->offset;
+	size_t offset = iova - mr->ibmr.iova + mr->page_offset;
 	int			map_index;
 	int			buf_index;
 	u64			length;
 
 	if (likely(mr->page_shift)) {
-		*offset_out = offset & mr->page_mask;
+		*offset_out = offset & (mr_page_size(mr) - 1);
 		offset >>= mr->page_shift;
 		*n_out = offset & mr->map_mask;
 		*m_out = offset >> mr->map_shift;
@@ -329,7 +326,7 @@ int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
 	if (mr->ibmr.type == IB_MR_TYPE_DMA)
 		return -EFAULT;
 
-	offset = (iova - mr->ibmr.iova + mr->offset) & mr->page_mask;
+	offset = (iova - mr->ibmr.iova + mr->page_offset) & mr->page_mask;
 	while (length > 0) {
 		u8 *va;
 		int bytes;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 691200d99d6b..5c3d1500ca68 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -315,11 +315,11 @@ struct rxe_mr {
 	u32			lkey;
 	u32			rkey;
 	enum rxe_mr_state	state;
-	u32			offset;
 	int			access;
 
-	int			page_shift;
-	int			page_mask;
+	unsigned int		page_offset;
+	unsigned int		page_shift;
+	u64			page_mask;
 	int			map_shift;
 	int			map_mask;
 
@@ -334,6 +334,11 @@ struct rxe_mr {
 	struct rxe_map		**map;
 };
 
+static inline unsigned int mr_page_size(struct rxe_mr *mr)
+{
+	return mr ? mr->ibmr.page_size : PAGE_SIZE;
+}
+
 enum rxe_mw_state {
 	RXE_MW_STATE_INVALID	= RXE_MR_STATE_INVALID,
 	RXE_MW_STATE_FREE	= RXE_MR_STATE_FREE,
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-next v5 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray
  2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
                   ` (4 preceding siblings ...)
  2023-01-16 23:56 ` [PATCH for-next v5 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c Bob Pearson
@ 2023-01-16 23:56 ` Bob Pearson
  5 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2023-01-16 23:56 UTC (permalink / raw)
  To: jgg, zyjzyj2000, leonro, yangx.jy, linux-rdma; +Cc: Bob Pearson

Replace struct rxe-phys_buf and struct rxe_map by struct xarray
in rxe_verbs.h. This allows using rcu locking on reads for
the memory maps stored in each mr.

This is based off of a sketch of a patch from Jason Gunthorpe in the
link below. Some changes were needed to make this work. It applies
cleanly to the current for-next and passes the pyverbs, perftest
and the same blktests test cases which run today.

Link: https://lore.kernel.org/linux-rdma/Y3gvZr6%2FNCii9Avy@nvidia.com/
Co-developed-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
v5:
  link: https://lore.kernel.org/linux-rdma/c03afa55-c2f2-5949-289a-4411bdba87f9@fujitsu.com/
  Responded to a note from lizhijian@fujitsu.com and restored calls to
  is_pmem_page() which were accidentally dropped in earlier versions.

 drivers/infiniband/sw/rxe/rxe_loc.h   |   8 +-
 drivers/infiniband/sw/rxe/rxe_mr.c    | 559 +++++++++++++-------------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  21 +-
 3 files changed, 285 insertions(+), 303 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index b1dda0cf891b..c1f78ed0f991 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -64,14 +64,14 @@ void rxe_mr_init_dma(int access, struct rxe_mr *mr);
 int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr);
 int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr);
-int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length);
-int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
-		enum rxe_mr_copy_dir dir);
+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova,
+			unsigned int length);
+int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+		unsigned int length, enum rxe_mr_copy_dir dir);
 int copy_data(struct rxe_pd *pd, int access, struct rxe_dma_info *dma,
 	      void *addr, int length, enum rxe_mr_copy_dir dir);
 int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 		  int sg_nents, unsigned int *sg_offset);
-void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length);
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val);
 int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value);
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index fdf76df4cf3e..a495c186dc3c 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -62,60 +62,31 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->lkey = mr->ibmr.lkey = lkey;
 	mr->rkey = mr->ibmr.rkey = rkey;
 
+	mr->access = access;
 	mr->ibmr.page_size = PAGE_SIZE;
 	mr->page_mask = PAGE_MASK;
 	mr->page_shift = PAGE_SHIFT;
 	mr->state = RXE_MR_STATE_INVALID;
 }
 
-static int rxe_mr_alloc(struct rxe_mr *mr, int num_buf)
-{
-	int i;
-	int num_map;
-	struct rxe_map **map = mr->map;
-
-	num_map = (num_buf + RXE_BUF_PER_MAP - 1) / RXE_BUF_PER_MAP;
-
-	mr->map = kmalloc_array(num_map, sizeof(*map), GFP_KERNEL);
-	if (!mr->map)
-		goto err1;
-
-	for (i = 0; i < num_map; i++) {
-		mr->map[i] = kmalloc(sizeof(**map), GFP_KERNEL);
-		if (!mr->map[i])
-			goto err2;
-	}
-
-	BUILD_BUG_ON(!is_power_of_2(RXE_BUF_PER_MAP));
-
-	mr->map_shift = ilog2(RXE_BUF_PER_MAP);
-	mr->map_mask = RXE_BUF_PER_MAP - 1;
-
-	mr->num_buf = num_buf;
-	mr->num_map = num_map;
-	mr->max_buf = num_map * RXE_BUF_PER_MAP;
-
-	return 0;
-
-err2:
-	for (i--; i >= 0; i--)
-		kfree(mr->map[i]);
-
-	kfree(mr->map);
-	mr->map = NULL;
-err1:
-	return -ENOMEM;
-}
-
 void rxe_mr_init_dma(int access, struct rxe_mr *mr)
 {
 	rxe_mr_init(access, mr);
 
-	mr->access = access;
 	mr->state = RXE_MR_STATE_VALID;
 	mr->ibmr.type = IB_MR_TYPE_DMA;
 }
 
+static unsigned long rxe_mr_iova_to_index(struct rxe_mr *mr, u64 iova)
+{
+	return (iova >> mr->page_shift) - (mr->ibmr.iova >> mr->page_shift);
+}
+
+static unsigned long rxe_mr_iova_to_page_offset(struct rxe_mr *mr, u64 iova)
+{
+	return iova & (mr_page_size(mr) - 1);
+}
+
 static bool is_pmem_page(struct page *pg)
 {
 	unsigned long paddr = page_to_phys(pg);
@@ -125,82 +96,97 @@ static bool is_pmem_page(struct page *pg)
 				 IORES_DESC_PERSISTENT_MEMORY);
 }
 
+static int rxe_mr_fill_pages_from_sgt(struct rxe_mr *mr, struct sg_table *sgt)
+{
+	XA_STATE(xas, &mr->page_list, 0);
+	struct sg_page_iter sg_iter;
+	struct page *page;
+	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
+
+	__sg_page_iter_start(&sg_iter, sgt->sgl, sgt->orig_nents, 0);
+	if (!__sg_page_iter_next(&sg_iter))
+		return 0;
+
+	do {
+		xas_lock(&xas);
+		while (true) {
+			page = sg_page_iter_page(&sg_iter);
+
+			if (persistent && !is_pmem_page(page)) {
+				rxe_dbg_mr(mr, "Unable to register persistent access to non-pmem device\n");
+				return -EINVAL;
+			}
+
+			xas_store(&xas, page);
+			if (xas_error(&xas))
+				break;
+			xas_next(&xas);
+			if (!__sg_page_iter_next(&sg_iter))
+				break;
+		}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	return xas_error(&xas);
+}
+
 int rxe_mr_init_user(struct rxe_dev *rxe, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr)
 {
-	struct rxe_map		**map;
-	struct rxe_phys_buf	*buf = NULL;
-	struct ib_umem		*umem;
-	struct sg_page_iter	sg_iter;
-	int			num_buf;
-	void			*vaddr;
+	struct ib_umem *umem;
 	int err;
 
+	rxe_mr_init(access, mr);
+
+	xa_init(&mr->page_list);
+
 	umem = ib_umem_get(&rxe->ib_dev, start, length, access);
 	if (IS_ERR(umem)) {
 		rxe_dbg_mr(mr, "Unable to pin memory region err = %d\n",
 			(int)PTR_ERR(umem));
-		err = PTR_ERR(umem);
-		goto err_out;
+		return PTR_ERR(umem);
 	}
 
-	num_buf = ib_umem_num_pages(umem);
-
-	rxe_mr_init(access, mr);
-
-	err = rxe_mr_alloc(mr, num_buf);
+	err = rxe_mr_fill_pages_from_sgt(mr, &umem->sgt_append.sgt);
 	if (err) {
-		rxe_dbg_mr(mr, "Unable to allocate memory for map\n");
-		goto err_release_umem;
+		ib_umem_release(umem);
+		return err;
 	}
 
-	num_buf			= 0;
-	map = mr->map;
-	if (length > 0) {
-		bool persistent_access = access & IB_ACCESS_FLUSH_PERSISTENT;
-
-		buf = map[0]->buf;
-		for_each_sgtable_page (&umem->sgt_append.sgt, &sg_iter, 0) {
-			struct page *pg = sg_page_iter_page(&sg_iter);
+	mr->umem = umem;
+	mr->ibmr.type = IB_MR_TYPE_USER;
+	mr->state = RXE_MR_STATE_VALID;
 
-			if (persistent_access && !is_pmem_page(pg)) {
-				rxe_dbg_mr(mr, "Unable to register persistent access to non-pmem device\n");
-				err = -EINVAL;
-				goto err_release_umem;
-			}
+	return 0;
+}
 
-			if (num_buf >= RXE_BUF_PER_MAP) {
-				map++;
-				buf = map[0]->buf;
-				num_buf = 0;
-			}
+static int rxe_mr_alloc(struct rxe_mr *mr, int num_buf)
+{
+	XA_STATE(xas, &mr->page_list, 0);
+	int i = 0;
+	int err;
 
-			vaddr = page_address(pg);
-			if (!vaddr) {
-				rxe_dbg_mr(mr, "Unable to get virtual address\n");
-				err = -ENOMEM;
-				goto err_release_umem;
-			}
-			buf->addr = (uintptr_t)vaddr;
-			buf->size = mr_page_size(mr);
-			num_buf++;
-			buf++;
+	xa_init(&mr->page_list);
 
+	do {
+		xas_lock(&xas);
+		while (i != num_buf) {
+			xas_store(&xas, XA_ZERO_ENTRY);
+			if (xas_error(&xas))
+				break;
+			xas_next(&xas);
+			i++;
 		}
-	}
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
 
-	mr->umem = umem;
-	mr->access = access;
-	mr->page_offset = ib_umem_offset(umem);
-	mr->state = RXE_MR_STATE_VALID;
-	mr->ibmr.type = IB_MR_TYPE_USER;
+	err = xas_error(&xas);
+	if (err)
+		return err;
 
-	return 0;
+	mr->num_buf = num_buf;
 
-err_release_umem:
-	ib_umem_release(umem);
-err_out:
-	return err;
+	return 0;
 }
 
 int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
@@ -214,7 +200,6 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	if (err)
 		goto err1;
 
-	mr->max_buf = max_pages;
 	mr->state = RXE_MR_STATE_FREE;
 	mr->ibmr.type = IB_MR_TYPE_MEM_REG;
 
@@ -224,206 +209,144 @@ int rxe_mr_init_fast(int max_pages, struct rxe_mr *mr)
 	return err;
 }
 
-static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
+static int rxe_set_page(struct ib_mr *ibmr, u64 iova)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
-	struct rxe_map *map;
-	struct rxe_phys_buf *buf;
+	struct page *page = virt_to_page(iova & mr->page_mask);
+	XA_STATE(xas, &mr->page_list, mr->nbuf);
+	bool persistent = !!(mr->access & IB_ACCESS_FLUSH_PERSISTENT);
+	int err;
+
+	if (persistent && !is_pmem_page(page)) {
+		rxe_dbg_mr(mr, "Unable to register persistent access to non-pmem device\n");
+		return -EINVAL;
+	}
 
 	if (unlikely(mr->nbuf == mr->num_buf))
 		return -ENOMEM;
 
-	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
-	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
+	do {
+		xas_lock(&xas);
+		xas_store(&xas, page);
+		xas_unlock(&xas);
+	} while (xas_nomem(&xas, GFP_KERNEL));
+
+	err = xas_error(&xas);
+	if (err)
+		return err;
 
-	buf->addr = addr;
-	buf->size = ibmr->page_size;
 	mr->nbuf++;
-
 	return 0;
 }
 
-int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
+int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sgl,
 		  int sg_nents, unsigned int *sg_offset)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
 	unsigned int page_size = mr_page_size(mr);
 
+	mr->nbuf = 0;
 	mr->page_shift = ilog2(page_size);
 	mr->page_mask = ~((u64)page_size - 1);
-	mr->page_offset = ibmr->iova & (page_size - 1);
-
-	mr->nbuf = 0;
+	mr->page_offset = mr->ibmr.iova & (page_size - 1);
 
-	return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+	return ib_sg_to_pages(ibmr, sgl, sg_nents, sg_offset, rxe_set_page);
 }
 
-static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
-			size_t *offset_out)
+/*
+ * TODO: Attempting to modify the mr page map between the time
+ * a packet is received and the map is referenced as here
+ * in xa_load(&mr->page_list) will cause problems. It is OK to
+ * deregister the mr since the mr reference counts will preserve
+ * it until memory accesses are complete. Currently reregister mr
+ * operations are not supported by the rxe driver but could be
+ * in the future. Invalidate followed by fast_reg mr will change
+ * the map and then the rkey so delayed packets arriving in the
+ * middle could use the wrong map entries. This isn't new but was
+ * already the case in the earlier implementation. This won't be
+ * a problem for well behaved programs which wait until all the
+ * outstanding packets for the first FMR before remapping to the
+ * second.
+ */
+static int rxe_mr_copy_xarray(struct rxe_mr *mr, u64 iova, void *addr,
+			      unsigned int length, enum rxe_mr_copy_dir dir)
 {
-	size_t offset = iova - mr->ibmr.iova + mr->page_offset;
-	int			map_index;
-	int			buf_index;
-	u64			length;
-
-	if (likely(mr->page_shift)) {
-		*offset_out = offset & (mr_page_size(mr) - 1);
-		offset >>= mr->page_shift;
-		*n_out = offset & mr->map_mask;
-		*m_out = offset >> mr->map_shift;
-	} else {
-		map_index = 0;
-		buf_index = 0;
-
-		length = mr->map[map_index]->buf[buf_index].size;
+	unsigned int page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+	unsigned long index = rxe_mr_iova_to_index(mr, iova);
+	unsigned int bytes;
+	struct page *page;
+	void *va;
 
-		while (offset >= length) {
-			offset -= length;
-			buf_index++;
-
-			if (buf_index == RXE_BUF_PER_MAP) {
-				map_index++;
-				buf_index = 0;
-			}
-			length = mr->map[map_index]->buf[buf_index].size;
-		}
+	while (length) {
+		page = xa_load(&mr->page_list, index);
+		if (!page)
+			return -EFAULT;
 
-		*m_out = map_index;
-		*n_out = buf_index;
-		*offset_out = offset;
+		bytes = min_t(unsigned int, length,
+				mr_page_size(mr) - page_offset);
+		va = kmap_local_page(page);
+		if (dir == RXE_FROM_MR_OBJ)
+			memcpy(addr, va + page_offset, bytes);
+		else
+			memcpy(va + page_offset, addr, bytes);
+		kunmap_local(va);
+
+		page_offset = 0;
+		addr += bytes;
+		length -= bytes;
+		index++;
 	}
-}
-
-void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
-{
-	size_t offset;
-	int m, n;
-
-	if (mr->state != RXE_MR_STATE_VALID)
-		return NULL;
-
-	if (mr->ibmr.type == IB_MR_TYPE_DMA)
-		return (void *)(uintptr_t)iova;
-
-	if (mr_check_range(mr, iova, length))
-		return NULL;
-
-	lookup_iova(mr, iova, &m, &n, &offset);
-
-	if (offset + length > mr->map[m]->buf[n].size)
-		return NULL;
 
-	return (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
+	return 0;
 }
 
-int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, int length)
+static void rxe_mr_copy_dma(struct rxe_mr *mr, u64 iova, void *addr,
+			    unsigned int length, enum rxe_mr_copy_dir dir)
 {
-	size_t offset;
-
-	if (length == 0)
-		return 0;
-
-	if (mr->ibmr.type == IB_MR_TYPE_DMA)
-		return -EFAULT;
-
-	offset = (iova - mr->ibmr.iova + mr->page_offset) & mr->page_mask;
-	while (length > 0) {
-		u8 *va;
-		int bytes;
-
-		bytes = mr->ibmr.page_size - offset;
-		if (bytes > length)
-			bytes = length;
-
-		va = iova_to_vaddr(mr, iova, length);
-		if (!va)
-			return -EFAULT;
+	unsigned int page_offset = iova & (PAGE_SIZE - 1);
+	unsigned int bytes;
+	struct page *page;
+	u8 *va;
 
-		arch_wb_cache_pmem(va, bytes);
-
-		length -= bytes;
+	while (length) {
+		page = virt_to_page(iova & mr->page_mask);
+		bytes = min_t(unsigned int, length,
+				PAGE_SIZE - page_offset);
+		va = kmap_local_page(page);
+
+		if (dir == RXE_TO_MR_OBJ)
+			memcpy(va + page_offset, addr, bytes);
+		else
+			memcpy(addr, va + page_offset, bytes);
+
+		kunmap_local(va);
+		page_offset = 0;
 		iova += bytes;
-		offset = 0;
+		addr += bytes;
+		length -= bytes;
 	}
-
-	return 0;
 }
 
-/* copy data from a range (vaddr, vaddr+length-1) to or from
- * a mr object starting at iova.
- */
-int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
-		enum rxe_mr_copy_dir dir)
+int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr,
+		unsigned int length, enum rxe_mr_copy_dir dir)
 {
-	int			err;
-	int			bytes;
-	u8			*va;
-	struct rxe_map		**map;
-	struct rxe_phys_buf	*buf;
-	int			m;
-	int			i;
-	size_t			offset;
+	int err;
 
 	if (length == 0)
 		return 0;
 
 	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
-		u8 *src, *dest;
-
-		src = (dir == RXE_TO_MR_OBJ) ? addr : ((void *)(uintptr_t)iova);
-
-		dest = (dir == RXE_TO_MR_OBJ) ? ((void *)(uintptr_t)iova) : addr;
-
-		memcpy(dest, src, length);
-
+		rxe_mr_copy_dma(mr, iova, addr, length, dir);
 		return 0;
 	}
 
-	WARN_ON_ONCE(!mr->map);
-
 	err = mr_check_range(mr, iova, length);
-	if (err) {
-		err = -EFAULT;
-		goto err1;
-	}
-
-	lookup_iova(mr, iova, &m, &i, &offset);
-
-	map = mr->map + m;
-	buf	= map[0]->buf + i;
-
-	while (length > 0) {
-		u8 *src, *dest;
-
-		va	= (u8 *)(uintptr_t)buf->addr + offset;
-		src = (dir == RXE_TO_MR_OBJ) ? addr : va;
-		dest = (dir == RXE_TO_MR_OBJ) ? va : addr;
-
-		bytes	= buf->size - offset;
-
-		if (bytes > length)
-			bytes = length;
-
-		memcpy(dest, src, bytes);
-
-		length	-= bytes;
-		addr	+= bytes;
-
-		offset	= 0;
-		buf++;
-		i++;
-
-		if (i == RXE_BUF_PER_MAP) {
-			i = 0;
-			map++;
-			buf = map[0]->buf;
-		}
+	if (unlikely(err)) {
+		rxe_dbg_mr(mr, "iova out of range");
+		return err;
 	}
 
-	return 0;
-
-err1:
-	return err;
+	return rxe_mr_copy_xarray(mr, iova, addr, length, dir);
 }
 
 /* copy data in or out of a wqe, i.e. sg list
@@ -495,7 +418,6 @@ int copy_data(
 
 		if (bytes > 0) {
 			iova = sge->addr + offset;
-
 			err = rxe_mr_copy(mr, iova, addr, bytes, dir);
 			if (err)
 				goto err2;
@@ -522,43 +444,104 @@ int copy_data(
 	return err;
 }
 
+int rxe_flush_pmem_iova(struct rxe_mr *mr, u64 iova, unsigned int length)
+{
+	unsigned int page_offset;
+	unsigned long index;
+	struct page *page;
+	unsigned int bytes;
+	int err;
+	u8 *va;
+
+	if (length == 0)
+		return 0;
+
+	if (mr->ibmr.type == IB_MR_TYPE_DMA)
+		return -EFAULT;
+
+	err = mr_check_range(mr, iova, length);
+	if (err)
+		return err;
+
+	while (length > 0) {
+		index = rxe_mr_iova_to_index(mr, iova);
+		page = xa_load(&mr->page_list, index);
+		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+		if (!page)
+			return -EFAULT;
+		bytes = min_t(unsigned int, length,
+				mr_page_size(mr) - page_offset);
+
+		va = kmap_local_page(page);
+		if (!va)
+			return -EFAULT;
+
+		arch_wb_cache_pmem(va + page_offset, bytes);
+
+		length -= bytes;
+		iova += bytes;
+		page_offset = 0;
+	}
+
+	return 0;
+}
+
 /* Guarantee atomicity of atomic operations at the machine level. */
 static DEFINE_SPINLOCK(atomic_ops_lock);
 
 int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 			u64 compare, u64 swap_add, u64 *orig_val)
 {
-	u64 *va;
+	unsigned int page_offset;
+	struct page *page;
 	u64 value;
+	u64 *va;
 
-	if (mr->state != RXE_MR_STATE_VALID) {
+	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
 		rxe_dbg_mr(mr, "mr not in valid state");
 		return -EINVAL;
 	}
 
-	va = iova_to_vaddr(mr, iova, sizeof(u64));
-	if (!va) {
-		rxe_dbg_mr(mr, "iova out of range");
-		return -ERANGE;
+	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
+		page_offset = iova & (PAGE_SIZE - 1);
+		page = virt_to_page(iova & PAGE_MASK);
+	} else {
+		unsigned long index;
+		int err;
+
+		err = mr_check_range(mr, iova, sizeof(value));
+		if (err) {
+			rxe_dbg_mr(mr, "iova out of range");
+			return -ERANGE;
+		}
+		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+		index = rxe_mr_iova_to_index(mr, iova);
+		page = xa_load(&mr->page_list, index);
+		if (!page)
+			return -EFAULT;
 	}
 
-	if ((uintptr_t)va & 0x7) {
+	if (unlikely(page_offset & 0x7)) {
 		rxe_dbg_mr(mr, "iova not aligned");
-		return RXE_ERR_NOT_ALIGNED;
+		return -RXE_ERR_NOT_ALIGNED;
 	}
 
+	va = kmap_local_page(page);
+
 	spin_lock_bh(&atomic_ops_lock);
-	value = *orig_val = *va;
+	value = *orig_val = va[page_offset >> 3];
 
 	if (opcode == IB_OPCODE_RC_COMPARE_SWAP) {
 		if (value == compare)
-			*va = swap_add;
+			va[page_offset >> 3] = swap_add;
 	} else {
 		value += swap_add;
-		*va = value;
+		va[page_offset >> 3] = value;
 	}
 	spin_unlock_bh(&atomic_ops_lock);
 
+	kunmap_local(va);
+
 	return 0;
 }
 
@@ -566,6 +549,8 @@ int rxe_mr_do_atomic_op(struct rxe_mr *mr, u64 iova, int opcode,
 int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 {
 #if defined CONFIG_64BIT
+	unsigned int page_offset;
+	struct page *page;
 	u64 *va;
 
 	/* See IBA oA19-28 */
@@ -574,20 +559,45 @@ int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
 		return -EINVAL;
 	}
 
-	va = iova_to_vaddr(mr, iova, sizeof(value));
-	if (unlikely(!va)) {
-		rxe_dbg_mr(mr, "iova out of range");
-		return -ERANGE;
+	if (mr->ibmr.type == IB_MR_TYPE_DMA) {
+		page_offset = iova & (PAGE_SIZE - 1);
+		page = virt_to_page(iova & PAGE_MASK);
+	} else {
+		unsigned long index;
+		int err;
+
+		/* See IBA oA19-28 */
+		err = mr_check_range(mr, iova, sizeof(value));
+		if (unlikely(err)) {
+			rxe_dbg_mr(mr, "iova out of range");
+			return -ERANGE;
+		}
+		page_offset = rxe_mr_iova_to_page_offset(mr, iova);
+		index = rxe_mr_iova_to_index(mr, iova);
+		page = xa_load(&mr->page_list, index);
+		if (!page)
+			return -EFAULT;
 	}
 
 	/* See IBA A19.4.2 */
-	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
+	if (unlikely(page_offset & 0x7)) {
 		rxe_dbg_mr(mr, "misaligned address");
 		return -RXE_ERR_NOT_ALIGNED;
 	}
 
+	va = kmap_local_page(page);
+
 	/* Do atomic write after all prior operations have completed */
-	smp_store_release(va, value);
+	/* TODO: This is what was chosen by the implementer but I am
+	 * concerned it isn't what they want. This only guarantees that
+	 * the write will complete before any subsequent reads but the
+	 * comment says all prior operations have completed. That would
+	 * require a full me matching acquire.
+	 * Normal usage has a matching load_acquire and store release.
+	 */
+	smp_store_release(&va[page_offset >> 3], value);
+
+	kunmap_local(va);
 
 	return 0;
 #else
@@ -629,12 +639,6 @@ int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
 	return 0;
 }
 
-/* (1) find the mr corresponding to lkey/rkey
- *     depending on lookup_type
- * (2) verify that the (qp) pd matches the mr pd
- * (3) verify that the mr can support the requested access
- * (4) verify that mr state is valid
- */
 struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 			 enum rxe_mr_lookup_type type)
 {
@@ -755,15 +759,10 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 void rxe_mr_cleanup(struct rxe_pool_elem *elem)
 {
 	struct rxe_mr *mr = container_of(elem, typeof(*mr), elem);
-	int i;
 
 	rxe_put(mr_pd(mr));
 	ib_umem_release(mr->umem);
 
-	if (mr->map) {
-		for (i = 0; i < mr->num_map; i++)
-			kfree(mr->map[i]);
-
-		kfree(mr->map);
-	}
+	if (mr->ibmr.type != IB_MR_TYPE_DMA)
+		xa_destroy(&mr->page_list);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 5c3d1500ca68..09437077ceee 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -288,17 +288,6 @@ enum rxe_mr_lookup_type {
 	RXE_LOOKUP_REMOTE,
 };
 
-#define RXE_BUF_PER_MAP		(PAGE_SIZE / sizeof(struct rxe_phys_buf))
-
-struct rxe_phys_buf {
-	u64      addr;
-	u64      size;
-};
-
-struct rxe_map {
-	struct rxe_phys_buf	buf[RXE_BUF_PER_MAP];
-};
-
 static inline int rkey_is_mw(u32 rkey)
 {
 	u32 index = rkey >> 8;
@@ -316,22 +305,16 @@ struct rxe_mr {
 	u32			rkey;
 	enum rxe_mr_state	state;
 	int			access;
+	atomic_t		num_mw;
 
 	unsigned int		page_offset;
 	unsigned int		page_shift;
 	u64			page_mask;
-	int			map_shift;
-	int			map_mask;
 
 	u32			num_buf;
 	u32			nbuf;
 
-	u32			max_buf;
-	u32			num_map;
-
-	atomic_t		num_mw;
-
-	struct rxe_map		**map;
+	struct xarray		page_list;
 };
 
 static inline unsigned int mr_page_size(struct rxe_mr *mr)
-- 
2.37.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply()
  2023-01-16 23:56 ` [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
@ 2023-01-17 15:09   ` Jason Gunthorpe
  2023-01-17 16:52     ` Bob Pearson
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-01-17 15:09 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Mon, Jan 16, 2023 at 05:56:00PM -0600, Bob Pearson wrote:

> +		err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
> +					  atmeth_comp(pkt),
> +					  atmeth_swap_add(pkt),
> +					  &res->atomic.orig_val);
> +		if (unlikely(err)) {
> +			if (err == -RXE_ERR_NOT_ALIGNED)
> +				return RESPST_ERR_MISALIGNED_ATOMIC;
> +			else
> +				return RESPST_ERR_RKEY_VIOLATION;

Why not just return these RESPST_ constants directly from
rxe_mr_do_atomic_op ?

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-16 23:56 ` [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
@ 2023-01-17 15:11   ` Jason Gunthorpe
  2023-01-17 16:57     ` Bob Pearson
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-01-17 15:11 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Mon, Jan 16, 2023 at 05:56:01PM -0600, Bob Pearson wrote:

> +/* only implemented for 64 bit architectures */
> +int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
> +{
> +#if defined CONFIG_64BIT

#ifdef

It is a little more typical style to provide an alternate version of
the function when #ifdefing

> +	u64 *va;
> +
> +	/* See IBA oA19-28 */
> +	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
> +		rxe_dbg_mr(mr, "mr not in valid state");
> +		return -EINVAL;
> +	}
> +
> +	va = iova_to_vaddr(mr, iova, sizeof(value));
> +	if (unlikely(!va)) {
> +		rxe_dbg_mr(mr, "iova out of range");
> +		return -ERANGE;
> +	}
> +
> +	/* See IBA A19.4.2 */
> +	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
> +		rxe_dbg_mr(mr, "misaligned address");
> +		return -RXE_ERR_NOT_ALIGNED;
> +	}
> +
> +	/* Do atomic write after all prior operations have completed */
> +	smp_store_release(va, value);
> +
> +	return 0;
> +#else
> +	WARN_ON(1);
> +	return -EINVAL;
> +#endif
> +}
> +
>  int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
>  {
>  	struct rxe_sge		*sge	= &dma->sge[dma->cur_sge];
> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
> index 1e38e5da1f4c..49298ff88d25 100644
> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
> @@ -764,30 +764,40 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
>  	return RESPST_ACKNOWLEDGE;
>  }
>  
> -#ifdef CONFIG_64BIT
> -static enum resp_states do_atomic_write(struct rxe_qp *qp,
> -					struct rxe_pkt_info *pkt)
> +static enum resp_states atomic_write_reply(struct rxe_qp *qp,
> +					   struct rxe_pkt_info *pkt)
>  {
> -	struct rxe_mr *mr = qp->resp.mr;
> -	int payload = payload_size(pkt);
> -	u64 src, *dst;
> -
> -	if (mr->state != RXE_MR_STATE_VALID)
> -		return RESPST_ERR_RKEY_VIOLATION;
> +	struct resp_res *res = qp->resp.res;
> +	struct rxe_mr *mr;
> +	u64 value;
> +	u64 iova;
> +	int err;
>  
> -	memcpy(&src, payload_addr(pkt), payload);
> +	if (!res) {
> +		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_WRITE_MASK);
> +		qp->resp.res = res;
> +	}
>  
> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
> -	/* check vaddr is 8 bytes aligned. */
> -	if (!dst || (uintptr_t)dst & 7)
> -		return RESPST_ERR_MISALIGNED_ATOMIC;
> +	if (res->replay)
> +		return RESPST_ACKNOWLEDGE;
>  
> -	/* Do atomic write after all prior operations have completed */
> -	smp_store_release(dst, src);
> +	mr = qp->resp.mr;
> +	value = *(u64 *)payload_addr(pkt);
> +	iova = qp->resp.va + qp->resp.offset;
>  
> -	/* decrease resp.resid to zero */
> -	qp->resp.resid -= sizeof(payload);
> +#if defined CONFIG_64BIT

Shouldn't need a #ifdef here

> +	err = rxe_mr_do_atomic_write(mr, iova, value);
> +	if (unlikely(err)) {
> +		if (err == -RXE_ERR_NOT_ALIGNED)
> +			return RESPST_ERR_MISALIGNED_ATOMIC;
> +		else
> +			return RESPST_ERR_RKEY_VIOLATION;

Again why not return the RESPST directly then the stub can return
RESPST_ERR_UNSUPPORTED_OPCODE?

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply()
  2023-01-17 15:09   ` Jason Gunthorpe
@ 2023-01-17 16:52     ` Bob Pearson
  0 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2023-01-17 16:52 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On 1/17/23 09:09, Jason Gunthorpe wrote:
> On Mon, Jan 16, 2023 at 05:56:00PM -0600, Bob Pearson wrote:
> 
>> +		err = rxe_mr_do_atomic_op(mr, iova, pkt->opcode,
>> +					  atmeth_comp(pkt),
>> +					  atmeth_swap_add(pkt),
>> +					  &res->atomic.orig_val);
>> +		if (unlikely(err)) {
>> +			if (err == -RXE_ERR_NOT_ALIGNED)
>> +				return RESPST_ERR_MISALIGNED_ATOMIC;
>> +			else
>> +				return RESPST_ERR_RKEY_VIOLATION;
> 
> Why not just return these RESPST_ constants directly from
> rxe_mr_do_atomic_op ?
> 
> Jason

The main reason is that the responder state machine is fairly complicated and hiding bits of
it in other files just makes it more difficult to follow. Originally all the code was inline here
but I felt it would be better to keep all the mr specific detail in one file.

Bob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 15:11   ` Jason Gunthorpe
@ 2023-01-17 16:57     ` Bob Pearson
  2023-01-17 16:59       ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Bob Pearson @ 2023-01-17 16:57 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On 1/17/23 09:11, Jason Gunthorpe wrote:
> On Mon, Jan 16, 2023 at 05:56:01PM -0600, Bob Pearson wrote:
> 
>> +/* only implemented for 64 bit architectures */
>> +int rxe_mr_do_atomic_write(struct rxe_mr *mr, u64 iova, u64 value)
>> +{
>> +#if defined CONFIG_64BIT
> 
> #ifdef
> 
> It is a little more typical style to provide an alternate version of
> the function when #ifdefing

I will do that.
> 
>> +	u64 *va;
>> +
>> +	/* See IBA oA19-28 */
>> +	if (unlikely(mr->state != RXE_MR_STATE_VALID)) {
>> +		rxe_dbg_mr(mr, "mr not in valid state");
>> +		return -EINVAL;
>> +	}
>> +
>> +	va = iova_to_vaddr(mr, iova, sizeof(value));
>> +	if (unlikely(!va)) {
>> +		rxe_dbg_mr(mr, "iova out of range");
>> +		return -ERANGE;
>> +	}
>> +
>> +	/* See IBA A19.4.2 */
>> +	if (unlikely((uintptr_t)va & 0x7 || iova & 0x7)) {
>> +		rxe_dbg_mr(mr, "misaligned address");
>> +		return -RXE_ERR_NOT_ALIGNED;
>> +	}
>> +
>> +	/* Do atomic write after all prior operations have completed */
>> +	smp_store_release(va, value);
>> +
>> +	return 0;
>> +#else
>> +	WARN_ON(1);
>> +	return -EINVAL;
>> +#endif
>> +}
>> +
>>  int advance_dma_data(struct rxe_dma_info *dma, unsigned int length)
>>  {
>>  	struct rxe_sge		*sge	= &dma->sge[dma->cur_sge];
>> diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
>> index 1e38e5da1f4c..49298ff88d25 100644
>> --- a/drivers/infiniband/sw/rxe/rxe_resp.c
>> +++ b/drivers/infiniband/sw/rxe/rxe_resp.c
>> @@ -764,30 +764,40 @@ static enum resp_states atomic_reply(struct rxe_qp *qp,
>>  	return RESPST_ACKNOWLEDGE;
>>  }
>>  
>> -#ifdef CONFIG_64BIT
>> -static enum resp_states do_atomic_write(struct rxe_qp *qp,
>> -					struct rxe_pkt_info *pkt)
>> +static enum resp_states atomic_write_reply(struct rxe_qp *qp,
>> +					   struct rxe_pkt_info *pkt)
>>  {
>> -	struct rxe_mr *mr = qp->resp.mr;
>> -	int payload = payload_size(pkt);
>> -	u64 src, *dst;
>> -
>> -	if (mr->state != RXE_MR_STATE_VALID)
>> -		return RESPST_ERR_RKEY_VIOLATION;
>> +	struct resp_res *res = qp->resp.res;
>> +	struct rxe_mr *mr;
>> +	u64 value;
>> +	u64 iova;
>> +	int err;
>>  
>> -	memcpy(&src, payload_addr(pkt), payload);
>> +	if (!res) {
>> +		res = rxe_prepare_res(qp, pkt, RXE_ATOMIC_WRITE_MASK);
>> +		qp->resp.res = res;
>> +	}
>>  
>> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
>> -	/* check vaddr is 8 bytes aligned. */
>> -	if (!dst || (uintptr_t)dst & 7)
>> -		return RESPST_ERR_MISALIGNED_ATOMIC;
>> +	if (res->replay)
>> +		return RESPST_ACKNOWLEDGE;
>>  
>> -	/* Do atomic write after all prior operations have completed */
>> -	smp_store_release(dst, src);
>> +	mr = qp->resp.mr;
>> +	value = *(u64 *)payload_addr(pkt);
>> +	iova = qp->resp.va + qp->resp.offset;
>>  
>> -	/* decrease resp.resid to zero */
>> -	qp->resp.resid -= sizeof(payload);
>> +#if defined CONFIG_64BIT
> 
> Shouldn't need a #ifdef here

This avoids a new special error (i.e. NOT_64_bit) and makes it clear we
won't call the code in mr.

I really don't understand why Fujitsu did it all this way instead of just
using a spinlock for 32 bit architectures as a fallback. But if I want to
keep to the spirit of their implementation this is fairly clear I think.

Bob
> 
>> +	err = rxe_mr_do_atomic_write(mr, iova, value);
>> +	if (unlikely(err)) {
>> +		if (err == -RXE_ERR_NOT_ALIGNED)
>> +			return RESPST_ERR_MISALIGNED_ATOMIC;
>> +		else
>> +			return RESPST_ERR_RKEY_VIOLATION;
> 
> Again why not return the RESPST directly then the stub can return
> RESPST_ERR_UNSUPPORTED_OPCODE?
> 
> Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 16:57     ` Bob Pearson
@ 2023-01-17 16:59       ` Jason Gunthorpe
  2023-01-17 17:04         ` Bob Pearson
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-01-17 16:59 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Tue, Jan 17, 2023 at 10:57:31AM -0600, Bob Pearson wrote:

> >> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
> >> -	/* check vaddr is 8 bytes aligned. */
> >> -	if (!dst || (uintptr_t)dst & 7)
> >> -		return RESPST_ERR_MISALIGNED_ATOMIC;
> >> +	if (res->replay)
> >> +		return RESPST_ACKNOWLEDGE;
> >>  
> >> -	/* Do atomic write after all prior operations have completed */
> >> -	smp_store_release(dst, src);
> >> +	mr = qp->resp.mr;
> >> +	value = *(u64 *)payload_addr(pkt);
> >> +	iova = qp->resp.va + qp->resp.offset;
> >>  
> >> -	/* decrease resp.resid to zero */
> >> -	qp->resp.resid -= sizeof(payload);
> >> +#if defined CONFIG_64BIT
> > 
> > Shouldn't need a #ifdef here
> 
> This avoids a new special error (i.e. NOT_64_bit) and makes it clear we
> won't call the code in mr.

? That doesn't seem right
 
> I really don't understand why Fujitsu did it all this way instead of just
> using a spinlock for 32 bit architectures as a fallback. But if I want to
> keep to the spirit of their implementation this is fairly clear I think.

IIRC the IBA definition is that this is supposed to be coherent with
the host CPU, the spinlock version is for the non-atomic atomics which
only has to be coherent with the "hca"

So a spinlock will not provide coherency with userspace that may be
touching this same atomic memory.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 16:59       ` Jason Gunthorpe
@ 2023-01-17 17:04         ` Bob Pearson
  2023-01-17 18:05           ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Bob Pearson @ 2023-01-17 17:04 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On 1/17/23 10:59, Jason Gunthorpe wrote:
> On Tue, Jan 17, 2023 at 10:57:31AM -0600, Bob Pearson wrote:
> 
>>>> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
>>>> -	/* check vaddr is 8 bytes aligned. */
>>>> -	if (!dst || (uintptr_t)dst & 7)
>>>> -		return RESPST_ERR_MISALIGNED_ATOMIC;
>>>> +	if (res->replay)
>>>> +		return RESPST_ACKNOWLEDGE;
>>>>  
>>>> -	/* Do atomic write after all prior operations have completed */
>>>> -	smp_store_release(dst, src);
>>>> +	mr = qp->resp.mr;
>>>> +	value = *(u64 *)payload_addr(pkt);
>>>> +	iova = qp->resp.va + qp->resp.offset;
>>>>  
>>>> -	/* decrease resp.resid to zero */
>>>> -	qp->resp.resid -= sizeof(payload);
>>>> +#if defined CONFIG_64BIT
>>>
>>> Shouldn't need a #ifdef here
>>
>> This avoids a new special error (i.e. NOT_64_bit) and makes it clear we
>> won't call the code in mr.
> 
> ? That doesn't seem right

that was the -3 of the -1, -2, -3 that we just fixed. there are three error paths out
of this state and we need a way to get to them. The #ifdef provides that third path.
>  
>> I really don't understand why Fujitsu did it all this way instead of just
>> using a spinlock for 32 bit architectures as a fallback. But if I want to
>> keep to the spirit of their implementation this is fairly clear I think.
> 
> IIRC the IBA definition is that this is supposed to be coherent with
> the host CPU, the spinlock version is for the non-atomic atomics which
> only has to be coherent with the "hca"
> 
> So a spinlock will not provide coherency with userspace that may be
> touching this same atomic memory.

Thanks. That makes sense.

Bob
> 
> Jason


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 17:04         ` Bob Pearson
@ 2023-01-17 18:05           ` Jason Gunthorpe
  2023-01-17 20:41             ` Bob Pearson
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2023-01-17 18:05 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Tue, Jan 17, 2023 at 11:04:29AM -0600, Bob Pearson wrote:
> On 1/17/23 10:59, Jason Gunthorpe wrote:
> > On Tue, Jan 17, 2023 at 10:57:31AM -0600, Bob Pearson wrote:
> > 
> >>>> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
> >>>> -	/* check vaddr is 8 bytes aligned. */
> >>>> -	if (!dst || (uintptr_t)dst & 7)
> >>>> -		return RESPST_ERR_MISALIGNED_ATOMIC;
> >>>> +	if (res->replay)
> >>>> +		return RESPST_ACKNOWLEDGE;
> >>>>  
> >>>> -	/* Do atomic write after all prior operations have completed */
> >>>> -	smp_store_release(dst, src);
> >>>> +	mr = qp->resp.mr;
> >>>> +	value = *(u64 *)payload_addr(pkt);
> >>>> +	iova = qp->resp.va + qp->resp.offset;
> >>>>  
> >>>> -	/* decrease resp.resid to zero */
> >>>> -	qp->resp.resid -= sizeof(payload);
> >>>> +#if defined CONFIG_64BIT
> >>>
> >>> Shouldn't need a #ifdef here
> >>
> >> This avoids a new special error (i.e. NOT_64_bit) and makes it clear we
> >> won't call the code in mr.
> > 
> > ? That doesn't seem right
> 
> that was the -3 of the -1, -2, -3 that we just fixed. there are three error paths out
> of this state and we need a way to get to them. The #ifdef provides
> that third path.

I feel like it should be solvable without this ifdef though

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 18:05           ` Jason Gunthorpe
@ 2023-01-17 20:41             ` Bob Pearson
  2023-01-18 13:53               ` Jason Gunthorpe
  0 siblings, 1 reply; 16+ messages in thread
From: Bob Pearson @ 2023-01-17 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On 1/17/23 12:05, Jason Gunthorpe wrote:
> On Tue, Jan 17, 2023 at 11:04:29AM -0600, Bob Pearson wrote:
>> On 1/17/23 10:59, Jason Gunthorpe wrote:
>>> On Tue, Jan 17, 2023 at 10:57:31AM -0600, Bob Pearson wrote:
>>>
>>>>>> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
>>>>>> -	/* check vaddr is 8 bytes aligned. */
>>>>>> -	if (!dst || (uintptr_t)dst & 7)
>>>>>> -		return RESPST_ERR_MISALIGNED_ATOMIC;
>>>>>> +	if (res->replay)
>>>>>> +		return RESPST_ACKNOWLEDGE;
>>>>>>  
>>>>>> -	/* Do atomic write after all prior operations have completed */
>>>>>> -	smp_store_release(dst, src);
>>>>>> +	mr = qp->resp.mr;
>>>>>> +	value = *(u64 *)payload_addr(pkt);
>>>>>> +	iova = qp->resp.va + qp->resp.offset;
>>>>>>  
>>>>>> -	/* decrease resp.resid to zero */
>>>>>> -	qp->resp.resid -= sizeof(payload);
>>>>>> +#if defined CONFIG_64BIT
>>>>>
>>>>> Shouldn't need a #ifdef here
>>>>
>>>> This avoids a new special error (i.e. NOT_64_bit) and makes it clear we
>>>> won't call the code in mr.
>>>
>>> ? That doesn't seem right
>>
>> that was the -3 of the -1, -2, -3 that we just fixed. there are three error paths out
>> of this state and we need a way to get to them. The #ifdef provides
>> that third path.
> 
> I feel like it should be solvable without this ifdef though
> 
> Jason

You could get rid of the ifdef in the atomic_write_reply() routine but then the rxe_mr_do_atomic_write() routine would have to have a second version in the #else case
that would have to return something different so that the third exit could be taken i.e.
whatever replaces the original -3. I really think this is simpler.

Bob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply()
  2023-01-17 20:41             ` Bob Pearson
@ 2023-01-18 13:53               ` Jason Gunthorpe
  0 siblings, 0 replies; 16+ messages in thread
From: Jason Gunthorpe @ 2023-01-18 13:53 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, leonro, yangx.jy, linux-rdma

On Tue, Jan 17, 2023 at 02:41:53PM -0600, Bob Pearson wrote:
> On 1/17/23 12:05, Jason Gunthorpe wrote:
> > On Tue, Jan 17, 2023 at 11:04:29AM -0600, Bob Pearson wrote:
> >> On 1/17/23 10:59, Jason Gunthorpe wrote:
> >>> On Tue, Jan 17, 2023 at 10:57:31AM -0600, Bob Pearson wrote:
> >>>
> >>>>>> -	dst = iova_to_vaddr(mr, qp->resp.va + qp->resp.offset, payload);
> >>>>>> -	/* check vaddr is 8 bytes aligned. */
> >>>>>> -	if (!dst || (uintptr_t)dst & 7)
> >>>>>> -		return RESPST_ERR_MISALIGNED_ATOMIC;
> >>>>>> +	if (res->replay)
> >>>>>> +		return RESPST_ACKNOWLEDGE;
> >>>>>>  
> >>>>>> -	/* Do atomic write after all prior operations have completed */
> >>>>>> -	smp_store_release(dst, src);
> >>>>>> +	mr = qp->resp.mr;
> >>>>>> +	value = *(u64 *)payload_addr(pkt);
> >>>>>> +	iova = qp->resp.va + qp->resp.offset;
> >>>>>>  
> >>>>>> -	/* decrease resp.resid to zero */
> >>>>>> -	qp->resp.resid -= sizeof(payload);
> >>>>>> +#if defined CONFIG_64BIT
> >>>>>
> >>>>> Shouldn't need a #ifdef here
> >>>>
> >>>> This avoids a new special error (i.e. NOT_64_bit) and makes it clear we
> >>>> won't call the code in mr.
> >>>
> >>> ? That doesn't seem right
> >>
> >> that was the -3 of the -1, -2, -3 that we just fixed. there are three error paths out
> >> of this state and we need a way to get to them. The #ifdef provides
> >> that third path.
> > 
> > I feel like it should be solvable without this ifdef though
> > 
> > Jason
> 
> You could get rid of the ifdef in the atomic_write_reply() routine but then the rxe_mr_do_atomic_write() routine would have to have a second version in the #else case
> that would have to return something different so that the third exit could be taken i.e.
> whatever replaces the original -3. I really think this is simpler.

You really should just return the RESPST_* from these functions.

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-01-18 14:13 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-16 23:55 [PATCH for-next v5 0/6] RDMA/rxe: Replace mr page map with an xarray Bob Pearson
2023-01-16 23:55 ` [PATCH for-next v5 1/6] RDMA/rxe: Cleanup mr_check_range Bob Pearson
2023-01-16 23:55 ` [PATCH for-next v5 2/6] RDMA/rxe: Move rxe_map_mr_sg to rxe_mr.c Bob Pearson
2023-01-16 23:56 ` [PATCH for-next v5 3/6] RDMA-rxe: Isolate mr code from atomic_reply() Bob Pearson
2023-01-17 15:09   ` Jason Gunthorpe
2023-01-17 16:52     ` Bob Pearson
2023-01-16 23:56 ` [PATCH for-next v5 4/6] RDMA-rxe: Isolate mr code from atomic_write_reply() Bob Pearson
2023-01-17 15:11   ` Jason Gunthorpe
2023-01-17 16:57     ` Bob Pearson
2023-01-17 16:59       ` Jason Gunthorpe
2023-01-17 17:04         ` Bob Pearson
2023-01-17 18:05           ` Jason Gunthorpe
2023-01-17 20:41             ` Bob Pearson
2023-01-18 13:53               ` Jason Gunthorpe
2023-01-16 23:56 ` [PATCH for-next v5 5/6] RDMA/rxe: Cleanup page variables in rxe_mr.c Bob Pearson
2023-01-16 23:56 ` [PATCH for-next v5 6/6] RDMA/rxe: Replace rxe_map and rxe_phys_buf by xarray Bob Pearson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).