linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
@ 2021-09-14 16:42 Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 1/5] RDMA/rxe: Add memory barriers to kernel queues Bob Pearson
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-14 16:42 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib; +Cc: Bob Pearson

This series of patches implements several bug fixes and minor
cleanups of the rxe driver. Specifically these fix a bug exposed
by blktest.

They apply cleanly to both
commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)

The first patch is a rewrite of an earlier patch.
It adds memory barriers to kernel to kernel queues. The logic for this
is the same as an earlier patch that only treated user to kernel queues.
Without this patch kernel to kernel queues are expected to intermittently
fail at low frequency as was seen for the other queues.

The second patch cleans up the state and type enums used by MRs.

The third patch separates the keys in rxe_mr and ib_mr. This allows
the following sequence seen in the srp driver to work correctly.

	do {
		ib_post_send( IB_WR_LOCAL_INV )
		ib_update_fast_reg_key()
		ib_map_mr_sg()
		ib_post_send( IB_WR_REG_MR )
	} while ( !done )

The fourth patch creates duplicate mapping tables for fast MRs. This
prevents rkeys referencing fast MRs from accessing data from an updated
map after the call to ib_map_mr_sg() call by keeping the new and old
mappings separate and atomically swapping them when a reg mr WR is
executed.

The fifth patch checks the type of MRs which receive local or remote
invalidate operations to prevent invalidating user MRs.

v3->v4:
Two of the patches in v3 were accepted in v5.15 so have been dropped
here.

The first patch was rewritten to correctly deal with queue operations
in rxe_verbs.c where the code is the client and not the server.

v2->v3:
The v2 version had a typo which broke clean application to for-next.
Additionally in v3 the order of the patches was changed to make
it a little cleaner.

Bob Pearson (5):
  RDMA/rxe: Add memory barriers to kernel queues
  RDMA/rxe: Cleanup MR status and type enums
  RDMA/rxe: Separate HW and SW l/rkeys
  RDMA/rxe: Create duplicate mapping tables for FMRs
  RDMA/rxe: Only allow invalidate for appropriate MRs

 drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
 drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
 drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +
 drivers/infiniband/sw/rxe/rxe_mr.c    | 267 ++++++++++++++++-------
 drivers/infiniband/sw/rxe/rxe_mw.c    |  36 ++--
 drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
 drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-
 drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
 drivers/infiniband/sw/rxe/rxe_req.c   |  51 ++---
 drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
 drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c |  92 ++------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  48 ++---
 13 files changed, 438 insertions(+), 471 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH for-rc v4 1/5] RDMA/rxe: Add memory barriers to kernel queues
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
@ 2021-09-14 16:42 ` Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 2/5] RDMA/rxe: Cleanup MR status and type enums Bob Pearson
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-14 16:42 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib; +Cc: Bob Pearson

Earlier patches added memory barriers to protect user space to kernel
space communications. The user space queues were previously shown to
have occasional memory synchonization errors which were removed by
adding smp_load_acquire, smp_store_release barriers.  This patch extends
that to the case where queues are used between kernel space threads.

This patch also extends the queue types to include kernel ULP queues
which access the other end of the queues in kernel verbs calls like
poll_cq and post_send/recv.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
 drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
 drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
 drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-
 drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
 drivers/infiniband/sw/rxe/rxe_req.c   |  37 ++--
 drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
 drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c |  53 +----
 9 files changed, 189 insertions(+), 314 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_comp.c b/drivers/infiniband/sw/rxe/rxe_comp.c
index d2d802c776fd..48a3864ada29 100644
--- a/drivers/infiniband/sw/rxe/rxe_comp.c
+++ b/drivers/infiniband/sw/rxe/rxe_comp.c
@@ -142,10 +142,7 @@ static inline enum comp_state get_wqe(struct rxe_qp *qp,
 	/* we come here whether or not we found a response packet to see if
 	 * there are any posted WQEs
 	 */
-	if (qp->is_user)
-		wqe = queue_head(qp->sq.queue, QUEUE_TYPE_FROM_USER);
-	else
-		wqe = queue_head(qp->sq.queue, QUEUE_TYPE_KERNEL);
+	wqe = queue_head(qp->sq.queue, QUEUE_TYPE_FROM_CLIENT);
 	*wqe_p = wqe;
 
 	/* no WQE or requester has not started it yet */
@@ -432,10 +429,7 @@ static void do_complete(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	if (post)
 		make_send_cqe(qp, wqe, &cqe);
 
-	if (qp->is_user)
-		advance_consumer(qp->sq.queue, QUEUE_TYPE_FROM_USER);
-	else
-		advance_consumer(qp->sq.queue, QUEUE_TYPE_KERNEL);
+	queue_advance_consumer(qp->sq.queue, QUEUE_TYPE_FROM_CLIENT);
 
 	if (post)
 		rxe_cq_post(qp->scq, &cqe, 0);
@@ -539,7 +533,7 @@ static void rxe_drain_resp_pkts(struct rxe_qp *qp, bool notify)
 			wqe->status = IB_WC_WR_FLUSH_ERR;
 			do_complete(qp, wqe);
 		} else {
-			advance_consumer(q, q->type);
+			queue_advance_consumer(q, q->type);
 		}
 	}
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_cq.c b/drivers/infiniband/sw/rxe/rxe_cq.c
index aef288f164fd..4eedaa0244b3 100644
--- a/drivers/infiniband/sw/rxe/rxe_cq.c
+++ b/drivers/infiniband/sw/rxe/rxe_cq.c
@@ -25,11 +25,7 @@ int rxe_cq_chk_attr(struct rxe_dev *rxe, struct rxe_cq *cq,
 	}
 
 	if (cq) {
-		if (cq->is_user)
-			count = queue_count(cq->queue, QUEUE_TYPE_TO_USER);
-		else
-			count = queue_count(cq->queue, QUEUE_TYPE_KERNEL);
-
+		count = queue_count(cq->queue, QUEUE_TYPE_TO_CLIENT);
 		if (cqe < count) {
 			pr_warn("cqe(%d) < current # elements in queue (%d)",
 				cqe, count);
@@ -65,7 +61,7 @@ int rxe_cq_from_init(struct rxe_dev *rxe, struct rxe_cq *cq, int cqe,
 	int err;
 	enum queue_type type;
 
-	type = uresp ? QUEUE_TYPE_TO_USER : QUEUE_TYPE_KERNEL;
+	type = QUEUE_TYPE_TO_CLIENT;
 	cq->queue = rxe_queue_init(rxe, &cqe,
 			sizeof(struct rxe_cqe), type);
 	if (!cq->queue) {
@@ -117,11 +113,7 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
 
 	spin_lock_irqsave(&cq->cq_lock, flags);
 
-	if (cq->is_user)
-		full = queue_full(cq->queue, QUEUE_TYPE_TO_USER);
-	else
-		full = queue_full(cq->queue, QUEUE_TYPE_KERNEL);
-
+	full = queue_full(cq->queue, QUEUE_TYPE_TO_CLIENT);
 	if (unlikely(full)) {
 		spin_unlock_irqrestore(&cq->cq_lock, flags);
 		if (cq->ibcq.event_handler) {
@@ -134,17 +126,10 @@ int rxe_cq_post(struct rxe_cq *cq, struct rxe_cqe *cqe, int solicited)
 		return -EBUSY;
 	}
 
-	if (cq->is_user)
-		addr = producer_addr(cq->queue, QUEUE_TYPE_TO_USER);
-	else
-		addr = producer_addr(cq->queue, QUEUE_TYPE_KERNEL);
-
+	addr = queue_producer_addr(cq->queue, QUEUE_TYPE_TO_CLIENT);
 	memcpy(addr, cqe, sizeof(*cqe));
 
-	if (cq->is_user)
-		advance_producer(cq->queue, QUEUE_TYPE_TO_USER);
-	else
-		advance_producer(cq->queue, QUEUE_TYPE_KERNEL);
+	queue_advance_producer(cq->queue, QUEUE_TYPE_TO_CLIENT);
 
 	spin_unlock_irqrestore(&cq->cq_lock, flags);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_qp.c b/drivers/infiniband/sw/rxe/rxe_qp.c
index 1ab6af7ddb25..a13991a003d0 100644
--- a/drivers/infiniband/sw/rxe/rxe_qp.c
+++ b/drivers/infiniband/sw/rxe/rxe_qp.c
@@ -231,7 +231,7 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 	qp->sq.max_inline = init->cap.max_inline_data = wqe_size;
 	wqe_size += sizeof(struct rxe_send_wqe);
 
-	type = uresp ? QUEUE_TYPE_FROM_USER : QUEUE_TYPE_KERNEL;
+	type = QUEUE_TYPE_FROM_CLIENT;
 	qp->sq.queue = rxe_queue_init(rxe, &qp->sq.max_wr,
 				wqe_size, type);
 	if (!qp->sq.queue)
@@ -248,12 +248,8 @@ static int rxe_qp_init_req(struct rxe_dev *rxe, struct rxe_qp *qp,
 		return err;
 	}
 
-	if (qp->is_user)
-		qp->req.wqe_index = producer_index(qp->sq.queue,
-						QUEUE_TYPE_FROM_USER);
-	else
-		qp->req.wqe_index = producer_index(qp->sq.queue,
-						QUEUE_TYPE_KERNEL);
+	qp->req.wqe_index = queue_get_producer(qp->sq.queue,
+					       QUEUE_TYPE_FROM_CLIENT);
 
 	qp->req.state		= QP_STATE_RESET;
 	qp->req.opcode		= -1;
@@ -293,7 +289,7 @@ static int rxe_qp_init_resp(struct rxe_dev *rxe, struct rxe_qp *qp,
 		pr_debug("qp#%d max_wr = %d, max_sge = %d, wqe_size = %d\n",
 			 qp_num(qp), qp->rq.max_wr, qp->rq.max_sge, wqe_size);
 
-		type = uresp ? QUEUE_TYPE_FROM_USER : QUEUE_TYPE_KERNEL;
+		type = QUEUE_TYPE_FROM_CLIENT;
 		qp->rq.queue = rxe_queue_init(rxe, &qp->rq.max_wr,
 					wqe_size, type);
 		if (!qp->rq.queue)
diff --git a/drivers/infiniband/sw/rxe/rxe_queue.c b/drivers/infiniband/sw/rxe/rxe_queue.c
index 72d95398e604..6e6e023c1b45 100644
--- a/drivers/infiniband/sw/rxe/rxe_queue.c
+++ b/drivers/infiniband/sw/rxe/rxe_queue.c
@@ -111,17 +111,33 @@ struct rxe_queue *rxe_queue_init(struct rxe_dev *rxe, int *num_elem,
 static int resize_finish(struct rxe_queue *q, struct rxe_queue *new_q,
 			 unsigned int num_elem)
 {
-	if (!queue_empty(q, q->type) && (num_elem < queue_count(q, q->type)))
+	enum queue_type type = q->type;
+	u32 prod;
+	u32 cons;
+
+	if (!queue_empty(q, q->type) && (num_elem < queue_count(q, type)))
 		return -EINVAL;
 
-	while (!queue_empty(q, q->type)) {
-		memcpy(producer_addr(new_q, new_q->type),
-					consumer_addr(q, q->type),
-					new_q->elem_size);
-		advance_producer(new_q, new_q->type);
-		advance_consumer(q, q->type);
+	prod = queue_get_producer(new_q, type);
+	cons = queue_get_consumer(q, type);
+
+	while (!queue_empty(q, type)) {
+		memcpy(queue_addr_from_index(new_q, prod),
+		       queue_addr_from_index(q, cons), new_q->elem_size);
+		prod = queue_next_index(new_q, prod);
+		cons = queue_next_index(q, cons);
 	}
 
+	new_q->buf->producer_index = prod;
+	q->buf->consumer_index = cons;
+
+	/* update private index copies */
+	if (type == QUEUE_TYPE_TO_CLIENT)
+		new_q->index = new_q->buf->producer_index;
+	else
+		q->index = q->buf->consumer_index;
+
+	/* exchange rxe_queue headers */
 	swap(*q, *new_q);
 
 	return 0;
diff --git a/drivers/infiniband/sw/rxe/rxe_queue.h b/drivers/infiniband/sw/rxe/rxe_queue.h
index 2702b0e55fc3..6227112ef7a2 100644
--- a/drivers/infiniband/sw/rxe/rxe_queue.h
+++ b/drivers/infiniband/sw/rxe/rxe_queue.h
@@ -10,34 +10,47 @@
 /* for definition of shared struct rxe_queue_buf */
 #include <uapi/rdma/rdma_user_rxe.h>
 
-/* implements a simple circular buffer that can optionally be
- * shared between user space and the kernel and can be resized
- * the requested element size is rounded up to a power of 2
- * and the number of elements in the buffer is also rounded
- * up to a power of 2. Since the queue is empty when the
- * producer and consumer indices match the maximum capacity
- * of the queue is one less than the number of element slots
+/* Implements a simple circular buffer that is shared between user
+ * and the driver and can be resized. The requested element size is
+ * rounded up to a power of 2 and the number of elements in the buffer
+ * is also rounded up to a power of 2. Since the queue is empty when
+ * the producer and consumer indices match the maximum capacity of the
+ * queue is one less than the number of element slots.
  *
  * Notes:
- *   - Kernel space indices are always masked off to q->index_mask
- *   before storing so do not need to be checked on reads.
- *   - User space indices may be out of range and must be
- *   masked before use when read.
- *   - The kernel indices for shared queues must not be written
- *   by user space so a local copy is used and a shared copy is
- *   stored when the local copy changes.
+ *   - The driver indices are always masked off to q->index_mask
+ *     before storing so do not need to be checked on reads.
+ *   - The user whether user space or kernel is generally
+ *     not trusted so its parameters are masked to make sure
+ *     they do not access the queue out of bounds on reads.
+ *   - The driver indices for queues must not be written
+ *     by user so a local copy is used and a shared copy is
+ *     stored when the local copy is changed.
  *   - By passing the type in the parameter list separate from q
- *   the compiler can eliminate the switch statement when the
- *   actual queue type is known when the function is called.
- *   In the performance path this is done. In less critical
- *   paths just q->type is passed.
+ *     the compiler can eliminate the switch statement when the
+ *     actual queue type is known when the function is called at
+ *     compile time.
+ *   - These queues are lock free. The user and driver must protect
+ *     changes to their end of the queues with locks if more than one
+ *     CPU can be accessing it at the same time.
  */
 
-/* type of queue */
+/**
+ * enum queue_type - type of queue
+ * @QUEUE_TYPE_TO_CLIENT:	Queue is written by rxe driver and
+ *				read by client. Used by rxe driver only.
+ * @QUEUE_TYPE_FROM_CLIENT:	Queue is written by client and
+ *				read by rxe driver. Used by rxe driver only.
+ * @QUEUE_TYPE_TO_DRIVER:	Queue is written by client and
+ *				read by rxe driver. Used by kernel client only.
+ * @QUEUE_TYPE_FROM_DRIVER:	Queue is written by rxe driver and
+ *				read by client. Used by kernel client only.
+ */
 enum queue_type {
-	QUEUE_TYPE_KERNEL,
-	QUEUE_TYPE_TO_USER,
-	QUEUE_TYPE_FROM_USER,
+	QUEUE_TYPE_TO_CLIENT,
+	QUEUE_TYPE_FROM_CLIENT,
+	QUEUE_TYPE_TO_DRIVER,
+	QUEUE_TYPE_FROM_DRIVER,
 };
 
 struct rxe_queue {
@@ -69,238 +82,171 @@ struct rxe_queue *rxe_queue_init(struct rxe_dev *rxe, int *num_elem,
 int rxe_queue_resize(struct rxe_queue *q, unsigned int *num_elem_p,
 		     unsigned int elem_size, struct ib_udata *udata,
 		     struct mminfo __user *outbuf,
-		     /* Protect producers while resizing queue */
-		     spinlock_t *producer_lock,
-		     /* Protect consumers while resizing queue */
-		     spinlock_t *consumer_lock);
+		     spinlock_t *producer_lock, spinlock_t *consumer_lock);
 
 void rxe_queue_cleanup(struct rxe_queue *queue);
 
-static inline int next_index(struct rxe_queue *q, int index)
+static inline u32 queue_next_index(struct rxe_queue *q, int index)
 {
-	return (index + 1) & q->buf->index_mask;
+	return (index + 1) & q->index_mask;
 }
 
-static inline int queue_empty(struct rxe_queue *q, enum queue_type type)
+static inline u32 queue_get_producer(const struct rxe_queue *q,
+				     enum queue_type type)
 {
 	u32 prod;
-	u32 cons;
 
 	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		/* protect user space index */
+	case QUEUE_TYPE_FROM_CLIENT:
+		/* protect user index */
 		prod = smp_load_acquire(&q->buf->producer_index);
-		cons = q->index;
 		break;
-	case QUEUE_TYPE_TO_USER:
+	case QUEUE_TYPE_TO_CLIENT:
 		prod = q->index;
-		/* protect user space index */
-		cons = smp_load_acquire(&q->buf->consumer_index);
 		break;
-	case QUEUE_TYPE_KERNEL:
+	case QUEUE_TYPE_FROM_DRIVER:
+		/* protect driver index */
+		prod = smp_load_acquire(&q->buf->producer_index);
+		break;
+	case QUEUE_TYPE_TO_DRIVER:
 		prod = q->buf->producer_index;
-		cons = q->buf->consumer_index;
 		break;
 	}
 
-	return ((prod - cons) & q->index_mask) == 0;
+	return prod;
 }
 
-static inline int queue_full(struct rxe_queue *q, enum queue_type type)
+static inline u32 queue_get_consumer(const struct rxe_queue *q,
+				     enum queue_type type)
 {
-	u32 prod;
 	u32 cons;
 
 	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		/* protect user space index */
-		prod = smp_load_acquire(&q->buf->producer_index);
+	case QUEUE_TYPE_FROM_CLIENT:
 		cons = q->index;
 		break;
-	case QUEUE_TYPE_TO_USER:
-		prod = q->index;
-		/* protect user space index */
+	case QUEUE_TYPE_TO_CLIENT:
+		/* protect user index */
 		cons = smp_load_acquire(&q->buf->consumer_index);
 		break;
-	case QUEUE_TYPE_KERNEL:
-		prod = q->buf->producer_index;
+	case QUEUE_TYPE_FROM_DRIVER:
 		cons = q->buf->consumer_index;
 		break;
+	case QUEUE_TYPE_TO_DRIVER:
+		/* protect driver index */
+		cons = smp_load_acquire(&q->buf->consumer_index);
+		break;
 	}
 
-	return ((prod + 1 - cons) & q->index_mask) == 0;
+	return cons;
 }
 
-static inline unsigned int queue_count(const struct rxe_queue *q,
-					enum queue_type type)
+static inline int queue_empty(struct rxe_queue *q, enum queue_type type)
 {
-	u32 prod;
-	u32 cons;
-
-	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		/* protect user space index */
-		prod = smp_load_acquire(&q->buf->producer_index);
-		cons = q->index;
-		break;
-	case QUEUE_TYPE_TO_USER:
-		prod = q->index;
-		/* protect user space index */
-		cons = smp_load_acquire(&q->buf->consumer_index);
-		break;
-	case QUEUE_TYPE_KERNEL:
-		prod = q->buf->producer_index;
-		cons = q->buf->consumer_index;
-		break;
-	}
+	u32 prod = queue_get_producer(q, type);
+	u32 cons = queue_get_consumer(q, type);
 
-	return (prod - cons) & q->index_mask;
+	return ((prod - cons) & q->index_mask) == 0;
 }
 
-static inline void advance_producer(struct rxe_queue *q, enum queue_type type)
+static inline int queue_full(struct rxe_queue *q, enum queue_type type)
 {
-	u32 prod;
+	u32 prod = queue_get_producer(q, type);
+	u32 cons = queue_get_consumer(q, type);
 
-	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		pr_warn_once("Normally kernel should not write user space index\n");
-		/* protect user space index */
-		prod = smp_load_acquire(&q->buf->producer_index);
-		prod = (prod + 1) & q->index_mask;
-		/* same */
-		smp_store_release(&q->buf->producer_index, prod);
-		break;
-	case QUEUE_TYPE_TO_USER:
-		prod = q->index;
-		q->index = (prod + 1) & q->index_mask;
-		q->buf->producer_index = q->index;
-		break;
-	case QUEUE_TYPE_KERNEL:
-		prod = q->buf->producer_index;
-		q->buf->producer_index = (prod + 1) & q->index_mask;
-		break;
-	}
+	return ((prod + 1 - cons) & q->index_mask) == 0;
 }
 
-static inline void advance_consumer(struct rxe_queue *q, enum queue_type type)
+static inline u32 queue_count(const struct rxe_queue *q,
+					enum queue_type type)
 {
-	u32 cons;
+	u32 prod = queue_get_producer(q, type);
+	u32 cons = queue_get_consumer(q, type);
 
-	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		cons = q->index;
-		q->index = (cons + 1) & q->index_mask;
-		q->buf->consumer_index = q->index;
-		break;
-	case QUEUE_TYPE_TO_USER:
-		pr_warn_once("Normally kernel should not write user space index\n");
-		/* protect user space index */
-		cons = smp_load_acquire(&q->buf->consumer_index);
-		cons = (cons + 1) & q->index_mask;
-		/* same */
-		smp_store_release(&q->buf->consumer_index, cons);
-		break;
-	case QUEUE_TYPE_KERNEL:
-		cons = q->buf->consumer_index;
-		q->buf->consumer_index = (cons + 1) & q->index_mask;
-		break;
-	}
+	return (prod - cons) & q->index_mask;
 }
 
-static inline void *producer_addr(struct rxe_queue *q, enum queue_type type)
+static inline void queue_advance_producer(struct rxe_queue *q,
+					  enum queue_type type)
 {
 	u32 prod;
 
 	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		/* protect user space index */
-		prod = smp_load_acquire(&q->buf->producer_index);
-		prod &= q->index_mask;
+	case QUEUE_TYPE_FROM_CLIENT:
+		pr_warn("%s: attempt to advance client index\n",
+			__func__);
 		break;
-	case QUEUE_TYPE_TO_USER:
+	case QUEUE_TYPE_TO_CLIENT:
 		prod = q->index;
+		prod = (prod + 1) & q->index_mask;
+		q->index = prod;
+		/* protect user index */
+		smp_store_release(&q->buf->producer_index, prod);
+		break;
+	case QUEUE_TYPE_FROM_DRIVER:
+		pr_warn("%s: attempt to advance driver index\n",
+			__func__);
 		break;
-	case QUEUE_TYPE_KERNEL:
+	case QUEUE_TYPE_TO_DRIVER:
 		prod = q->buf->producer_index;
+		prod = (prod + 1) & q->index_mask;
+		q->buf->producer_index = prod;
 		break;
 	}
-
-	return q->buf->data + (prod << q->log2_elem_size);
 }
 
-static inline void *consumer_addr(struct rxe_queue *q, enum queue_type type)
+static inline void queue_advance_consumer(struct rxe_queue *q,
+					  enum queue_type type)
 {
 	u32 cons;
 
 	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
+	case QUEUE_TYPE_FROM_CLIENT:
 		cons = q->index;
+		cons = (cons + 1) & q->index_mask;
+		q->index = cons;
+		/* protect user index */
+		smp_store_release(&q->buf->consumer_index, cons);
 		break;
-	case QUEUE_TYPE_TO_USER:
-		/* protect user space index */
-		cons = smp_load_acquire(&q->buf->consumer_index);
-		cons &= q->index_mask;
+	case QUEUE_TYPE_TO_CLIENT:
+		pr_warn("%s: attempt to advance client index\n",
+			__func__);
 		break;
-	case QUEUE_TYPE_KERNEL:
+	case QUEUE_TYPE_FROM_DRIVER:
 		cons = q->buf->consumer_index;
+		cons = (cons + 1) & q->index_mask;
+		q->buf->consumer_index = cons;
+		break;
+	case QUEUE_TYPE_TO_DRIVER:
+		pr_warn("%s: attempt to advance driver index\n",
+			__func__);
 		break;
 	}
-
-	return q->buf->data + (cons << q->log2_elem_size);
 }
 
-static inline unsigned int producer_index(struct rxe_queue *q,
-						enum queue_type type)
+static inline void *queue_producer_addr(struct rxe_queue *q,
+					enum queue_type type)
 {
-	u32 prod;
+	u32 prod = queue_get_producer(q, type);
 
-	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		/* protect user space index */
-		prod = smp_load_acquire(&q->buf->producer_index);
-		prod &= q->index_mask;
-		break;
-	case QUEUE_TYPE_TO_USER:
-		prod = q->index;
-		break;
-	case QUEUE_TYPE_KERNEL:
-		prod = q->buf->producer_index;
-		break;
-	}
-
-	return prod;
+	return q->buf->data + (prod << q->log2_elem_size);
 }
 
-static inline unsigned int consumer_index(struct rxe_queue *q,
-						enum queue_type type)
+static inline void *queue_consumer_addr(struct rxe_queue *q,
+					enum queue_type type)
 {
-	u32 cons;
-
-	switch (type) {
-	case QUEUE_TYPE_FROM_USER:
-		cons = q->index;
-		break;
-	case QUEUE_TYPE_TO_USER:
-		/* protect user space index */
-		cons = smp_load_acquire(&q->buf->consumer_index);
-		cons &= q->index_mask;
-		break;
-	case QUEUE_TYPE_KERNEL:
-		cons = q->buf->consumer_index;
-		break;
-	}
+	u32 cons = queue_get_consumer(q, type);
 
-	return cons;
+	return q->buf->data + (cons << q->log2_elem_size);
 }
 
-static inline void *addr_from_index(struct rxe_queue *q,
-				unsigned int index)
+static inline void *queue_addr_from_index(struct rxe_queue *q, u32 index)
 {
 	return q->buf->data + ((index & q->index_mask)
-				<< q->buf->log2_elem_size);
+				<< q->log2_elem_size);
 }
 
-static inline unsigned int index_from_addr(const struct rxe_queue *q,
+static inline u32 queue_index_from_addr(const struct rxe_queue *q,
 				const void *addr)
 {
 	return (((u8 *)addr - q->buf->data) >> q->log2_elem_size)
@@ -309,7 +255,7 @@ static inline unsigned int index_from_addr(const struct rxe_queue *q,
 
 static inline void *queue_head(struct rxe_queue *q, enum queue_type type)
 {
-	return queue_empty(q, type) ? NULL : consumer_addr(q, type);
+	return queue_empty(q, type) ? NULL : queue_consumer_addr(q, type);
 }
 
 #endif /* RXE_QUEUE_H */
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 3894197a82f6..801e36cefc29 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -49,21 +49,16 @@ static void req_retry(struct rxe_qp *qp)
 	unsigned int cons;
 	unsigned int prod;
 
-	if (qp->is_user) {
-		cons = consumer_index(q, QUEUE_TYPE_FROM_USER);
-		prod = producer_index(q, QUEUE_TYPE_FROM_USER);
-	} else {
-		cons = consumer_index(q, QUEUE_TYPE_KERNEL);
-		prod = producer_index(q, QUEUE_TYPE_KERNEL);
-	}
+	cons = queue_get_consumer(q, QUEUE_TYPE_FROM_CLIENT);
+	prod = queue_get_producer(q, QUEUE_TYPE_FROM_CLIENT);
 
 	qp->req.wqe_index	= cons;
 	qp->req.psn		= qp->comp.psn;
 	qp->req.opcode		= -1;
 
 	for (wqe_index = cons; wqe_index != prod;
-			wqe_index = next_index(q, wqe_index)) {
-		wqe = addr_from_index(qp->sq.queue, wqe_index);
+			wqe_index = queue_next_index(q, wqe_index)) {
+		wqe = queue_addr_from_index(qp->sq.queue, wqe_index);
 		mask = wr_opcode_mask(wqe->wr.opcode, qp);
 
 		if (wqe->state == wqe_state_posted)
@@ -121,15 +116,9 @@ static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp)
 	unsigned int cons;
 	unsigned int prod;
 
-	if (qp->is_user) {
-		wqe = queue_head(q, QUEUE_TYPE_FROM_USER);
-		cons = consumer_index(q, QUEUE_TYPE_FROM_USER);
-		prod = producer_index(q, QUEUE_TYPE_FROM_USER);
-	} else {
-		wqe = queue_head(q, QUEUE_TYPE_KERNEL);
-		cons = consumer_index(q, QUEUE_TYPE_KERNEL);
-		prod = producer_index(q, QUEUE_TYPE_KERNEL);
-	}
+	wqe = queue_head(q, QUEUE_TYPE_FROM_CLIENT);
+	cons = queue_get_consumer(q, QUEUE_TYPE_FROM_CLIENT);
+	prod = queue_get_producer(q, QUEUE_TYPE_FROM_CLIENT);
 
 	if (unlikely(qp->req.state == QP_STATE_DRAIN)) {
 		/* check to see if we are drained;
@@ -170,7 +159,7 @@ static struct rxe_send_wqe *req_next_wqe(struct rxe_qp *qp)
 	if (index == prod)
 		return NULL;
 
-	wqe = addr_from_index(q, index);
+	wqe = queue_addr_from_index(q, index);
 
 	if (unlikely((qp->req.state == QP_STATE_DRAIN ||
 		      qp->req.state == QP_STATE_DRAINED) &&
@@ -560,7 +549,8 @@ static void update_state(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 	qp->req.opcode = pkt->opcode;
 
 	if (pkt->mask & RXE_END_MASK)
-		qp->req.wqe_index = next_index(qp->sq.queue, qp->req.wqe_index);
+		qp->req.wqe_index = queue_next_index(qp->sq.queue,
+						     qp->req.wqe_index);
 
 	qp->need_req_skb = 0;
 
@@ -614,7 +604,7 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 
 	wqe->state = wqe_state_done;
 	wqe->status = IB_WC_SUCCESS;
-	qp->req.wqe_index = next_index(qp->sq.queue, qp->req.wqe_index);
+	qp->req.wqe_index = queue_next_index(qp->sq.queue, qp->req.wqe_index);
 
 	if ((wqe->wr.send_flags & IB_SEND_SIGNALED) ||
 	    qp->sq_sig_type == IB_SIGNAL_ALL_WR)
@@ -645,7 +635,8 @@ int rxe_requester(void *arg)
 		goto exit;
 
 	if (unlikely(qp->req.state == QP_STATE_RESET)) {
-		qp->req.wqe_index = consumer_index(q, q->type);
+		qp->req.wqe_index = queue_get_consumer(q,
+						QUEUE_TYPE_FROM_CLIENT);
 		qp->req.opcode = -1;
 		qp->req.need_rd_atomic = 0;
 		qp->req.wait_psn = 0;
@@ -711,7 +702,7 @@ int rxe_requester(void *arg)
 			wqe->last_psn = qp->req.psn;
 			qp->req.psn = (qp->req.psn + 1) & BTH_PSN_MASK;
 			qp->req.opcode = IB_OPCODE_UD_SEND_ONLY;
-			qp->req.wqe_index = next_index(qp->sq.queue,
+			qp->req.wqe_index = queue_next_index(qp->sq.queue,
 						       qp->req.wqe_index);
 			wqe->state = wqe_state_done;
 			wqe->status = IB_WC_SUCCESS;
diff --git a/drivers/infiniband/sw/rxe/rxe_resp.c b/drivers/infiniband/sw/rxe/rxe_resp.c
index 5501227ddc65..c5a0b1f5747a 100644
--- a/drivers/infiniband/sw/rxe/rxe_resp.c
+++ b/drivers/infiniband/sw/rxe/rxe_resp.c
@@ -303,10 +303,7 @@ static enum resp_states get_srq_wqe(struct rxe_qp *qp)
 
 	spin_lock_bh(&srq->rq.consumer_lock);
 
-	if (qp->is_user)
-		wqe = queue_head(q, QUEUE_TYPE_FROM_USER);
-	else
-		wqe = queue_head(q, QUEUE_TYPE_KERNEL);
+	wqe = queue_head(q, QUEUE_TYPE_FROM_CLIENT);
 	if (!wqe) {
 		spin_unlock_bh(&srq->rq.consumer_lock);
 		return RESPST_ERR_RNR;
@@ -322,13 +319,8 @@ static enum resp_states get_srq_wqe(struct rxe_qp *qp)
 	memcpy(&qp->resp.srq_wqe, wqe, size);
 
 	qp->resp.wqe = &qp->resp.srq_wqe.wqe;
-	if (qp->is_user) {
-		advance_consumer(q, QUEUE_TYPE_FROM_USER);
-		count = queue_count(q, QUEUE_TYPE_FROM_USER);
-	} else {
-		advance_consumer(q, QUEUE_TYPE_KERNEL);
-		count = queue_count(q, QUEUE_TYPE_KERNEL);
-	}
+	queue_advance_consumer(q, QUEUE_TYPE_FROM_CLIENT);
+	count = queue_count(q, QUEUE_TYPE_FROM_CLIENT);
 
 	if (srq->limit && srq->ibsrq.event_handler && (count < srq->limit)) {
 		srq->limit = 0;
@@ -357,12 +349,8 @@ static enum resp_states check_resource(struct rxe_qp *qp,
 			qp->resp.status = IB_WC_WR_FLUSH_ERR;
 			return RESPST_COMPLETE;
 		} else if (!srq) {
-			if (qp->is_user)
-				qp->resp.wqe = queue_head(qp->rq.queue,
-						QUEUE_TYPE_FROM_USER);
-			else
-				qp->resp.wqe = queue_head(qp->rq.queue,
-						QUEUE_TYPE_KERNEL);
+			qp->resp.wqe = queue_head(qp->rq.queue,
+					QUEUE_TYPE_FROM_CLIENT);
 			if (qp->resp.wqe) {
 				qp->resp.status = IB_WC_WR_FLUSH_ERR;
 				return RESPST_COMPLETE;
@@ -389,12 +377,8 @@ static enum resp_states check_resource(struct rxe_qp *qp,
 		if (srq)
 			return get_srq_wqe(qp);
 
-		if (qp->is_user)
-			qp->resp.wqe = queue_head(qp->rq.queue,
-					QUEUE_TYPE_FROM_USER);
-		else
-			qp->resp.wqe = queue_head(qp->rq.queue,
-					QUEUE_TYPE_KERNEL);
+		qp->resp.wqe = queue_head(qp->rq.queue,
+				QUEUE_TYPE_FROM_CLIENT);
 		return (qp->resp.wqe) ? RESPST_CHK_LENGTH : RESPST_ERR_RNR;
 	}
 
@@ -936,12 +920,8 @@ static enum resp_states do_complete(struct rxe_qp *qp,
 	}
 
 	/* have copy for srq and reference for !srq */
-	if (!qp->srq) {
-		if (qp->is_user)
-			advance_consumer(qp->rq.queue, QUEUE_TYPE_FROM_USER);
-		else
-			advance_consumer(qp->rq.queue, QUEUE_TYPE_KERNEL);
-	}
+	if (!qp->srq)
+		queue_advance_consumer(qp->rq.queue, QUEUE_TYPE_FROM_CLIENT);
 
 	qp->resp.wqe = NULL;
 
@@ -1213,7 +1193,7 @@ static void rxe_drain_req_pkts(struct rxe_qp *qp, bool notify)
 		return;
 
 	while (!qp->srq && q && queue_head(q, q->type))
-		advance_consumer(q, q->type);
+		queue_advance_consumer(q, q->type);
 }
 
 int rxe_responder(void *arg)
diff --git a/drivers/infiniband/sw/rxe/rxe_srq.c b/drivers/infiniband/sw/rxe/rxe_srq.c
index 610c98d24b5c..a9e7817e2732 100644
--- a/drivers/infiniband/sw/rxe/rxe_srq.c
+++ b/drivers/infiniband/sw/rxe/rxe_srq.c
@@ -93,7 +93,7 @@ int rxe_srq_from_init(struct rxe_dev *rxe, struct rxe_srq *srq,
 	spin_lock_init(&srq->rq.producer_lock);
 	spin_lock_init(&srq->rq.consumer_lock);
 
-	type = uresp ? QUEUE_TYPE_FROM_USER : QUEUE_TYPE_KERNEL;
+	type = QUEUE_TYPE_FROM_CLIENT;
 	q = rxe_queue_init(rxe, &srq->rq.max_wr,
 			srq_wqe_size, type);
 	if (!q) {
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 267b5a9c345d..46eaa1c5d4b3 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -218,11 +218,7 @@ static int post_one_recv(struct rxe_rq *rq, const struct ib_recv_wr *ibwr)
 	int num_sge = ibwr->num_sge;
 	int full;
 
-	if (rq->is_user)
-		full = queue_full(rq->queue, QUEUE_TYPE_FROM_USER);
-	else
-		full = queue_full(rq->queue, QUEUE_TYPE_KERNEL);
-
+	full = queue_full(rq->queue, QUEUE_TYPE_TO_DRIVER);
 	if (unlikely(full)) {
 		err = -ENOMEM;
 		goto err1;
@@ -237,11 +233,7 @@ static int post_one_recv(struct rxe_rq *rq, const struct ib_recv_wr *ibwr)
 	for (i = 0; i < num_sge; i++)
 		length += ibwr->sg_list[i].length;
 
-	if (rq->is_user)
-		recv_wqe = producer_addr(rq->queue, QUEUE_TYPE_FROM_USER);
-	else
-		recv_wqe = producer_addr(rq->queue, QUEUE_TYPE_KERNEL);
-
+	recv_wqe = queue_producer_addr(rq->queue, QUEUE_TYPE_TO_DRIVER);
 	recv_wqe->wr_id = ibwr->wr_id;
 	recv_wqe->num_sge = num_sge;
 
@@ -254,10 +246,7 @@ static int post_one_recv(struct rxe_rq *rq, const struct ib_recv_wr *ibwr)
 	recv_wqe->dma.cur_sge		= 0;
 	recv_wqe->dma.sge_offset	= 0;
 
-	if (rq->is_user)
-		advance_producer(rq->queue, QUEUE_TYPE_FROM_USER);
-	else
-		advance_producer(rq->queue, QUEUE_TYPE_KERNEL);
+	queue_advance_producer(rq->queue, QUEUE_TYPE_TO_DRIVER);
 
 	return 0;
 
@@ -633,27 +622,17 @@ static int post_one_send(struct rxe_qp *qp, const struct ib_send_wr *ibwr,
 
 	spin_lock_irqsave(&qp->sq.sq_lock, flags);
 
-	if (qp->is_user)
-		full = queue_full(sq->queue, QUEUE_TYPE_FROM_USER);
-	else
-		full = queue_full(sq->queue, QUEUE_TYPE_KERNEL);
+	full = queue_full(sq->queue, QUEUE_TYPE_TO_DRIVER);
 
 	if (unlikely(full)) {
 		spin_unlock_irqrestore(&qp->sq.sq_lock, flags);
 		return -ENOMEM;
 	}
 
-	if (qp->is_user)
-		send_wqe = producer_addr(sq->queue, QUEUE_TYPE_FROM_USER);
-	else
-		send_wqe = producer_addr(sq->queue, QUEUE_TYPE_KERNEL);
-
+	send_wqe = queue_producer_addr(sq->queue, QUEUE_TYPE_TO_DRIVER);
 	init_send_wqe(qp, ibwr, mask, length, send_wqe);
 
-	if (qp->is_user)
-		advance_producer(sq->queue, QUEUE_TYPE_FROM_USER);
-	else
-		advance_producer(sq->queue, QUEUE_TYPE_KERNEL);
+	queue_advance_producer(sq->queue, QUEUE_TYPE_TO_DRIVER);
 
 	spin_unlock_irqrestore(&qp->sq.sq_lock, flags);
 
@@ -845,18 +824,12 @@ static int rxe_poll_cq(struct ib_cq *ibcq, int num_entries, struct ib_wc *wc)
 
 	spin_lock_irqsave(&cq->cq_lock, flags);
 	for (i = 0; i < num_entries; i++) {
-		if (cq->is_user)
-			cqe = queue_head(cq->queue, QUEUE_TYPE_TO_USER);
-		else
-			cqe = queue_head(cq->queue, QUEUE_TYPE_KERNEL);
+		cqe = queue_head(cq->queue, QUEUE_TYPE_FROM_DRIVER);
 		if (!cqe)
 			break;
 
 		memcpy(wc++, &cqe->ibwc, sizeof(*wc));
-		if (cq->is_user)
-			advance_consumer(cq->queue, QUEUE_TYPE_TO_USER);
-		else
-			advance_consumer(cq->queue, QUEUE_TYPE_KERNEL);
+		queue_advance_consumer(cq->queue, QUEUE_TYPE_FROM_DRIVER);
 	}
 	spin_unlock_irqrestore(&cq->cq_lock, flags);
 
@@ -868,10 +841,7 @@ static int rxe_peek_cq(struct ib_cq *ibcq, int wc_cnt)
 	struct rxe_cq *cq = to_rcq(ibcq);
 	int count;
 
-	if (cq->is_user)
-		count = queue_count(cq->queue, QUEUE_TYPE_TO_USER);
-	else
-		count = queue_count(cq->queue, QUEUE_TYPE_KERNEL);
+	count = queue_count(cq->queue, QUEUE_TYPE_FROM_DRIVER);
 
 	return (count > wc_cnt) ? wc_cnt : count;
 }
@@ -887,10 +857,7 @@ static int rxe_req_notify_cq(struct ib_cq *ibcq, enum ib_cq_notify_flags flags)
 	if (cq->notify != IB_CQ_NEXT_COMP)
 		cq->notify = flags & IB_CQ_SOLICITED_MASK;
 
-	if (cq->is_user)
-		empty = queue_empty(cq->queue, QUEUE_TYPE_TO_USER);
-	else
-		empty = queue_empty(cq->queue, QUEUE_TYPE_KERNEL);
+	empty = queue_empty(cq->queue, QUEUE_TYPE_FROM_DRIVER);
 
 	if ((flags & IB_CQ_REPORT_MISSED_EVENTS) && !empty)
 		ret = 1;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-rc v4 2/5] RDMA/rxe: Cleanup MR status and type enums
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 1/5] RDMA/rxe: Add memory barriers to kernel queues Bob Pearson
@ 2021-09-14 16:42 ` Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 3/5] RDMA/rxe: Separate HW and SW l/rkeys Bob Pearson
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-14 16:42 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib; +Cc: Bob Pearson

Eliminate RXE_MR_STATE_ZOMBIE which is not compatible with IBA.
RXE_MR_STATE_INVALID is better.

Replace RXE_MR_TYPE_XXX by IB_MR_TYPE_XXX which covers all the needed
types.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c    | 20 ++++++++++++--------
 drivers/infiniband/sw/rxe/rxe_verbs.h |  9 +--------
 2 files changed, 13 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 5890a8246216..0cc24154762c 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -24,17 +24,22 @@ u8 rxe_get_next_key(u32 last_key)
 
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 {
+
+
 	switch (mr->type) {
-	case RXE_MR_TYPE_DMA:
+	case IB_MR_TYPE_DMA:
 		return 0;
 
-	case RXE_MR_TYPE_MR:
+	case IB_MR_TYPE_USER:
+	case IB_MR_TYPE_MEM_REG:
 		if (iova < mr->iova || length > mr->length ||
 		    iova > mr->iova + mr->length - length)
 			return -EFAULT;
 		return 0;
 
 	default:
+		pr_warn("%s: mr type (%d) not supported\n",
+			__func__, mr->type);
 		return -EFAULT;
 	}
 }
@@ -51,7 +56,6 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->ibmr.lkey = lkey;
 	mr->ibmr.rkey = rkey;
 	mr->state = RXE_MR_STATE_INVALID;
-	mr->type = RXE_MR_TYPE_NONE;
 	mr->map_shift = ilog2(RXE_BUF_PER_MAP);
 }
 
@@ -100,7 +104,7 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
 	mr->ibmr.pd = &pd->ibpd;
 	mr->access = access;
 	mr->state = RXE_MR_STATE_VALID;
-	mr->type = RXE_MR_TYPE_DMA;
+	mr->type = IB_MR_TYPE_DMA;
 }
 
 int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
@@ -173,7 +177,7 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 	mr->va = start;
 	mr->offset = ib_umem_offset(umem);
 	mr->state = RXE_MR_STATE_VALID;
-	mr->type = RXE_MR_TYPE_MR;
+	mr->type = IB_MR_TYPE_USER;
 
 	return 0;
 
@@ -203,7 +207,7 @@ int rxe_mr_init_fast(struct rxe_pd *pd, int max_pages, struct rxe_mr *mr)
 	mr->ibmr.pd = &pd->ibpd;
 	mr->max_buf = max_pages;
 	mr->state = RXE_MR_STATE_FREE;
-	mr->type = RXE_MR_TYPE_MR;
+	mr->type = IB_MR_TYPE_MEM_REG;
 
 	return 0;
 
@@ -302,7 +306,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 	if (length == 0)
 		return 0;
 
-	if (mr->type == RXE_MR_TYPE_DMA) {
+	if (mr->type == IB_MR_TYPE_DMA) {
 		u8 *src, *dest;
 
 		src = (dir == RXE_TO_MR_OBJ) ? addr : ((void *)(uintptr_t)iova);
@@ -564,7 +568,7 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 		return -EINVAL;
 	}
 
-	mr->state = RXE_MR_STATE_ZOMBIE;
+	mr->state = RXE_MR_STATE_INVALID;
 	rxe_drop_ref(mr_pd(mr));
 	rxe_drop_index(mr);
 	rxe_drop_ref(mr);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index ac2a2148027f..c6aca2293294 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -267,18 +267,11 @@ struct rxe_qp {
 };
 
 enum rxe_mr_state {
-	RXE_MR_STATE_ZOMBIE,
 	RXE_MR_STATE_INVALID,
 	RXE_MR_STATE_FREE,
 	RXE_MR_STATE_VALID,
 };
 
-enum rxe_mr_type {
-	RXE_MR_TYPE_NONE,
-	RXE_MR_TYPE_DMA,
-	RXE_MR_TYPE_MR,
-};
-
 enum rxe_mr_copy_dir {
 	RXE_TO_MR_OBJ,
 	RXE_FROM_MR_OBJ,
@@ -314,7 +307,7 @@ struct rxe_mr {
 	struct ib_umem		*umem;
 
 	enum rxe_mr_state	state;
-	enum rxe_mr_type	type;
+	enum ib_mr_type		type;
 	u64			va;
 	u64			iova;
 	size_t			length;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-rc v4 3/5] RDMA/rxe: Separate HW and SW l/rkeys
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 1/5] RDMA/rxe: Add memory barriers to kernel queues Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 2/5] RDMA/rxe: Cleanup MR status and type enums Bob Pearson
@ 2021-09-14 16:42 ` Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 4/5] RDMA/rxe: Create duplicate mapping tables for FMRs Bob Pearson
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-14 16:42 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib; +Cc: Bob Pearson

Separate software and simulated hardware lkeys and rkeys for MRs and MWs.
This makes struct ib_mr and struct ib_mw isolated from hardware changes
triggered by executing work requests.

This change fixes a bug seen in blktest.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |  1 +
 drivers/infiniband/sw/rxe/rxe_mr.c    | 69 ++++++++++++++++++++++-----
 drivers/infiniband/sw/rxe/rxe_mw.c    | 30 ++++++------
 drivers/infiniband/sw/rxe/rxe_req.c   | 14 ++----
 drivers/infiniband/sw/rxe/rxe_verbs.h | 18 ++-----
 5 files changed, 81 insertions(+), 51 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index f0c954575bde..4fd73b51fabf 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -86,6 +86,7 @@ struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length);
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length);
 int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey);
+int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
 int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
 void rxe_mr_cleanup(struct rxe_pool_entry *arg);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 0cc24154762c..370212801abc 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -53,8 +53,14 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	u32 lkey = mr->pelem.index << 8 | rxe_get_next_key(-1);
 	u32 rkey = (access & IB_ACCESS_REMOTE) ? lkey : 0;
 
-	mr->ibmr.lkey = lkey;
-	mr->ibmr.rkey = rkey;
+	/* set ibmr->l/rkey and also copy into private l/rkey
+	 * for user MRs these will always be the same
+	 * for cases where caller 'owns' the key portion
+	 * they may be different until REG_MR WQE is executed.
+	 */
+	mr->lkey = mr->ibmr.lkey = lkey;
+	mr->rkey = mr->ibmr.rkey = rkey;
+
 	mr->state = RXE_MR_STATE_INVALID;
 	mr->map_shift = ilog2(RXE_BUF_PER_MAP);
 }
@@ -195,10 +201,8 @@ int rxe_mr_init_fast(struct rxe_pd *pd, int max_pages, struct rxe_mr *mr)
 {
 	int err;
 
-	rxe_mr_init(0, mr);
-
-	/* In fastreg, we also set the rkey */
-	mr->ibmr.rkey = mr->ibmr.lkey;
+	/* always allow remote access for FMRs */
+	rxe_mr_init(IB_ACCESS_REMOTE, mr);
 
 	err = rxe_mr_alloc(mr, max_pages);
 	if (err)
@@ -511,8 +515,8 @@ struct rxe_mr *lookup_mr(struct rxe_pd *pd, int access, u32 key,
 	if (!mr)
 		return NULL;
 
-	if (unlikely((type == RXE_LOOKUP_LOCAL && mr_lkey(mr) != key) ||
-		     (type == RXE_LOOKUP_REMOTE && mr_rkey(mr) != key) ||
+	if (unlikely((type == RXE_LOOKUP_LOCAL && mr->lkey != key) ||
+		     (type == RXE_LOOKUP_REMOTE && mr->rkey != key) ||
 		     mr_pd(mr) != pd || (access && !(access & mr->access)) ||
 		     mr->state != RXE_MR_STATE_VALID)) {
 		rxe_drop_ref(mr);
@@ -535,9 +539,9 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey)
 		goto err;
 	}
 
-	if (rkey != mr->ibmr.rkey) {
-		pr_err("%s: rkey (%#x) doesn't match mr->ibmr.rkey (%#x)\n",
-			__func__, rkey, mr->ibmr.rkey);
+	if (rkey != mr->rkey) {
+		pr_err("%s: rkey (%#x) doesn't match mr->rkey (%#x)\n",
+			__func__, rkey, mr->rkey);
 		ret = -EINVAL;
 		goto err_drop_ref;
 	}
@@ -558,6 +562,49 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey)
 	return ret;
 }
 
+/* user can (re)register fast MR by executing a REG_MR WQE.
+ * user is expected to hold a reference on the ib mr until the
+ * WQE completes.
+ * Once a fast MR is created this is the only way to change the
+ * private keys. It is the responsibility of the user to maintain
+ * the ib mr keys in sync with rxe mr keys.
+ */
+int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
+{
+	struct rxe_mr *mr = to_rmr(wqe->wr.wr.reg.mr);
+	u32 key = wqe->wr.wr.reg.key;
+	u32 access = wqe->wr.wr.reg.access;
+
+	/* user can only register MR in free state */
+	if (unlikely(mr->state != RXE_MR_STATE_FREE)) {
+		pr_warn("%s: mr->lkey = 0x%x not free\n",
+			__func__, mr->lkey);
+		return -EINVAL;
+	}
+
+	/* user can only register mr with qp in same protection domain */
+	if (unlikely(qp->ibqp.pd != mr->ibmr.pd)) {
+		pr_warn("%s: qp->pd and mr->pd don't match\n",
+			__func__);
+		return -EINVAL;
+	}
+
+	/* user is only allowed to change key portion of l/rkey */
+	if (unlikely((mr->lkey & ~0xff) != (key & ~0xff))) {
+		pr_warn("%s: key = 0x%x has wrong index mr->lkey = 0x%x\n",
+			__func__, key, mr->lkey);
+		return -EINVAL;
+	}
+
+	mr->access = access;
+	mr->lkey = key;
+	mr->rkey = (access & IB_ACCESS_REMOTE) ? key : 0;
+	mr->iova = wqe->wr.wr.reg.mr->iova;
+	mr->state = RXE_MR_STATE_VALID;
+
+	return 0;
+}
+
 int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index 5ba77df7598e..a5e2ea7d80f0 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -21,7 +21,7 @@ int rxe_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
 	}
 
 	rxe_add_index(mw);
-	ibmw->rkey = (mw->pelem.index << 8) | rxe_get_next_key(-1);
+	mw->rkey = ibmw->rkey = (mw->pelem.index << 8) | rxe_get_next_key(-1);
 	mw->state = (mw->ibmw.type == IB_MW_TYPE_2) ?
 			RXE_MW_STATE_FREE : RXE_MW_STATE_VALID;
 	spin_lock_init(&mw->lock);
@@ -71,6 +71,8 @@ int rxe_dealloc_mw(struct ib_mw *ibmw)
 static int rxe_check_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 			 struct rxe_mw *mw, struct rxe_mr *mr)
 {
+	u32 key = wqe->wr.wr.mw.rkey & 0xff;
+
 	if (mw->ibmw.type == IB_MW_TYPE_1) {
 		if (unlikely(mw->state != RXE_MW_STATE_VALID)) {
 			pr_err_once(
@@ -108,7 +110,7 @@ static int rxe_check_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 		}
 	}
 
-	if (unlikely((wqe->wr.wr.mw.rkey & 0xff) == (mw->ibmw.rkey & 0xff))) {
+	if (unlikely(key == (mw->rkey & 0xff))) {
 		pr_err_once("attempt to bind MW with same key\n");
 		return -EINVAL;
 	}
@@ -161,13 +163,9 @@ static int rxe_check_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 static void rxe_do_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 		      struct rxe_mw *mw, struct rxe_mr *mr)
 {
-	u32 rkey;
-	u32 new_rkey;
-
-	rkey = mw->ibmw.rkey;
-	new_rkey = (rkey & 0xffffff00) | (wqe->wr.wr.mw.rkey & 0x000000ff);
+	u32 key = wqe->wr.wr.mw.rkey & 0xff;
 
-	mw->ibmw.rkey = new_rkey;
+	mw->rkey = (mw->rkey & ~0xff) | key;
 	mw->access = wqe->wr.wr.mw.access;
 	mw->state = RXE_MW_STATE_VALID;
 	mw->addr = wqe->wr.wr.mw.addr;
@@ -197,29 +195,29 @@ int rxe_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 	struct rxe_mw *mw;
 	struct rxe_mr *mr;
 	struct rxe_dev *rxe = to_rdev(qp->ibqp.device);
+	u32 mw_rkey = wqe->wr.wr.mw.mw_rkey;
+	u32 mr_lkey = wqe->wr.wr.mw.mr_lkey;
 	unsigned long flags;
 
-	mw = rxe_pool_get_index(&rxe->mw_pool,
-				wqe->wr.wr.mw.mw_rkey >> 8);
+	mw = rxe_pool_get_index(&rxe->mw_pool, mw_rkey >> 8);
 	if (unlikely(!mw)) {
 		ret = -EINVAL;
 		goto err;
 	}
 
-	if (unlikely(mw->ibmw.rkey != wqe->wr.wr.mw.mw_rkey)) {
+	if (unlikely(mw->rkey != mw_rkey)) {
 		ret = -EINVAL;
 		goto err_drop_mw;
 	}
 
 	if (likely(wqe->wr.wr.mw.length)) {
-		mr = rxe_pool_get_index(&rxe->mr_pool,
-					wqe->wr.wr.mw.mr_lkey >> 8);
+		mr = rxe_pool_get_index(&rxe->mr_pool, mr_lkey >> 8);
 		if (unlikely(!mr)) {
 			ret = -EINVAL;
 			goto err_drop_mw;
 		}
 
-		if (unlikely(mr->ibmr.lkey != wqe->wr.wr.mw.mr_lkey)) {
+		if (unlikely(mr->lkey != mr_lkey)) {
 			ret = -EINVAL;
 			goto err_drop_mr;
 		}
@@ -292,7 +290,7 @@ int rxe_invalidate_mw(struct rxe_qp *qp, u32 rkey)
 		goto err;
 	}
 
-	if (rkey != mw->ibmw.rkey) {
+	if (rkey != mw->rkey) {
 		ret = -EINVAL;
 		goto err_drop_ref;
 	}
@@ -323,7 +321,7 @@ struct rxe_mw *rxe_lookup_mw(struct rxe_qp *qp, int access, u32 rkey)
 	if (!mw)
 		return NULL;
 
-	if (unlikely((rxe_mw_rkey(mw) != rkey) || rxe_mw_pd(mw) != pd ||
+	if (unlikely((mw->rkey != rkey) || rxe_mw_pd(mw) != pd ||
 		     (mw->ibmw.type == IB_MW_TYPE_2 && mw->qp != qp) ||
 		     (mw->length == 0) ||
 		     (access && !(access & mw->access)) ||
diff --git a/drivers/infiniband/sw/rxe/rxe_req.c b/drivers/infiniband/sw/rxe/rxe_req.c
index 801e36cefc29..2981b3ef3cc0 100644
--- a/drivers/infiniband/sw/rxe/rxe_req.c
+++ b/drivers/infiniband/sw/rxe/rxe_req.c
@@ -562,7 +562,6 @@ static void update_state(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 {
 	u8 opcode = wqe->wr.opcode;
-	struct rxe_mr *mr;
 	u32 rkey;
 	int ret;
 
@@ -580,14 +579,11 @@ static int rxe_do_local_ops(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 		}
 		break;
 	case IB_WR_REG_MR:
-		mr = to_rmr(wqe->wr.wr.reg.mr);
-		rxe_add_ref(mr);
-		mr->state = RXE_MR_STATE_VALID;
-		mr->access = wqe->wr.wr.reg.access;
-		mr->ibmr.lkey = wqe->wr.wr.reg.key;
-		mr->ibmr.rkey = wqe->wr.wr.reg.key;
-		mr->iova = wqe->wr.wr.reg.mr->iova;
-		rxe_drop_ref(mr);
+		ret = rxe_reg_fast_mr(qp, wqe);
+		if (unlikely(ret)) {
+			wqe->status = IB_WC_LOC_QP_OP_ERR;
+			return ret;
+		}
 		break;
 	case IB_WR_BIND_MW:
 		ret = rxe_bind_mw(qp, wqe);
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index c6aca2293294..31c38b2f7d0a 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -306,6 +306,8 @@ struct rxe_mr {
 
 	struct ib_umem		*umem;
 
+	u32			lkey;
+	u32			rkey;
 	enum rxe_mr_state	state;
 	enum ib_mr_type		type;
 	u64			va;
@@ -343,6 +345,7 @@ struct rxe_mw {
 	enum rxe_mw_state	state;
 	struct rxe_qp		*qp; /* Type 2 only */
 	struct rxe_mr		*mr;
+	u32			rkey;
 	int			access;
 	u64			addr;
 	u64			length;
@@ -467,26 +470,11 @@ static inline struct rxe_pd *mr_pd(struct rxe_mr *mr)
 	return to_rpd(mr->ibmr.pd);
 }
 
-static inline u32 mr_lkey(struct rxe_mr *mr)
-{
-	return mr->ibmr.lkey;
-}
-
-static inline u32 mr_rkey(struct rxe_mr *mr)
-{
-	return mr->ibmr.rkey;
-}
-
 static inline struct rxe_pd *rxe_mw_pd(struct rxe_mw *mw)
 {
 	return to_rpd(mw->ibmw.pd);
 }
 
-static inline u32 rxe_mw_rkey(struct rxe_mw *mw)
-{
-	return mw->ibmw.rkey;
-}
-
 int rxe_register_device(struct rxe_dev *rxe, const char *ibdev_name);
 
 void rxe_mc_cleanup(struct rxe_pool_entry *arg);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-rc v4 4/5] RDMA/rxe: Create duplicate mapping tables for FMRs
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
                   ` (2 preceding siblings ...)
  2021-09-14 16:42 ` [PATCH for-rc v4 3/5] RDMA/rxe: Separate HW and SW l/rkeys Bob Pearson
@ 2021-09-14 16:42 ` Bob Pearson
  2021-09-14 16:42 ` [PATCH for-rc v4 5/5] RDMA/rxe: Only allow invalidate for appropriate MRs Bob Pearson
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-14 16:42 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib; +Cc: Bob Pearson

For fast memory regions create duplicate mapping tables so
ib_map_mr_sg() can build a new mapping table which is then
swapped into place synchronously with the execution of an IB_WR_REG_MR
work request.

Currently the rxe driver uses the same table for receiving RDMA operations
and for building new tables in preparation for reusing the MR. This
exposes users to potentially incorrect results.

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_loc.h   |   1 +
 drivers/infiniband/sw/rxe/rxe_mr.c    | 196 +++++++++++++++++---------
 drivers/infiniband/sw/rxe/rxe_mw.c    |   6 +-
 drivers/infiniband/sw/rxe/rxe_verbs.c |  39 ++---
 drivers/infiniband/sw/rxe/rxe_verbs.h |  21 +--
 5 files changed, 161 insertions(+), 102 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_loc.h b/drivers/infiniband/sw/rxe/rxe_loc.h
index 4fd73b51fabf..1ca43b859d80 100644
--- a/drivers/infiniband/sw/rxe/rxe_loc.h
+++ b/drivers/infiniband/sw/rxe/rxe_loc.h
@@ -87,6 +87,7 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length);
 int advance_dma_data(struct rxe_dma_info *dma, unsigned int length);
 int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey);
 int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe);
+int rxe_mr_set_page(struct ib_mr *ibmr, u64 addr);
 int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata);
 void rxe_mr_cleanup(struct rxe_pool_entry *arg);
 
diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 370212801abc..8d658d42abed 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -24,7 +24,7 @@ u8 rxe_get_next_key(u32 last_key)
 
 int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 {
-
+	struct rxe_map_set *set = mr->cur_map_set;
 
 	switch (mr->type) {
 	case IB_MR_TYPE_DMA:
@@ -32,8 +32,8 @@ int mr_check_range(struct rxe_mr *mr, u64 iova, size_t length)
 
 	case IB_MR_TYPE_USER:
 	case IB_MR_TYPE_MEM_REG:
-		if (iova < mr->iova || length > mr->length ||
-		    iova > mr->iova + mr->length - length)
+		if (iova < set->iova || length > set->length ||
+		    iova > set->iova + set->length - length)
 			return -EFAULT;
 		return 0;
 
@@ -65,41 +65,89 @@ static void rxe_mr_init(int access, struct rxe_mr *mr)
 	mr->map_shift = ilog2(RXE_BUF_PER_MAP);
 }
 
-static int rxe_mr_alloc(struct rxe_mr *mr, int num_buf)
+static void rxe_mr_free_map_set(int num_map, struct rxe_map_set *set)
 {
 	int i;
-	int num_map;
-	struct rxe_map **map = mr->map;
 
-	num_map = (num_buf + RXE_BUF_PER_MAP - 1) / RXE_BUF_PER_MAP;
+	for (i = 0; i < num_map; i++)
+		kfree(set->map[i]);
 
-	mr->map = kmalloc_array(num_map, sizeof(*map), GFP_KERNEL);
-	if (!mr->map)
-		goto err1;
+	kfree(set->map);
+	kfree(set);
+}
+
+static int rxe_mr_alloc_map_set(int num_map, struct rxe_map_set **setp)
+{
+	int i;
+	struct rxe_map_set *set;
+
+	set = kmalloc(sizeof(*set), GFP_KERNEL);
+	if (!set)
+		goto err_out;
+
+	set->map = kmalloc_array(num_map, sizeof(struct rxe_map *), GFP_KERNEL);
+	if (!set->map)
+		goto err_free_set;
 
 	for (i = 0; i < num_map; i++) {
-		mr->map[i] = kmalloc(sizeof(**map), GFP_KERNEL);
-		if (!mr->map[i])
-			goto err2;
+		set->map[i] = kmalloc(sizeof(struct rxe_map), GFP_KERNEL);
+		if (!set->map[i])
+			goto err_free_map;
 	}
 
+	*setp = set;
+
+	return 0;
+
+err_free_map:
+	for (i--; i >= 0; i--)
+		kfree(set->map[i]);
+
+	kfree(set->map);
+err_free_set:
+	kfree(set);
+err_out:
+	return -ENOMEM;
+}
+
+/**
+ * rxe_mr_alloc() - Allocate memory map array(s) for MR
+ * @mr: Memory region
+ * @num_buf: Number of buffer descriptors to support
+ * @both: If non zero allocate both mr->map and mr->next_map
+ *	  else just allocate mr->map. Used for fast MRs
+ *
+ * Return: 0 on success else an error
+ */
+static int rxe_mr_alloc(struct rxe_mr *mr, int num_buf, int both)
+{
+	int ret;
+	int num_map;
+
 	BUILD_BUG_ON(!is_power_of_2(RXE_BUF_PER_MAP));
+	num_map = (num_buf + RXE_BUF_PER_MAP - 1) / RXE_BUF_PER_MAP;
 
 	mr->map_shift = ilog2(RXE_BUF_PER_MAP);
 	mr->map_mask = RXE_BUF_PER_MAP - 1;
-
 	mr->num_buf = num_buf;
-	mr->num_map = num_map;
 	mr->max_buf = num_map * RXE_BUF_PER_MAP;
+	mr->num_map = num_map;
 
-	return 0;
+	ret = rxe_mr_alloc_map_set(num_map, &mr->cur_map_set);
+	if (ret)
+		goto err_out;
 
-err2:
-	for (i--; i >= 0; i--)
-		kfree(mr->map[i]);
+	if (both) {
+		ret = rxe_mr_alloc_map_set(num_map, &mr->next_map_set);
+		if (ret) {
+			rxe_mr_free_map_set(mr->num_map, mr->cur_map_set);
+			goto err_out;
+		}
+	}
 
-	kfree(mr->map);
-err1:
+	return 0;
+
+err_out:
 	return -ENOMEM;
 }
 
@@ -116,6 +164,7 @@ void rxe_mr_init_dma(struct rxe_pd *pd, int access, struct rxe_mr *mr)
 int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 		     int access, struct rxe_mr *mr)
 {
+	struct rxe_map_set	*set;
 	struct rxe_map		**map;
 	struct rxe_phys_buf	*buf = NULL;
 	struct ib_umem		*umem;
@@ -123,7 +172,6 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 	int			num_buf;
 	void			*vaddr;
 	int err;
-	int i;
 
 	umem = ib_umem_get(pd->ibpd.device, start, length, access);
 	if (IS_ERR(umem)) {
@@ -137,18 +185,20 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 
 	rxe_mr_init(access, mr);
 
-	err = rxe_mr_alloc(mr, num_buf);
+	err = rxe_mr_alloc(mr, num_buf, 0);
 	if (err) {
 		pr_warn("%s: Unable to allocate memory for map\n",
 				__func__);
 		goto err_release_umem;
 	}
 
-	mr->page_shift = PAGE_SHIFT;
-	mr->page_mask = PAGE_SIZE - 1;
+	set = mr->cur_map_set;
+	set->page_shift = PAGE_SHIFT;
+	set->page_mask = PAGE_SIZE - 1;
+
+	num_buf = 0;
+	map = set->map;
 
-	num_buf			= 0;
-	map = mr->map;
 	if (length > 0) {
 		buf = map[0]->buf;
 
@@ -171,26 +221,24 @@ int rxe_mr_init_user(struct rxe_pd *pd, u64 start, u64 length, u64 iova,
 			buf->size = PAGE_SIZE;
 			num_buf++;
 			buf++;
-
 		}
 	}
 
 	mr->ibmr.pd = &pd->ibpd;
 	mr->umem = umem;
 	mr->access = access;
-	mr->length = length;
-	mr->iova = iova;
-	mr->va = start;
-	mr->offset = ib_umem_offset(umem);
 	mr->state = RXE_MR_STATE_VALID;
 	mr->type = IB_MR_TYPE_USER;
 
+	set->length = length;
+	set->iova = iova;
+	set->va = start;
+	set->offset = ib_umem_offset(umem);
+
 	return 0;
 
 err_cleanup_map:
-	for (i = 0; i < mr->num_map; i++)
-		kfree(mr->map[i]);
-	kfree(mr->map);
+	rxe_mr_free_map_set(mr->num_map, mr->cur_map_set);
 err_release_umem:
 	ib_umem_release(umem);
 err_out:
@@ -204,7 +252,7 @@ int rxe_mr_init_fast(struct rxe_pd *pd, int max_pages, struct rxe_mr *mr)
 	/* always allow remote access for FMRs */
 	rxe_mr_init(IB_ACCESS_REMOTE, mr);
 
-	err = rxe_mr_alloc(mr, max_pages);
+	err = rxe_mr_alloc(mr, max_pages, 1);
 	if (err)
 		goto err1;
 
@@ -222,21 +270,24 @@ int rxe_mr_init_fast(struct rxe_pd *pd, int max_pages, struct rxe_mr *mr)
 static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
 			size_t *offset_out)
 {
-	size_t offset = iova - mr->iova + mr->offset;
+	struct rxe_map_set *set = mr->cur_map_set;
+	size_t offset = iova - set->iova + set->offset;
 	int			map_index;
 	int			buf_index;
 	u64			length;
+	struct rxe_map *map;
 
-	if (likely(mr->page_shift)) {
-		*offset_out = offset & mr->page_mask;
-		offset >>= mr->page_shift;
+	if (likely(set->page_shift)) {
+		*offset_out = offset & set->page_mask;
+		offset >>= set->page_shift;
 		*n_out = offset & mr->map_mask;
 		*m_out = offset >> mr->map_shift;
 	} else {
 		map_index = 0;
 		buf_index = 0;
 
-		length = mr->map[map_index]->buf[buf_index].size;
+		map = set->map[map_index];
+		length = map->buf[buf_index].size;
 
 		while (offset >= length) {
 			offset -= length;
@@ -246,7 +297,8 @@ static void lookup_iova(struct rxe_mr *mr, u64 iova, int *m_out, int *n_out,
 				map_index++;
 				buf_index = 0;
 			}
-			length = mr->map[map_index]->buf[buf_index].size;
+			map = set->map[map_index];
+			length = map->buf[buf_index].size;
 		}
 
 		*m_out = map_index;
@@ -267,7 +319,7 @@ void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
 		goto out;
 	}
 
-	if (!mr->map) {
+	if (!mr->cur_map_set) {
 		addr = (void *)(uintptr_t)iova;
 		goto out;
 	}
@@ -280,13 +332,13 @@ void *iova_to_vaddr(struct rxe_mr *mr, u64 iova, int length)
 
 	lookup_iova(mr, iova, &m, &n, &offset);
 
-	if (offset + length > mr->map[m]->buf[n].size) {
+	if (offset + length > mr->cur_map_set->map[m]->buf[n].size) {
 		pr_warn("crosses page boundary\n");
 		addr = NULL;
 		goto out;
 	}
 
-	addr = (void *)(uintptr_t)mr->map[m]->buf[n].addr + offset;
+	addr = (void *)(uintptr_t)mr->cur_map_set->map[m]->buf[n].addr + offset;
 
 out:
 	return addr;
@@ -322,7 +374,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 		return 0;
 	}
 
-	WARN_ON_ONCE(!mr->map);
+	WARN_ON_ONCE(!mr->cur_map_set);
 
 	err = mr_check_range(mr, iova, length);
 	if (err) {
@@ -332,7 +384,7 @@ int rxe_mr_copy(struct rxe_mr *mr, u64 iova, void *addr, int length,
 
 	lookup_iova(mr, iova, &m, &i, &offset);
 
-	map = mr->map + m;
+	map = mr->cur_map_set->map + m;
 	buf	= map[0]->buf + i;
 
 	while (length > 0) {
@@ -572,8 +624,9 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey)
 int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 {
 	struct rxe_mr *mr = to_rmr(wqe->wr.wr.reg.mr);
-	u32 key = wqe->wr.wr.reg.key;
+	u32 key = wqe->wr.wr.reg.key & 0xff;
 	u32 access = wqe->wr.wr.reg.access;
+	struct rxe_map_set *set;
 
 	/* user can only register MR in free state */
 	if (unlikely(mr->state != RXE_MR_STATE_FREE)) {
@@ -589,19 +642,36 @@ int rxe_reg_fast_mr(struct rxe_qp *qp, struct rxe_send_wqe *wqe)
 		return -EINVAL;
 	}
 
-	/* user is only allowed to change key portion of l/rkey */
-	if (unlikely((mr->lkey & ~0xff) != (key & ~0xff))) {
-		pr_warn("%s: key = 0x%x has wrong index mr->lkey = 0x%x\n",
-			__func__, key, mr->lkey);
-		return -EINVAL;
-	}
-
 	mr->access = access;
-	mr->lkey = key;
-	mr->rkey = (access & IB_ACCESS_REMOTE) ? key : 0;
-	mr->iova = wqe->wr.wr.reg.mr->iova;
+	mr->lkey = (mr->lkey & ~0xff) | key;
+	mr->rkey = (access & IB_ACCESS_REMOTE) ? mr->lkey : 0;
 	mr->state = RXE_MR_STATE_VALID;
 
+	set = mr->cur_map_set;
+	mr->cur_map_set = mr->next_map_set;
+	mr->cur_map_set->iova = wqe->wr.wr.reg.mr->iova;
+	mr->next_map_set = set;
+
+	return 0;
+}
+
+int rxe_mr_set_page(struct ib_mr *ibmr, u64 addr)
+{
+	struct rxe_mr *mr = to_rmr(ibmr);
+	struct rxe_map_set *set = mr->next_map_set;
+	struct rxe_map *map;
+	struct rxe_phys_buf *buf;
+
+	if (unlikely(set->nbuf == mr->num_buf))
+		return -ENOMEM;
+
+	map = set->map[set->nbuf / RXE_BUF_PER_MAP];
+	buf = &map->buf[set->nbuf % RXE_BUF_PER_MAP];
+
+	buf->addr = addr;
+	buf->size = ibmr->page_size;
+	set->nbuf++;
+
 	return 0;
 }
 
@@ -626,14 +696,12 @@ int rxe_dereg_mr(struct ib_mr *ibmr, struct ib_udata *udata)
 void rxe_mr_cleanup(struct rxe_pool_entry *arg)
 {
 	struct rxe_mr *mr = container_of(arg, typeof(*mr), pelem);
-	int i;
 
 	ib_umem_release(mr->umem);
 
-	if (mr->map) {
-		for (i = 0; i < mr->num_map; i++)
-			kfree(mr->map[i]);
+	if (mr->cur_map_set)
+		rxe_mr_free_map_set(mr->num_map, mr->cur_map_set);
 
-		kfree(mr->map);
-	}
+	if (mr->next_map_set)
+		rxe_mr_free_map_set(mr->num_map, mr->next_map_set);
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_mw.c b/drivers/infiniband/sw/rxe/rxe_mw.c
index a5e2ea7d80f0..9534a7fe1a98 100644
--- a/drivers/infiniband/sw/rxe/rxe_mw.c
+++ b/drivers/infiniband/sw/rxe/rxe_mw.c
@@ -142,15 +142,15 @@ static int rxe_check_bind_mw(struct rxe_qp *qp, struct rxe_send_wqe *wqe,
 
 	/* C10-75 */
 	if (mw->access & IB_ZERO_BASED) {
-		if (unlikely(wqe->wr.wr.mw.length > mr->length)) {
+		if (unlikely(wqe->wr.wr.mw.length > mr->cur_map_set->length)) {
 			pr_err_once(
 				"attempt to bind a ZB MW outside of the MR\n");
 			return -EINVAL;
 		}
 	} else {
-		if (unlikely((wqe->wr.wr.mw.addr < mr->iova) ||
+		if (unlikely((wqe->wr.wr.mw.addr < mr->cur_map_set->iova) ||
 			     ((wqe->wr.wr.mw.addr + wqe->wr.wr.mw.length) >
-			      (mr->iova + mr->length)))) {
+			      (mr->cur_map_set->iova + mr->cur_map_set->length)))) {
 			pr_err_once(
 				"attempt to bind a VA MW outside of the MR\n");
 			return -EINVAL;
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.c b/drivers/infiniband/sw/rxe/rxe_verbs.c
index 46eaa1c5d4b3..aaf2a92f241b 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.c
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.c
@@ -954,41 +954,26 @@ static struct ib_mr *rxe_alloc_mr(struct ib_pd *ibpd, enum ib_mr_type mr_type,
 	return ERR_PTR(err);
 }
 
-static int rxe_set_page(struct ib_mr *ibmr, u64 addr)
-{
-	struct rxe_mr *mr = to_rmr(ibmr);
-	struct rxe_map *map;
-	struct rxe_phys_buf *buf;
-
-	if (unlikely(mr->nbuf == mr->num_buf))
-		return -ENOMEM;
-
-	map = mr->map[mr->nbuf / RXE_BUF_PER_MAP];
-	buf = &map->buf[mr->nbuf % RXE_BUF_PER_MAP];
-
-	buf->addr = addr;
-	buf->size = ibmr->page_size;
-	mr->nbuf++;
-
-	return 0;
-}
-
+/* build next_map_set from scatterlist
+ * The IB_WR_REG_MR WR will swap map_sets
+ */
 static int rxe_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg,
 			 int sg_nents, unsigned int *sg_offset)
 {
 	struct rxe_mr *mr = to_rmr(ibmr);
+	struct rxe_map_set *set = mr->next_map_set;
 	int n;
 
-	mr->nbuf = 0;
+	set->nbuf = 0;
 
-	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_set_page);
+	n = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, rxe_mr_set_page);
 
-	mr->va = ibmr->iova;
-	mr->iova = ibmr->iova;
-	mr->length = ibmr->length;
-	mr->page_shift = ilog2(ibmr->page_size);
-	mr->page_mask = ibmr->page_size - 1;
-	mr->offset = mr->iova & mr->page_mask;
+	set->va = ibmr->iova;
+	set->iova = ibmr->iova;
+	set->length = ibmr->length;
+	set->page_shift = ilog2(ibmr->page_size);
+	set->page_mask = ibmr->page_size - 1;
+	set->offset = set->iova & set->page_mask;
 
 	return n;
 }
diff --git a/drivers/infiniband/sw/rxe/rxe_verbs.h b/drivers/infiniband/sw/rxe/rxe_verbs.h
index 31c38b2f7d0a..9eabc8f30359 100644
--- a/drivers/infiniband/sw/rxe/rxe_verbs.h
+++ b/drivers/infiniband/sw/rxe/rxe_verbs.h
@@ -293,6 +293,17 @@ struct rxe_map {
 	struct rxe_phys_buf	buf[RXE_BUF_PER_MAP];
 };
 
+struct rxe_map_set {
+	struct rxe_map		**map;
+	u64			va;
+	u64			iova;
+	size_t			length;
+	u32			offset;
+	u32			nbuf;
+	int			page_shift;
+	int			page_mask;
+};
+
 static inline int rkey_is_mw(u32 rkey)
 {
 	u32 index = rkey >> 8;
@@ -310,26 +321,20 @@ struct rxe_mr {
 	u32			rkey;
 	enum rxe_mr_state	state;
 	enum ib_mr_type		type;
-	u64			va;
-	u64			iova;
-	size_t			length;
-	u32			offset;
 	int			access;
 
-	int			page_shift;
-	int			page_mask;
 	int			map_shift;
 	int			map_mask;
 
 	u32			num_buf;
-	u32			nbuf;
 
 	u32			max_buf;
 	u32			num_map;
 
 	atomic_t		num_mw;
 
-	struct rxe_map		**map;
+	struct rxe_map_set	*cur_map_set;
+	struct rxe_map_set	*next_map_set;
 };
 
 enum rxe_mw_state {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH for-rc v4 5/5] RDMA/rxe: Only allow invalidate for appropriate MRs
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
                   ` (3 preceding siblings ...)
  2021-09-14 16:42 ` [PATCH for-rc v4 4/5] RDMA/rxe: Create duplicate mapping tables for FMRs Bob Pearson
@ 2021-09-14 16:42 ` Bob Pearson
  2021-09-15  0:07 ` [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Shoaib Rao
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-14 16:42 UTC (permalink / raw)
  To: jgg, zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib; +Cc: Bob Pearson

Local and remote invalidate operations are not allowed by IBA for
MRs created by (re)register memory verbs. This patch checks the
MR type in rxe_invalidate_mr().

Signed-off-by: Bob Pearson <rpearsonhpe@gmail.com>
---
 drivers/infiniband/sw/rxe/rxe_mr.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/infiniband/sw/rxe/rxe_mr.c b/drivers/infiniband/sw/rxe/rxe_mr.c
index 8d658d42abed..53271df10e47 100644
--- a/drivers/infiniband/sw/rxe/rxe_mr.c
+++ b/drivers/infiniband/sw/rxe/rxe_mr.c
@@ -605,6 +605,12 @@ int rxe_invalidate_mr(struct rxe_qp *qp, u32 rkey)
 		goto err_drop_ref;
 	}
 
+	if (unlikely(mr->type != IB_MR_TYPE_MEM_REG)) {
+		pr_warn("%s: mr->type (%d) is wrong type\n", __func__, mr->type);
+		ret = -EINVAL;
+		goto err_drop_ref;
+	}
+
 	mr->state = RXE_MR_STATE_FREE;
 	ret = 0;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
                   ` (4 preceding siblings ...)
  2021-09-14 16:42 ` [PATCH for-rc v4 5/5] RDMA/rxe: Only allow invalidate for appropriate MRs Bob Pearson
@ 2021-09-15  0:07 ` Shoaib Rao
  2021-09-15  0:58   ` Bob Pearson
  2021-09-18  2:13 ` Yi Zhang
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 16+ messages in thread
From: Shoaib Rao @ 2021-09-15  0:07 UTC (permalink / raw)
  To: Bob Pearson, jgg, zyjzyj2000, linux-rdma, bvanassche, mie

Hi Bob, I can verify that rping works after applying this patch series.

Thanks.

Shoaib


On 9/14/21 9:42 AM, Bob Pearson wrote:
> This series of patches implements several bug fixes and minor
> cleanups of the rxe driver. Specifically these fix a bug exposed
> by blktest.
>
> They apply cleanly to both
> commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
> commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
>
> The first patch is a rewrite of an earlier patch.
> It adds memory barriers to kernel to kernel queues. The logic for this
> is the same as an earlier patch that only treated user to kernel queues.
> Without this patch kernel to kernel queues are expected to intermittently
> fail at low frequency as was seen for the other queues.
>
> The second patch cleans up the state and type enums used by MRs.
>
> The third patch separates the keys in rxe_mr and ib_mr. This allows
> the following sequence seen in the srp driver to work correctly.
>
> 	do {
> 		ib_post_send( IB_WR_LOCAL_INV )
> 		ib_update_fast_reg_key()
> 		ib_map_mr_sg()
> 		ib_post_send( IB_WR_REG_MR )
> 	} while ( !done )
>
> The fourth patch creates duplicate mapping tables for fast MRs. This
> prevents rkeys referencing fast MRs from accessing data from an updated
> map after the call to ib_map_mr_sg() call by keeping the new and old
> mappings separate and atomically swapping them when a reg mr WR is
> executed.
>
> The fifth patch checks the type of MRs which receive local or remote
> invalidate operations to prevent invalidating user MRs.
>
> v3->v4:
> Two of the patches in v3 were accepted in v5.15 so have been dropped
> here.
>
> The first patch was rewritten to correctly deal with queue operations
> in rxe_verbs.c where the code is the client and not the server.
>
> v2->v3:
> The v2 version had a typo which broke clean application to for-next.
> Additionally in v3 the order of the patches was changed to make
> it a little cleaner.
>
> Bob Pearson (5):
>    RDMA/rxe: Add memory barriers to kernel queues
>    RDMA/rxe: Cleanup MR status and type enums
>    RDMA/rxe: Separate HW and SW l/rkeys
>    RDMA/rxe: Create duplicate mapping tables for FMRs
>    RDMA/rxe: Only allow invalidate for appropriate MRs
>
>   drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
>   drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
>   drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +
>   drivers/infiniband/sw/rxe/rxe_mr.c    | 267 ++++++++++++++++-------
>   drivers/infiniband/sw/rxe/rxe_mw.c    |  36 ++--
>   drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
>   drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-
>   drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
>   drivers/infiniband/sw/rxe/rxe_req.c   |  51 ++---
>   drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
>   drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
>   drivers/infiniband/sw/rxe/rxe_verbs.c |  92 ++------
>   drivers/infiniband/sw/rxe/rxe_verbs.h |  48 ++---
>   13 files changed, 438 insertions(+), 471 deletions(-)
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-15  0:07 ` [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Shoaib Rao
@ 2021-09-15  0:58   ` Bob Pearson
  0 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-15  0:58 UTC (permalink / raw)
  To: Shoaib Rao, jgg, zyjzyj2000, linux-rdma, bvanassche, mie

On 9/14/21 7:07 PM, Shoaib Rao wrote:
> Hi Bob, I can verify that rping works after applying this patch series.
> 
> Thanks.
> 
> Shoaib
> 
> 
Thanks,

I appreciate your quick response.

Bob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
                   ` (5 preceding siblings ...)
  2021-09-15  0:07 ` [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Shoaib Rao
@ 2021-09-18  2:13 ` Yi Zhang
  2021-09-25  7:55   ` Yi Zhang
  2021-09-23 18:49 ` Olga Kornievskaia
  2021-09-24 13:54 ` Jason Gunthorpe
  8 siblings, 1 reply; 16+ messages in thread
From: Yi Zhang @ 2021-09-18  2:13 UTC (permalink / raw)
  To: Bob Pearson
  Cc: Jason Gunthorpe, Zhu Yanjun, RDMA mailing list, Bart Van Assche,
	mie, rao.shoaib, Sagi Grimberg

Hi Bob
With this patch serious, the blktests nvme-rdma still can be failed
with the below error. and the test can be pass with siw.

[ 1702.140090] loop0: detected capacity change from 0 to 2097152
[ 1702.150729] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 1702.151425] nvmet_rdma: enabling port 0 (10.16.221.116:4420)
[ 1702.158810] nvmet: creating controller 1 for subsystem
blktests-subsystem-1 for NQN
nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
[ 1702.159037] nvme nvme0: creating 32 I/O queues.
[ 1702.171671] nvme nvme0: failed to initialize MR pool sized 128 for QID 32
[ 1702.178482] nvme nvme0: rdma connection establishment failed (-12)
[ 1702.292261] eno2 speed is unknown, defaulting to 1000
[ 1702.297325] eno3 speed is unknown, defaulting to 1000
[ 1702.302389] eno4 speed is unknown, defaulting to 1000
[ 1702.317991] rdma_rxe: unloaded

Failure from:
        /*
         * Currently we don't use SG_GAPS MR's so if the first entry is
         * misaligned we'll end up using two entries for a single data page,
         * so one additional entry is required.
         */
        pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
        ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
                              queue->queue_size,
                              IB_MR_TYPE_MEM_REG,
                              pages_per_mr, 0);
        if (ret) {
                dev_err(queue->ctrl->ctrl.device,
                        "failed to initialize MR pool sized %d for QID %d\n",
                        queue->queue_size, nvme_rdma_queue_idx(queue));
                goto out_destroy_ring;
        }


On Wed, Sep 15, 2021 at 12:43 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> This series of patches implements several bug fixes and minor
> cleanups of the rxe driver. Specifically these fix a bug exposed
> by blktest.
>
> They apply cleanly to both
> commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
> commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
>
> The first patch is a rewrite of an earlier patch.
> It adds memory barriers to kernel to kernel queues. The logic for this
> is the same as an earlier patch that only treated user to kernel queues.
> Without this patch kernel to kernel queues are expected to intermittently
> fail at low frequency as was seen for the other queues.
>
> The second patch cleans up the state and type enums used by MRs.
>
> The third patch separates the keys in rxe_mr and ib_mr. This allows
> the following sequence seen in the srp driver to work correctly.
>
>         do {
>                 ib_post_send( IB_WR_LOCAL_INV )
>                 ib_update_fast_reg_key()
>                 ib_map_mr_sg()
>                 ib_post_send( IB_WR_REG_MR )
>         } while ( !done )
>
> The fourth patch creates duplicate mapping tables for fast MRs. This
> prevents rkeys referencing fast MRs from accessing data from an updated
> map after the call to ib_map_mr_sg() call by keeping the new and old
> mappings separate and atomically swapping them when a reg mr WR is
> executed.
>
> The fifth patch checks the type of MRs which receive local or remote
> invalidate operations to prevent invalidating user MRs.
>
> v3->v4:
> Two of the patches in v3 were accepted in v5.15 so have been dropped
> here.
>
> The first patch was rewritten to correctly deal with queue operations
> in rxe_verbs.c where the code is the client and not the server.
>
> v2->v3:
> The v2 version had a typo which broke clean application to for-next.
> Additionally in v3 the order of the patches was changed to make
> it a little cleaner.
>
> Bob Pearson (5):
>   RDMA/rxe: Add memory barriers to kernel queues
>   RDMA/rxe: Cleanup MR status and type enums
>   RDMA/rxe: Separate HW and SW l/rkeys
>   RDMA/rxe: Create duplicate mapping tables for FMRs
>   RDMA/rxe: Only allow invalidate for appropriate MRs
>
>  drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
>  drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
>  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c    | 267 ++++++++++++++++-------
>  drivers/infiniband/sw/rxe/rxe_mw.c    |  36 ++--
>  drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
>  drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-
>  drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
>  drivers/infiniband/sw/rxe/rxe_req.c   |  51 ++---
>  drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
>  drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.c |  92 ++------
>  drivers/infiniband/sw/rxe/rxe_verbs.h |  48 ++---
>  13 files changed, 438 insertions(+), 471 deletions(-)
>
> --
> 2.30.2
>


-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
                   ` (6 preceding siblings ...)
  2021-09-18  2:13 ` Yi Zhang
@ 2021-09-23 18:49 ` Olga Kornievskaia
  2021-09-23 19:56   ` Jason Gunthorpe
  2021-09-24 13:54 ` Jason Gunthorpe
  8 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2021-09-23 18:49 UTC (permalink / raw)
  To: Bob Pearson
  Cc: Jason Gunthorpe, Zhu Yanjun, linux-rdma, bvanassche, mie, rao.shoaib

On Tue, Sep 14, 2021 at 12:49 PM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> This series of patches implements several bug fixes and minor
> cleanups of the rxe driver. Specifically these fix a bug exposed
> by blktest.
>
> They apply cleanly to both
> commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
> commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
>
> The first patch is a rewrite of an earlier patch.
> It adds memory barriers to kernel to kernel queues. The logic for this
> is the same as an earlier patch that only treated user to kernel queues.
> Without this patch kernel to kernel queues are expected to intermittently
> fail at low frequency as was seen for the other queues.
>
> The second patch cleans up the state and type enums used by MRs.
>
> The third patch separates the keys in rxe_mr and ib_mr. This allows
> the following sequence seen in the srp driver to work correctly.
>
>         do {
>                 ib_post_send( IB_WR_LOCAL_INV )
>                 ib_update_fast_reg_key()
>                 ib_map_mr_sg()
>                 ib_post_send( IB_WR_REG_MR )
>         } while ( !done )
>
> The fourth patch creates duplicate mapping tables for fast MRs. This
> prevents rkeys referencing fast MRs from accessing data from an updated
> map after the call to ib_map_mr_sg() call by keeping the new and old
> mappings separate and atomically swapping them when a reg mr WR is
> executed.
>
> The fifth patch checks the type of MRs which receive local or remote
> invalidate operations to prevent invalidating user MRs.
>
> v3->v4:
> Two of the patches in v3 were accepted in v5.15 so have been dropped
> here.
>
> The first patch was rewritten to correctly deal with queue operations
> in rxe_verbs.c where the code is the client and not the server.
>
> v2->v3:
> The v2 version had a typo which broke clean application to for-next.
> Additionally in v3 the order of the patches was changed to make
> it a little cleaner.

Hi Bob,

After applying these patches only top of 5.15-rc1. rping and mount
NFSoRDMA works.

Thank you.

>
> Bob Pearson (5):
>   RDMA/rxe: Add memory barriers to kernel queues
>   RDMA/rxe: Cleanup MR status and type enums
>   RDMA/rxe: Separate HW and SW l/rkeys
>   RDMA/rxe: Create duplicate mapping tables for FMRs
>   RDMA/rxe: Only allow invalidate for appropriate MRs
>
>  drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
>  drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
>  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +
>  drivers/infiniband/sw/rxe/rxe_mr.c    | 267 ++++++++++++++++-------
>  drivers/infiniband/sw/rxe/rxe_mw.c    |  36 ++--
>  drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
>  drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-
>  drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
>  drivers/infiniband/sw/rxe/rxe_req.c   |  51 ++---
>  drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
>  drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
>  drivers/infiniband/sw/rxe/rxe_verbs.c |  92 ++------
>  drivers/infiniband/sw/rxe/rxe_verbs.h |  48 ++---
>  13 files changed, 438 insertions(+), 471 deletions(-)
>
> --
> 2.30.2
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-23 18:49 ` Olga Kornievskaia
@ 2021-09-23 19:56   ` Jason Gunthorpe
  2021-09-24  1:25     ` Bob Pearson
  0 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2021-09-23 19:56 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Bob Pearson, Zhu Yanjun, linux-rdma, bvanassche, mie, rao.shoaib

On Thu, Sep 23, 2021 at 02:49:54PM -0400, Olga Kornievskaia wrote:
> On Tue, Sep 14, 2021 at 12:49 PM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > This series of patches implements several bug fixes and minor
> > cleanups of the rxe driver. Specifically these fix a bug exposed
> > by blktest.
> >
> > They apply cleanly to both
> > commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
> > commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
> >
> > The first patch is a rewrite of an earlier patch.
> > It adds memory barriers to kernel to kernel queues. The logic for this
> > is the same as an earlier patch that only treated user to kernel queues.
> > Without this patch kernel to kernel queues are expected to intermittently
> > fail at low frequency as was seen for the other queues.
> >
> > The second patch cleans up the state and type enums used by MRs.
> >
> > The third patch separates the keys in rxe_mr and ib_mr. This allows
> > the following sequence seen in the srp driver to work correctly.
> >
> >         do {
> >                 ib_post_send( IB_WR_LOCAL_INV )
> >                 ib_update_fast_reg_key()
> >                 ib_map_mr_sg()
> >                 ib_post_send( IB_WR_REG_MR )
> >         } while ( !done )
> >
> > The fourth patch creates duplicate mapping tables for fast MRs. This
> > prevents rkeys referencing fast MRs from accessing data from an updated
> > map after the call to ib_map_mr_sg() call by keeping the new and old
> > mappings separate and atomically swapping them when a reg mr WR is
> > executed.
> >
> > The fifth patch checks the type of MRs which receive local or remote
> > invalidate operations to prevent invalidating user MRs.
> >
> > v3->v4:
> > Two of the patches in v3 were accepted in v5.15 so have been dropped
> > here.
> >
> > The first patch was rewritten to correctly deal with queue operations
> > in rxe_verbs.c where the code is the client and not the server.
> >
> > v2->v3:
> > The v2 version had a typo which broke clean application to for-next.
> > Additionally in v3 the order of the patches was changed to make
> > it a little cleaner.
> 
> Hi Bob,
> 
> After applying these patches only top of 5.15-rc1. rping and mount
> NFSoRDMA works.

I've lost track with all the mails, are we good on this series now or
was there still more fixing needed?

Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-23 19:56   ` Jason Gunthorpe
@ 2021-09-24  1:25     ` Bob Pearson
  0 siblings, 0 replies; 16+ messages in thread
From: Bob Pearson @ 2021-09-24  1:25 UTC (permalink / raw)
  To: Jason Gunthorpe, Olga Kornievskaia, Yi Zhang
  Cc: Zhu Yanjun, linux-rdma, bvanassche, mie, rao.shoaib


>>
>> Hi Bob,
>>
>> After applying these patches only top of 5.15-rc1. rping and mount
>> NFSoRDMA works.
> 
> I've lost track with all the mails, are we good on this series now or
> was there still more fixing needed?
> 
> Jason
> 

There is one remaining report of a problem from Yi Zhang, on 9/17/21, who sees an error in blktest.
Unfortunately I am unable to figure out how to reproduce the error and the code he shows is not directly
in rxe. I suspect it may be related to resource limits if anything which has nothing to do with
the patch series. I believe they are all good.

Bob

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
                   ` (7 preceding siblings ...)
  2021-09-23 18:49 ` Olga Kornievskaia
@ 2021-09-24 13:54 ` Jason Gunthorpe
  2021-09-24 15:44   ` Shoaib Rao
  8 siblings, 1 reply; 16+ messages in thread
From: Jason Gunthorpe @ 2021-09-24 13:54 UTC (permalink / raw)
  To: Bob Pearson; +Cc: zyjzyj2000, linux-rdma, bvanassche, mie, rao.shoaib

On Tue, Sep 14, 2021 at 11:42:02AM -0500, Bob Pearson wrote:
> This series of patches implements several bug fixes and minor
> cleanups of the rxe driver. Specifically these fix a bug exposed
> by blktest.
> 
> They apply cleanly to both
> commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
> commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
> 
> The first patch is a rewrite of an earlier patch.
> It adds memory barriers to kernel to kernel queues. The logic for this
> is the same as an earlier patch that only treated user to kernel queues.
> Without this patch kernel to kernel queues are expected to intermittently
> fail at low frequency as was seen for the other queues.
> 
> The second patch cleans up the state and type enums used by MRs.
> 
> The third patch separates the keys in rxe_mr and ib_mr. This allows
> the following sequence seen in the srp driver to work correctly.
> 
> 	do {
> 		ib_post_send( IB_WR_LOCAL_INV )
> 		ib_update_fast_reg_key()
> 		ib_map_mr_sg()
> 		ib_post_send( IB_WR_REG_MR )
> 	} while ( !done )
> 
> The fourth patch creates duplicate mapping tables for fast MRs. This
> prevents rkeys referencing fast MRs from accessing data from an updated
> map after the call to ib_map_mr_sg() call by keeping the new and old
> mappings separate and atomically swapping them when a reg mr WR is
> executed.
> 
> The fifth patch checks the type of MRs which receive local or remote
> invalidate operations to prevent invalidating user MRs.
> 
> v3->v4:
> Two of the patches in v3 were accepted in v5.15 so have been dropped
> here.
> 
> The first patch was rewritten to correctly deal with queue operations
> in rxe_verbs.c where the code is the client and not the server.
> 
> v2->v3:
> The v2 version had a typo which broke clean application to for-next.
> Additionally in v3 the order of the patches was changed to make
> it a little cleaner.
> 
> Bob Pearson (5):
>   RDMA/rxe: Add memory barriers to kernel queues
>   RDMA/rxe: Cleanup MR status and type enums
>   RDMA/rxe: Separate HW and SW l/rkeys
>   RDMA/rxe: Create duplicate mapping tables for FMRs
>   RDMA/rxe: Only allow invalidate for appropriate MRs

applied to for-next, this is a little too complicated for for-rc, and
no fixes lines

Thanks,
Jason

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-24 13:54 ` Jason Gunthorpe
@ 2021-09-24 15:44   ` Shoaib Rao
  0 siblings, 0 replies; 16+ messages in thread
From: Shoaib Rao @ 2021-09-24 15:44 UTC (permalink / raw)
  To: Jason Gunthorpe, Bob Pearson; +Cc: zyjzyj2000, linux-rdma, bvanassche, mie


On 9/24/21 6:54 AM, Jason Gunthorpe wrote:
> On Tue, Sep 14, 2021 at 11:42:02AM -0500, Bob Pearson wrote:
>> This series of patches implements several bug fixes and minor
>> cleanups of the rxe driver. Specifically these fix a bug exposed
>> by blktest.
>>
>> They apply cleanly to both
>> commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
>> commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
>>
>> The first patch is a rewrite of an earlier patch.
>> It adds memory barriers to kernel to kernel queues. The logic for this
>> is the same as an earlier patch that only treated user to kernel queues.
>> Without this patch kernel to kernel queues are expected to intermittently
>> fail at low frequency as was seen for the other queues.
>>
>> The second patch cleans up the state and type enums used by MRs.
>>
>> The third patch separates the keys in rxe_mr and ib_mr. This allows
>> the following sequence seen in the srp driver to work correctly.
>>
>> 	do {
>> 		ib_post_send( IB_WR_LOCAL_INV )
>> 		ib_update_fast_reg_key()
>> 		ib_map_mr_sg()
>> 		ib_post_send( IB_WR_REG_MR )
>> 	} while ( !done )
>>
>> The fourth patch creates duplicate mapping tables for fast MRs. This
>> prevents rkeys referencing fast MRs from accessing data from an updated
>> map after the call to ib_map_mr_sg() call by keeping the new and old
>> mappings separate and atomically swapping them when a reg mr WR is
>> executed.
>>
>> The fifth patch checks the type of MRs which receive local or remote
>> invalidate operations to prevent invalidating user MRs.
>>
>> v3->v4:
>> Two of the patches in v3 were accepted in v5.15 so have been dropped
>> here.
>>
>> The first patch was rewritten to correctly deal with queue operations
>> in rxe_verbs.c where the code is the client and not the server.
>>
>> v2->v3:
>> The v2 version had a typo which broke clean application to for-next.
>> Additionally in v3 the order of the patches was changed to make
>> it a little cleaner.
>>
>> Bob Pearson (5):
>>    RDMA/rxe: Add memory barriers to kernel queues
>>    RDMA/rxe: Cleanup MR status and type enums
>>    RDMA/rxe: Separate HW and SW l/rkeys
>>    RDMA/rxe: Create duplicate mapping tables for FMRs
>>    RDMA/rxe: Only allow invalidate for appropriate MRs
> applied to for-next, this is a little too complicated for for-rc, and
> no fixes lines
>
> Thanks,
> Jason

Jason can you pull in my patch also.

Thanks,

Shoaib


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-18  2:13 ` Yi Zhang
@ 2021-09-25  7:55   ` Yi Zhang
  2021-09-25 20:32     ` Pearson, Robert B
  0 siblings, 1 reply; 16+ messages in thread
From: Yi Zhang @ 2021-09-25  7:55 UTC (permalink / raw)
  To: Bob Pearson, Sagi Grimberg
  Cc: Jason Gunthorpe, Zhu Yanjun, RDMA mailing list, Bart Van Assche,
	mie, rao.shoaib

Hi Bob/Sagi

Regarding the blktests nvme-rdma failure with rdma_rxe, I found the
patch[3] reduce the number of MRs and MWs to 4K from 256K, which lead
the nvme connect failed[1], if I change the "number of io queues to
use" to 31, it will works[2],
and the default io queues num is the CPU core count(mine is 48),
that's why it failed on my environment.

I also have tried revert patch[2], and the blktests run successfully
on my server, could we change back to 256K again?

[1]
# nvme connect -t rdma -a 10.16.221.74 -s 4420 -n testnqn  -i 32
Failed to write to /dev/nvme-fabrics: Cannot allocate memory
# dmesg
[  857.546927] nvmet: creating controller 1 for subsystem testnqn for
NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0030-4310-8058-b9c04f325732.
[  857.547640] nvme nvme0: creating 32 I/O queues.
[  857.595285] nvme nvme0: failed to initialize MR pool sized 128 for QID 32
[  857.602141] nvme nvme0: rdma connection establishment failed (-12)
[2]
# nvme connect -t rdma -a 10.16.221.74 -s 4420 -n testnqn  -i 31
# dmesg
[  863.792874] nvmet: creating controller 1 for subsystem testnqn for
NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0030-4310-8058-b9c04f325732.
[  863.793638] nvme nvme0: creating 31 I/O queues.
[  863.845594] nvme nvme0: mapped 31/0/0 default/read/poll queues.
[  863.853633] nvme nvme0: new ctrl: NQN "testnqn", addr 10.16.221.74:4420

[3]
commit af732adfacb2c6d886713624af2ff8e555c32aa4
Author: Bob Pearson <rpearsonhpe@gmail.com>
Date:   Mon Jun 7 23:25:46 2021 -0500

    RDMA/rxe: Enable MW object pool

    Currently the rxe driver has a rxe_mw struct object but nothing about
    memory windows is enabled. This patch turns on memory windows and some
    minor cleanup.

    Set device attribute in rxe.c so max_mw = MAX_MW.  Change parameters in
    rxe_param.h so that MAX_MW is the same as MAX_MR.  Reduce the number of
    MRs and MWs to 4K from 256K.  Add device capability bits for 2a and 2b
    memory windows.  Removed RXE_MR_TYPE_MW from the rxe_mr_type enum.



On Sat, Sep 18, 2021 at 10:13 AM Yi Zhang <yi.zhang@redhat.com> wrote:
>
> Hi Bob
> With this patch serious, the blktests nvme-rdma still can be failed
> with the below error. and the test can be pass with siw.
>
> [ 1702.140090] loop0: detected capacity change from 0 to 2097152
> [ 1702.150729] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
> [ 1702.151425] nvmet_rdma: enabling port 0 (10.16.221.116:4420)
> [ 1702.158810] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
> [ 1702.159037] nvme nvme0: creating 32 I/O queues.
> [ 1702.171671] nvme nvme0: failed to initialize MR pool sized 128 for QID 32
> [ 1702.178482] nvme nvme0: rdma connection establishment failed (-12)
> [ 1702.292261] eno2 speed is unknown, defaulting to 1000
> [ 1702.297325] eno3 speed is unknown, defaulting to 1000
> [ 1702.302389] eno4 speed is unknown, defaulting to 1000
> [ 1702.317991] rdma_rxe: unloaded
>
> Failure from:
>         /*
>          * Currently we don't use SG_GAPS MR's so if the first entry is
>          * misaligned we'll end up using two entries for a single data page,
>          * so one additional entry is required.
>          */
>         pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
>         ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
>                               queue->queue_size,
>                               IB_MR_TYPE_MEM_REG,
>                               pages_per_mr, 0);
>         if (ret) {
>                 dev_err(queue->ctrl->ctrl.device,
>                         "failed to initialize MR pool sized %d for QID %d\n",
>                         queue->queue_size, nvme_rdma_queue_idx(queue));
>                 goto out_destroy_ring;
>         }
>
>
> On Wed, Sep 15, 2021 at 12:43 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > This series of patches implements several bug fixes and minor
> > cleanups of the rxe driver. Specifically these fix a bug exposed
> > by blktest.
> >
> > They apply cleanly to both
> > commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc)
> > commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
> >
> > The first patch is a rewrite of an earlier patch.
> > It adds memory barriers to kernel to kernel queues. The logic for this
> > is the same as an earlier patch that only treated user to kernel queues.
> > Without this patch kernel to kernel queues are expected to intermittently
> > fail at low frequency as was seen for the other queues.
> >
> > The second patch cleans up the state and type enums used by MRs.
> >
> > The third patch separates the keys in rxe_mr and ib_mr. This allows
> > the following sequence seen in the srp driver to work correctly.
> >
> >         do {
> >                 ib_post_send( IB_WR_LOCAL_INV )
> >                 ib_update_fast_reg_key()
> >                 ib_map_mr_sg()
> >                 ib_post_send( IB_WR_REG_MR )
> >         } while ( !done )
> >
> > The fourth patch creates duplicate mapping tables for fast MRs. This
> > prevents rkeys referencing fast MRs from accessing data from an updated
> > map after the call to ib_map_mr_sg() call by keeping the new and old
> > mappings separate and atomically swapping them when a reg mr WR is
> > executed.
> >
> > The fifth patch checks the type of MRs which receive local or remote
> > invalidate operations to prevent invalidating user MRs.
> >
> > v3->v4:
> > Two of the patches in v3 were accepted in v5.15 so have been dropped
> > here.
> >
> > The first patch was rewritten to correctly deal with queue operations
> > in rxe_verbs.c where the code is the client and not the server.
> >
> > v2->v3:
> > The v2 version had a typo which broke clean application to for-next.
> > Additionally in v3 the order of the patches was changed to make
> > it a little cleaner.
> >
> > Bob Pearson (5):
> >   RDMA/rxe: Add memory barriers to kernel queues
> >   RDMA/rxe: Cleanup MR status and type enums
> >   RDMA/rxe: Separate HW and SW l/rkeys
> >   RDMA/rxe: Create duplicate mapping tables for FMRs
> >   RDMA/rxe: Only allow invalidate for appropriate MRs
> >
> >  drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
> >  drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
> >  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +
> >  drivers/infiniband/sw/rxe/rxe_mr.c    | 267 ++++++++++++++++-------
> >  drivers/infiniband/sw/rxe/rxe_mw.c    |  36 ++--
> >  drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
> >  drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-
> >  drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
> >  drivers/infiniband/sw/rxe/rxe_req.c   |  51 ++---
> >  drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
> >  drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_verbs.c |  92 ++------
> >  drivers/infiniband/sw/rxe/rxe_verbs.h |  48 ++---
> >  13 files changed, 438 insertions(+), 471 deletions(-)
> >
> > --
> > 2.30.2
> >
>
>
> --
> Best Regards,
>   Yi Zhang



-- 
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.
  2021-09-25  7:55   ` Yi Zhang
@ 2021-09-25 20:32     ` Pearson, Robert B
  0 siblings, 0 replies; 16+ messages in thread
From: Pearson, Robert B @ 2021-09-25 20:32 UTC (permalink / raw)
  To: Yi Zhang, Bob Pearson, Sagi Grimberg
  Cc: Jason Gunthorpe, Zhu Yanjun, RDMA mailing list, Bart Van Assche,
	mie, rao.shoaib

Sorry that was my fault. 256K seemed way too big but 4K is too small. This whole area has been problematic lately. There are several use cases for rxe and one is to be an easy test target which is hard if you can't reach the limits.
On the other hand if you want to really use it you just want the limits to get out of the way. One way to fix this would be to add a bunch of module parameters but there are too many attributes to be practical.
Another would be to have 2 or 3 attribute sets (large, medium, small) that you could pick with one module parameter. Or, you could have a configuration file somewhere that the driver could read but seems messy.
Or you could pass strings to one module parameter that could be parsed to set the attributes. If anyone has a preference I'd be happy to write something.

Bob

-----Original Message-----
From: Yi Zhang <yi.zhang@redhat.com> 
Sent: Saturday, September 25, 2021 2:56 AM
To: Bob Pearson <rpearsonhpe@gmail.com>; Sagi Grimberg <sagi@grimberg.me>
Cc: Jason Gunthorpe <jgg@nvidia.com>; Zhu Yanjun <zyjzyj2000@gmail.com>; RDMA mailing list <linux-rdma@vger.kernel.org>; Bart Van Assche <bvanassche@acm.org>; mie@igel.co.jp; rao.shoaib@oracle.com
Subject: Re: [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes.

Hi Bob/Sagi

Regarding the blktests nvme-rdma failure with rdma_rxe, I found the patch[3] reduce the number of MRs and MWs to 4K from 256K, which lead the nvme connect failed[1], if I change the "number of io queues to use" to 31, it will works[2], and the default io queues num is the CPU core count(mine is 48), that's why it failed on my environment.

I also have tried revert patch[2], and the blktests run successfully on my server, could we change back to 256K again?

[1]
# nvme connect -t rdma -a 10.16.221.74 -s 4420 -n testnqn  -i 32 Failed to write to /dev/nvme-fabrics: Cannot allocate memory # dmesg [  857.546927] nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0030-4310-8058-b9c04f325732.
[  857.547640] nvme nvme0: creating 32 I/O queues.
[  857.595285] nvme nvme0: failed to initialize MR pool sized 128 for QID 32 [  857.602141] nvme nvme0: rdma connection establishment failed (-12) [2] # nvme connect -t rdma -a 10.16.221.74 -s 4420 -n testnqn  -i 31 # dmesg [  863.792874] nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0030-4310-8058-b9c04f325732.
[  863.793638] nvme nvme0: creating 31 I/O queues.
[  863.845594] nvme nvme0: mapped 31/0/0 default/read/poll queues.
[  863.853633] nvme nvme0: new ctrl: NQN "testnqn", addr 10.16.221.74:4420

[3]
commit af732adfacb2c6d886713624af2ff8e555c32aa4
Author: Bob Pearson <rpearsonhpe@gmail.com>
Date:   Mon Jun 7 23:25:46 2021 -0500

    RDMA/rxe: Enable MW object pool

    Currently the rxe driver has a rxe_mw struct object but nothing about
    memory windows is enabled. This patch turns on memory windows and some
    minor cleanup.

    Set device attribute in rxe.c so max_mw = MAX_MW.  Change parameters in
    rxe_param.h so that MAX_MW is the same as MAX_MR.  Reduce the number of
    MRs and MWs to 4K from 256K.  Add device capability bits for 2a and 2b
    memory windows.  Removed RXE_MR_TYPE_MW from the rxe_mr_type enum.



On Sat, Sep 18, 2021 at 10:13 AM Yi Zhang <yi.zhang@redhat.com> wrote:
>
> Hi Bob
> With this patch serious, the blktests nvme-rdma still can be failed 
> with the below error. and the test can be pass with siw.
>
> [ 1702.140090] loop0: detected capacity change from 0 to 2097152 [ 
> 1702.150729] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 [ 
> 1702.151425] nvmet_rdma: enabling port 0 (10.16.221.116:4420) [ 
> 1702.158810] nvmet: creating controller 1 for subsystem
> blktests-subsystem-1 for NQN
> nqn.2014-08.org.nvmexpress:uuid:4c4c4544-0035-4b10-8044-b9c04f463333.
> [ 1702.159037] nvme nvme0: creating 32 I/O queues.
> [ 1702.171671] nvme nvme0: failed to initialize MR pool sized 128 for 
> QID 32 [ 1702.178482] nvme nvme0: rdma connection establishment failed 
> (-12) [ 1702.292261] eno2 speed is unknown, defaulting to 1000 [ 
> 1702.297325] eno3 speed is unknown, defaulting to 1000 [ 1702.302389] 
> eno4 speed is unknown, defaulting to 1000 [ 1702.317991] rdma_rxe: 
> unloaded
>
> Failure from:
>         /*
>          * Currently we don't use SG_GAPS MR's so if the first entry is
>          * misaligned we'll end up using two entries for a single data page,
>          * so one additional entry is required.
>          */
>         pages_per_mr = nvme_rdma_get_max_fr_pages(ibdev, queue->pi_support) + 1;
>         ret = ib_mr_pool_init(queue->qp, &queue->qp->rdma_mrs,
>                               queue->queue_size,
>                               IB_MR_TYPE_MEM_REG,
>                               pages_per_mr, 0);
>         if (ret) {
>                 dev_err(queue->ctrl->ctrl.device,
>                         "failed to initialize MR pool sized %d for QID %d\n",
>                         queue->queue_size, nvme_rdma_queue_idx(queue));
>                 goto out_destroy_ring;
>         }
>
>
> On Wed, Sep 15, 2021 at 12:43 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > This series of patches implements several bug fixes and minor 
> > cleanups of the rxe driver. Specifically these fix a bug exposed by 
> > blktest.
> >
> > They apply cleanly to both
> > commit 1b789bd4dbd48a92f5427d9c37a72a8f6ca17754 (origin/for-rc) 
> > commit 6a217437f9f5482a3f6f2dc5fcd27cf0f62409ac (origin/for-next)
> >
> > The first patch is a rewrite of an earlier patch.
> > It adds memory barriers to kernel to kernel queues. The logic for 
> > this is the same as an earlier patch that only treated user to kernel queues.
> > Without this patch kernel to kernel queues are expected to 
> > intermittently fail at low frequency as was seen for the other queues.
> >
> > The second patch cleans up the state and type enums used by MRs.
> >
> > The third patch separates the keys in rxe_mr and ib_mr. This allows 
> > the following sequence seen in the srp driver to work correctly.
> >
> >         do {
> >                 ib_post_send( IB_WR_LOCAL_INV )
> >                 ib_update_fast_reg_key()
> >                 ib_map_mr_sg()
> >                 ib_post_send( IB_WR_REG_MR )
> >         } while ( !done )
> >
> > The fourth patch creates duplicate mapping tables for fast MRs. This 
> > prevents rkeys referencing fast MRs from accessing data from an 
> > updated map after the call to ib_map_mr_sg() call by keeping the new 
> > and old mappings separate and atomically swapping them when a reg mr 
> > WR is executed.
> >
> > The fifth patch checks the type of MRs which receive local or remote 
> > invalidate operations to prevent invalidating user MRs.
> >
> > v3->v4:
> > Two of the patches in v3 were accepted in v5.15 so have been dropped 
> > here.
> >
> > The first patch was rewritten to correctly deal with queue 
> > operations in rxe_verbs.c where the code is the client and not the server.
> >
> > v2->v3:
> > The v2 version had a typo which broke clean application to for-next.
> > Additionally in v3 the order of the patches was changed to make it a 
> > little cleaner.
> >
> > Bob Pearson (5):
> >   RDMA/rxe: Add memory barriers to kernel queues
> >   RDMA/rxe: Cleanup MR status and type enums
> >   RDMA/rxe: Separate HW and SW l/rkeys
> >   RDMA/rxe: Create duplicate mapping tables for FMRs
> >   RDMA/rxe: Only allow invalidate for appropriate MRs
> >
> >  drivers/infiniband/sw/rxe/rxe_comp.c  |  12 +-
> >  drivers/infiniband/sw/rxe/rxe_cq.c    |  25 +--
> >  drivers/infiniband/sw/rxe/rxe_loc.h   |   2 +
> >  drivers/infiniband/sw/rxe/rxe_mr.c    | 267 ++++++++++++++++-------
> >  drivers/infiniband/sw/rxe/rxe_mw.c    |  36 ++--
> >  drivers/infiniband/sw/rxe/rxe_qp.c    |  12 +-
> >  drivers/infiniband/sw/rxe/rxe_queue.c |  30 ++-  
> > drivers/infiniband/sw/rxe/rxe_queue.h | 292 +++++++++++---------------
> >  drivers/infiniband/sw/rxe/rxe_req.c   |  51 ++---
> >  drivers/infiniband/sw/rxe/rxe_resp.c  |  40 +---
> >  drivers/infiniband/sw/rxe/rxe_srq.c   |   2 +-
> >  drivers/infiniband/sw/rxe/rxe_verbs.c |  92 ++------  
> > drivers/infiniband/sw/rxe/rxe_verbs.h |  48 ++---
> >  13 files changed, 438 insertions(+), 471 deletions(-)
> >
> > --
> > 2.30.2
> >
>
>
> --
> Best Regards,
>   Yi Zhang



--
Best Regards,
  Yi Zhang


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-25 20:36 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14 16:42 [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Bob Pearson
2021-09-14 16:42 ` [PATCH for-rc v4 1/5] RDMA/rxe: Add memory barriers to kernel queues Bob Pearson
2021-09-14 16:42 ` [PATCH for-rc v4 2/5] RDMA/rxe: Cleanup MR status and type enums Bob Pearson
2021-09-14 16:42 ` [PATCH for-rc v4 3/5] RDMA/rxe: Separate HW and SW l/rkeys Bob Pearson
2021-09-14 16:42 ` [PATCH for-rc v4 4/5] RDMA/rxe: Create duplicate mapping tables for FMRs Bob Pearson
2021-09-14 16:42 ` [PATCH for-rc v4 5/5] RDMA/rxe: Only allow invalidate for appropriate MRs Bob Pearson
2021-09-15  0:07 ` [PATCH for-rc v4 0/5] RDMA/rxe: Various bug fixes Shoaib Rao
2021-09-15  0:58   ` Bob Pearson
2021-09-18  2:13 ` Yi Zhang
2021-09-25  7:55   ` Yi Zhang
2021-09-25 20:32     ` Pearson, Robert B
2021-09-23 18:49 ` Olga Kornievskaia
2021-09-23 19:56   ` Jason Gunthorpe
2021-09-24  1:25     ` Bob Pearson
2021-09-24 13:54 ` Jason Gunthorpe
2021-09-24 15:44   ` Shoaib Rao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).